Relationships
#463 swamp serve scheduled workflows do not load repo extension registries
Opened by phy2vir · 5/27/2026· Shipped 5/29/2026
Summary
A workflow that can be discovered and validated from the CLI fails when fired by swamp serve scheduled execution. The scheduled run appears to execute with only built-in or otherwise incomplete registries loaded, even though a fresh CLI process in the same repository can see the required pulled model and vault extension types.
This report intentionally uses generic placeholder names and omits concrete extension names, hostnames, IPs, and domain-specific workflow details.
Environment
- Server command:
swamp serve --repo-dir <repo> --host 0.0.0.0 --port 9090 - Reproduced after updating server binary and repository metadata to:
20260527.181855.0-sha.2efdbeea - Repository contains:
- pulled model extensions, e.g.
<pulled-model-type> - a local custom model extension, e.g.
<local-model-type> - a pulled custom vault extension, e.g.
<custom-vault-type> - vault config using the pulled vault type
- workflows whose steps call those model types and whose model definitions use
vault.get(...)
- pulled model extensions, e.g.
What works from the CLI
From the same repo directory, fresh CLI commands see the extension catalog correctly:
swamp vault type search --json
# includes local_encryption and <custom-vault-type>
swamp model type search <query> --json
# includes <pulled-model-type> and <local-model-type>
swamp doctor extensions --json
# overallStatus: pass
# relevant pulled/local model sources: Indexed
# relevant vault source: IndexedThe workflow validates from the CLI:
swamp workflow validate <workflow-name> --json
# passed: trueA manually triggered workflow run from the same repo had also succeeded before the scheduler-specific test.
Reproduction
- Start
swamp servewith--repo-dir <repo>. - Confirm the process is running the updated binary.
- Add a temporary cron trigger to an existing read-only workflow, scheduled for the next minute.
- Confirm
/healthshows scheduling enabled and the schedule registered with the expectednextRun. - Wait for the scheduler to fire.
The scheduler fires as expected:
Scheduled workflow "<workflow-name>" ("<cron>")
Registered schedule for workflow "<workflow-name>": "<cron>"
Running scheduled workflow "<workflow-name>"
Firing scheduled run for workflow "<workflow-name>"Actual result
The scheduled run fails immediately before executing the first method. In the latest retest on 20260527.181855.0-sha.2efdbeea, the run failed in about 19 ms with:
Unknown model type: <pulled-model-type>A previous scheduled run against another workflow in the same repo failed similarly with:
Unknown model type: <local-model-type>The service journal from that previous run also showed the vault registry was incomplete during scheduled execution:
Unsupported vault type: <custom-vault-type> (vault "<vault-name>"). Available vault types: local_encryptionThis is the key contrast: a fresh CLI process in the same repo sees the pulled/local model types and custom vault type, while the scheduled execution path inside the long-running swamp serve process behaves as if the repo extension registries are not loaded.
Expected result
Scheduled workflow execution should use the same repo extension registry/catalog resolution as manual CLI workflow execution and WebSocket-triggered execution. If swamp model type search, swamp vault type search, and swamp doctor extensions can see the extension types from the repo, swamp serve scheduled runs should be able to resolve those same types.
Cleanup after reproduction
The temporary cron trigger was removed after the scheduled run, and /health returned schedules: [].
Notes from source inspection
In the inspected source, swamp serve creates ScheduledExecutionService, and scheduled runs call executeWorkflowWithLocks(...). That path appears intended to call registry loading through workflow run dependency construction, but the observed runtime behavior suggests one of these is happening in the serve scheduler process:
- registry loaders are not configured for the
servecommand process, or ensureLoaded()returns without loading the repo extension catalog, or- scheduled workflow execution constructs part of its runtime through a path that bypasses the configured repo extension loaders.
The repo layout itself does not appear incomplete: pulled extension lock data and pulled files exist, and swamp doctor extensions reports the relevant entries as indexed.
Privacy note
Concrete extension names, workflow names, hostnames, IPs, and domain-specific model names have been replaced with placeholders in this report.
Shipped
Click a lifecycle step above to view its details.
stack72 commented 5/27/2026, 8:53:13 PM
@phy2vir thanks for the detailed report — the structured reproduction steps and the 19ms timing observation were particularly helpful for narrowing this down.
We have triaged this and done a thorough code walkthrough of the extension loading architecture plus multiple reproduction attempts.
How serve extension loading is designed to work:
When you run swamp serve --repo-dir <repo>, the process goes through runCli() which pre-parses --repo-dir and calls configureExtensionLoaders() before the serve command action runs. This sets up lazy loader closures on the global registry singletons (modelRegistry, vaultTypeRegistry, driverTypeRegistry, reportRegistry) — the same path every CLI command uses. The serve command is not excluded from this setup.
When a schedule fires, the path is: ScheduledExecutionService.executeWorkflow → executeWorkflowWithLocks → createWorkflowRunDeps (in src/serve/deps.ts), which calls ensureLoaded() on all four registries before constructing the workflow deps. This triggers the lazy loaders, which read the extension catalog DB, enumerate pulled extension directories from the lockfile, and register types (either fully or as lazy catalog entries for per-bundle loading). Then WorkflowExecutionService.executeStep calls resolveModelType(), which uses ensureTypeLoaded() for on-demand bundle import, followed by the auto-resolver as a fallback.
The WebSocket-triggered execution path goes through the same executeWorkflowWithLocks function — there is no separate code path for scheduled vs WebSocket execution.
Where swamp serve should run from:
swamp serve --repo-dir <path> should work from any working directory. All internal paths — the extension catalog, lockfile, pulled extension sources, bundles, model definitions, vault configs, and workflows — are resolved as absolute paths from the --repo-dir value. The process CWD is not used for any extension or repo lookups.
We verified this explicitly: starting swamp serve --repo-dir /tmp/test-repo from / (root), ~ (home), and the repo directory itself all produced identical behavior — extensions loaded, scheduled workflows executed.
If you are running serve via systemd or another service manager where the working directory is / or /root, that should be fine as long as --repo-dir points to the repo root (the directory containing .swamp.yaml).
Filesystem layout the loaders expect:
The extension loaders resolve all paths from --repo-dir. Here is what they expect:
<repo-dir>/ ← --repo-dir points here
├── .swamp.yaml # repo marker
├── .swamp/
│ ├── _extension_catalog.db # SQLite catalog — indexes all extension sources
│ ├── bundles/<hash>/ # pre-compiled JS bundles (from pull)
│ │ ├── user_pool.js
│ │ └── ...
│ └── pulled-extensions/
│ └── @<collective>/<extension>/
│ ├── manifest.yaml
│ ├── models/*.ts # pulled model sources
│ ├── vaults/*.ts # pulled vault sources
│ └── ...
├── extensions/
│ └── models/ # default modelsDir
│ ├── upstream_extensions.json # lockfile — maps pulled extensions to dirs
│ └── <local-model>.ts # local custom model extensions live here
├── models/ # model instance definitions (YAML)
│ └── @<collective>/<type>/<uuid>.yaml
├── vaults/ # vault configurations
│ └── <vault-name>.yaml
└── workflows/ # workflow definitions
└── workflow-<uuid>.yamlKey paths:
- Models dir: defaults to
extensions/models, overridable viaSWAMP_MODELS_DIRenv ormodelsDirin.swamp.yaml - Lockfile:
<modelsDir>/upstream_extensions.json - Extension catalog:
.swamp/_extension_catalog.db— rebuilt on demand if missing or stale - Bundles:
.swamp/bundles/<hash>/— pre-compiled JS fromextension pull
Reproduction attempts — all passed:
Dev mode (
deno run dev serve) with a pulled extension (@swamp/gcp/cloudshell), scheduled workflow firing every minute — extensions loaded correctly.Compiled binary with
@swamp/aws/cognitopulled, model + scheduled workflow — first and second scheduled runs both completed. First run took ~470ms (catalog rebuild + bundle imports), second was instant (registries cached).Compiled binary from CWD
/with--repo-dir /tmp/...— extensions loaded identically to running from the repo directory.Compiled binary from CWD
~with deleted extension catalog (_extension_catalog.db*removed) — catalog rebuilt from disk on first scheduled execution, bundles loaded from cache, run completed.CLI validation before each test —
swamp model type search,swamp workflow validate, andswamp doctor extensionsall confirmed extensions indexed and healthy.
What the 19ms timing tells us:
In our successful tests, the first scheduled run takes 400-500ms because ensureLoaded() triggers the lazy loaders. The 19ms from your report means ensureLoaded() returned almost instantly — either the registries were already marked as loaded (but empty), or the loaders ran and failed silently. The loader functions have catch {} blocks that swallow all errors, so if anything goes wrong, the registries end up empty but permanently marked as "loaded."
To help us reproduce, could you share:
- Does the repo use
.swamp-sources.yaml(additional extension source directories)? - Is the datastore filesystem-based or custom (e.g., S3)?
- Is the
--repo-dirpath a symlink or does it cross any mount boundaries? - Does
.swamp.yamlhave amodelsDirorworkflowsDiroverride? - If you can reproduce again, could you run with
--log-level debug? The debug logs should show whether the extension loaders actually ran and what they found.
We are planning a defensive fix regardless — adding diagnostic logging to the silent catch blocks and eager extension loading at serve startup so failures surface immediately rather than getting swallowed.
phy2vir commented 5/28/2026, 7:08:44 PM
Follow-up after the requested environment checks, debug repro, and a scheduler-run restore test. I am sanitizing concrete package/workflow/host names here.
Answers to the environment questions:
- No
.swamp-sources.yamlis present, andswamp extension source listreports no additional sources. - Datastore is filesystem-backed, rooted at
<repo>/.swamp;swamp datastore status --jsonreports healthy. --repo-dirpoints at a real directory, not a symlink.<repo>,<repo>/.swamp,<repo>/extensions,<repo>/workflows, and<repo>/vaultsare all on the same ext4 mount..swamp.yamlhas nomodelsDirorworkflowsDiroverrides.- systemd unit uses
WorkingDirectory=<repo>andExecStart=/usr/local/bin/swamp serve --repo-dir <repo> --host 0.0.0.0 --port 9090.
One important environment difference: the systemd service initially had NO_COLOR=1, PATH=..., and USER=root, but no HOME or USERPROFILE.
On startup, swamp serve logged these warnings:
Failed to load user datastore extensions: "Cannot determine home directory (HOME/USERPROFILE not set)"
Failed to load user model extensions: "Cannot determine home directory (HOME/USERPROFILE not set)"
Failed to load user vault extensions: "Cannot determine home directory (HOME/USERPROFILE not set)"
Failed to load user driver extensions: "Cannot determine home directory (HOME/USERPROFILE not set)"
Failed to load user report extensions: "Cannot determine home directory (HOME/USERPROFILE not set)"With debug enabled and no HOME, a temporary scheduled workflow reproduced the issue:
- scheduled run failed in 21ms
- first step failed with
Unknown model typefor a pulled model type - logs also showed the pulled vault type was unsupported, with only
local_encryptionavailable - debug logs showed auto-resolution being attempted/skipped for the relevant collectives, rather than the already-pulled repo bundles being loaded
Then I added a systemd drop-in:
[Service]
Environment=HOME=/rootAfter restarting swamp serve, the startup loader warnings disappeared. Running the same scheduled workflow through the scheduler then succeeded in about 28s.
I then tested the actual restore workflow through swamp serve scheduling, not via manual CLI execution. With HOME=/root still set, that scheduled restore run also succeeded end-to-end in about 249s. It produced the expected restore-test data artifact, reported success: true, confirmed the temporary VM was network-isolated and guest-agent reachable, and cleanup completed successfully.
To answer the likely deployment question: this was a normal systemd wrapper around the documented/built-in swamp serve --repo-dir <repo> command. I do not see a swamp serve install / swamp service install style command in the current CLI that would have generated a unit with HOME set automatically. In hindsight, setting HOME or XDG_CONFIG_HOME explicitly in the service unit is a good hardening step, and it is now the local workaround.
So this looks reproducible when swamp serve runs under a service manager without HOME/USERPROFILE: startup user-extension loading fails, and after that the serve process appears to have incomplete registries for pulled repo extensions during scheduled execution. Setting HOME avoids the failure in this environment and fixes both a small scheduled repro workflow and the real scheduled restore workflow.
Thanks again for digging into this. Hopefully this narrows the repro surface: service-managed swamp serve, valid --repo-dir, filesystem datastore, no source overrides, no directory overrides, but missing HOME/USERPROFILE in the process environment.
stack72 commented 5/29/2026, 12:54:59 AM
Thanks for the excellent reproduction — that nailed it. 🙏
Your follow-up isolated the root cause precisely: the systemd unit ran swamp serve with no HOME/USERPROFILE in its environment. swamp loads every extension — including already-pulled repo bundles — through an embedded runtime that lives under ~/.swamp, so home-directory resolution threw deep inside each loader at startup. That cascaded into the misleading Failed to load user X extensions warnings and, at scheduled-run time, Unknown model type for your pulled model/vault types. Your Environment=HOME=/root drop-in is exactly the right fix, and it's the recommended workaround.
We've shipped a follow-up in #1470 that closes the diagnosability gap so nobody else has to reverse-engineer this:
- A guard clause in the extension-loader setup now detects the missing-home condition once, up front, and emits a single actionable warning that names the real cause and the fix (set HOME, e.g.
Environment=HOME=/root) — replacing the five confusing per-kindFailed to load user … extensionswarnings. swamp serve --helpnow documents theHOME/USERPROFILErequirement for service-managed deployments.
Note this is intentionally a diagnosability + docs fix: it doesn't make headless serve load extensions without a home directory (the embedded runtime genuinely needs one) — it just tells you so clearly instead of failing obscurely. Setting HOME (or XDG-appropriate equivalents) in the service unit remains the correct configuration, as you found.
Thanks again for the detailed, sanitized write-up and the restore-workflow verification — it made this a quick fix.
Sign in to post a ripple.