Skip to main content
← Back to list
01Issue
BugShippedSwamp CLI
Assigneesstack72

#382 Many CLI commands acquire global .datastore.lock unnecessarily, causing 60s LockTimeoutError under any concurrent writer

Opened by mgreten · 5/20/2026· Shipped 5/20/2026

Summary

Most non-read CLI commands (and even some read-ish ones like vault read-secret) acquire the global S3 datastore lock via requireInitializedRepo(). When any other process holds that lock — including a normal background swamp model method run doing a routine push-to-S3 — these commands wait up to 60s and then fail with LockTimeoutError.

In our environment two scheduled models on two different machines (a Mac and a Linux box) regularly run a few minutes apart. Each one holds .datastore.lock for ~30–60s during its push phase. The lock is also continuously heartbeat-renewed (extend() every TTL/3 = 10s), so when both holders' windows overlap, the lock is effectively never free. Every interactive swamp extension install, swamp vault put, etc. that lands in that window hits the 60s timeout and errors out.

This is not a stale-lock bug (the holder is healthy) and not an S3/MinIO bug (conditional writes are working). It is that the lock scope is too coarse for the read/write profile of these commands.

Affected commands (verified in source 20260519.225230.0-sha.4f9d3e72)

All call requireInitializedRepo which acquires .datastore.lock:

  • swamp extension installsrc/cli/commands/extension_install.ts:67
  • swamp vault putsrc/cli/commands/vault_put.ts
  • swamp vault createsrc/cli/commands/vault_create.ts
  • swamp vault editsrc/cli/commands/vault_edit.ts
  • swamp vault migratesrc/cli/commands/vault_migrate.ts
  • swamp vault read-secretsrc/cli/commands/vault_read_secret.ts (even though it's a read)

Notably, extension install only restores the local .swamp/ cache from a JSON lockfile — it writes no datastore data — yet still takes the global lock. Same shape applies to vault read-secret.

Steps to reproduce

  1. Use the S3 datastore (e.g. MinIO over a network — Tailscale, LAN, anything where push is not instant).
  2. Trigger a long-running write: in one shell, run a swamp model method run that produces enough output to push for 30–60s (or just let a scheduled model do it).
  3. In another shell on the same machine, or any other machine sharing the datastore, run any of:
    swamp extension install
    swamp vault put my-vault some-key
    swamp vault read-secret my-vault some-key
  4. Observe: the command waits up to 60s and then fails with:
    LockTimeoutError: Lock .datastore.lock held by <host> (pid <pid>) — timed out after 60xxx ms

Why this is painful in practice

We run agentic pipelines (ADW) that create git worktrees and run swamp extension install from inside each new worktree to populate the local cache. Every worktree creation rolls the dice against whatever scheduled model is currently pushing. We had to wrap our adapter in a 30s-sleep-and-retry loop just to get reliable worktree setup. The retry works, but it's papering over a lock scope that doesn't match what these commands actually do.

vault read-secret is the most surprising case: it's a read, but a writer-class lock contention can stop you from reading a secret you need.

Hypothesized fix

Several writer-style commands likely don't need the global write lock:

  1. extension install is a pure local-cache restore from upstream_extensions.json. It does not need the datastore lock at all — only the per-extension pull operation (if any extension is missing) needs to coordinate with the datastore, and that could happen under a narrower lock.
  2. vault read-secret is a read. It should follow the same pattern as other read commands and not take the writer lock.
  3. vault put/create/edit do write to the datastore, but they write to a specific vault path. A vault-scoped lock (.vault-{name}.lock) would let them proceed concurrently with unrelated work.

A broader option: introduce a separate .push.lock for the push-to-S3 step of model method run's flush, distinct from .datastore.lock. That alone would unblock most cases because push is the dominant lock holder.

TTL/wait timeout not user-tunable

S3Lock in the bundled @swamp/s3-datastore (s3.js:~55900) hardcodes DEFAULT_TTL_MS=30000 and DEFAULT_MAX_WAIT_MS=60000. Neither is exposed in .swamp.yaml or via env. Even where lock acquisition is correct, users can't tune it for slow networks.

Environment

  • swamp 20260516.045246.0-sha.e6eda98d (latest is 20260519.225230.0-sha.4f9d3e72; behavior unchanged — verified by reading the new source)
  • macOS Darwin 25.3.0, ARM64
  • Datastore: @swamp/[email protected] against MinIO over Tailscale
  • Two machines run scheduled swamp model method run on a cron
  • .swamp.yaml excerpt:
    datastore:
      type: '@swamp/s3-datastore'
      config:
        bucket: swamp-data
        region: us-east-1
        endpoint: 'http://minio-swamp.tail001dd.ts.net:9000'
        forcePathStyle: true

Workaround we're using

A 30-second sleep-and-retry around swamp extension install in our worktree adapter. Works reliably for the install path; doesn't help interactive vault put invocations.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 4 MOREREVIEW+ 3 MOREPR_MERGEDSHIPPED

Shipped

5/20/2026, 7:36:28 AM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack725/20/2026, 6:57:28 AM
Editable. Press Enter to edit.

stack72 commented 5/20/2026, 7:00:04 AM

hey @mgreten - I am currently in the middle of redesigning the datastores so that they don't use global locks! You can see the work in #378, #379 and #380

mgreten commented 5/20/2026, 3:39:57 PM

ahhh thank you!

Sign in to post a ripple.