Skip to main content
← Back to list
01Issue
BugClosedSwamp CLI
Assigneesstack72

#224 S3 datastore stale-lock retry loop never evicts orphan locks left by swamp processes that panic at shutdown

Opened by bixu · 5/4/2026

Description

Two behaviors observed today that compound to make swamp effectively hang forever once any prior invocation has panicked at shutdown.

1. swamp panics on shutdown but does not exit

After a successful method run (kubevirt-dev discover against a Harvester cluster), the binary printed:

thread 'main' panicked at ext/node/ops/tls_wrap.rs:2018:31:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'main' (10569428) panicked at library/core/src/panicking.rs:230:5:
panic in a function that cannot unwind
stack backtrace:
   0:        0x100e32ecc - _uv_mutex_unlock
   1:        0x10009fb5c - <unknown>
   ...
thread caused non-unwinding panic. aborting.

The shell wrapper returned and printed this stack, but the underlying swamp binary remained alive — ps aux showed the process for ~33 minutes afterward, still holding the S3 datastore global lock. Killing it manually freed the lock.

2. Stale-lock detection enters an infinite retry loop instead of evicting

Any subsequent swamp invocation that touches the datastore enters this loop, with the same dead PID being reported as the holder repeatedly:

datastore·lock Global lock acquired by "[email protected]" during per-model lock acquisition — releasing and retrying
datastore·lock Global lock held by "[email protected]" appears stale (exceeded TTL of 30000ms) — proceeding
datastore·lock Global lock held by "[email protected]" — waiting for structural command to finish
datastore·lock Global lock held by "[email protected]" appears stale (exceeded TTL of 30000ms) — proceeding
datastore·lock Global lock released, proceeding with per-model locks
datastore·lock Global lock acquired by "[email protected]" during per-model lock acquisition — releasing and retrying
[...repeats indefinitely until process B also dies of the same panic...]

The "Global lock released, proceeding with per-model locks" line is misleading — the next per-model lock attempt re-encounters the same orphan global lock and starts the cycle again. Net effect: every swamp invocation hangs until a human kills the wedged earlier process.

Note: from a single user's perspective the holder shows the same identity ([email protected]) — so the retry logic is contending with our own dead self. There's no PID/instance discriminator visible to distinguish the live caller from the dead holder.

Suggested fixes

  1. Lock eviction must be atomic. When the TTL has expired, the caller should CAS-replace the lock holder (claim it as ours), not "proceed" while leaving the stale entry visible to the next read. Otherwise every caller sees the same stale orphan and never makes progress.
  2. Process B should give up faster when it observes itself looping on its own identity — at minimum, log a clear "another swamp instance from this user is dead, run kill <pid>" message and exit non-zero, rather than spinning.
  3. Investigate the Deno tls_wrap panic on shutdown. It appears to fire reliably after method runs that use kubectl subprocesses. Even if the upstream Deno panic itself can't be fixed quickly, the swamp binary should ensure it actually exits when its main runs through panic — currently it hangs mid-unwind ("non-unwinding panic. aborting." but then doesn't).

Environment

  • swamp 20260504.140403.0-sha.d4c9188f
  • macOS Darwin 25.4.0 (arm64, Apple Silicon)
  • Shared S3 datastore (HiveMQ collective recently migrated from local sqlite to S3 — PLT-477)
  • Method that triggered it: swamp model method run kubevirt-dev exec (@hivemq/harvester/kubevirt)

Workaround

ps -ax -o pid,etime,command | grep 'swamp model' | grep -v grep
kill <pid>

Lock frees immediately and any pending swamp callers complete on the next retry.

02Bog Flow
OPENTRIAGEDIN PROGRESSCLOSED+ 1 MOREASSIGNEDCLASSIFICATION

Closed

5/4/2026, 5:05:01 PM

No activity in this phase yet.

03Sludge Pulse
stack72 assigned stack725/4/2026, 4:47:05 PM
Editable. Press Enter to edit.

stack72 commented 5/4/2026, 5:04:58 PM

Thanks @bixu for the detailed report. With the wider context across lab #213, #219, and the deploy CI hits, this turns out to be three independent problems compounding — let me break them apart.

Problem 1: stale-lock recovery isn't atomic — already fixed

The "infinite retry loop" / "Global lock released, proceeding with per-model locks" cycle is exactly what swamp-club#218 described, and it was fixed in commit 60829024 (PR #1291), which landed shortly before you filed this. Your version 20260504.140403.0-sha.d4c9188f is the commit immediately before that fix.

That commit wires the existing DistributedLock.forceRelease(expectedNonce) breakglass primitive into both wait loops in acquireModelLocks (src/cli/repo_context.ts) via a new tryForceReleaseStaleLock helper. Behaviour now:

  • When the wait loop observes a stale global lock, the helper performs a CAS-style delete keyed on the lock's nonce. forceRelease re-verifies the nonce immediately before the deleteObject, so there's zero risk of clobbering a live re-acquire.
  • The post-acquire TOCTOU re-check sees null instead of the same orphan, breaking the recursion that previously triggered "Global lock acquired by ... during per-model lock acquisition — releasing and retrying" forever.
  • A regression test (acquireModelLocks - force-releases stale global lock instead of infinite-looping) was added to src/cli/repo_context_test.ts to lock this in.

Verified end-to-end against MinIO in the original PR: planted a stale lock object, ran a model method, observed clean exit in ~1.9s with .datastore.lock deleted from the bucket. Once you upgrade past 60829024 you'll stop hitting the loop, regardless of what's holding the orphan.

This addresses your suggested fixes #1 (atomic eviction via CAS) and #2 (process B no longer spins).

Problem 2: the Deno tls_wrap.rs panic — waiting on Deno 2.7.15

The Rust panic at ext/node/ops/tls_wrap.rs:2018:31 (called Option::unwrap() on a None value) is the same upstream Deno bug hitting lab #213, lab #219, our swamp-club deploy CI, and your kubevirt-dev run. Triggered when a detached or zero-length ArrayBuffer reaches the TLSWrap _writev / write_buffer code path — different swamp call sites tickle it (background update fetch, datastore push, kubectl-using extensions), but it's the same unwrap().

Already fixed upstream in denoland/deno#33737, merged 2026-05-01. The PR replaces the three unwrap() calls (lines 2000, 2018, 2068) with let-else that returns/continues gracefully, plus a regression test under tests/specs/node/tls_write_detached_buffer/. We're waiting on the next Deno release (likely v2.7.15, expected this week) and will rebundle swamp on top of it. That release closes out problem 2 across every trigger including yours.

This addresses your suggested fix #3 (the actual Deno-side investigation).

Problem 3: process stays alive after non-unwinding panic — residual concern

The "binary remained alive ~33 minutes" observation is independent of problems 1 and 2 and isn't fixed by either of the patches above. Rust prints "thread caused non-unwinding panic. aborting." but the swamp process stays in ps, holding whatever it was holding at the moment of panic. Even after Deno 2.7.15 lands and the specific tls_wrap panic is gone, anything else that panics during teardown could surface the same pattern. This sits at the Deno-runtime / event-loop boundary rather than in any extension.

This needs its own swamp-side investigation. We'll file a separate tracking issue for it once we've confirmed (post-2.7.15) that the panic-during-teardown class of bugs still leaves the process resident — that lets us scope the fix without conflating it with the upstream tls_wrap issue.

Closing this out

Going to close #224 as a duplicate — problem 1 is resolved on tip via #218, problem 2 is resolved by the pending Deno 2.7.15 bump, and problem 3 will get its own issue once we have a clean repro on a 2.7.15 build.

Immediate action for you: upgrade past commit 60829024 and the lock retry-loop will stop. Until Deno 2.7.15 ships, the panic itself can still fire against the kubevirt extension's TLS traffic, but with the lock-recovery fix in place it will no longer wedge subsequent invocations the way it did for you here.

bixu commented 5/4/2026, 5:05:23 PM

Stronger repro of behavior #2 (the retry loop): once an orphan lock exists in S3, killing every local swamp process does not clear it. With ps -ax | grep swamp | grep -v grep returning empty (zero live swamp PIDs system-wide), the next invocation still spins:

datastore·lock Global lock held by "[email protected]" — waiting for structural command to finish
datastore·lock Global lock held by "[email protected]" appears stale (exceeded TTL of 30000ms) — proceeding
datastore·lock Global lock released, proceeding with per-model locks
datastore·lock Global lock acquired by "[email protected]" during per-model lock acquisition — releasing and retrying
[loops indefinitely]

So "kill the wedged process" is not a reliable workaround — the orphan lock persists in S3 even when no holder is alive anywhere. The only escape is whatever forces eviction of the S3 lock object itself.

This reinforces fix suggestion #1 in the original report: stale-lock detection must atomically replace the holder (CAS swap), not "proceed" while leaving the stale entry intact. Right now stale detection has no actual side effect on the S3 state — the entry is read, declared stale, and left in place.

What are the supported procedures for clearing an orphaned S3 lock? Is there a swamp datastore lock clear or similar, or do users need to manually aws s3 rm the lock object?

bixu commented 5/4/2026, 5:08:28 PM

Found the actual mechanism behind the infinite retry loop: it's AWS SSO credential expiration during/before lock acquisition being misclassified as lock contention.

With fresh SSO creds, the same swamp model method run works normally. With expired SSO creds, the first S3.putObjectConditional call inside S3Lock.acquire throws CredentialsProviderError: Token is expired. That exception is bubbled up — but the calling layer in datastore_sync_coordinator.ts / repo_context.ts does not distinguish "auth failure" from "lock held by another caller", so it goes into the retry loop.

Fatal trace from a single, currently-failing invocation (no concurrent swamp processes anywhere on the host):

FTL error S3OperationError [CredentialsProviderError]: S3 putObjectConditional failed CredentialsProviderError — Token is expired. To refresh this SSO session run 'aws sso login' with the corresponding profile.
  at S3Client2.wrapError (.swamp/datastore-bundles/7c811b12/s3.js:55300:12)
  at S3Client2.run (.swamp/datastore-bundles/7c811b12/s3.js:55273:18)
  at async S3Client2.putObjectConditional (.swamp/datastore-bundles/7c811b12/s3.js:55392:7)
  at async S3Lock.acquire (.swamp/datastore-bundles/7c811b12/s3.js:55504:23)
  at async registerDatastoreSyncNamed (src/infrastructure/persistence/datastore_sync_coordinator.ts:269:7)
  at async acquireModelLocks (src/cli/repo_context.ts:810:5)
  ...
  [cause]: _CredentialsProviderError: Token is expired. To refresh this SSO session run 'aws sso login' with the corresponding profile.

Updated theory of behavior #2 (the retry loop, from the original report):

  • When SSO creds expire at startup → fast fatal exit (above stack), good behavior.
  • When SSO creds expire mid-execution (long-running model run, or token aged out between processes that share a daemon/cache) → S3Lock.acquire keeps throwing CredentialsProviderError, the lock layer treats those failures as transient lock contention, and the loop runs forever. The "stale lock" log lines we saw earlier weren't evidence of contention — they were the lock layer reading old state and then failing to write the replacement, indistinguishable in its logs from a real held lock.

Suggested fix narrows: in the lock-acquire retry path, distinguish CredentialsProviderError (and other auth/permission errors) from PreconditionFailed (the only condition that should mean "another caller holds the lock"). Auth errors should be fatal and surfaced immediately. The "Global lock released, proceeding" log line should also only fire when an actual eviction succeeded, not just when a read showed the entry as TTL-expired.

Workaround for users: when you see the retry loop, first check aws sso login status — odds are the credentials are expired, not that there's a real lock holder.

Sign in to post a ripple.