extension unyank needs -y/--yes flag for non-interactive use

thread 'main' panicked at ext/node/ops/tls_wrap.rs:2018:31:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'main' (10569428) panicked at library/core/src/panicking.rs:230:5:
panic in a function that cannot unwind
stack backtrace:
   0:        0x100e32ecc - _uv_mutex_unlock
   1:        0x10009fb5c - <unknown>
   ...
thread caused non-unwinding panic. aborting.

The shell wrapper returned and printed this stack, but the underlying swamp binary remained alive — ps aux showed the process for ~33 minutes afterward, still holding the S3 datastore global lock. Killing it manually freed the lock.

2. Stale-lock detection enters an infinite retry loop instead of evicting

Any subsequent swamp invocation that touches the datastore enters this loop, with the same dead PID being reported as the holder repeatedly:

datastore·lock Global lock acquired by "[email protected]" during per-model lock acquisition — releasing and retrying
datastore·lock Global lock held by "[email protected]" appears stale (exceeded TTL of 30000ms) — proceeding
datastore·lock Global lock held by "[email protected]" — waiting for structural command to finish
datastore·lock Global lock held by "[email protected]" appears stale (exceeded TTL of 30000ms) — proceeding
datastore·lock Global lock released, proceeding with per-model locks
datastore·lock Global lock acquired by "[email protected]" during per-model lock acquisition — releasing and retrying
[...repeats indefinitely until process B also dies of the same panic...]

The "Global lock released, proceeding with per-model locks" line is misleading — the next per-model lock attempt re-encounters the same orphan global lock and starts the cycle again. Net effect: every swamp invocation hangs until a human kills the wedged earlier process.

Note: from a single user's perspective the holder shows the same identity ([email protected]) — so the retry logic is contending with our own dead self. There's no PID/instance discriminator visible to distinguish the live caller from the dead holder.

Suggested fixes

Lock eviction must be atomic. When the TTL has expired, the caller should CAS-replace the lock holder (claim it as ours), not "proceed" while leaving the stale entry visible to the next read. Otherwise every caller sees the same stale orphan and never makes progress.
Process B should give up faster when it observes itself looping on its own identity — at minimum, log a clear "another swamp instance from this user is dead, run kill <pid>" message and exit non-zero, rather than spinning.
Investigate the Deno tls_wrap panic on shutdown. It appears to fire reliably after method runs that use kubectl subprocesses. Even if the upstream Deno panic itself can't be fixed quickly, the swamp binary should ensure it actually exits when its main runs through panic — currently it hangs mid-unwind ("non-unwinding panic. aborting." but then doesn't).

Environment

swamp 20260504.140403.0-sha.d4c9188f
macOS Darwin 25.4.0 (arm64, Apple Silicon)
Shared S3 datastore (HiveMQ collective recently migrated from local sqlite to S3 — PLT-477)
Method that triggered it: swamp model method run kubevirt-dev exec (@hivemq/harvester/kubevirt)

Workaround

ps -ax -o pid,etime,command | grep 'swamp model' | grep -v grep
kill <pid>

Lock frees immediately and any pending swamp callers complete on the next retry.

02Bog Flow

Closed

5/4/2026, 5:05:01 PM

No activity in this phase yet.

03Sludge Pulse

stack72 assigned stack725/4/2026, 4:47:05 PM

stack72 commented 5/4/2026, 5:04:58 PM

Thanks @bixu for the detailed report. With the wider context across lab #213, #219, and the deploy CI hits, this turns out to be three independent problems compounding — let me break them apart.

Problem 1: stale-lock recovery isn't atomic — already fixed

The "infinite retry loop" / "Global lock released, proceeding with per-model locks" cycle is exactly what swamp-club#218 described, and it was fixed in commit 60829024 (PR #1291), which landed shortly before you filed this. Your version 20260504.140403.0-sha.d4c9188f is the commit immediately before that fix.

That commit wires the existing DistributedLock.forceRelease(expectedNonce) breakglass primitive into both wait loops in acquireModelLocks (src/cli/repo_context.ts) via a new tryForceReleaseStaleLock helper. Behaviour now:

When the wait loop observes a stale global lock, the helper performs a CAS-style delete keyed on the lock's nonce. forceRelease re-verifies the nonce immediately before the deleteObject, so there's zero risk of clobbering a live re-acquire.
The post-acquire TOCTOU re-check sees null instead of the same orphan, breaking the recursion that previously triggered "Global lock acquired by ... during per-model lock acquisition — releasing and retrying" forever.
A regression test (acquireModelLocks - force-releases stale global lock instead of infinite-looping) was added to src/cli/repo_context_test.ts to lock this in.

Verified end-to-end against MinIO in the original PR: planted a stale lock object, ran a model method, observed clean exit in ~1.9s with .datastore.lock deleted from the bucket. Once you upgrade past 60829024 you'll stop hitting the loop, regardless of what's holding the orphan.

This addresses your suggested fixes #1 (atomic eviction via CAS) and #2 (process B no longer spins).

Problem 2: the Deno `tls_wrap.rs` panic — waiting on Deno 2.7.15

The Rust panic at ext/node/ops/tls_wrap.rs:2018:31 (called Option::unwrap() on a None value) is the same upstream Deno bug hitting lab #213, lab #219, our swamp-club deploy CI, and your kubevirt-dev run. Triggered when a detached or zero-length ArrayBuffer reaches the TLSWrap _writev / write_buffer code path — different swamp call sites tickle it (background update fetch, datastore push, kubectl-using extensions), but it's the same unwrap().

Already fixed upstream in denoland/deno#33737, merged 2026-05-01. The PR replaces the three unwrap() calls (lines 2000, 2018, 2068) with let-else that returns/continues gracefully, plus a regression test under tests/specs/node/tls_write_detached_buffer/. We're waiting on the next Deno release (likely v2.7.15, expected this week) and will rebundle swamp on top of it. That release closes out problem 2 across every trigger including yours.

This addresses your suggested fix #3 (the actual Deno-side investigation).

Problem 3: process stays alive after non-unwinding panic — residual concern

The "binary remained alive ~33 minutes" observation is independent of problems 1 and 2 and isn't fixed by either of the patches above. Rust prints "thread caused non-unwinding panic. aborting." but the swamp process stays in ps, holding whatever it was holding at the moment of panic. Even after Deno 2.7.15 lands and the specific tls_wrap panic is gone, anything else that panics during teardown could surface the same pattern. This sits at the Deno-runtime / event-loop boundary rather than in any extension.

This needs its own swamp-side investigation. We'll file a separate tracking issue for it once we've confirmed (post-2.7.15) that the panic-during-teardown class of bugs still leaves the process resident — that lets us scope the fix without conflating it with the upstream tls_wrap issue.

Closing this out

Going to close #224 as a duplicate — problem 1 is resolved on tip via #218, problem 2 is resolved by the pending Deno 2.7.15 bump, and problem 3 will get its own issue once we have a clean repro on a 2.7.15 build.

Immediate action for you: upgrade past commit 60829024 and the lock retry-loop will stop. Until Deno 2.7.15 ships, the panic itself can still fire against the kubevirt extension's TLS traffic, but with the lock-recovery fix in place it will no longer wedge subsequent invocations the way it did for you here.

bixu commented 5/4/2026, 5:05:23 PM

Stronger repro of behavior #2 (the retry loop): once an orphan lock exists in S3, killing every local swamp process does not clear it. With ps -ax | grep swamp | grep -v grep returning empty (zero live swamp PIDs system-wide), the next invocation still spins:

datastore·lock Global lock held by "[email protected]" — waiting for structural command to finish
datastore·lock Global lock held by "[email protected]" appears stale (exceeded TTL of 30000ms) — proceeding
datastore·lock Global lock released, proceeding with per-model locks
datastore·lock Global lock acquired by "[email protected]" during per-model lock acquisition — releasing and retrying
[loops indefinitely]

So "kill the wedged process" is not a reliable workaround — the orphan lock persists in S3 even when no holder is alive anywhere. The only escape is whatever forces eviction of the S3 lock object itself.

This reinforces fix suggestion #1 in the original report: stale-lock detection must atomically replace the holder (CAS swap), not "proceed" while leaving the stale entry intact. Right now stale detection has no actual side effect on the S3 state — the entry is read, declared stale, and left in place.

What are the supported procedures for clearing an orphaned S3 lock? Is there a swamp datastore lock clear or similar, or do users need to manually aws s3 rm the lock object?

bixu commented 5/4/2026, 5:08:28 PM

Found the actual mechanism behind the infinite retry loop: it's AWS SSO credential expiration during/before lock acquisition being misclassified as lock contention.

With fresh SSO creds, the same swamp model method run works normally. With expired SSO creds, the first S3.putObjectConditional call inside S3Lock.acquire throws CredentialsProviderError: Token is expired. That exception is bubbled up — but the calling layer in datastore_sync_coordinator.ts / repo_context.ts does not distinguish "auth failure" from "lock held by another caller", so it goes into the retry loop.

Fatal trace from a single, currently-failing invocation (no concurrent swamp processes anywhere on the host):

FTL error S3OperationError [CredentialsProviderError]: S3 putObjectConditional failed CredentialsProviderError — Token is expired. To refresh this SSO session run 'aws sso login' with the corresponding profile.
  at S3Client2.wrapError (.swamp/datastore-bundles/7c811b12/s3.js:55300:12)
  at S3Client2.run (.swamp/datastore-bundles/7c811b12/s3.js:55273:18)
  at async S3Client2.putObjectConditional (.swamp/datastore-bundles/7c811b12/s3.js:55392:7)
  at async S3Lock.acquire (.swamp/datastore-bundles/7c811b12/s3.js:55504:23)
  at async registerDatastoreSyncNamed (src/infrastructure/persistence/datastore_sync_coordinator.ts:269:7)
  at async acquireModelLocks (src/cli/repo_context.ts:810:5)
  ...
  [cause]: _CredentialsProviderError: Token is expired. To refresh this SSO session run 'aws sso login' with the corresponding profile.

Updated theory of behavior #2 (the retry loop, from the original report):

When SSO creds expire at startup → fast fatal exit (above stack), good behavior.
When SSO creds expire mid-execution (long-running model run, or token aged out between processes that share a daemon/cache) → S3Lock.acquire keeps throwing CredentialsProviderError, the lock layer treats those failures as transient lock contention, and the loop runs forever. The "stale lock" log lines we saw earlier weren't evidence of contention — they were the lock layer reading old state and then failing to write the replacement, indistinguishable in its logs from a real held lock.

Suggested fix narrows: in the lock-acquire retry path, distinguish CredentialsProviderError (and other auth/permission errors) from PreconditionFailed (the only condition that should mean "another caller holds the lock"). Auth errors should be fatal and surfaced immediately. The "Global lock released, proceeding" log line should also only fire when an actual eviction succeeded, not just when a read showed the entry as TTL-expired.

Workaround for users: when you see the retry loop, first check aws sso login status — odds are the credentials are expired, not that there's a real lock holder.

extension unyank needs -y/--yes flag for non-interactive use

Step key parsing in execution_service uses naive split(":") and silently truncates colon-containing step names

Workflow-scope report artifacts unreachable via `swamp data get --workflow`

Add --stdin support to method run and workflow run for Unix pipe composition

Docs: add run namespace to CEL expressions reference

Fix invalidate-then-reconcile sequencing in doctor extensions; failure-mode RowStates unreachable in sourceDetails

doctor extensions invalidateAll does not trigger fingerprint recheck for existing Indexed rows

Extension failure recording has dual write paths (legacy buildIndex vs W3 reconcile)

Expose workflow run ID in CEL so resource keys can be run-scoped

Add `swamp doctor workflows` subcommand to surface workflow YAML parse errors during preflight

Add 'swamp issue comment' command for updating existing issues

Close #327 — fixed in 20260511.160514.0-sha.9d03b09a

Nested workflow task fails with 'Bad resource ID' when workflowIdOrName uses @collective/ prefix

Detect stale skill directories and prompt for repo upgrade

Improve skill trigger routing for cross-model edge cases

Add type search for driver, datastore, and report kinds

Surface ReconcileFromDisk dryRun transitions in swamp doctor extensions

Docs: update doctor extensions reference for W6 aggregate-state rendering + repair flags

swamp issue: check if reporter is on an outdated binary before opening

Implement W6: swamp doctor extensions aggregate-state rendering + repair surface (extension catalog rearchitecture)

extension pull fails referencing removed source after extension source rm

Investigate whether allExtensionMethodsAttached guard can be removed via registration-path consolidation

Trace context is lost: CLI ignores inbound TRACEPARENT and raw driver doesn't propagate active context into in-process methods

swamp model create --global-arg KEY=VALUE doesn't coerce strings to z.number() schemas

swamp model create --global-arg KEY=VALUE doesn't coerce strings to z.number() schemas

swamp issue get should rate-limit unauthenticated users instead of blocking

swamp issue get should not require authentication

telemetry: emit child entries for follow-up action method invocations

Publish release-candidate / unstable extension versions

extension source rm leaves stale @local/. rows in catalog, blocking later pulls of registry types

extension push drops binaries: field from re-emitted archive manifest.yaml

macOS launchd autoupdate (club.swamp.autoupdate) silently fails — binary stays stale

Clicking @hivemq/honeycomb extension card on /extensions shows 'Something went wrong'

Docs: Update model-definitions.md and workflows.md for direct type execution

Docs: document binaries manifest field in extension-manifest.md

Add an official @swamp/ssh extension for general-purpose SSH (brownfield-friendly)

Accept and display binaries field from extension push metadata

Direct type execution: collapse model create + method run into one command

Per-method telemetry events for workflow runs

Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run

Provide a CLI-shape primer for AI agents to reduce rediscovery overhead

quality rubric: don't penalize extensions whose upstream constrains them to a single platform

Extension update rejects multiple .ts files extending the same target type within one local extension (regression)

swamp extension push deadlocks when invoked from inside a swamp workflow step

Collective-scoped auth keys + OIDC federation for CI publishing

forEach self.* in modelIdOrName not resolved in runtime execution path

Award leaderboard points for referrals and collective invites

Add agent harness detection and AiTool to telemetry

Workflow-level runtime expressions (env.*, vault.*) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings

Implement W5: Per-fingerprint import URLs + subprocess test harness (extension catalog rearchitecture)

swamp config set crashes with YAML serialization error

datastore compact: VACUUM fails in compiled binary (SQLITE_LIMIT_ATTACHED=0)

Repo-level version gating: minSwampVersion high-water mark for team consistency

Docs: document self.* expressions in modelIdOrName during forEach

Docs: How-to guide for background autoupdating

Manifest version bumps silently ignored for existing local extension aggregates

materialiseExtensions misclassifies pulled rows when manifest name collides with a pulled extension

Local extension edits don't reliably trigger rebundle

Missing unique indexes on user.email and user.username allow duplicate users

Resolve self.* expressions in modelIdOrName during forEach expansion

discord-bot double-sends sign_up notifications

Discord bot sends duplicate signup notifications

Add 'swamp workflow list' as alias for 'swamp workflow search'

Add 'swamp auth status' as alias for 'swamp auth whoami'

Extension bundle cache does not invalidate on source edits

extension pull fails on @local/[email protected] phantom-claim collision when local repo has its own extension

Lab search by numeric issue ID returns no results

W3 sourceToRow writes empty source_mtime — should carry filesystem mtime through Source entity

Warm-start rebundleAndUpdateCatalog should respect terminal RowStates set by reconcile

Implement W4: KindAdapter + unified loader (extension catalog rearchitecture)

Docs: add vault read-secret command to reference manual

Extension layer garbage collection: prune catalog rows + evict orphaned bundles

Docs: document workflow concurrency limits in reference manual

Locally-sourced extension: source_mtime updates without regenerating stale bundle

Extension push: allow shipping executable host helpers (bin/mudroom blocker)

Vault expressions silently deliver __SWAMP_VSEC__ sentinels under the docker driver

First-class shell-shim support for extensions, with registry-level visibility

Local extension model bundles don't rebuild when source changes (no rebuild CLI; manual cache delete breaks the runner)

Configurable concurrency limits for workflow fan-out (forEach, parallel jobs/steps)

feat(security): redact sensitive method arg values from audit log

Workflow-level runtime expressions (env., vault.) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings