Skip to main content
← Back to list
01Issue
BugTriagedSwamp CLI
Assigneesstack72

Relationships

#520 Per-model LockTimeoutError at 60s causes cascading failures under concurrent access

Opened by webframp · 6/1/2026

Summary

Under concurrent access (25 parallel subagents hitting the same model instance), a single long-running model method or workflow step holds the per-model datastore lock for >60s, causing all other concurrent callers to receive LockTimeoutError. This cascades — once one process times out, the next queued caller also hits the 60s limit, creating a chain of failures.

Environment

  • swamp version: 20260530.005533.0-sha.1c117111
  • OS: Linux (WSL2)
  • Single model instance of type @webframp/gitlab with ~45 extension methods
  • Local filesystem datastore (no Postgres/S3)

Steps to Reproduce

  1. Create a model instance and register ~45 extension methods on it.
  2. Spawn 10+ concurrent swamp model method run <model> <method> calls targeting the same model with varying methods (some fast at ~1s, some slow at ~3.5s).
  3. Meanwhile, spawn 3+ concurrent swamp workflow run calls that also target the same model across multiple steps.
  4. Observe lock contention within 1-2 minutes.

Observed Behavior

WRN datastore·lock Waiting for lock "data/@webframp/gitlab/.../.lock"
  held by "user@host" (pid XXXXX, acquired XXXms ago)

FTL error LockTimeoutError: Lock "data/@webframp/gitlab/.../.lock"
  held by user@host (pid XXXXX) — timed out after 60130ms

A single workflow process (pid 4130092 in our test) held the lock across multiple minutes (several consecutive lock acquisitions), causing every other concurrent process to time out at 60s. The lock holder would release after ~3 rounds of lock acquisition (~3 minutes), at which point another queued process would take over and repeat the pattern.

Workflow runs wrapped this as:

FTL error Error: 'Workflow execution failed: Lock ".../.lock" held by user@host (pid XXXXX) — timed out after 60135ms'

Impact

  • ~80% of concurrent model method calls failed during contention periods
  • Entire workflow runs failed because any single step couldn't acquire the lock
  • No automatic recovery — the contention only resolved when the lock-holding process naturally completed its cycle

Expected Behavior

  • Lock acquisition should have a fair queue (FIFO) rather than everyone contending and timing out
  • Workflow steps should share or release the lock between steps, not hold it across the entire workflow job
  • A configurable timeout or retry mechanism would help
  • Consider a read/write lock pattern for read-only methods
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 2 MOREREVIEW

Triaged

6/1/2026, 9:35:00 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack726/1/2026, 9:33:02 PM
Editable. Press Enter to edit.

webframp commented 6/1/2026, 6:44:42 PM

A second stress test (1 hour, 15 parallel subagents) confirms this is severe and reproducible.

Results

Pattern Subagents Lock Waits Lock Timeouts (60s)
Rapid dashboard (1-4s sleep between calls) 3 ~90 each ~40 each
Mixed methods (get_current_user, list_all_projects, list_todos, 2-6s sleep) 3 69 each 19-32 each
daily-summary workflow only 3 ~63 each 22-33 each
health-check + mr-triage workflows 2 ~59 each ~32 each
Zero-sleep rapid fire (get_current_user, list_runners, list_todos in tight loop) 1 110 45

All 9 model-method subagents hit lock waits and timeouts within the first few minutes. The zero-sleep agent was worst at 110 waits and 45 timeouts in 1 hour.

Key observation

Data operations (swamp data list, data query, data search) ran concurrently with zero lock contention — they use a different lock resource. This suggests the bottleneck is specifically the per-model datastore lock, not a global lock.

Impact quantification

  • ~30-40% of rapid-fire model method calls failed with LockTimeoutError during contention periods
  • 7-16 workflow runs failed entirely per subagent because individual steps could not acquire the lock
  • Workflow lock failures cascade: one step timeout fails the entire workflow

Additional note

The lock holder is tracked by PID, but there is no evidence of stale locks (process died while holding). The contention is purely from >1 process queued behind a slow method call like dashboard (~3.5s execution).

webframp commented 6/1/2026, 9:02:31 PM

Retested on version 20260601.163824.0-sha.c2872a24 (15 subagents, 1 hour). Results are essentially unchanged:

Metric First test Retest
Lock waits (rapid-fire, 3 agents) ~90 each ~80 each
Lock timeouts (rapid-fire, 3 agents) ~40 each ~35 each
Lock waits (zero-sleep, 1 agent) 110 99
Lock timeouts (zero-sleep, 1 agent) 45 38

The per-model lock is still a hard bottleneck. Data operations (swamp data list/query/search) confirmed zero contention — they use a different lock domain.

Swamp version 163824.0 does not appear to contain a fix for this issue.

Sign in to post a ripple.