Lab: Configurable concurrency limits for workflow fan-out (forEach, parallel jobs/steps)

Problem statement

Swamp's workflow engine has no concurrency controls anywhere in the parallelism path. Independent jobs at the same topological level, parallel steps within a job, and forEach expansions are all funnelled through merge() (src/infrastructure/stream/merge.ts) without any semaphore, pool, or cap. The design doc explicitly calls this out as a goal — "maximum parallelism through the job" (design/workflow.md:14) — and the only hardcoded bound I could find anywhere in the engine is MAX_WORKFLOW_NESTING_DEPTH = 10 in src/.../execution_service.ts, which is about nesting depth, not width.

In practice, unbounded fan-out runs into two real ceilings that swamp itself is in the best position to enforce:

Downstream API rate limits. A forEach over N items where each step calls an external API (LLM provider, GitHub, Slack, AWS, any rate-limited SaaS) sends N concurrent requests the instant the level becomes runnable. For a 100-item fan-out against an API with a 50 RPM concurrent-request cap, half the runs immediately fail with 429s. This isn't a programming error — it's the natural behaviour of a workflow engine that has no notion of "how many of these am I allowed to run at once."
Local execution capacity. Under the docker driver, every parallel step spawns its own container — there is no cap in DockerExecutionDriver. A 200-item fan-out tries to spin up 200 simultaneous containers; the docker daemon, host kernel (file descriptors, PID limits), and host memory will fail well before swamp does. Under the raw driver, the same applies to in-process resource pressure.

Today every user solves this themselves, badly:

Outer shell loops that batch invocations of the workflow (defeats the point of forEach)
In-model sleep(index * delay) staggering (crude, brittle, model-author burden)
Host-mounted flock lockfiles in the docker image (works, completely outside swamp's awareness)

These all push concurrency policy out of the workflow definition and into ad-hoc user code, where it's reinvented per workflow with different tradeoffs.

Proposed solution

Introduce a concurrency: field on the workflow schema, configurable at three levels with the existing "first non-undefined wins" resolution (matching how driver: already works):

# workflow level — applies to the whole DAG
concurrency: 10

jobs:
  - name: fan-out
    # job level — applies to all steps in this job
    concurrency: 5

    steps:
      - name: per-item
        forEach:
          item: target
          in: \${{ inputs.targets }}
        # forEach level — caps the expansion specifically
        concurrency: 3
        task: { ... }

Semantics:

A non-negative integer is a hard cap on simultaneously executing units at that level (jobs / steps / forEach iterations).
0 or unset = current unbounded behaviour (so it's a pure additive change, no migration).
A semaphore acquired before each merge() task starts and released on completion handles this cleanly without restructuring the topological executor.
Resolution order: forEach > step > job > workflow > unbounded. The most-local cap wins.

A complementary global setting would also be useful for daemon-wide protection of the host:

SWAMP_MAX_CONCURRENT_STEPS=20 swamp serve

This is a defence-in-depth setting for shared-host scenarios (CI runners, dev laptops) where individual workflow authors may not know the host's real ceiling.

Alternatives considered

Document the absence and tell users to batch externally. Status quo. Fine for trivial fan-outs; falls apart the moment a workflow does anything against a real API. Pushes a load-shedding concern from the engine into every workflow author's brain.
Per-driver concurrency caps only (e.g., on docker:). Helps with container exhaustion but doesn't help with the more common API-rate-limit case, since you can absolutely overwhelm an HTTP API with the raw driver.
Dynamic backpressure based on observed failure rates. Interesting but a much bigger feature; explicit caps are the 80% case and a prerequisite for any adaptive scheme later.

Why this belongs in swamp

The workflow engine is the only layer that knows the topology — which steps will run in parallel, which are blocked behind dependencies, when a level becomes runnable. Asking each model author to invent their own concurrency control means they don't have that view; they only see one invocation at a time. Putting the cap in the workflow schema keeps the policy where the topology is.

This is also the cleanest path to making forEach actually usable for real fan-out workloads. Right now, anyone running forEach over more than a handful of items has to either accept downstream failures or implement workarounds that defeat the declarative spirit of swamp.

extension unyank needs -y/--yes flag for non-interactive use

Step key parsing in execution_service uses naive split(":") and silently truncates colon-containing step names

Workflow-scope report artifacts unreachable via `swamp data get --workflow`

Add --stdin support to method run and workflow run for Unix pipe composition

Docs: add run namespace to CEL expressions reference

Fix invalidate-then-reconcile sequencing in doctor extensions; failure-mode RowStates unreachable in sourceDetails

doctor extensions invalidateAll does not trigger fingerprint recheck for existing Indexed rows

Extension failure recording has dual write paths (legacy buildIndex vs W3 reconcile)

Expose workflow run ID in CEL so resource keys can be run-scoped

Add `swamp doctor workflows` subcommand to surface workflow YAML parse errors during preflight

Add 'swamp issue comment' command for updating existing issues

Close #327 — fixed in 20260511.160514.0-sha.9d03b09a

Nested workflow task fails with 'Bad resource ID' when workflowIdOrName uses @collective/ prefix

Detect stale skill directories and prompt for repo upgrade

Improve skill trigger routing for cross-model edge cases

Add type search for driver, datastore, and report kinds

Surface ReconcileFromDisk dryRun transitions in swamp doctor extensions

Docs: update doctor extensions reference for W6 aggregate-state rendering + repair flags

swamp issue: check if reporter is on an outdated binary before opening

Implement W6: swamp doctor extensions aggregate-state rendering + repair surface (extension catalog rearchitecture)

extension pull fails referencing removed source after extension source rm

Investigate whether allExtensionMethodsAttached guard can be removed via registration-path consolidation

Trace context is lost: CLI ignores inbound TRACEPARENT and raw driver doesn't propagate active context into in-process methods

swamp model create --global-arg KEY=VALUE doesn't coerce strings to z.number() schemas

swamp model create --global-arg KEY=VALUE doesn't coerce strings to z.number() schemas

swamp issue get should rate-limit unauthenticated users instead of blocking

swamp issue get should not require authentication

telemetry: emit child entries for follow-up action method invocations

Publish release-candidate / unstable extension versions

extension source rm leaves stale @local/. rows in catalog, blocking later pulls of registry types

extension push drops binaries: field from re-emitted archive manifest.yaml

macOS launchd autoupdate (club.swamp.autoupdate) silently fails — binary stays stale

Clicking @hivemq/honeycomb extension card on /extensions shows 'Something went wrong'

Docs: Update model-definitions.md and workflows.md for direct type execution

Docs: document binaries manifest field in extension-manifest.md

Add an official @swamp/ssh extension for general-purpose SSH (brownfield-friendly)

Accept and display binaries field from extension push metadata

Direct type execution: collapse model create + method run into one command

Per-method telemetry events for workflow runs

Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run

Provide a CLI-shape primer for AI agents to reduce rediscovery overhead

quality rubric: don't penalize extensions whose upstream constrains them to a single platform

Extension update rejects multiple .ts files extending the same target type within one local extension (regression)

swamp extension push deadlocks when invoked from inside a swamp workflow step

Collective-scoped auth keys + OIDC federation for CI publishing

forEach self.* in modelIdOrName not resolved in runtime execution path

Award leaderboard points for referrals and collective invites

Add agent harness detection and AiTool to telemetry

Workflow-level runtime expressions (env.*, vault.*) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings

Implement W5: Per-fingerprint import URLs + subprocess test harness (extension catalog rearchitecture)

swamp config set crashes with YAML serialization error

datastore compact: VACUUM fails in compiled binary (SQLITE_LIMIT_ATTACHED=0)

Repo-level version gating: minSwampVersion high-water mark for team consistency

Docs: document self.* expressions in modelIdOrName during forEach

Docs: How-to guide for background autoupdating

Manifest version bumps silently ignored for existing local extension aggregates

materialiseExtensions misclassifies pulled rows when manifest name collides with a pulled extension

Local extension edits don't reliably trigger rebundle

Missing unique indexes on user.email and user.username allow duplicate users

Resolve self.* expressions in modelIdOrName during forEach expansion

discord-bot double-sends sign_up notifications

Discord bot sends duplicate signup notifications

Add 'swamp workflow list' as alias for 'swamp workflow search'

Add 'swamp auth status' as alias for 'swamp auth whoami'

Extension bundle cache does not invalidate on source edits

extension pull fails on @local/[email protected] phantom-claim collision when local repo has its own extension

Lab search by numeric issue ID returns no results

W3 sourceToRow writes empty source_mtime — should carry filesystem mtime through Source entity

Warm-start rebundleAndUpdateCatalog should respect terminal RowStates set by reconcile

Implement W4: KindAdapter + unified loader (extension catalog rearchitecture)

Docs: add vault read-secret command to reference manual

Extension layer garbage collection: prune catalog rows + evict orphaned bundles

Docs: document workflow concurrency limits in reference manual

Locally-sourced extension: source_mtime updates without regenerating stale bundle

Extension push: allow shipping executable host helpers (bin/mudroom blocker)

Vault expressions silently deliver __SWAMP_VSEC__ sentinels under the docker driver

First-class shell-shim support for extensions, with registry-level visibility

Local extension model bundles don't rebuild when source changes (no rebuild CLI; manual cache delete breaks the runner)

Configurable concurrency limits for workflow fan-out (forEach, parallel jobs/steps)

feat(security): redact sensitive method arg values from audit log

Workflow-level runtime expressions (env., vault.) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings