Skip to main content

UNDERSTANDING WORKFLOW SUSPENSION

A manual_approval step does something the rest of a workflow does not: it stops. Most steps run, produce data, and hand off to the next. A gate, by contrast, suspends the whole run and waits — possibly for minutes, possibly for days — until a person decides. That difference shapes how swamp persists, resumes, and accounts for these runs, and it is worth understanding why the design is the way it is.

Why a workflow suspends rather than blocks

A naive way to "wait for approval" is to keep the running process alive, blocking on input. Swamp does not do this, and the reason is durability. An approval might not come for a long time. A process holding open for hours is fragile: a laptop sleeps, a CI runner is reclaimed, a deploy box reboots. If the wait lived only in memory, any of those would lose the run.

Instead, reaching a gate ends the current invocation. The run is not held in a process — it is written down. The run record is persisted under .swamp/workflow-runs/, capturing the run's status as suspended, each step's status (the gate sits at waiting_approval, completed steps keep succeeded, and steps after the gate stay pending), and the outputs produced so far. Once that record is on disk, no process needs to stay alive. The approval can arrive from a different shell, a different machine, or a different person, hours later, and the run is still exactly where it was left.

How resume reconstructs the run

Because the run lives in its record rather than in memory, resuming is a matter of reading that record back and continuing from it. swamp workflow resume loads the persisted run, sees which steps already reached a terminal status, and starts only the steps that have not run yet — the gate's dependents and whatever follows them. Completed steps are not re-executed; their recorded outputs remain available to the steps that depend on them, exactly as in an uninterrupted run.

This is why a suspended run is not a paused thread waiting to be unparked. It is a checkpoint. Resume rebuilds the execution context from that checkpoint, which is also why the same dependency and data-chaining rules that govern a normal run govern the resumed remainder.

Why approve and resume are separate

Approving a gate and resuming a run are two commands, not one, and the separation is deliberate. Approval is a small, fast, auditable act: it writes an approvalDecision — who decided, when, and optionally why — to the gate's step record. It does not run anything. Resume is the heavy operation: it reconstructs the context and executes the remaining steps, which may take real time and have real consequences.

Keeping them apart means the decision and the execution can be made by different people at different times. One operative can review and approve while another later resumes; the audit trail records the approver independently of whoever triggers the work. It also means a rejection is cheap and total — rejecting records the decision and marks the run failed without executing anything downstream.

A timeout adds a third consideration: the value of an approval can decay. A sign-off given against three-hour-old staging metrics may no longer be safe. The timeout expresses that the window for a decision is bounded; an approval offered after it has elapsed is refused rather than honoured.

How this compares with other systems

Suspension is a common pattern in workflow engines, and swamp's version sits among familiar relatives, with its own emphasis on a durable on-disk record.

  • Argo Workflows offers a suspend template that pauses a workflow until it is resumed manually (argo resume) or an optional duration elapses. The shape — pause, wait for an external signal, continue — is the same.
  • AWS Step Functions uses a wait-for-callback pattern (.waitForTaskToken): a task pauses and emits a token, and the state machine only continues when an external caller returns that token with success or failure. The decision comes from outside the running execution, as a gate's approval does.
  • Jenkins pipelines use the input step to pause for human input or approval before proceeding. It is the closest analogue to a manual_approval gate guarding a deploy.

The recurring idea across all of these is that some transitions in an automated process are not the machine's to make. Where swamp differs is in keeping the suspended run as a plain, inspectable record on disk rather than as live engine state, so the wait survives anything that might happen to the process that started it.

For the lifecycle's exact statuses and fields, see the workflows reference. To gate a workflow yourself, see Gate a Workflow with Manual Approval.