Skip to main content
← Back to list
01Issue
FeatureOpenSwamp CLI
AssigneesNone

#369 Add a manual_approval (pause) task type to workflow steps

Opened by bixu · 5/18/2026

Problem

Workflow YAMLs support only two task types per design/workflow.md: model_method and nested-workflow. There is no way to halt a workflow mid-flight and wait for an explicit operator action before the next step runs.

Current workarounds:

  1. Split a multi-phase rollout into N separate workflows the operator runs in sequence (file-level gate).
  2. Use allowFailure: true on a deliberately-failing step so the run errors out at the gate point — leaves a confusing "errored" entry in run history.

Both are awkward when the gate point is "operator must do something out-of-band (verify in a UI, mint a key, smoke-test a deploy) and then resume the same run."

Motivating use case

Rolling out Tailscale SSH to a 10-node Harvester bare-metal fleet (HiveMQ PLT-487). The rollout has four operator-gated phases (dev-canary → dev-fleet → prod-canary → prod-fleet) and each phase has a bootstrap step (install + tailscale up) followed by an SSH hardening step (disable password auth). Between bootstrap and harden the operator must verify Tailscale SSH from their laptop before allowing the harden step to proceed — otherwise a Tailscale bring-up failure plus the harden step together would lock everyone out of the node.

We had to express this as eight workflows (4 bootstrap + 4 harden) plus operator hygiene rules in workflow descriptions, because there is no way to encode "pause here, wait for operator OK, then continue" inside a single workflow.

The file-split workaround works but introduces real problems:

  • Run history fragments a single logical operation across eight runs. Reconstructing "what happened during the prod rollout" requires cross-referencing multiple run IDs by timestamp and name.
  • The gate point cannot share state cleanly. Auth keys, advertised tags, and other workflow inputs have to be re-passed at each phase boundary, multiplying operator surface area.
  • Operators can accidentally skip a phase or run them in the wrong order (e.g. harden before bootstrap, or harden before verifying SSH). No structural guardrail prevents this — only documentation in workflow descriptions.

Linked context (HiveMQ-internal, but readable):

  • Closed prior attempt: hivemq/hivemq-terraform-harvester#254
  • Current redesign PR: hivemq/hivemq-terraform-harvester#257

Proposed shape (conceptual)

A new step task type — call it manual_approval or pause — that suspends workflow execution at the step boundary and waits for an explicit ack before continuing. The minimal viable shape:

  • A prompt string the operator sees describing what they need to verify or do.
  • An approvers field that scopes who can approve (could be Tailscale identities, Okta groups, swamp users — whatever the auth surface supports). For v1 this could default to "any swamp user with write access to this repo."
  • A timeout after which the suspended step auto-fails, releasing the workflow lock.
  • A CLI command — e.g. swamp workflow approve <run-id> <step-name> — for granting or rejecting the gate. CLI-only is sufficient for our use case; a web UI under swamp serve would be a nice-to-have but not required.

Why this matters beyond our case

Multi-phase rollouts with operator gates are the norm for any change that touches production infrastructure. Today swamp users will reach for the file-split workaround we did and end up with worse audit history and more fragile orchestration than necessary. A first-class pause primitive would put swamp on par with Argo Workflows' suspend, AWS Step Functions' wait-for-token, and Jenkins Pipeline's input step — all of which solve the same shape of problem.

Affected components

  • Workflow YAML schema (new task type variant).
  • Workflow runtime (suspension + ack handling, persistence of suspended state across swamp process restarts).
  • Workflow run history (a new step state — "awaiting approval" — that 's neither running nor failed).
  • CLI surface (a new swamp workflow approve / reject command).
  • design/workflow.md (document the new task type).

What I'd like from upstream

  1. Acknowledgement that this is a missing primitive worth adding.
  2. Pointer to any in-progress design (I checked design/workflow.md on main but didn't see one).
  3. If accepted, a rough sense of priority so we can plan around it on our side — keep the file-split if it's months out; collapse to a single workflow once it lands.

Out of scope for this issue

  • No web UI request; CLI-only would unblock us.
  • No fine-grained approval policies (multi-approver, OPA-style rules). Those can come later. The minimal viable primitive is "halt; wait for a CLI ack; continue or abort."
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

5/18/2026, 4:13:12 PM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.