Lab: Improve skill trigger routing for cross-model edge cases

Context

PR #1364 restructured 16 swamp-* skills into 11, improving trigger eval pass rate from 198/202 (98.0%) to 262/264 (99.2%) on sonnet. Cross-model CI confirms all 4 models pass the 90% threshold:

Sonnet: 262/264 (99.2%)
GPT-5.4: 258/264 (97.7%)
Opus: 251/263 (95.4%)
Gemini 2.5 Pro: 245/263 (93.2%)

However, the cross-model analysis surfaced patterns worth addressing in a follow-up.

Cross-model failures to revisit

1. Report authoring API terms route to swamp-extension instead of swamp-report (3/4 models)

"How do I read execution data in a report using dataHandles?" fails on sonnet, opus, and gemini. The terms dataHandles, UnifiedDataRepository, and MethodReportContext pull strongly toward swamp-extension because the report creation content was moved there. swamp-report still has these in its reference files but the frontmatter description no longer claims them.

Options: Either accept that authoring queries route to swamp-extension (and remove from swamp-report evals), or add authoring-specific triggers back to swamp-report's description.

2. "manifest.yaml to push" splits 50/50 across models

Sonnet/gemini route to swamp-extension-publish; GPT/opus route to swamp-extension. This query genuinely spans both skills — manifest creation is authoring, but "push" signals publishing intent.

Options: Remove this eval from both skills (it's legitimately ambiguous), or strengthen the publish description to claim manifest creation when paired with push intent.

3. Opus over-indexes on swamp-extension for debugging queries

Opus routes "type error in custom model code", "module not found", and "extension crashing" to swamp-extension instead of swamp-troubleshooting (5 failures). The "Do NOT use for debugging" negative trigger in swamp-extension's description is not strong enough for opus.

Options: Strengthen the negative trigger wording, or add "type error", "module not found", "crashing" as explicit triggers in swamp-troubleshooting.

4. "workflow erroring on second step" routes to swamp-workflow (2/4 models)

GPT and opus route this to swamp-workflow instead of swamp-troubleshooting despite "erroring" being in troubleshooting's triggers. The word "workflow" dominates.

Options: Add a stronger negative trigger to swamp-workflow for error/debugging queries, or accept that workflow-context debugging naturally lands in the workflow skill first.

5. Gemini text-only responses (no tool call)

Gemini 2.5 Pro fails several evals by responding with text instead of making a tool call. This appears to be a systemic issue with the promptfoo system prompt rather than a skill description problem.

Options: Tune the system prompt in generate_config.ts for gemini specifically, or accept the lower gemini baseline.

extension unyank needs -y/--yes flag for non-interactive use

Step key parsing in execution_service uses naive split(":") and silently truncates colon-containing step names

Workflow-scope report artifacts unreachable via `swamp data get --workflow`

Add --stdin support to method run and workflow run for Unix pipe composition

Docs: add run namespace to CEL expressions reference

Fix invalidate-then-reconcile sequencing in doctor extensions; failure-mode RowStates unreachable in sourceDetails

doctor extensions invalidateAll does not trigger fingerprint recheck for existing Indexed rows

Extension failure recording has dual write paths (legacy buildIndex vs W3 reconcile)

Expose workflow run ID in CEL so resource keys can be run-scoped

Add `swamp doctor workflows` subcommand to surface workflow YAML parse errors during preflight

Add 'swamp issue comment' command for updating existing issues

Close #327 — fixed in 20260511.160514.0-sha.9d03b09a

Nested workflow task fails with 'Bad resource ID' when workflowIdOrName uses @collective/ prefix

Detect stale skill directories and prompt for repo upgrade

Improve skill trigger routing for cross-model edge cases

Add type search for driver, datastore, and report kinds

Surface ReconcileFromDisk dryRun transitions in swamp doctor extensions

Docs: update doctor extensions reference for W6 aggregate-state rendering + repair flags

swamp issue: check if reporter is on an outdated binary before opening

Implement W6: swamp doctor extensions aggregate-state rendering + repair surface (extension catalog rearchitecture)

extension pull fails referencing removed source after extension source rm

Investigate whether allExtensionMethodsAttached guard can be removed via registration-path consolidation

Trace context is lost: CLI ignores inbound TRACEPARENT and raw driver doesn't propagate active context into in-process methods

swamp model create --global-arg KEY=VALUE doesn't coerce strings to z.number() schemas

swamp model create --global-arg KEY=VALUE doesn't coerce strings to z.number() schemas

swamp issue get should rate-limit unauthenticated users instead of blocking

swamp issue get should not require authentication

telemetry: emit child entries for follow-up action method invocations

Publish release-candidate / unstable extension versions

extension source rm leaves stale @local/. rows in catalog, blocking later pulls of registry types

extension push drops binaries: field from re-emitted archive manifest.yaml

macOS launchd autoupdate (club.swamp.autoupdate) silently fails — binary stays stale

Clicking @hivemq/honeycomb extension card on /extensions shows 'Something went wrong'

Docs: Update model-definitions.md and workflows.md for direct type execution

Docs: document binaries manifest field in extension-manifest.md

Add an official @swamp/ssh extension for general-purpose SSH (brownfield-friendly)

Accept and display binaries field from extension push metadata

Direct type execution: collapse model create + method run into one command

Per-method telemetry events for workflow runs

Workflow run liveness: orphaned 'running' records when originating CLI process dies mid-run

Provide a CLI-shape primer for AI agents to reduce rediscovery overhead

quality rubric: don't penalize extensions whose upstream constrains them to a single platform

Extension update rejects multiple .ts files extending the same target type within one local extension (regression)

swamp extension push deadlocks when invoked from inside a swamp workflow step

Collective-scoped auth keys + OIDC federation for CI publishing

forEach self.* in modelIdOrName not resolved in runtime execution path

Award leaderboard points for referrals and collective invites

Add agent harness detection and AiTool to telemetry

Workflow-level runtime expressions (env.*, vault.*) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings

Implement W5: Per-fingerprint import URLs + subprocess test harness (extension catalog rearchitecture)

swamp config set crashes with YAML serialization error

datastore compact: VACUUM fails in compiled binary (SQLITE_LIMIT_ATTACHED=0)

Repo-level version gating: minSwampVersion high-water mark for team consistency

Docs: document self.* expressions in modelIdOrName during forEach

Docs: How-to guide for background autoupdating

Manifest version bumps silently ignored for existing local extension aggregates

materialiseExtensions misclassifies pulled rows when manifest name collides with a pulled extension

Local extension edits don't reliably trigger rebundle

Missing unique indexes on user.email and user.username allow duplicate users

Resolve self.* expressions in modelIdOrName during forEach expansion

discord-bot double-sends sign_up notifications

Discord bot sends duplicate signup notifications

Add 'swamp workflow list' as alias for 'swamp workflow search'

Add 'swamp auth status' as alias for 'swamp auth whoami'

Extension bundle cache does not invalidate on source edits

extension pull fails on @local/[email protected] phantom-claim collision when local repo has its own extension

Lab search by numeric issue ID returns no results

W3 sourceToRow writes empty source_mtime — should carry filesystem mtime through Source entity

Warm-start rebundleAndUpdateCatalog should respect terminal RowStates set by reconcile

Implement W4: KindAdapter + unified loader (extension catalog rearchitecture)

Docs: add vault read-secret command to reference manual

Extension layer garbage collection: prune catalog rows + evict orphaned bundles

Docs: document workflow concurrency limits in reference manual

Locally-sourced extension: source_mtime updates without regenerating stale bundle

Extension push: allow shipping executable host helpers (bin/mudroom blocker)

Vault expressions silently deliver __SWAMP_VSEC__ sentinels under the docker driver

First-class shell-shim support for extensions, with registry-level visibility

Local extension model bundles don't rebuild when source changes (no rebuild CLI; manual cache delete breaks the runner)

Configurable concurrency limits for workflow fan-out (forEach, parallel jobs/steps)

feat(security): redact sensitive method arg values from audit log

Workflow-level runtime expressions (env., vault.) not resolved in driverConfig — docker driver receives literal ${{ ... }} strings