Skip to main content
← Back to list
01Issue
BugOpenSwamp CLI
AssigneesNone

Improve skill trigger routing accuracy across models

Opened by stack72 · 4/10/2026

Description

Multi-model skill trigger evals reveal systematic failure patterns affecting skill routing accuracy across Claude Sonnet, Claude Opus, GPT-5.4, and Gemini 2.5 Pro. All models pass the 90% threshold but there are clear skill description issues to fix.

CI results (2026-04-10): https://github.com/systeminit/swamp/actions/runs/24239470178

Model Pass Rate Passed Failed
Sonnet 99.0% 200/202 2
GPT-5.4 98.0% 198/202 4
Opus 94.1% 190/202 12
Gemini 2.5 Pro 91.6% 185/202 17

Cross-Model Failures (highest priority)

  1. extension-model should NOT trigger for "Can I run this custom model as part of a scheduled workflow?" - fails on 3/4 models (sonnet, gpt-5.4, opus). Description too broad.
  2. report SHOULD trigger for "What methods does UnifiedDataRepository expose in reports?" - fails on 3/4 models (sonnet, opus, gemini). Routed to troubleshooting instead.
  3. extension-driver should NOT trigger for "Run this workflow step on a remote Kubernetes cluster" - fails on 3/4 models (gpt-5.4, opus, gemini). Description too broad.
  4. model should NOT trigger for "How do I chain this model into an automated workflow?" - fails on 2/4 (gpt-5.4, opus).
  5. workflow should NOT trigger for "The workflow is erroring on the second step" - fails on 2/4 (opus, gemini).

Pattern 1: swamp-report poorly differentiated

3/4 models affected. Gemini has 7 report failures alone. Reports consistently routed to troubleshooting. Fix: Update .claude/skills/swamp-report/SKILL.md to mention report creation, output formats, dataRepository, UnifiedDataRepository, dataHandles. Differentiate from troubleshooting.

Pattern 2: Extension descriptions too broad

3/4 models affected each. extension-model and extension-driver trigger on usage queries not about creating extensions. Fix: Update .claude/skills/swamp-extension-model/SKILL.md and swamp-extension-driver/SKILL.md to emphasize creating new TypeScript extensions. Add exclusions.

Pattern 3: Text responses instead of tool calls

2/4 models affected (Opus: 8 cases, Gemini: 10 cases). Both respond conversationally instead of routing via tool call. Fix: Strengthen system prompt in evals/promptfoo/generate_config.ts (~line 242) or adjust expectations per model.

Pattern 4: Ambiguous test cases

Some test cases may need reclassification in trigger_evals.json files. Review cases that fail on 2-3 models where routing is genuinely ambiguous.

Priority

  1. Fix report SKILL.md description
  2. Fix extension-model and extension-driver SKILL.md descriptions
  3. Review ambiguous test cases
  4. Address text response handling for Opus and Gemini

Reproduction

Set ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY then run eval-skill-triggers for each model. See scripts/analyze_eval_results.ts for cross-model analysis.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

4/10/2026, 12:31:02 PM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.