Improve skill trigger routing accuracy across models
Opened by stack72 · 4/10/2026
Description
Multi-model skill trigger evals reveal systematic failure patterns affecting skill routing accuracy across Claude Sonnet, Claude Opus, GPT-5.4, and Gemini 2.5 Pro. All models pass the 90% threshold but there are clear skill description issues to fix.
CI results (2026-04-10): https://github.com/systeminit/swamp/actions/runs/24239470178
| Model | Pass Rate | Passed | Failed |
|---|---|---|---|
| Sonnet | 99.0% | 200/202 | 2 |
| GPT-5.4 | 98.0% | 198/202 | 4 |
| Opus | 94.1% | 190/202 | 12 |
| Gemini 2.5 Pro | 91.6% | 185/202 | 17 |
Cross-Model Failures (highest priority)
- extension-model should NOT trigger for "Can I run this custom model as part of a scheduled workflow?" - fails on 3/4 models (sonnet, gpt-5.4, opus). Description too broad.
- report SHOULD trigger for "What methods does UnifiedDataRepository expose in reports?" - fails on 3/4 models (sonnet, opus, gemini). Routed to troubleshooting instead.
- extension-driver should NOT trigger for "Run this workflow step on a remote Kubernetes cluster" - fails on 3/4 models (gpt-5.4, opus, gemini). Description too broad.
- model should NOT trigger for "How do I chain this model into an automated workflow?" - fails on 2/4 (gpt-5.4, opus).
- workflow should NOT trigger for "The workflow is erroring on the second step" - fails on 2/4 (opus, gemini).
Pattern 1: swamp-report poorly differentiated
3/4 models affected. Gemini has 7 report failures alone. Reports consistently routed to troubleshooting. Fix: Update .claude/skills/swamp-report/SKILL.md to mention report creation, output formats, dataRepository, UnifiedDataRepository, dataHandles. Differentiate from troubleshooting.
Pattern 2: Extension descriptions too broad
3/4 models affected each. extension-model and extension-driver trigger on usage queries not about creating extensions. Fix: Update .claude/skills/swamp-extension-model/SKILL.md and swamp-extension-driver/SKILL.md to emphasize creating new TypeScript extensions. Add exclusions.
Pattern 3: Text responses instead of tool calls
2/4 models affected (Opus: 8 cases, Gemini: 10 cases). Both respond conversationally instead of routing via tool call. Fix: Strengthen system prompt in evals/promptfoo/generate_config.ts (~line 242) or adjust expectations per model.
Pattern 4: Ambiguous test cases
Some test cases may need reclassification in trigger_evals.json files. Review cases that fail on 2-3 models where routing is genuinely ambiguous.
Priority
- Fix report SKILL.md description
- Fix extension-model and extension-driver SKILL.md descriptions
- Review ambiguous test cases
- Address text response handling for Opus and Gemini
Reproduction
Set ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY then run eval-skill-triggers for each model. See scripts/analyze_eval_results.ts for cross-model analysis.
Open
No activity in this phase yet.
Sign in to post a ripple.