Skip to main content

Models reference

The operator-facing surface for helmdeck's model-tier classification. Every chat completion model that hits the gateway falls into one of three tiers (A, B, C) plus an implicit "unknown" bucket that resolves to Tier C. The tier drives five tier-aware knobs simultaneously — catalog projection size, output token budget, prompt-template variant, strict-JSON mode, and prefix-cache routing. The full mechanism lives in internal/llmcontext/budgets.go and is described architecturally in ADR 050, ADR 051, and ADR 053. This page is the operator's quick lookup.

TierStructural complianceMandatory deposit-step complianceWhat to doDoc
Tier A (frontier)Works out of the box. Generic SKILL.md as shipped. No per-model profile required. Handles parallel tool use, full N-platform fanout, multi-criterion fit checks, one-clarifying-question discipline.Skips the mandatory deposit step in single-response workflows. Empirically observed (2026-06-09): Claude Sonnet 4.6 ran 8 platform variations via blog.rewrite_for_audience but never called artifact.put or artifact.verify_manifest — conflated "append CTA" with "deposit to artifacts." This is workflow-shape-dependent, not tier-specific (revised — see empirical findings below). Iterative multi-turn workflows succeed even on Tier C. The architectural answer is the audit-callback pattern (PR #462) wired as a typed call the agent invokes, with engine-level enforcement (#461 Phase 3) as the durable fix.Build your agent with the standard skill + your own SOUL/USER/IDENTITY/AGENTS for voice and operator persona. For multi-deposit workflows, structure the skill as a multi-turn workflow with explicit operator handoffs rather than asking the agent to do everything in one response. Verify behavior on your specific Tier A model.(no special doc — but see empirical findings below)
Tier B (mid-tier paid or strong free)Unknown — experiment first. Treat as a research question.Unknown. Run the same A/B and count artifact.put vs claimed deposits.A/B test the same prompt with generic skill vs. a borrowed Tier C profile. Compare artifact.put calls + verify_manifest result + hallucination count. Decide based on YOUR trace.docs/howto/experiment-with-tier-b-models.md
Tier C (free / open-weight)Must customize. Generic skill prose fails empirically — three failure modes documented (PR #462 + the field reports).Profile + iterative workflow shape makes mandatory tool calls reliable. Profile-aware Tier C agent in single-response mode fires verify_manifest via auto-deposit pipelines (2026-06-09 trace). Three-turn iterative workflow (outline → draft → operator-triggered deposit+verify) successfully calls explicit artifact.put + artifact.verify_manifest with all_present: true — even on a free gpt-oss-120b:free route. Latency is the trade-off: ~3-5 min per turn on free Tier C routes; feels like hanging in the UI but completes correctly.Use a model profile from models/<provider>-<model>.yaml as the starting point. Fork SOUL/USER/AGENTS per the profile's prompt_template. For multi-deposit workflows, split the work into explicit operator-triggered turns rather than asking for everything in one response — each turn small enough that chain-call reliability holds. Encode use-case-specific success criteria with hard counts.docs/howto/add-free-models.md

The library is a starting point, not a finished product. Operators MUST customize per (model × use-case). One library entry won't fit every use case. The mandatory deposit step needs engine-level enforcement to be reliable across tiers — that's #461 Phase 3.

Empirical findings from 2026-06-09

Same prompt (tech-blog-publisher on the mcp-adr-analysis-server source), same skill, two model tiers — three traces captured:

BehaviorTier C baseline (no profile)Tier C w/ profile (gpt-oss-120b:free)Tier A (claude-sonnet-4.6)
Parallel tool use at startup✓ 3 simultaneous
Real blog.rewrite_for_audience calls4 (in chat, not deposited)0 (used pipeline shortcut)✓ 8
Real pipeline-run calls (auto-deposit producer)0✓ 20
Real blog.append_cta calls008 (all REJECTED — see PR #468)
InfoQ 6-criterion fit check executedskippedskipped✓ per-criterion grades
Multi-step plan acknowledged upfrontpartialpartial✓ full 5-step plan stated
Honored "ask at most ONE clarifying question" rule✗ (hedged)✗ (hedged)✓ one question + stated defaults
artifact.put calls000
artifact.verify_manifest calls01 (all_present: true, 2 of 2 verified)0
Hallucinated manifest entries6 (earlier session)00

Reading the table: Tier A handles every structural aspect of the skill better than either Tier C variant — parallel tool use, full fanout, fit-check rigor, clarifying-question discipline. But on the load-bearing "deposit step is mandatory" rule, all three traces above (single-response workflows) skipped explicit artifact.put. Only the profile-aware Tier C variant fired verify_manifest — and it did so via the pipeline-run shortcut, where the audit was implicit in the pipeline contract rather than an explicit agent decision.

Refined finding from a fourth trace (later 2026-06-09): a Tier C agent running an iterative three-turn workflow (outline → draft → operator-triggered deposit+verify) on the same gpt-oss-120b:free route successfully called BOTH artifact.put AND artifact.verify_manifest, returning all_present: true. Latency was significant (~5 minutes for the deposit-and-verify turn on the free route), but the mandatory tool calls executed correctly.

Corrected conclusion: deposit-step skipping is workflow-shape-dependent, not tier-invariant. Single-response workflows asking the agent to do classify-outline-draft-deposit-verify-checklist in one go fail on every tier. Multi-turn iterative workflows with explicit operator handoffs (each turn small enough that 1-2 pack calls suffices) drive the mandatory calls reliably even on cheap Tier C. Engine-level enforcement (#461 Phase 3) remains the durable architectural answer because it removes the workflow-shape dependency entirely — but well-shaped iterative skill prose CAN drive the mandatory call on every tier we've tested.

When a skill produces N artifacts that must be deposited, structure it as:

  • Turn 1: agent classifies + outlines + asks at most one clarifying question + stops with explicit handoff (Reply 'proceed' to write the draft)
  • Turn 2: agent writes the full draft + stops with explicit handoff (Reply 'deposit' to save and verify)
  • Turn 3: agent calls artifact.put + artifact.verify_manifest + reports the result

Each turn is small enough that chain_call_reliability: high (1-2 pack calls per turn) actually applies. The mandatory tool calls fire reliably. Trade-off is operator effort (three message exchanges instead of one) and latency on free Tier C routes (~3-5 min per turn).

The handoff line at the end of each turn IS load-bearing — if the skill prose says "produce a handoff line" but doesn't list missing-handoff as an invalidation condition, the model will drop it. Pin handoff lines as success-criteria invalidation conditions in AGENTS.md (A response is invalid unless it ends with the literal text 'Reply with deposit ...').

Per-model profiles available today (per issue #464 Phase 1):

  • openai/gpt-oss-120b:free — sourced from OpenAI Harmony + Together AI + IBM watsonx docs; empirically validated 2026-06-09.
  • google/gemma-4-26b-a4b-it:freeprofile stub; 26B-total / 3.8B-active MoE Gemma 4 (256K context, multimodal); sourced from Google AI + Hugging Face + DeepMind docs; baseline empirical trace pending — Google AI Studio shared free pool gates upstream (2026-06-10 429 finding), BYOK recommended for sustained empirical work.
  • meta-llama/llama-3.3-70b-instruct:freeprofile stub; 70B dense Llama 3.3; sourced from Meta Llama 3.3 docs; baseline empirical trace pending.
  • nvidia/nemotron-3-super-120b-a12b:freeprofile stub; 120B/12B hybrid Mamba-Transformer MoE (1M context); sourced from Nvidia NIM + agentic-coding cookbook + technical blog; baseline empirical trace pending.
  • qwen/qwen3-coder:freeprofile stub; 480B-total / 35B-active MoE coder-specialized variant (1M context via YaRN, 256K native); sourced from Alibaba Qwen HF card + GitHub README + announcement blog; baseline empirical trace pending. Substituted for the originally-listed z-ai/glm-4.5-air:free, which was deprecated from OpenRouter on or before 2026-06-10.

Non-OpenRouter profiles (alternative routing) — per issue #482:

  • huggingface/openai/gpt-oss-120b — first non-OpenRouter template; reuses the gpt-oss prompting guidance from the OpenRouter sibling above (model behavior is provider-agnostic); routes through HuggingFace Inference Providers (router.huggingface.co/v1) with :fastest / :cheapest / :preferred provider-selection policies. Empirical sections empty — community contributions invited. See docs/howto/configure-non-openrouter-providers.md for routing setup.

Stub profiles ship the schema scaffold with docs-sourced prompting guidance. They invite community contributions per docs/howto/add-free-models.md § 7: operators running custom Tier C agents can submit trace excerpts to populate community_traces[] or open issues against any per-model YAML.

See also: docs/reference/model-profiles-schema.md for the canonical YAML schema reference; docs/howto/configure-non-openrouter-providers.md for routing setup on HuggingFace / Together / Groq / self-hosted.

How tier affects behavior

KnobTier ATier BTier CSource ADR
Catalog projectionpassthrough (no compaction)trim metadata to ~22 KBaggressive trim to ~10 KB + lexical pre-filterADR 050
Output token budget4,000–8,0002,0001,500ADR 051
helmdeck.plan prompt variantfull_steps — emit the complete pipeline JSON in one shotfull_stepssingle_pick — emit ONE step at a time + more_steps_likely flag; agent re-calls plan for the next stepADR 053
Strict-JSON / response_formatenabled when the provider supports itenabled for providers that support itdisabled — constrained-decoding deadlock risk on quantized inferenceADR 051 PR #3
Prefix-cache routing (catalog block in system prompt)enabled where supportedenabled on DeepSeek V4 ProdisabledADR 051 PR #4
LLM filter pass (two-pass cascade)off by defaultoffon by default (catalog won't fit otherwise)ADR 050 PR #4

The asymmetry is deliberate: Tier C models get more tier-aware protection (smaller catalog, smaller output, simpler prompt, conservative encoder flags) because they have less raw capability to handle the unprotected path. Tier A models get fewer protections because they don't need them.

When you'll see Tier C behavior

Three situations trigger the Tier C path:

  1. The model is explicitly in the Tier C list below — e.g. openrouter/nvidia/nemotron-3-super-120b-a12b:free, openrouter/moonshotai/kimi-k2.6, or any other entry below.
  2. The model matches a Tier C prefix — e.g. anything starting with openrouter/qwen/qwen-2.5- matches the broader openrouter/qwen/qwen-2.5- prefix entry.
  3. The model is unknown — never seen before; helmdeck falls back to Tier C as the conservative default. This is intentional: when we don't know enough about a model to classify it, we assume it might struggle with the unprotected path.

If you observe Tier C behavior on a model you believe should be Tier A or B (you're hitting the smaller output budget, the simpler single_pick plan path, or the aggressive catalog trim), check that your model id matches an entry below verbatim or as a prefix. The model id openrouter/nvidia/nemotron-3-super-120b-a12b:free is technically a 120B model but is Tier C because the free-tier inference quality (quantization + reasoning-token leak + 50% multi-step plan success — see the validation arc blog post) doesn't match what parameter count alone would suggest.

Tier A — frontier and frontier-relay models

Reliable on multi-step plans, structured output, and prefix caching. Catalog projection passes through unchanged; full output budget; full_steps plan variant; strict-JSON enabled.

Model id (prefix or exact)Input ceilingOutput budgetStrict JSONPrefix cacheHybrid reasoningNotes
anthropic/claude-opus-200K8KAnthropic flagship; extreme robustness under load
anthropic/claude-sonnet-200K8KPrimary working model class for helmdeck
anthropic/claude-3.7-sonnet200K8KThinking mode; <think> blocks stripped by ADR 051 helper. BFCL 73.24%, Aider 84.2%
anthropic/claude-haiku-200K4KFastest Anthropic; used for helmdeck.plan filter pass when budget allows
openai/gpt-4o100K4KBFCL 83.88%; stable; surfaces real HTTP errors rather than silent drops
openai/gpt-51M4KBFCL 72.92%, Aider 88.0%, 91.6% diff-format adherence — current frontier benchmark
openai/o3-mini200K4.4KBFCL 84.00%; reasoning model; <think> blocks stripped
google/gemini-2.5-pro1M4KBFCL 85.04% (leaderboard top), Aider 99.6% edit-format
google/gemini-2.5-flash1M2.5KBFCL 75.58%; fast + cheap; watch for safety-filter drops on code-execution prompts
openrouter/anthropic/claude-200K8KOpenRouter relay of the Anthropic family
openrouter/openai/gpt-4o100K4KOpenRouter relay
openrouter/openai/gpt-51M4KOpenRouter relay
openrouter/openai/o3-mini200K4.4KOpenRouter relay; reasoning model
openrouter/google/gemini-2.5-1M4KCovers both pro and flash routes

Tier B — capable mid-tier and large open-weight models

Reliable on most tasks but with documented degradation modes — trailing-comma JSON at high temperature, narrower context windows, occasional timeout on long reasoning. Catalog trimmed to ~22 KB; output budget capped at 2K; full_steps plan variant; strict-JSON enabled where the provider supports it.

Model id (prefix)Input ceilingOutput budgetStrict JSONPrefix cacheHybrid reasoningNotes
openrouter/meta-llama/llama-3.1-70b32K2KBaseline mid-tier hosted
openrouter/meta-llama/llama-3.3-70b32K2KLlama 3.3 family
openrouter/google/gemma-2-16K1.5KSmaller-context Tier B route
openrouter/mistralai/32K2KTrailing-comma JSON degradation at high temperature; native API supports strict mode
openrouter/deepseek/deepseek-v4-pro1M2KBFCL proxy 71.4%, Aider proxy 74.2%; HYBRID reasoning, 30× cache discount, can hit 30-min serverless timeouts — keep filter cascade on
openrouter/deepseek/deepseek-v3.2128K2KAider 74.2%; smaller context window than V4; same 30× cache discount
openrouter/deepseek/deepseek-chat128K2KCatches the broader deepseek-chat-v3 / v3.1 family
openrouter/x-ai/grok-256K2KBFCL proxy 61.38%, Aider 97.3% edit-format; price-tier bump past 128K context; xAI native supports strict mode

Tier C — free and weak open-weight models

Inconsistent on multi-step plans; chronic empty completions on the free tier; structured-output failures (constrained-decoding deadlock); hybrid-reasoning models that leak <think> content. helmdeck.plan routes these to the single_pick variant by default — one step per call, agent re-plans for the next — per ADR 053. The two-pass LLM filter cascade is on by default because the catalog won't fit a single-pass call.

The tier table rows below describe routing-level behavior — input ceiling, output budget, strict-JSON support, and the like. For prompting-level behavior — what prompt shape this model expects, what reasoning-effort knob it exposes, what failure modes are documented — see the per-model profile library linked in the previous section. Per-model profiles override the row-level Notes column with specific best practices, anti-patterns, and prompt templates sourced from official model docs.

Model id (prefix or exact)Input ceilingOutput budgetStrict JSONHybrid reasoningNotes
openrouter/openrouter/free16K1.5KOpenRouter's free routing tier — chronic empty completions due to infrastructure drops
openrouter/nvidia/nemotron-16K1.5KReasoning-trace models; the validation arc post measured 50% success at multi-step plan, ~71-char near-empty responses with 423 completion tokens (canonical reasoning leak). single_pick plan variant addresses both failure modes. Profile: nvidia/nemotron-3-super-120b-a12b:free (stub)
openrouter/z-ai/glm-16K1.5KBFCL 70.85%; infrastructure drops on the free routing tier. Note: z-ai/glm-4.5-air:free was deprecated from OpenRouter on or before 2026-06-10; only the paid z-ai/glm-4.5-air remains. Prefix row covers future GLM :free variants if Z.ai restores any.
openrouter/google/gemma-4-16K1.5K26B-A4B MoE Gemma 4; 256K context window, multimodal; binary thinking-mode toggle (no graded knob). Google AI Studio shared free pool gates upstream (2026-06-10 429 finding); BYOK recommended. Profile: google/gemma-4-26b-a4b-it:free (stub)
openrouter/meta-llama/llama-3.3-70b-instruct:free16K1.5KFree routing of Llama 3.3 70B; the non-:free route is Tier B. Profile: meta-llama/llama-3.3-70b-instruct:free (stub)
openrouter/qwen/qwen3-coder16K1.5K480B/35B MoE coder-specialized variant; 1M context via YaRN, 256K native; Agent RL post-trained for multi-turn tool use; non-thinking-mode only. Profile: qwen/qwen3-coder:free (stub)
openrouter/qwen/qwen-2.5-16K1.5KAider 71.4%; injects ```json code fences even in strict mode — ADR 051 helper unwraps them
openrouter/moonshotai/kimi-k2256K1.5KHYBRID reasoning (large <think> output); observed 5-minute timeouts on long prompts
openrouter/moonshotai/kimi-256K1.5KCovers kimi-latest and future Kimi releases until empirically reclassified
openrouter/tencent/250K1.5Khy3-preview + future Tencent routes; conservative until live-validated

Picking a model for your goal

If you wantUseWhy
The most reliable structured output on any complexity intentA Tier A model with strict-JSON enabled (anthropic/claude-sonnet-, google/gemini-2.5-pro, openai/gpt-5, or the openrouter/auto relay which Smart-routes among these)Full output budget, full catalog, strict-JSON-enforced shape
Lowest cost per helmdeck.plan call AND consistent outputopenrouter/autoOpenRouter's complexity-aware router picks the cheapest model that can handle the intent; usually lands on Tier A
To exercise the single_pick agent-loop patternAny Tier C model — e.g. openrouter/nvidia/nemotron-3-super-120b-a12b:free, openrouter/moonshotai/kimi-latestThese default to the single_pick plan variant — see ADR 053. The agent re-calls helmdeck.plan after each step
Best price-per-token among Tier Bopenrouter/deepseek/deepseek-v3.2 or openrouter/deepseek/deepseek-v4-proDeepSeek native route has 30× prefix-cache discount; cached input ~$0.0145/M tokens
Maximum context (1M+ tokens)openai/gpt-5, google/gemini-2.5-pro, openrouter/deepseek/deepseek-v4-proAll three accept 1M+ input tokens
Reasoning model (the model "thinks" before answering)Any IsHybridReasoning: true model — anthropic/claude-3.7-sonnet, openai/o3-mini, openrouter/deepseek/deepseek-v4-pro, the Kimi K2 family<think> blocks are stripped automatically per ADR 051 PR #1; first-byte latency is higher; net output quality is often better on hard problems

Overriding the tier

When a model defies its tier classification — most commonly a Tier C model whose operator knows it handles multi-step plans reliably, or a Tier A model where the operator wants single_pick for cost reasons — the explicit override mechanism lives on the Budget entry in internal/llmcontext/budgets.go. Set PromptVariant: PromptVariantFullSteps to force full_steps despite a Tier C default, or PromptVariant: PromptVariantSinglePick to force single_pick despite a Tier A default. Same posture as the existing IsHybridReasoning / WantsStrictJSON / SupportsPrefixCache flags from ADR 051 PR #2 — operators override per-entry when their per-model knowledge contradicts the table.

To add a new model entry or revise an existing one, follow the methodology in Calibrate model tiers. The tier classifications are not guesses; each entry's trailing comment cites the benchmark (BFCL, Aider) or production observation that motivated the placement.