Helmdeck — Run agentic workflows on cheap or local LLMs at 10× lower cost than frontier-model APIs.

Why helmdeck

Frontier-model APIs price a single agentic workflow at $0.20–$0.50. Helmdeck runs the same workflow on a cheap or local model for $0.05–$0.10, with deterministic packs absorbing the ambiguity that the model would otherwise burn tokens rediscovering.

Cheap models, real work

Run agentic browser, code, slides, vision, and desktop workflows on gpt-oss-120b, Gemma 4, or Mistral — the same Phase 5.5 code-edit loop that needs Sonnet on Cursor.

Deterministic primitives

36 typed capability packs do the work. The LLM only picks which pack to call. Move recurring deterministic work out of the expensive token-priced layer.

Self-hosted, audited

Your data, your keys, your hardware. Per-pack audit log, vault-backed credentials, egress-guarded network. Apache 2.0.

Read the full comparison →

Tutorials

Learning-oriented walkthroughs. Start here if helmdeck is new — go from zero to a working pack-driven agent with explicit steps.

Read →

How-to guides

Problem-solving recipes. Wire helmdeck into a specific MCP client, extend a sidecar, ship a webhook integration.

Read →

Reference

Information lookup. Pack contracts, SKILLS for LLMs, every Architecture Decision Record, project tracking.

Read →

Explanation

Understanding-oriented background. The why behind the security model and architecture choices.

Read →

Recently shipped

The latest engineering notes, design rationale, and field reports from the helmdeck project.

Jun 17, 2026

Render ≠ preview: what we learned shipping a hyperframes integration

A v0.29.2 pipeline produced 15 seconds of animation followed by 83 seconds of blank canvas. We assumed it was a slot-lifetime bug, filed upstream issues, shipped a fix, and tagged a release — then discovered that even upstream's own decision-tree example doesn't render at all (2 distinct frames over 15 seconds). The actual story: hyperframes has a known, documented 'render ≠ preview' bug class, and the registry's own decision-tree trips over it. Upstream's own `hyperframes lint` was telling us this the whole time. We wrapped it as a helmdeck pack so the next agent catches it before burning the render budget.

Read post →Jun 14, 2026

When agent-instruction docs drift from upstream spec

I wrote a best-practices guide for helmdeck's HyperFrames integration. A maintainer asked one question — 'where's this sourced from?' — and the answer turned out to be 'I made it up.' Here's what we did about it, and the broader lesson for anyone writing agent reference docs.

Read post →Jun 10, 2026

HuggingFace isn't just another LLM router — it's a platform helmdeck barely uses

PR #489 added HF Inference Providers as alternative routing. The bigger opportunity is everything else HF offers — datasets, embeddings, Spaces, tokenizers — that helmdeck currently ignores. Epic #490 frames the strategic direction.

Read post →Jun 9, 2026

Empirical validation: the audit-callback pattern fires (and the profile only gets you partway)

A profile-aware Tier C agent ran the audit-callback pattern end-to-end on openai/gpt-oss-120b:free — real artifacts, real verify_manifest with all_present:true. It also simplified the skill's 9-platform table to 2 variations. The library is a starting point, not a finished product.

Read post →Jun 9, 2026

Plausibility-shaped output: when Tier C models manifest deposits they never made

A Tier C free model produced a confidently-formatted six-entry deposit manifest, with byte sizes and a policy citation, for artifacts that never existed. One real pack call, six fabricated. The architectural fix is verify-against-ground-truth.

Read post →Jun 9, 2026

The audit-callback pattern: verify-against-ground-truth as anti-hallucination middleware

For any pack call an LLM might transform in its text response, ship a paired audit pack that reads ground truth. The architecture is the same shape as ADR 052 av-validate — applied at the chat-response layer instead of the artifact layer.

Read post →Jun 9, 2026

Tier A is structurally better. The deposit-step failure is universal.

We ran the same prompt on Claude Sonnet 4.6 that we ran on gpt-oss-120b:free. Tier A handles parallel tool use, 8-platform fanout, the InfoQ 6-criterion fit check, and the "one clarifying question" rule. It also skips the mandatory artifact.put step the same way Tier C does. The deposit-step failure is tier-invariant.

Read post →Jun 5, 2026

Recipe-style docs are dramatically underused. Here's the case for them.

We shipped a cookbook of intent → prompt recipes alongside our reference docs. Within 48 hours it had eclipsed the prompt-templates page as the most-linked-to doc in our reference site. The pattern is simple, the per-recipe cost is ~15 minutes, and most projects don't do it.

Read post →

All posts