<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Helmdeck blog</title>
        <link>https://helmdeck.dev/blog</link>
        <description>Engineering notes, design rationale, and field reports from the helmdeck project.</description>
        <lastBuildDate>Fri, 08 May 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright © 2026 Tosin Akinosho.</copyright>
        <item>
            <title><![CDATA[Why a $0.10 model can do work that needs a $3 model]]></title>
            <link>https://helmdeck.dev/blog/cheap-models-do-frontier-work</link>
            <guid>https://helmdeck.dev/blog/cheap-models-do-frontier-work</guid>
            <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Helmdeck moves intelligence from the LLM to the pack handler. A look at where the cost actually goes — and how cheap or local models can run agentic workflows that frontier-model APIs charge 10× more for. With the prompts and recipe to test it yourself.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p>⚠️ <strong>These are my findings, not a vendor benchmark.</strong> I ran them on one helmdeck install, with a specific set of prompts, against a few specific competitor stacks. Your numbers will probably differ. The recipe to reproduce is at the bottom — if your numbers disagree, please <a href="https://github.com/tosin2013/helmdeck/issues/new" target="_blank" rel="noopener noreferrer" class="">share</a> and I'll update this page.</p>
</blockquote>
<p>Today's helmdeck install ran a 6-step Phase 5.5 code-edit loop on <code>gpt-oss-120b</code> for <strong>$0.07 total</strong> — clone a repo, read a file, apply a one-line patch, run tests, commit, push. The same loop on Cursor / Claude Code direct via Sonnet would have run <strong>$0.30+</strong>. Same outcome; ~5× cost gap.</p>
<p>That's not unusual. Here's what I see across five common workflows:</p>
<!-- -->






























<table><thead><tr><th>Workflow</th><th>Frontier-model approach</th><th>Helmdeck (gpt-oss-120b)</th></tr></thead><tbody><tr><td>Browser scrape + GitHub comment</td><td>$0.25 (Anthropic Computer Use)</td><td><strong>$0.005</strong></td></tr><tr><td>Code edit loop (6 steps)</td><td>$0.35 (Cursor / Aider)</td><td><strong>$0.07</strong></td></tr><tr><td>Multi-step browser test</td><td>$0.20 (Browser-use NL)</td><td><strong>$0.03</strong></td></tr><tr><td>PDF → structured Markdown</td><td>$1.00 (naive Sonnet vision)</td><td><strong>$0.003</strong></td></tr></tbody></table>
<p>Median is roughly 10× per-task cost reduction. Why?</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-structural-reason">The structural reason<a href="https://helmdeck.dev/blog/cheap-models-do-frontier-work#the-structural-reason" class="hash-link" aria-label="Direct link to The structural reason" title="Direct link to The structural reason" translate="no">​</a></h2>
<p>Every alternative approach asks the LLM to do all the work. Anthropic's Computer Use API has Sonnet drive a screenshot-reason-action loop where every step is a vision-laden API call. OpenAI Operator does the same shape on GPT-4o. Browser-use has the LLM author selectors and decisions per step. Cursor and Claude Code read entire files into context to reason about a one-line edit. Naive function-calling on Sonnet has the model figure out tool schemas, retries, error semantics, and state management on every fresh agent session.</p>
<p>Helmdeck inverts the split. <strong>Packs are typed, security-bounded, audited primitives.</strong> The pack handler is Go code that already knows how to talk to Firecrawl / Docling / Playwright MCP / git / xdotool / GitHub's REST API. The LLM emits a short JSON tool call (~50–200 tokens) and reads a short JSON response (~200–800 tokens). It doesn't need to figure out the API surface — that's done once, in code, and amortized forever.</p>
<p>This is the pattern that's been load-bearing in software for decades. Compilers vs. interpreters. Postgres vs. let-the-LLM-compute-everything. Move recurring deterministic work <em>out</em> of the expensive token-priced layer <em>into</em> the cheap deterministic layer. Reserve the expensive layer for the irreducibly judgment-y parts.</p>
<p><a class="" href="https://helmdeck.dev/integrations/SKILLS"><code>SKILLS.md</code></a> — the agent skill bundle — teaches the model the catalog and contracts upfront so it picks the right pack on first try. It's ~9 KB, prompt-cached, and shaves another ~50% off per-workflow cost on weak models because they stop fumbling schemas and dropping session ids.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="one-concrete-example-webscrape">One concrete example: web.scrape<a href="https://helmdeck.dev/blog/cheap-models-do-frontier-work#one-concrete-example-webscrape" class="hash-link" aria-label="Direct link to One concrete example: web.scrape" title="Direct link to One concrete example: web.scrape" translate="no">​</a></h2>
<p>Say you need to scrape an article and post a GitHub issue summarizing it.</p>
<p><strong>Anthropic Computer Use approach.</strong> Sonnet receives a goal. It takes a screenshot. It reasons "I should navigate to the URL" and emits a <code>computer</code> tool call. It gets a screenshot back. It reasons "the page is loaded, I should select the article body" and emits a <code>computer</code> call to scroll. Another screenshot. Another reason step. Maybe 8–12 turns later, it has the content extracted, summarizes it, and emits one more call to file the issue. Each turn carries 1500+ image tokens. Total: ~$0.25.</p>
<p><strong>Helmdeck approach.</strong> The LLM emits one tool call:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"helmdeck__web-scrape"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"arguments"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token property" style="color:#36acaa">"url"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://example.com/article"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>The pack handler talks to Firecrawl, gets clean Markdown back, returns it as the tool result. The LLM emits one more tool call:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"helmdeck__github-create_issue"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token property" style="color:#36acaa">"arguments"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token property" style="color:#36acaa">"repo"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"owner/repo"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"title"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"..."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"body"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"..."</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Two short LLM round-trips. Total: ~$0.005.</p>
<p>The 50× cost gap isn't because helmdeck has a cleverer model — it's because Firecrawl already knows how to scrape SPAs, and a deterministic pack handler is doing 90% of the work that the model would otherwise spend tokens rediscovering on every run.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-it-doesnt-win">Where it doesn't win<a href="https://helmdeck.dev/blog/cheap-models-do-frontier-work#where-it-doesnt-win" class="hash-link" aria-label="Direct link to Where it doesn't win" title="Direct link to Where it doesn't win" translate="no">​</a></h2>
<p>I'm not arguing helmdeck wins everywhere:</p>
<ul>
<li class=""><strong>One-off, ad-hoc tasks where no pack fits.</strong> Pack overhead doesn't amortize over a single use; just ask Sonnet directly.</li>
<li class=""><strong>Truly novel workflows</strong> where the LLM has to reason from first principles. Packs absorb common shapes; new shapes still need the model to invent.</li>
<li class=""><strong>Orgs already running tuned Sonnet pipelines that work.</strong> Don't fix what isn't broken.</li>
<li class=""><strong>Self-hosted ops cost.</strong> A helmdeck install needs CPU/RAM for sidecars, storage, upgrades. The economics work when you're running many tasks across shared infra, not for one user / one machine / one workflow.</li>
</ul>
<p>If your situation hits any of these, the comparison numbers don't apply to you. The full breakdown — including all five workflows and the model-vs-pack split per task — is on the long-form <a class="" href="https://helmdeck.dev/explanation/why-helmdeck">Why helmdeck</a> page.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="test-it-yourself">Test it yourself<a href="https://helmdeck.dev/blog/cheap-models-do-frontier-work#test-it-yourself" class="hash-link" aria-label="Direct link to Test it yourself" title="Direct link to Test it yourself" translate="no">​</a></h2>
<p>The most useful thing you can do with this post is reproduce the numbers (or refute them) on your own hardware:</p>
<ol>
<li class=""><a class="" href="https://helmdeck.dev/tutorials/install-cli">Install helmdeck</a> (~30 min).</li>
<li class="">Connect <a class="" href="https://helmdeck.dev/integrations/openclaw">OpenClaw</a> — it's the validated end-to-end client.</li>
<li class="">Run the prompts at <a href="https://github.com/tosin2013/helmdeck/tree/main/scripts/oc-capture/prompts" target="_blank" rel="noopener noreferrer" class=""><code>scripts/oc-capture/prompts/easy-cluster.txt</code></a> against your model of choice.</li>
<li class="">Run the same workflows on whichever competitor stack you're evaluating against.</li>
<li class="">Compare costs from each provider's billing dashboard.</li>
</ol>
<p>If your numbers come back within the ranges quoted above, that's a reproduction. If they <strong>disagree</strong> — lower or higher — please <a href="https://github.com/tosin2013/helmdeck/issues/new" target="_blank" rel="noopener noreferrer" class="">open an issue</a> titled <code>cost-reproduction: &lt;workflow&gt;</code>, or <a class="" href="https://helmdeck.dev/blog">submit a community blog post</a> with your full methodology. See <a href="https://github.com/tosin2013/helmdeck/blob/main/CONTRIBUTING.md" target="_blank" rel="noopener noreferrer" class=""><code>CONTRIBUTING.md</code></a> §"Other contribution types" for how to add yourself as an author.</p>
<p>We particularly want <strong>independent reproductions</strong> — your real findings on your real hardware are more valuable than another marketing pitch from me. The <a class="" href="https://helmdeck.dev/explanation/why-helmdeck">Why helmdeck</a> page will get updated with your numbers (and a link to your post) if your reproduction surfaces a meaningful discrepancy.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="see-also">See also<a href="https://helmdeck.dev/blog/cheap-models-do-frontier-work#see-also" class="hash-link" aria-label="Direct link to See also" title="Direct link to See also" translate="no">​</a></h2>
<ul>
<li class=""><a class="" href="https://helmdeck.dev/explanation/why-helmdeck">Why helmdeck</a> — the long-form version of this post with all five comparison tables and a full reproduction recipe</li>
<li class=""><a class="" href="https://helmdeck.dev/tutorials/install-cli">Get started — install helmdeck</a></li>
<li class=""><a class="" href="https://helmdeck.dev/integrations/SKILLS">SKILLS.md</a> — the agent skill bundle that's load-bearing for the cheap-model story</li>
<li class=""><a class="" href="https://helmdeck.dev/PACKS">Pack catalog</a> — the 36 capability packs the comparisons use</li>
</ul>]]></content:encoded>
            <category>cost</category>
            <category>mcp</category>
            <category>weak-models</category>
            <category>agent-architecture</category>
        </item>
    </channel>
</rss>