Skip to main content

4 posts tagged with "mcp"

View All Tags

The audit-callback pattern: verify-against-ground-truth as anti-hallucination middleware

· 6 min read
Tosin Akinosho
Helmdeck maintainer

Hook

Three architectural fixes from a single morning closed three different Tier C failure modes. A fourth — the agent producing a confidently-formatted manifest of fictitious deposits — survived all three. The structural answer isn't another fix at the producer side. It's a typed audit pack that reads ground truth after the fact, with the skill forced to surface the gap.

Context

Helmdeck's been on a Tier C reliability arc for a week. Three patterns kept recurring:

PatternExampleFix shape
Skill prose ignored"Save to artifacts" → markdown returned inlineTurn the advisory into a typed pack call (PR #450)
Required arg omittedcontent.ground rejects when model missingResolve a default at the pack layer (PR #453)
Mechanism vs. persona mixedTier C overwhelmed by 17 KB monolithic SKILL.mdSplit per OpenClaw's canonical agent-workspace modelissue #457 and follow-ups

We shipped all three, plus the layered workspace refactor, and retested on openai/gpt-oss-120b:free. The first three fixes worked — the agent loaded the layered files correctly, applied the decision rules from AGENTS.md, picked the right publishing mode, and made one successful blog.rewrite_for_audience call without specifying model. Then it produced a six-entry deposit manifest table for artifacts that didn't exist. The skill was in context. The pack was reachable. The model invented the calls as text.

That class of failure can't be fixed at the producer side — the producer was never called. It needs a verifier at the consumer side.

Finding

The shape that worked

artifact.verify_manifest:

{
"tool": "helmdeck__artifact-verify-manifest",
"arguments": {
"expected": [
{ "artifact_key": "blog.publish/abc-mcp-adr-canonical.md" },
{ "artifact_key": "blog.publish/def-mcp-adr-linkedin.md" }
]
}
}

Returns:

{
"verified": [
{ "artifact_key": "blog.publish/abc-mcp-adr-canonical.md",
"filename": "mcp-adr-canonical.md",
"namespace": "blog.publish",
"size": 7421,
"content_type": "text/markdown" }
],
"missing": [
{ "artifact_key": "blog.publish/def-mcp-adr-linkedin.md", "reason": "artifact not found" }
],
"all_present": false,
"summary": "1 of 2 claimed artifacts verified; 1 missing"
}

Handler: pure passthrough to ArtifactStore.Get per claimed key, dedup before lookup, accumulate found vs. not-found. ~150 LOC, 100% per-function coverage on 15 tests.

The skill update is two paragraphs:

### 4b. Verify deposit — MANDATORY, NOT ADVISORY

After producing the deposit-manifest table in §4, you MUST call
helmdeck__artifact-verify-manifest with every artifact_key from
the table. This is an anti-hallucination audit.

If `all_present: false` — DO NOT claim the deposit succeeded.
Report the missing[] entries explicitly and propose retrying the
deposit step for those specifically.

That's it. The audit pack is a tool name, not advisory prose — Tier C invokes it ~most of the time because it's a concrete tool call, not a "remember to" reminder. When it does invoke it, the returned missing[] is in the LLM's context window for the next response turn, making "all six deposited" implausible to assert.

Why this is the same shape as ADR 052

ADR 052 (av-output-validation-post-step) made av.validate a default-on post-step on slides.narrate and podcast.generate. The token-savings claim was concrete: every "the video has issues" diagnostic burns ~3,000 tokens of bash output and analysis; reading the validation field from the run record collapses that to ~200 tokens. The architecture: turn an implicit trust in the artifact ("looks fine, ship it") into a typed pack output the agent reads in O(200) tokens.

artifact.verify_manifest is the same shape at a different layer:

LayerWhat's verifiedTrust replaced
ADR 052 (artifact layer)The artifact's structural integrity (codec, faststart, packet contiguity, RMS)"the encoder produced a usable file" → typed validation.checks[]
artifact.verify_manifest (chat-response layer)The agent's claims about what's in the store"the agent said it deposited" → typed verified[] / missing[]

Both move from implicit trust to explicit verification, both surface findings in O(200) tokens, both pin the failure mode at a place where it can't drift back.

Phase 2 — generalize

The pattern fits a lot of helmdeck packs. Anywhere the LLM might transform a producer's output in its text response, you can pair the producer with an audit pack that re-reads authoritative state:

ProducerAuditor (planned)Verifies
artifact.putartifact.verify_manifest (shipped)Keys exist in store
repo.fetchrepo.verify-cloneClaimed clone_path exists, commit SHA matches
blog.publishblog.verify-publishedPublished URL is reachable, content matches
pack.start (async)pack.verify-completedjob_id is completed, not working
slides.narrateslides.verify-renderedMP4 exists + passes av.validate
content.groundcontent.verify-groundedclaims_grounded_count matches grounded[] length
pipeline-runpipeline.verify-completionClaimed step outputs match run record

Each follows the same shape: input is the agent's claim, output is {verified[], missing[], summary}. Handler reads authoritative state and reports the gap. Tracking in #461.

Phase 3 — engine-level hook (deferred)

The skill-prose dependency in Phase 1 ("after the deposit step, you MUST call verify-manifest") is itself a Tier C failure surface — small chance the model ignores it. The next architectural step is an engine-level post-call hook: when a producer pack completes, the engine auto-invokes the registered auditor, attaches the result to the same response envelope, and the LLM sees both without skill-prose dependency.

That's its own ADR. Not shipping it until Phase 1 + 2 prove the pattern is generally useful. Premature middleware is a way to build a complicated system you can't justify.

Why this matters to you

If you're building an agent on weak models, the producer-audit pair is a more durable shape than trying to make the model infallible.

Three principles that fall out of the work:

  1. Trust the producer; verify the consumer. Packs are reliable when they're called. The unreliability is the agent's claims about what it called. Verifying the consumer side closes that gap regardless of model tier.
  2. Make the audit a typed tool, not prose. "Remember to verify" is a Tier C failure mode. "Call helmdeck__artifact-verify-manifest" is a tool dispatch. The tool's existence in the catalog AND the skill's mandatory-step prose together raise the floor.
  3. The audit response has to be in context when the agent writes its final text. If verification runs out-of-band and the result lands in a log, the agent never sees it and continues asserting compliance. The audit must be a tool call whose result the LLM reads before its next text turn.

The pattern transfers to any MCP-tooling system, not just helmdeck. The MCP spec's tool-call envelope is exactly the surface this pattern uses. If your agent produces structured claims about world state (deposits, sends, publishes, mutations), pair each producer with an auditor and require the auditor in your skill template.

See also

Helmdeck v0.10.0 — content packs, 38 packs total, registry-published

· 6 min read
Tosin Akinosho
Helmdeck maintainer

Tag: v0.10.0. Upgrade in 5 min: git checkout v0.10.0 && make install.

v0.10.0 is a content-packs release. The two new packs let an agent generate something it can publish — a blog post via Ghost, a multi-speaker podcast MP3 via ElevenLabs — in one tool call, instead of needing to chain http.fetch + custom JWT logic + ffmpeg shellouts. Pack count climbs from 36 to 38.

It also closes a gap on the operator side: until this release helmdeck had zero documentation on how to upgrade a running install. New docs/howto/upgrade-helmdeck.md covers the in-place Compose upgrade, schema migrations, post-upgrade validation, and rollback. With Kubernetes coming in v1.0, the absence of this guide had become a real risk.

vision.click_anywhere works mechanically. The model still doesn't. Five projects waiting for someone to build them.

· 10 min read
Tosin Akinosho
Helmdeck maintainer

Issue #112 is the canonical research thread. This post pulls the same thinking into a project-shaped frame: five separable projects, any one of which would close the gap. If you're an OSS maintainer or researcher looking for a 3–6 month project that lots of people would benefit from — pick one.

In v0.10.0 we shipped a mechanical fix for vision.click_anywhere's "loops forever clicking the same coords" bug. The fix worked: per-step screenshots now show genuine visual progression instead of identical bytes. Xvfb repaints, scrot captures the new frame, the next iteration sees a different image.

And the model still doesn't emit done.

We tested with three plausible goals — "click the URL bar to focus it", "click the New Tab button at the top of the Chromium window", "click anywhere in the center of the visible window". Per-step coordinates were sensible, screenshots changed turn-to-turn, but claude-haiku-4.5 (and similar small/cheap vision models) hit max_steps: 10 with completed: false every time. The model could see the screenshot. It couldn't decide whether the click had succeeded.

Why a $0.10 model can do work that needs a $3 model

· 6 min read
Tosin Akinosho
Helmdeck maintainer

⚠️ These are my findings, not a vendor benchmark. I ran them on one helmdeck install, with a specific set of prompts, against a few specific competitor stacks. Your numbers will probably differ. The recipe to reproduce is at the bottom — if your numbers disagree, please share and I'll update this page.

Today's helmdeck install ran a 6-step Phase 5.5 code-edit loop on gpt-oss-120b for $0.07 total — clone a repo, read a file, apply a one-line patch, run tests, commit, push. The same loop on Cursor / Claude Code direct via Sonnet would have run $0.30+. Same outcome; ~5× cost gap.

That's not unusual. Here's what I see across five common workflows: