Changelog

All notable changes to helmdeck are documented here. The format follows Keep a Changelog 1.1.0 and this project adheres to Semantic Versioning starting at v1.0.0; pre-1.0 minor versions may break compatibility (documented per release).

For the forward-looking release plan — what is targeted for upcoming versions and the hard exit gates for each — see docs/RELEASES.md.

Unreleased

[0.29.10] - 2026-06-22

Theme: "Error-path findings extraction — empirical loop empirically closes."

Single-PR hot-fix following the v0.29.9 BYO empirical test. v0.29.9 shipped the findings-memory architecture (data layer + projection + compose prompt injection); the first run on v0.29.9 surfaced a subtle but load-bearing bug — when a validation pack errors with output (lint-strict-mode's standard contract), Engine.Execute's post-handler short-circuit dropped the output before findings extraction could see it. Without this fix, the findings-memory loop never closes on Tier C runs because strict validation packs always error. v0.29.10 fixes the engine to capture handler output into a closure-visible variable before the error short-circuit, so audit-row findings get extracted from BOTH success-path AND error-path output. Regression test simulates the exact lint-strict pattern (pre-fix=0 findings, post-fix=2 findings).

Operator upgrade: clean — single engine-internal change. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.10, restart control-plane. Sidecar-hyperframes is unchanged from v0.29.8. No operator-side actions required beyond the upgrade. Empirical follow-through: re-run the BYO pipeline test ONCE — confirm /api/v1/memory/defaults?caller=openclaw-configure now returns common_findings populated from the run; re-run a SECOND time — confirm the compose-prompt findings prefix appears in the next lint sidecar's findings (ideally with fewer/different codes). That's the two-run validation closing today's iteration arc.

Fixed

Findings extraction on the error path: pack handlers that return BOTH a structured output AND a PackError (e.g. hyperframes.lint in strict mode — its standard contract returns the findings JSON + CodeArtifactFailed) now correctly land findings in the audit row. Surfaced empirically minutes after v0.29.9 deployed: re-ran the BYO pipeline, lint failed with artifact_failed, the lint sidecar artifact contained findings, but /api/v1/memory/defaults showed common_findings: 0. Root cause: Engine.Execute's post-handler short-circuit if err != nil { return nil, wrap(err) } returns nil as the result, and the audit deferred-closure was reading result.Output — so the output blob that the handler had written got dropped before findings extraction ran. The error path is exactly where we WANT findings recorded (the lint pack's strict-mode contract is "emit findings + error"). Fix: declare var handlerOutput json.RawMessage at the top of Execute, assign it right after safeInvoke returns (BEFORE the error short-circuit), pass it to writePackAudit from the closure. One regression test simulates the lint-strict pattern (return findings JSON + CodeArtifactFailed) and asserts the audit row's Findings field carries the 2 codes — pre-fix this was 0, post-fix it's 2. 1672 tests pass across packs + api + memory + packs/builtin. Closes the empirical gap surfaced by the first post-v0.29.9 BYO run; without this, the findings-memory loop never closes on Tier C runs because validation packs in strict mode always error.

[0.29.9] - 2026-06-22

Theme: "Empirical-reinforcement loop closes + admin observability."

Same-day continuation of the 24-hour BYO empirical-iteration cycle. v0.29.4 shipped the pre-render validation suite; v0.29.6 → v0.29.8 fixed the infrastructure bugs surfaced by running it (operator-uploads visibility, S3 Get URL, memory forget bypass-decrypt, sidecar pin). With infrastructure clean, the first complete BYO run produced real LLM-output findings (missing_local_asset, gsap_studio_edit_blocked, timeline_track_too_dense) — exactly the antipatterns the helmdeck-hyperframes-authoring skill documents. The skill is in-context; the LLM ignored it. v0.29.9 closes that gap with three architectural additions:

#572 — PackAudit.Findings + BuildDefaults.CommonFindings. Every pack audit row now carries structured findings; aggregation surfaces them as a per-caller frequency-ranked list via /api/v1/memory/defaults and the MCP helmdeck://my-defaults resource.
#573 — hyperframes.compose injects top-N common findings into its system prompt on every run. Empty findings → zero token cost. Empirical "you did X N times" beats abstract rules — biggest lift on Tier C models.
#571 — Routing Memory page gains a caller selector for admins so operators can inspect what their agents have been doing (not just their own admin activity).

Architecture writeup (draft, ships once empirically validated): 2026-06-22-findings-memory-empirical-reinforcement.md covers the three generalizable takeaways — empirical signal beats abstract rules for weak models, loop closes at prompt layer (not fine-tune time), validator rule codes should be load-bearing in both the gate AND the generator's prompt.

Operator upgrade: clean — no schema migrations, no removed packs. Additive across the board. New PackAudit.Findings field is optional (omitempty); existing audit rows with no findings remain valid. New CommonFindings array on defaults is omitempty — clients ignoring it see no change. The compose-prompt findings prefix is conditional on ec.Memory != nil AND non-empty findings — deployments without memory wired (default without HELMDECK_MEMORY_KEY) see zero change. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.9, restart control-plane, re-run ./scripts/configure-openclaw.sh. Sidecar-hyperframes is unchanged from v0.29.8. Empirical validation pending: the BYO pipeline test on v0.29.9 should produce fewer (or different) lint findings than the same test on v0.29.8, because the compose step now incorporates the prior-run findings as constraints. If the same codes recur, slice 4's prompt-template phrasing needs tuning (the static code→guidance map follow-up).

Added

Findings-memory: agent USES the data (#570 slice 4 — compose prompt injection). The hyperframes.compose pack now reads its caller's CommonFindings via ec.Memory.List(AuditKeyPrefixPack) + ProjectDefaults, and appends a "FINDINGS FROM YOUR PRIOR RUNS" section to the system prompt before dispatching to the LLM. Closes the empirical-reinforcement loop: slices 1+2 (the data plumbing) record what every lint/inspect/validate finding the agent has produced; slice 4 (this PR) feeds those back into the next compose call so the LLM sees concrete antipattern counts ("missing_local_asset seen 2 time(s), severity=error") alongside the abstract authoring-rules system prompt. Empty findings → empty prefix → zero token cost for new callers (auto-tunes per caller). Capped at composeFindingsTopN = 10 so the prefix tops out at ~300 tokens (negligible against the multi-thousand-token compose prompt). Prefix lives in the SYSTEM message rather than the user message so the LLM gateway / OpenRouter can cache the per-caller system half across requests and only the description varies per call. Tier coverage: helps all tiers — Tier A (claude-sonnet, gpt-4) gets marginal reinforcement of rules they already mostly follow; Tier B (llama-3-70b, etc.) gets meaningful gap-closing on specific recurring failures; Tier C (gpt-oss-120b:free, gemma-9b) gets the highest lift because the empirical "you did X N times" signal cuts through the abstract-rule-ignoring drift these models exhibit. 6 new sub-tests cover the empty-memory + nil-memory + audits-without-findings → no-prefix paths, the empirical-data path (simulates the 2026-06-22 BYO lint findings → confirms missing_local_asset appears with count=2 + gsap_studio_edit_blocked appears with count=1), the hard-constraint closing line, and the topN cap. 1671 tests pass across packs + api + memory + packs/builtin. Empirical validation pending: next BYO test run on this code should show the agent avoiding the prior-run failure modes; if it still hallucinates the same codes, slice 4's prompt-template phrasing needs tuning (possible follow-up: add a static code→human-guidance map for the most common codes).
Findings-memory layer (#570 slices 1+2 — data plumbing). The engine now records structured rule-violation findings on every pack audit row, and BuildDefaults (used by /api/v1/memory/defaults + the MCP helmdeck://my-defaults resource) aggregates them into common_findings so the agent can see which validation findings keep recurring across runs. Closes the gap surfaced by the first empirical BYO test (2026-06-22): the lint pack emitted missing_local_asset, gsap_studio_edit_blocked, timeline_track_too_dense against the LLM's authored composition — exactly the antipatterns the helmdeck-hyperframes-authoring skill documents — and there was no mechanism for the agent to learn from those failures between runs. Three changes: (1) extend PackAudit in internal/packs/audit.go with a terse Findings []AuditFinding slice (code + severity + file only — verbose message/fixHint/snippet stay in the pack's sidecar artifact). Capped at maxAuditFindings = 50 so a single dense run can't monopolize the audit budget. (2) extractFindings(output) heuristically pulls findings from THREE recognized output shapes: top-level {"findings": [...]} (any pack), nested {lint: {findings: [...]}} (hyperframes.lint), and {inspect: {issues: [...]}} / {validate: {errors: [...], warnings: [...]}} (the other two validation packs). Each finding is normalized to {code, severity, file} — entries without a code are skipped. Both code and severity/level field-name variations are tolerated (lint uses severity; validate uses level). (3) BuildDefaults adds CommonFindings []CommonFinding to the projection — group-by-code aggregation with OccurrenceCount, LastSeenUnix, and pack-attribution from the most-recent occurrence. Sorted busiest-first, capped at DefaultsFindingsTopN = 20. 14 new sub-tests cover the extraction (all three output shapes; code-missing skip; bad-JSON returns nil; row-cap enforcement) and the projection (3 distinct codes across 3 runs sorted correctly; cross-pack aggregation; empty input; top-N truncation). 1665 tests pass across 4 packages. Slices 3+4 (UI surface + compose-prompt injection — the agent actually READING the common_findings) follow as separate PRs.
Routing Memory page gains a caller selector for admins (#569). Closes the operator-visible gap surfaced 2026-06-22: an operator logged in as admin cleared Routing Memory then ran BYO pipeline tests via OpenClaw, saw "No history yet" on the page. The data was being recorded correctly — it just landed under openclaw-configure (the JWT subject minted by configure-openclaw.sh for the MCP bridge), not under the operator's own admin subject. ADR 047's per-caller isolation is correct multi-tenant design, but the UI had no affordance for admins to inspect what their agents had been doing. Three changes: (1) new MemoryStore.ListNamespaces method on both InMemory + SQLite implementations — returns distinct caller namespaces + row counts, sorted busiest-first, never decrypts (raw column read). (2) new GET /api/v1/memory/callers endpoint backed by ListNamespaces — admin-gated so non-admin operators see only their own caller (defense in depth for the per-caller isolation contract). (3) GET /api/v1/memory/defaults accepts ?caller=<name> query param; admin scope required to override, non-admins see their own scope regardless of the param. UI: dropdown above the existing Refresh/Clear buttons, populated from /api/v1/memory/callers, only renders when more than one caller exists; selecting another caller re-fetches /api/v1/memory/defaults?caller=<selected> and the three sections (Recent activity / Learned pack defaults / Learned pipeline defaults) repopulate. 6 new sub-tests cover ListNamespaces on both backends (busiest-first ordering + empty-after-drain), the callers endpoint (empty store + non-admin filter), and the defaults endpoint's admin-only override gate (regression guard for the per-caller-isolation contract). Empirical use case: when your OpenClaw agent is running BYO pipelines, switch the dropdown to openclaw-configure to see exactly what it's been doing — the same audit history the agent reads as defaults.

[0.29.8] - 2026-06-22

Theme: "Validation-suite sidecar pin hot-fix — third BYO empirical iteration."

Same-day follow-up to v0.29.7's two fixes (S3 Get URL + memory forget bypass-decrypt). v0.29.7's BYO pipeline test reached the lint step (compose finally resolving the artifact URL post-#564) — then died with handler_failed: hyperframes lint emitted no JSON (exit 127). Exit 127 is bash for "command not found." The v0.29.4 hyperframes.{lint,inspect,validate} packs set NeedsSession: true but forgot to pin SessionSpec.Image, so the session executor spawned them into the default base sidecar (helmdeck-sidecar:latest) which doesn't have the hyperframes CLI on PATH. v0.29.8 ships #567's pin (same convention hyperframes.render has used since v0.13.0, including the HELMDECK_SIDECAR_HYPERFRAMES env override). Operator-visible: builtin.byo-audio-narrated-video now actually completes the lint → inspect → validate gates instead of exit-127-ing at lint. Third hot-fix-from-empirical-testing in 24 hours.

Operator upgrade: clean — single backend code change in three packs. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.8, restart control-plane. Sidecar-hyperframes is unchanged from v0.29.7. No operator-side actions required beyond the upgrade. After it lands, re-run any failed BYO pipeline test — the lint gate should now find the upstream CLI and either pass (clean composition) or produce structured findings (which is the publish-gate working as designed, not a regression).

Fixed

hyperframes.{lint,inspect,validate} packs (v0.29.4 pre-render validation suite) now pin SessionSpec.Image to hyperframesSidecarImage(), matching hyperframes.render's pattern. Surfaced empirically the same day v0.29.7 shipped: a builtin.byo-audio-narrated-video run failed at the lint step with handler_failed: hyperframes lint emitted no JSON (exit 127). Exit code 127 is bash for "command not found" — the lint pack was being spawned into the default base sidecar (helmdeck-sidecar:latest) which doesn't have the hyperframes CLI on PATH. The render pack pins the right image via Image: hyperframesSidecarImage() in its SessionSpec; my v0.29.4 lint/inspect/validate packs set NeedsSession: true but forgot the image pin, so the session executor used its default. Fix: add the same SessionSpec.Image = hyperframesSidecarImage() pin to all three packs, plus sensible MemoryLimit/Timeout/CPUProfile defaults per pack's compute shape (lint: 1g/5min/IO; inspect: 2g/10min/Compute since it loads in headless Chrome with at_transitions sampling; validate: 2g/5min/Compute since it boots Chrome + DevTools console). Operator-visible effect: the builtin.byo-audio-narrated-video pipeline now actually completes the lint→inspect→validate gates instead of exit-127-failing at lint. Both the HELMDECK_SIDECAR_HYPERFRAMES env override + the default pinned image are honored, matching render's behavior.

[0.29.7] - 2026-06-22

Theme: "BYO empirical-test recovery — two same-day-surfaced production blockers fixed."

Same-day follow-up to v0.29.6, surfaced during the first real end-to-end test of builtin.byo-audio-narrated-video. v0.29.4 shipped the BYO pipeline, v0.29.5 shipped the operator upload UI, v0.29.6 fixed the artifact-list visibility. v0.29.7 closes the two operator-visible blockers that ONLY surface against a production S3/Garage backend + multi-restart deployment — neither caught by unit tests because the memory store + Get URL contract differences only manifest at the integration layer.

Bug 1 — S3 Get returned Artifact with empty URL. My v0.29.4 BYO implementation in hyperframes.compose calls ec.Artifacts.Get(ctx, key) and asserts art.URL != "" before threading it into the audio_url codepath. The MemoryArtifactStore.Get filled URL with "memory://" + key (non-empty, contract met). The S3ArtifactStore.Get filled URL on Put but returned Artifact{Key, Size, ContentType, CreatedAt} on Get — no URL, fails the assert. All 6 of the operator's BYO pipeline test runs failed at compose with artifact_failed: audio_artifact_key "..." resolved to empty URL (artifact store does not expose presigned URLs?). PR #564 fixes by calling s.presign(ctx, key) in Get like Put does; same contract both directions.

Bug 2 — Routing Memory's "Clear all history" couldn't recover from rotated keys. Restarting the control plane across releases without a pinned HELMDECK_MEMORY_KEY generates a fresh ephemeral master each time. The SQLite memory table persists ciphertext from old keys; new process can't decrypt. The UI showed build defaults: memory: decrypt: cipher: message authentication failed. The Clear button hit POST /api/v1/memory/forget which listed THEN deleted each entry one by one — the list step decrypt-failed and forget got stuck. Operator's only recovery was a manual sqlite3 ... DELETE. PR #565 fixes by adding MemoryStore.DeletePrefix that operates on raw SQL rows without decrypting, and switching the forget handler to use it. Also documents pinning HELMDECK_MEMORY_KEY (32-byte hex in .env.local) to prevent rotation in the first place.

Operator upgrade: clean — single backend changes only. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.7, restart control-plane. Sidecar-hyperframes is unchanged from v0.29.6. One operator-side action recommended: pin HELMDECK_MEMORY_KEY if you haven't already (echo "HELMDECK_MEMORY_KEY=$(openssl rand -hex 32)" >> deploy/compose/.env.local); see PR #565's CHANGELOG entry for context.

Fixed

Routing Memory's "Clear all history" button now works even when the encryption key has rotated, unblocking the memory: decrypt: cipher: message authentication failed recovery path. Surfaced empirically the same day v0.29.6 shipped: operator restarted the control plane multiple times across v0.29.4/5/6 deployments without a pinned HELMDECK_MEMORY_KEY, each restart generated a fresh ephemeral master, the SQLite memory table persisted entries encrypted with the OLD keys, the NEW process couldn't decrypt them → AES-256-GCM auth tag verification failed on every list call → build defaults: memory: decrypt: cipher: message authentication failed error in the Routing Memory UI. The UI's Clear button hit POST /api/v1/memory/forget which listed THEN deleted each entry one by one — so the list step decrypt-failed and forget got stuck (the only path operators had to clear stale entries was a manual sqlite3 ... DELETE). Fix: new MemoryStore.DeletePrefix(ctx, ns, prefix) (int, error) method (internal/memory/memory.go's interface + implementations on both InMemoryStore and SQLiteStore). SQL DELETE FROM memory_entries WHERE namespace=? AND key LIKE ? ESCAPE '\' operates on raw rows and never touches ciphertext, so it succeeds even when no key can decrypt the existing rows. LIKE wildcards (%, _) in the caller's prefix are escaped so they match literally — caller's audit-key vocabulary is operator-extensible and the SQL injection / wildcard-leak surface needs to be tight. The /api/v1/memory/forget handler now uses DeletePrefix instead of List + per-key Delete. 7 new sub-tests cover the round-trip happy path, idempotency on empty namespaces, cross-namespace isolation, LIKE-wildcard literal-matching for both % and _, and the load-bearing "rotated key" regression: open the same DB with a different master, confirm List fails with auth-tag mismatch AND DeletePrefix succeeds + clears the orphans + post-clear List works again. Documentation note: pin HELMDECK_MEMORY_KEY (32-byte hex) in your .env.local to prevent the rotation in the first place — the autogenerate-with-warning fallback is fine for development but loses state on every restart.
S3ArtifactStore.Get now populates the URL field on returned Artifact with a presigned link, matching the Put path's contract. This unblocks hyperframes.compose's BYO audio_artifact_key resolution (and any downstream pack that chains an existing artifact into another via URL). Surfaced empirically the same day v0.29.6 shipped: an operator ran builtin.byo-audio-narrated-video against a UI-uploaded MP3 → all 6 pipeline attempts failed at compose with artifact_failed: audio_artifact_key "..." resolved to empty URL (artifact store does not expose presigned URLs?). Root cause: my v0.29.4 BYO implementation in hyperframes.compose calls ec.Artifacts.Get(ctx, key) and asserts art.URL != "". The Memory store filled URL with "memory://" + key (non-empty, contract honored). The S3 store filled URL on Put (via s.presign(ctx, key)) but Get returned Artifact{Key, Size, ContentType, CreatedAt} with no URL — empty string, fails the assert. Fix is two added lines in internal/packs/s3store.go: call s.presign(ctx, key) in Get, set URL: signed on the returned Artifact. presign errors are non-fatal — the empty URL surfaces back through the existing BYO assert (same contract as before the fix; the assert was correct, the precondition was wrong). One regression test in s3store_test.go (the live-S3 path, skipped without endpoint env vars but exercised in CI) asserts URL is populated on Get. Validation: with the fix in place, the user's BYO test prompt now succeeds at compose; pipeline reaches lint/inspect/validate/render gates without the URL-empty short-circuit.

[0.29.6] - 2026-06-21

Theme: "Operator-uploads list-visibility hot-fix."

Same-day hot-fix to v0.29.5's drag-drop upload card. The upload bytes layer was fine (artifact persisted + downloadable + usable by pipelines), but the Management UI's Artifacts page table didn't surface operator-uploaded files because the default list endpoint iterates the pack registry only. v0.29.6 ships the targeted fix: after the pack-registry loop, also iterate special non-pack namespaces (currently operator-uploads) and append their artifacts to the result. Operators can now see their uploads in the Artifacts table immediately after dropping a file. Back-compat-safe; no API surface change, no pipeline shape change. Single-PR release pattern matching the v0.13.1 same-day-hotfix discipline (see the 2026-05-13 v0.12.1 blog post for the rationale).

Operator upgrade: clean — single backend code change. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.6, restart control-plane. The sidecar-hyperframes image is unchanged from v0.29.5 — no need to pull or restart anything else. The fix is purely in the artifact-list HTTP handler; existing operator-uploads keys from v0.29.5 (which were correctly persisted, just invisibly) immediately become listable in the UI.

Fixed

GET /api/v1/artifacts (the Management UI's Artifacts list endpoint) now surfaces operator-uploads/* artifacts in the default listing. Bug surfaced empirically the same day v0.29.5 shipped: an operator uploaded an MP3 via the new drag-drop card (PR #556), the upload succeeded (the operator-uploads/<hash>-<filename> key was returned + the bytes were correctly stored — verified via GET /api/v1/artifacts/download/<key> returning 200 + 2.65 MB), but the artifact didn't appear in the Artifacts page table. Root cause: the default list (no ?pack= filter) iterates the pack registry and queries store.ListForPack(packName) for each registered pack. operator-uploads isn't a registered pack — it's a special namespace introduced by the upload endpoint. So the iteration skipped it entirely. Fix: after the pack-registry loop, also iterate a hardcoded list of special non-pack namespaces (currently just operator-uploads) and append their artifacts to the result. The artifacts were always in the store + listable via ?pack=operator-uploads filter; this just makes them visible in the default view. One regression test covers the no-registry-wired path (which would have caught the bug in CI if we'd added it on the original PR #556).

[0.29.5] - 2026-06-21

Theme: "Operator artifact upload + BYO-audio worked example."

Same-day follow-up to v0.29.4 that closes the chat-side-file-ingestion gap and refines the gpt-oss-120b reference recipes with a BYO-audio variant. v0.29.4 shipped builtin.byo-audio-narrated-video but operators couldn't easily get an MP3 INTO the artifact store — artifact.put is a pack that takes bytes via the agent's tool input, and a 2.5 MiB MP3 means ~3.3 MiB of base64 in chat which is impractical. v0.29.5 ships the drag-drop upload card on the Management UI's Artifacts page plus the new POST /api/v1/artifacts/upload REST endpoint behind it, AND refines the gpt-oss-120b-concept-animator howto with a BYO variant that collapses the 5-call from-scratch narrated-video chain to a single pipeline call when the operator supplies the audio. Together these close the workflow surfaced by the v0.29.3 retest: operator drops MP3 → copies artifact key → asks Tier C agent → narrated MP4 with pre-render validation gates inlined. Back-compat-safe; operators on v0.29.4 can upgrade directly with no input or schema changes.

Operator upgrade: clean — no schema migrations, no removed packs, no pipeline-shape changes. The upload endpoint is additive; the existing artifact.put pack continues to work unchanged for agent-driven uploads. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.5 + ghcr.io/tosin2013/helmdeck-sidecar-hyperframes:0.29.5, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion. Smoke-test the upload UX: open the Management UI's Artifacts page, drag any file onto the new upload card, confirm the resulting artifact_key appears with a copy button, paste into a chat invocation of builtin.byo-audio-narrated-video (audio files only; the pipeline rejects non-audio at compose's content-type validation).

Documentation

gpt-oss-120b-concept-animator howto gains a bring-your-own-audio variant section (#496 refinement). Recipe now covers the operator workflow where Maya (the sanitized worked-example persona) already has an audio file — recorded interview, stitched podcast clip, ElevenLabs render — and wants a narrated video against it without re-generating audio from a prompt. Operator-side: open Management UI → Artifacts → drag-drop the audio file (the new POST /api/v1/artifacts/upload endpoint shipped above) → copy the returned audio_artifact_key → paste into chat. Agent-side: a CONSTRAINTS-section override on the base AGENTS.md template that locks the model to ONE pack call (helmdeck__pipeline-run with builtin.byo-audio-narrated-video) and explicitly invalidates podcast.generate regeneration when an audio_artifact_key is supplied. Why this matters for Tier C models: from-scratch is a 5-call chain (podcast → compose → render → av.validate → verify_manifest) where each call is a drift opportunity; the BYO variant collapses to 1 call because the pipeline inlines the chain including the pre-render validation gates (lint/inspect/validate). Includes a sample test prompt + the expected JSON pack-call shape so an operator can verify their agent fires correctly. Companion gpt-oss-120b-slide-narrator recipe's Related section updated to point at the BYO variant.

Added

Operator-facing artifact upload — new POST /api/v1/artifacts/upload REST endpoint plus a drag-drop card on the Management UI's Artifacts page (web/src/pages/artifacts.tsx). Closes the UX gap surfaced during v0.29.4 testing: the operator has an MP3 (or any media file) on their laptop and wants to use it via the BYO-audio narrated-video pipeline, but there was no clean path to get the file INTO the artifact store. artifact.put is a pack — it takes bytes in the agent's tool input, but for a 2.5 MiB MP3 that means ~3.3 MiB of base64 in the chat message, which is impractical. The new endpoint accepts multipart/form-data with a file field, persists under the operator-uploads/ namespace, and returns {artifact_key, url, size, content_type, filename}. Content-type detection: prefer the browser-set Content-Type on the upload part, fall back to mime.TypeByExtension(filename), then to http.DetectContentType on the first 512 bytes. 100 MiB cap (50 MiB above hyperframes.attach_audio's audio cap so long-form audio + large video both fit). Filename sanitization strips path prefixes + control characters + truncates at 200 chars. UI surface: drag-drop zone with a fallback file input, success state shows the resulting artifact_key with a copy-to-clipboard button + a hint about pasting it into pipeline inputs. 8 new sub-tests cover input validation (happy path, content-type inference from extension, missing file field, plain-JSON instead of multipart, no-artifact-store-wired), filename sanitization (normal/spaces/path-prefixes/control-chars/empty/truncation), and the operator-uploads namespace contract. Workflow now: operator opens Management UI → Artifacts → drags MP3 → copies returned audio_artifact_key → asks agent to run builtin.byo-audio-narrated-video with that key + a topic description + duration_seconds. No chat-side file ingestion, no SSH, no curl scripts.

[0.29.4] - 2026-06-21

Theme: "Pre-render validation suite + bring-your-own-audio pipeline + render-deterministic authoring docs."

Four days of follow-up work to the v0.29.3 retest investigation. The v0.29.3 render produced 2 distinct frames over 90 seconds despite PR #546's slot-lifetime fix landing correctly — diagnosis showed it wasn't a slot-lifetime bug at all, it was upstream's "render ≠ preview" bug class manifesting in the decision-tree example we'd chosen as the default. v0.29.4 ships the architectural response: three new pre-render validation packs that wrap upstream's own diagnostic tools (hyperframes lint, hyperframes inspect, hyperframes validate), a bring-your-own-audio pipeline so operators can compose visuals against an MP3 they've already uploaded, an explanation page + agent skill that codify the empirically-derived render-deterministic composition rules, and the operator-visible <audio id=...> fix in attach_audio that upstream's lint was flagging as silent-in-renders all along. Back-compat-safe; operators on v0.29.3 can upgrade directly with no input or schema changes.

Operator upgrade: clean — no schema migrations, no removed packs, no pipeline-shape changes. Existing pipelines/agents continue unchanged. The three validation packs are additive; the BYO pipeline is additive (starter pipeline count went 22 → 23); the audio_artifact_key input on hyperframes.compose is additive and mutually exclusive with the existing audio_url. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.4 + ghcr.io/tosin2013/helmdeck-sidecar-hyperframes:0.29.4, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion into the deployed SKILL.md AND install the new helmdeck-hyperframes-authoring skill (auto-discovered from skills/<name>/SKILL.md). For a quick smoke-test of the BYO path: artifact.put an MP3 → pipeline-run builtin.byo-audio-narrated-video with the returned key + a topic description + the audio's duration (max 720s).

Added

Three new packs form a pre-render validation suite for hyperframes scaffold projects: hyperframes.lint (static source), hyperframes.inspect (runtime layout), hyperframes.validate (runtime errors + WCAG contrast). All three share the same input shape as hyperframes.render (project_artifact_key OR composition_html, mutually exclusive), the same setup helper (setupHyperframesProjectDir), the same JSON-parse + strict-mode contract, the same soft-surface default (findings ARE the output; the pack returns success even with errors). The trio targets the three failure-detection windows: lint catches STATIC issues from source files (~1s, file-system only), inspect catches RUNTIME LAYOUT issues by loading in headless Chrome and sampling the DOM at N timestamps, validate catches RUNTIME CONSOLE ERRORS during the headless load plus a WCAG AA contrast audit across timeline samples. hyperframes.lint wraps hyperframes lint --json — catches media_missing_id (audio silent in renders), google_fonts_import (external font fetches fail in sandboxed renders), gsap_studio_edit_blocked (manual __timelines registration conflicting with runtime auto-discovery), composition_self_attribute_selector (CSS that leaks across embedded instances), missing_gsap_script, etc. hyperframes.inspect wraps hyperframes inspect --json — catches text_box_overflow (text extends past its container at a specific timestamp), transition_overlap (sibling clips overlap at a transition seam), static_collapse (element width or height goes to 0); at_transitions:true samples every tween start/end boundary to catch transient overlaps. hyperframes.validate wraps hyperframes validate --json — catches CORS-blocked external assets (which produce silent blank media in renders), net::ERR_FAILED for any external resource, JS runtime errors during composition load (which lead to blank-canvas renders), plus WCAG AA contrast failures across sampled timestamps; strict mode targets console errors only (contrast failures are a separate audit dimension). All three pass strict:true to surface error-severity findings as typed CodeArtifactFailed, gating downstream packs on a clean result. Reference docs at docs/reference/packs/hyperframes/{lint,inspect,validate}.md. 14 + 11 + 8 = 33 new sub-tests cover input validation, happy paths, strict-mode behavior, CLI argv shape (verbose, at-transitions, no-contrast flags thread correctly), the JSON-prefix stripper (CLI emits a telemetry notice before the JSON payload on first session invocation), and contrast-vs-error severity separation in validate strict mode. Architectural twin of av.validate end-to-end; the four packs together (av.validate post-render + the new three pre-render) give pipelines symmetric validation on both sides of the render boundary.

Fixed

hyperframes.attach_audio injected <audio> element now carries id="aroll-audio-<sha256-prefix>" matching the content-addressed filename stem. Upstream's own hyperframes lint flags media without id as a hard error (media_missing_id): "The renderer requires id to discover media elements — this audio will be SILENT in renders." The content-hash id mirrors the filename's hash component so the same audio bytes always produce the same id (stable across re-runs of the same narration). Surfaced during the v0.29.3 retest — even with PR #546's slot-lifetime fix, the audio element our pack injected was technically render-silent per upstream's contract. Existing 15 attach_audio sub-tests updated to assert the id contract; one new sub-test verifies content-addressed id stability across calls with identical audio bytes. Reference: field report, upstream issue heygen-com/hyperframes#1437 (render ≠ preview bug class).

Added

hyperframes.compose gains an audio_artifact_key input as an alternative to audio_url — the handler resolves the artifact key to a presigned URL via ec.Artifacts.Get(...).Artifact.URL and threads it into the existing audio_url codepath. Mutually exclusive with audio_url. Enables bring-your-own-audio pipelines that compose visuals against pre-existing audio (operator uploads via artifact.put, or output from a prior pack call) without an intermediate artifact.get step that would base64-encode the full audio bytes just to extract the URL. Five new sub-tests cover BYO happy path (key resolves, composition embeds URL), mutual-exclusion guard, key-not-found error, no-artifact-store-wired error, and a back-compat regression test confirming the existing audio_url path is unchanged. Back-compat: pre-existing callers passing audio_url see ZERO behavior change.
New pipeline builtin.byo-audio-narrated-video — bring-your-own-audio counterpart to builtin.prompt-narrated-video. Inputs: audio_artifact_key + description + duration_seconds (required) plus optional aspect_ratio (16:9 default / 9:16 / 1:1) and resolution (1080p / 4k). Pipeline shape: hyperframes.compose (with audio_artifact_key) → hyperframes.lint → hyperframes.inspect (with at_transitions:true) → hyperframes.validate → hyperframes.render. All three validation gates pass strict:true — any error-severity finding aborts the pipeline BEFORE render burns wall-clock. 12-minute cap enforced by hyperframes.compose's existing hyperframesComposeMaxDuration (720s) → CodeInvalidInput on duration_seconds > 720. Use case: a user uploads an MP3 (interview, lecture, podcast clip) and wants a topic-relevant narrated MP4. Skips podcast.generate vs prompt-narrated-video because the audio already exists. Tier-A authoring (LLM writes the composition from scratch); for Tier-C scaffold-based workflows where the user wants visuals borrowed from upstream's curated examples instead of LLM-authored, use builtin.scaffolded-narrated-video — but it regenerates audio via podcast.generate, so it won't preserve a user-uploaded MP3. Pipeline count update: starter pipelines went 22 → 23.

Documentation

New explanation page docs/explanation/authoring-render-deterministic-compositions.md codifies the empirically-derived rules an LLM or human author must follow so a hyperframes composition renders correctly (not just previews correctly). Covers the structural contract (single GSAP timeline per composition, key matches data-composition-id, declarative sub-composition sequencing via data-start), the authoring-style contract (layout-before-animation, synchronous construction, no setTimeout/setInterval/requestAnimationFrame/repeat:-1/post-paint DOM mutation), the asset contract (media needs id, no external CDN URLs except GSAP itself, no CSS transform on GSAP-animated elements), and the pre-render validation gate (lint → inspect → validate, all strict, before render). Sourced from the v0.29.3 retest investigation; references upstream's "render ≠ preview" tracking issue (heygen-com/hyperframes#1437).
New skill skills/helmdeck-hyperframes-authoring/SKILL.md packages the same rules in agent-context-injection format. Auto-discovered by scripts/configure-openclaw.sh (no script changes needed). Use when an agent is authoring composition HTML for hyperframes.compose, hyperframes.render, or any pipeline that produces a programmatic MP4 — including builtin.scaffolded-narrated-video, builtin.prompt-video, builtin.prompt-narrated-video. Includes a worked example of the smallest render-deterministic composition skeleton (title + subtitle, 8s, with gsap.from() entrances) so the LLM has a reference shape to extend rather than authoring from scratch.
skills/helmdeck/SKILL.md updated: adds bullets for hyperframes.lint / hyperframes.inspect / hyperframes.validate next to the existing hyperframes.compose and hyperframes.render entries, references the new authoring skill, and emphasizes the "always run lint → inspect → validate BEFORE render" publish-gate pattern with token-economics rationale (lint <1s, inspect+validate ~10-30s vs render's ~1-5 min — gates catch failures cheaply before render budget burns).

Changed

scripts/hyperframes-bare-baseline.sh now defaults --example=kinetic-type (empirically render-deterministic: 10 distinct frames over 10 samples) instead of decision-tree (render-hostile: 2 distinct frames over 15s even when rendered bare from upstream's registry). Adds --lint=true|false (default true) and --no-lint shorthand to run upstream's hyperframes lint --json upfront and surface findings in the diagnostic.json + final summary. Help text documents the render-deterministic example set (kinetic-type, swiss-grid, warm-grain) and notes decision-tree exists for reproducing the v0.29.2/v0.29.3 slot-lifetime regression test bed.

[0.29.3] - 2026-06-17

Theme: "Decision-tree blank-canvas fix + upstream pin hygiene."

Two follow-ups to v0.29.2's hyperframes.attach_audio pack. The first (#546) closes the blank-canvas symptom operators saw when narrated videos extended past the scaffold's 15-second child composition: attach_audio now stretches the child's data-duration to match the root's when they started equal, eliminating the upstream slot-lifetime trigger. The second (#548) bumps the sidecar's pinned hyperframes from 0.6.97 to 0.6.110 for general hygiene. Both back-compat-safe; operators on v0.29.2 can upgrade directly with no input or schema changes.

Honest framing: the pin bump does NOT fix the slot-lifetime bug. Upstream #911 (closed 2026-05-17, shipped in 0.6.110) addresses an adjacent code path; helmdeck's actual bug is filed at heygen-com/hyperframes#1540 and tracked in #547. PR #546's child-composition rewrite is the only thing closing the operator-visible bug today. See the blog post for the empirical trail.

Operator upgrade: clean — no schema migrations, no removed packs. Existing pipelines/agents continue unchanged. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.3 + ghcr.io/tosin2013/helmdeck-sidecar-hyperframes:0.29.3, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion into the deployed SKILL.md. Re-run any narrated-video pipeline that produced a 15-second-then-blank canvas on v0.29.2 to confirm the fix.

Changed

helmdeck-sidecar-hyperframes Dockerfile bumps the pinned upstream from hyperframes@0.6.97 to hyperframes@0.6.110 — version hygiene only; does NOT fix the child-composition slot-lifetime bug. Background: while drafting PR #546 (the helmdeck-side child-composition stretch fix), I noticed upstream #911 had been closed 2026-05-17 with a fix that sounded like our exact symptom ("Sub-composition slot goes black after GSAP timeline ends, regardless of host data-duration"). Empirically verified the closure by building the sidecar with hyperframes@0.6.110 and re-rendering the same bare npx hyperframes init decision-tree scaffold with root extended to 331s and child left at 15s: frames at t=20s/100s/200s/300s are byte-identical to the 0.6.97 result (md5 9c95fca0…, 8.6 KB blank canvas), continuing to blank the instant the child's data-duration elapses. Inspecting the shipped runtime bundle confirmed the #911 fix IS present (d.hasAttribute("data-composition-src")||d.hasAttribute("data-composition-file")) but addresses an adjacent code path — the producer's htmlCompiler stripping data-composition-src during inlining — not the duration-mismatch case helmdeck hits. Filed heygen-com/hyperframes#1540 with the reproducer; helmdeck-side watch issue #547 tracks when the shim can come back out. PR #546's helmdeck-side rewrite remains the only fix in play. The pin bump is still worth landing: 13 patch releases of unrelated upstream improvements, ADR-037 exact-version + CLI-surface-sentinel discipline intact, Dependabot tracking the live latest, and the regression-check render confirmed 0.6.110 does NOT break the working scenario (root and child durations matched). Operator-visible reproducer + the wider "trust-but-verify-an-issue-close" story in the 2026-06-17-child-composition-slot-lifetime blog post.

Fixed

hyperframes.attach_audio now stretches child compositions whose data-duration matched the root's original, closing the v0.29.2 follow-up bug where the decision-tree scaffold rendered as a blank canvas for 83 of 98 seconds. Empirical repro from run_6f6cb0ea40a94dd1: a ~98-second narrated video, audio attached correctly, but visuals went white at 15s and stayed white through the rest. Root cause: the decision-tree scaffold's index.html has both a root composition (data-composition-id="main", data-duration="15") AND a child composition (<div data-composition-id="decision-tree" data-composition-src="compositions/decision_tree.html" data-duration="15">). The v0.29.2 attach_audio rewrote root's duration to 97.9 but left the child at 15 — so the renderer played 0-15s of decision-tree animation followed by 83 seconds of inactive (blank) canvas. The fix extends updateRootDataDuration to ALSO rewrite any <div> with a data-composition-id attribute whose data-duration equals the root's original. Conservative heuristic: only stretch children that were span-aligned with the root. Operator-deliberate divergences (e.g. a 5-second intro composition under a 30-second root) are preserved — when a child's data-duration differs from root's original, it's left alone. class="clip" data-durations are still untouched (no data-composition-id anchor on clip elements). Four new sub-tests cover the empirical decision-tree shape, the stretches-matching-children behavior, the leaves-divergent-children-alone behavior, and the regression guard for class="clip" semantics. Existing 15 attach_audio tests pass unchanged. 1124 builtin / 1770 across consumers pass with race detector clean.

[0.29.2] - 2026-06-17

Theme: "Silent-video fix + tunable Firecrawl/LLM concurrency."

Two follow-ups landed within hours of v0.29.1: the hyperframes.attach_audio pack that closes the v0.28.x silent-video bug for unreliable upstream examples (decision-tree empirically), and an operator env var to tune content.ground's per-call Firecrawl + verify concurrency. Both back-compat-safe; operators on v0.29.1 can upgrade directly with no input or schema changes.

Operator upgrade: clean — no schema migrations, no removed packs. Existing pipelines/agents continue unchanged. The new pack is additive; the concurrency env var defaults to today's hardcoded 4 when unset. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.2, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion into the deployed SKILL.md.

Added

content.ground Phase 2 concurrency is now operator-tunable via HELMDECK_CONTENT_GROUND_CONCURRENCY (#524). Default stays 4 (the historical hardcoded value), range [1, 32]. Out-of-range or non-numeric values silently fall back to the default — an operator typo can't break grounding. Use case: operators running self-hosted Firecrawl with relaxed rate limits + a dedicated LLM gateway can raise the limit for faster wall-clock on long posts (12+ claims); operators on free-tier shared infrastructure can lower it to avoid bumping into rate caps. The constant contentGroundConcurrency becomes a function contentGroundConcurrency() re-evaluated on every handler entry — restart not required after updating the env. Eighteen new sub-tests cover the default, valid overrides at the boundaries (1, 4, 8, 16, 32), out-of-range fallback (0, -1, 33, 1000, 100000), non-numeric fallback (typos like "fourrr", "4.5"), and whitespace trimming (" 4 " → 4). 1120 builtin tests pass with race detector clean; existing 45 content.ground tests pass unchanged (back-compat). Reference doc gains a "Tunable Phase 2 concurrency" section explaining when to raise vs lower the limit.

Added

New pack hyperframes.attach_audio closes the silent-video failure mode in builtin.scaffolded-narrated-video (#521). Background: upstream hyperframes init --audio=<path> is silently ignored by at least the decision-tree example (and possibly others — empirical per-example reliability), so threading audio_url through hyperframes.scaffold is unreliable. Despite v0.28.4 and v0.28.5 fixing every other step in the chain (audio threading, concat-vs-validate timing), runs against decision-tree continued to produce 15-second silent MP4s. This pack is the deterministic alternative: pure-Go in-process tarball transform that downloads the audio bytes, embeds them under assets/aroll-audio-<sha256-prefix>.<ext> (content-addressed for dedup), and injects an <audio> element as the first child of the root composition div (matched by data-composition-id="main" — the canonical hyperframes scaffold convention; tolerant of arbitrary attribute order). The element carries data-start="0", data-duration=<seconds>, data-volume=<volume>, data-track-index=<idx> per upstream's contract (volume defaults to 1.0, track index to 9 — the documented audio-track row). By default also rewrites the root composition div's data-duration to the audio length (update_root_duration: true) so the rendered video plays the full narration; set false when hyperframes.interpolate has already established the duration. Required inputs: project_artifact_key, audio_artifact_key, duration_seconds. Outputs: new project_artifact_key plus audio_filename / audio_size / duration_seconds_used / root_duration_updated / track_index_used / volume_used telemetry. Supported audio content types: audio/{mpeg,mp3,mp4,aac,wav,x-wav} covering ElevenLabs' default mp3_44100_192 and the common alternatives. 50 MiB cap matches hyperframes.attach_asset's. Same shape as attach_asset end-to-end — no dispatcher, no session executor, just ec.Artifacts. 15 new sub-tests cover input validation (missing keys, negative duration, missing audio/project, empty bytes, oversize, unsupported content-type), the happy path (MP3 with all defaults — confirms audio element injected, data-duration rewritten, audio file written into tarball, content-addressed filename stable across calls), update_root_duration:false semantics (root duration preserved), custom volume/track_index, no-root-div rejection, missing-index.html rejection, and the regex helpers at unit level (spliceAudioIntoRoot finds data-composition-id="main" regardless of attribute order; updateRootDataDuration only rewrites the root, not child clip durations; handles data-duration before/after data-composition-id in the attribute list). builtin.scaffolded-narrated-video pipeline rewired: hyperframes.scaffold no longer receives audio_url (upstream's unreliable path), and hyperframes.attach_audio is inserted between interpolate and render chaining podcast.generate.audio_artifact_key + duration_s. Existing direct-pack callers of hyperframes.scaffold that still pass audio_url see no behavior change — the scaffold input is preserved for back-compat; this PR just stops the built-in pipeline from relying on it. Reference doc docs/reference/packs/hyperframes/attach_audio.md covers the splice algorithm, supported content types, and the issue #521 history. 1102 builtin / 1748 across consumers pass with race detector clean.

[0.29.1] - 2026-06-17

Theme: "v0.29.0 follow-ups — close the audit gap and unblock operator upgrades."

Two fixes surfaced during v0.29.0 release prep + first-hour operator testing. Both are back-compat-safe; operators can upgrade directly from v0.29.0 with no input or schema changes.

Operator upgrade: clean — no schema migrations, no removed packs. Existing pipelines/agents continue unchanged. configure-openclaw.sh now actually completes on a correctly-configured deployment (was blocked by a false-positive preflight on every prior version). After tag push, run git pull && ./scripts/configure-openclaw.sh to pick up both fixes; the script refreshes the deployed SKILL.md with the new helmdeckVersion stamp.

Added

content.ground gains a handler-internal per-claim cache that survives unrelated edits to the input markdown (#523). The engine-level MemoryConfig cache (ADR 047, added in PR #522) keys on sha256(caller + input bytes), so a typo fix anywhere in the markdown invalidates every claim's cached source. The new per-claim cache keys on sha256(claim_text + "\0" + search_query) — claims whose text + query is unchanged across edits hit, skipping BOTH the Firecrawl /v1/search call AND the per-source verify LLM call. The two caches stack: the engine cache catches idempotent re-runs (~millisecond replay); the per-claim cache catches the "fix a typo, re-cite" workflow where the engine cache misses but the claim set is mostly unchanged. TTL is 7 days (vs the engine cache's 24h) because the per-claim key is content-derived rather than time-derived. The cache is goroutine-safe (Phase 2's bounded errgroup populates it concurrently); cache hits skip the errgroup slot entirely so a cached re-run completes in ~claim-extractor wall-clock with no Phase 2 work. Failed Firecrawl searches are NOT cached (transient outages shouldn't poison the cache for 7 days); empty picks (no source found) ARE cached so a re-run doesn't re-burn the verify LLM call on the same null result. Two new output fields: claims_cached (per-claim cache hits) and firecrawl_calls (real Firecrawl calls, excludes cache hits) — operators see "0 of 5 claims hit Firecrawl after the typo fix" telemetry. Existing Memory: &MemoryConfig{...} declaration on the pack stays (engine cache continues to work); the per-claim layer is additive. Four new sub-tests: typo-fix workflow with all-cache-hits (zero Firecrawl, zero verify), mutate-one-claim with 2 hits + 1 miss, nil-Memory safety (engine without WithMemoryStore works unchanged), key stability. Existing 41 content.ground tests pass unchanged (back-compat); race detector clean. Reference doc has a two-layer cache table explaining when each layer hits.

Fixed

scripts/configure-openclaw.sh — auth probe completes correctly on authenticated deployments (#539). Two bugs stacked. (1) SIGPIPE under pipefail: the probe piped openclaw models auth list 2>/dev/null directly into grep -q. grep -q exits immediately on first match, which closes its stdin and SIGPIPEs the upstream auth-list call (rc=141 = 128 + signal 13). With set -o pipefail (set at script line 34), the 141 propagates as the pipeline exit; the if ! inverts it; the script dies with "missing openrouter auth" on a correctly-authenticated deployment. (2) Redundant ephemeral container: the original implementation used docker compose -f $OPENCLAW_COMPOSE_FILE run --rm -T openclaw-cli, which spawns a fresh container that exits non-zero under 2>/dev/null for unrelated reasons on top of the SIGPIPE issue. Fix: capture-then-grep — capture the auth list into a variable first, then grep the variable. The capture lets the upstream finish cleanly (no SIGPIPE); the in-memory grep against $auth_list never closes its stdin early. Switched to docker exec against the running $OPENCLAW_CONTAINER (the pattern used elsewhere in this script) so the auth state is the one OpenClaw actually uses. Empirically blocked the v0.29.0 SKILL.md refresh that exposed the bug — every prior configure-openclaw.sh run on this deployment hit the false-positive die. The 4 other grep -q probes in this script (against tiny docker ps / docker network inspect outputs that fit in the kernel pipe buffer) are unaffected and unchanged; if they ever bite, the same capture-then-grep pattern applies.

[0.29.0] - 2026-06-16

Theme: "Packs measure their own input."

Closes the cross-pack JIT length-sizing convention adoption arc. All six length-variable packs (blog.rewrite_for_audience, podcast.generate, hyperframes.compose, slides.narrate, research.deep, content.ground) now accept length_intent (summary / thorough / exhaustive) + inspect:true + explicit numeric overrides, and report length_intent_applied + truncated on every generate response. Calling agents can declare intent uniformly and stop precomputing per-pack length surfaces in their AGENTS.md. Originally motivated by an undersized blog rewrite empirically observed 2026-06-16: a ~7,000-word source compressed to 1,161 words because the agent's static "1300-2000 words for technical-deep-dive" target couldn't scale with source size. All six adoptions strictly back-compat — existing callers passing the explicit numeric input (max_tokens, duration_target_min, duration_seconds, max_claims, limit) see ZERO behavior change. The umbrella tracking issue (#525) closed with this release.

Operator upgrade: clean — no schema migrations, no removed packs, no breaking input changes. All input/output additions are optional. Existing pipelines and agents continue to work unchanged; the new convention is opt-in via the new fields. Pack signatures podcast.GenerateScript, slides.narrate's generateEngagement, and content.ground's extractClaims gained finish-reason returns — internal API changes only, no caller impact.

Added

content.ground adopts the JIT length-sizing convention (#525 umbrella, #531 follow-up). Sixth and final pack in the cross-pack adoption sequence after blog.rewrite_for_audience (#527), podcast.generate (#533), hyperframes.compose (#534), slides.narrate (#535), and research.deep (#536). content.ground is cost-cap shaped like research.deep — each claim costs a Firecrawl /v1/search + per-source LLM verify call. Intent maps directly to the existing max_claims input: summary → 3 claims, thorough → 5 (matches legacy default), exhaustive → 8 (matches legacy ceiling). The issue's original "intentional back-compat break" framing was based on a wrong premise — the current code is already capped at 8 with a default of 5, not unlimited. The exhaustive row labels today's hard cap rather than relaxing it. New optional inputs: length_intent and inspect:true. Precedence: inspect:true short-circuit → explicit max_claims ("explicit", clamped to [1, 8]) → length_intent ("intent:*") → legacy default 5 ("default"). Strict back-compat: existing callers passing max_claims see ZERO behavior change. New outputs on every generate response: max_claims_applied (what was actually used after clamping), length_intent_applied (where the value came from), truncated (fires when EITHER the claim extractor LLM hit finish_reason=length OR the rewrite step truncated and fell back to citation-only). extractClaims's signature gains a finish-reason return: (claims, raw, finishReason, error). The rewrite step's pre-existing errRewriteTruncated signal is now also surfaced via truncated:true rather than being a silent log-only event. inspect:true short-circuits before the dispatcher / HELMDECK_FIRECRAWL_ENABLED checks — gateway-less, Firecrawl-less environments can plan a grounding pass. OutputSchema.Required narrowed from [claims_considered, claims_grounded, sha256] to [] so inspect responses (no extraction) satisfy the validator. Six new sub-tests cover inspect short-circuit (no Firecrawl, no dispatcher) + resolver precedence at unit level + each intent row mapping to the right max_claims_applied + explicit-max_claims-wins back-compat + no-input default → 5 / "default" + extractor finish_reason=length → truncated + unknown-intent fallback. 1073 builtin tests pass with race detector clean. Reference doc updated with the heuristic table, precedence rules, and the truncated-signal semantics. Closes the cross-pack adoption sequence — all six length-variable packs now share length_intent / inspect / truncated so agents can declare intent uniformly and stop precomputing per-pack length surfaces in their AGENTS.md.

Added

research.deep adopts the JIT length-sizing convention (#525 umbrella, #532 follow-up). Fifth in the cross-pack adoption sequence after blog.rewrite_for_audience (#527), podcast.generate (#533), hyperframes.compose (#534), and slides.narrate (#535). research.deep is cost-cap shaped: the "length" being controlled isn't output words or duration but the number of source URLs scraped per call (each costs a Firecrawl SERP page hit + a per-source markdown scrape + a slice of the synthesis LLM's context window). Intent maps directly to the existing limit input: summary → 3 sources, thorough → 5 (matches the legacy default), exhaustive → 10 (matches the hard cap). New optional inputs: length_intent and inspect:true. Precedence: inspect:true short-circuit → explicit limit ("explicit", clamped to [1, 10]) → length_intent ("intent:*") → legacy default 5 ("default"). Strict back-compat: existing callers passing limit see ZERO behavior change. New outputs on every generate response: limit_applied (what Firecrawl actually saw), sources_used (count after empty-markdown filtering — operators see how lossy the scrape was), length_intent_applied, truncated (fires when the synthesis LLM hit finish_reason=length; re-run with smaller intent or larger max_tokens). inspect:true short-circuits before the dispatcher / HELMDECK_FIRECRAWL_ENABLED / model-required checks — gateway-less, Firecrawl-less environments can plan a research call. InputSchema.Required narrowed from [query, model] to [query] so inspect-mode payloads omitting model aren't rejected by the engine validator (runtime model-required check is preserved for the generate path). OutputSchema.Required narrowed from [query, sources, synthesis, model] to [query] so inspect responses satisfy the validator. Eleven new sub-tests cover inspect short-circuit (no Firecrawl, no dispatcher) + each intent row mapping to the right Firecrawl limit + explicit-limit-wins back-compat + no-input default → 5 / "default" + finish-reason truncation + JIT-metric presence + resolver precedence at unit level + unknown-intent fallback. 1062 builtin tests pass with race detector clean. Reference doc updated with the heuristic table, precedence rules, and an inspect-mode response section.

Added

slides.narrate adopts the JIT length-sizing convention (#525 umbrella, #530 follow-up). slides.narrate's relationship to the convention is unusual: the pack does NOT generate narration (notes come from the input markdown — typically prepared by slides.outline) and per-slide duration is dictated by the natural length of the TTS audio. So length_intent is observational + reporting rather than active sizing: the agent declares the density they expected, the pack measures what they actually got, and reports the gap so the agent can iterate on slides.outline if needed. New optional inputs: length_intent (summary / thorough / exhaustive), inspect:true, words_per_slide_min/max. Heuristic table: summary → 40-60 words per narrated slide (~16-24 sec at 150 wpm), thorough → 80-120 (~32-48 sec), exhaustive → 150-220 (~60-88 sec). Precedence: inspect:true short-circuit → explicit words_per_slide_min + max ("explicit") → length_intent ("intent:*") → no input → "default:reporting-only" (thorough's range used as the stats baseline so within/outside counts stay meaningful). New outputs on every generate response: source_words_per_slide_avg/min/max, narrated_slide_count, slides_within_intent_range, slides_outside_intent_range, length_intent_applied, truncated (fires when the engagement-metadata LLM hit finish_reason=length — the only gateway-dispatch call in the pack; TTS is HTTP-direct). inspect:true short-circuits before the session executor / vault checks — gateway-less and session-less environments can run a deck quality check without renderable resources. Silent slides (empty notes) are excluded from the average so intro/outro placeholders don't drag the density signal down. generateEngagement's signature gains a finish-reason return value: (map, string, error). OutputSchema.Required narrowed from [video_artifact_key, video_size, slide_count, total_duration_s, has_narration] to [slide_count] so inspect-mode responses (parse-only, no rendering) satisfy the engine validator. Eight new sub-tests; 1056 builtin / 1702 across consumers pass with race detector clean. Reference doc updated with the density table, precedence rules, and an explanation of why this adoption is observational rather than active.

Added

hyperframes.compose adopts the JIT length-sizing convention (#525 umbrella, #529 follow-up). New optional inputs: length_intent (summary / thorough / exhaustive) and inspect:true. Unlike blog.rewrite_for_audience (#527) and podcast.generate (#533) — both of which scale by source word count — the compose pack picks a fixed duration from the intent table because the description is a planning instruction, not source material: summary → 60s (floor 30s, ceiling 120s), thorough (default for intent path) → 180s (120-360s), exhaustive → 600s (360-720s, matches hyperframes.render's 12-min cap). Precedence: inspect:true short-circuit → audio_url + duration_seconds ("explicit:audio-locked") → duration_seconds > 0 ("explicit") → length_intent set → legacy 8-sec default ("default:legacy-8sec", preserves back-compat — existing silent-micro-animation callers see ZERO behavior change). New outputs on every generate response: description_words, target_duration_sec_chosen, length_intent_applied, truncated (fires when the composition-HTML LLM hit finish_reason=length, signaling the assembled HTML may be incomplete — re-run with a richer description or smaller intent / larger max_tokens). inspect:true short-circuits before the dispatcher / model-required / audio-requires-duration checks — gateway-less and dispatcher-less environments can plan a composition without spending anything; an agent can also inspect with audio_url set even before measuring the audio duration. The model field is no longer in InputSchema.Required (was [description, model], now [description]) so inspect-mode payloads omitting model aren't rejected by the engine schema validator; runtime check still enforces model for the generate path. Thirteen new sub-tests cover inspect short-circuit + back-compat default + each intent row + numeric-overrides-intent + audio-locked precedence + finish-reason truncation + stop-finish-no-truncation + JIT-metric presence + resolver precedence + unknown-intent fallback. 1036 builtin tests pass with race detector clean. Reference doc updated with the heuristic table, precedence rules, and an inspect-mode worked example.

Added

podcast.generate adopts the JIT length-sizing convention (umbrella #525, pilot landed in v0.28.7's blog.rewrite_for_audience, podcast follow-up #528). New optional inputs: length_intent (summary / thorough / exhaustive) and inspect:true. The pack measures the source it actually sees (script text in mode A, source_text in mode C-2, scraped text in mode C-1 after Firecrawl) and picks a duration_target_min from a heuristic table: summary → reading time × 0.20, floor 1 min, ceiling 3 min; thorough → × 0.50, 3 min, 8 min; exhaustive → × 0.90, 6 min, 12 min. Reading time uses 150 wpm (matches slides.narrate's caption pacing constant). Back-compat is strict: when neither length_intent nor duration_target_min is set, the pack falls back to today's 8-min default (length_intent_applied: "default:legacy-8min"); existing callers see ZERO behavior change. duration_target_min still wins over intent when set (length_intent_applied: "explicit"); script mode reports "n/a:script" because the script's length is intrinsic. New outputs on every generate response: source_words, target_duration_min_chosen, actual_duration_min, length_intent_applied, truncated (fires on finish_reason=length from the script-generation LLM). inspect:true short-circuits before the dispatcher / session / vault checks — gateway-less and session-less environments can plan podcast duration without spending anything; the model-required check and Firecrawl-enabled check are both skipped when inspecting. inspect does NOT scrape source_url (the reason field tells the caller to call again without inspect to get a measured suggestion). podcast.GenerateScript's signature gains a finish-reason return value (([]Turn, string, error)); only 2 in-tree callers, both updated. Reference doc updated with the heuristic table, precedence rules, an inspect-mode worked example, and clarified mode-validation footnotes. Twelve new sub-tests cover inspect short-circuit + script-mode inspect + source_url-no-scrape + back-compat default + explicit-numeric-wins + JIT-metric-presence + the three intent rows + floor-clamp + unknown-intent-fallback + resolver precedence; 1078 builtin + podcast pkg tests pass with race detector clean.

Added

blog.rewrite_for_audience pack pilots the JIT length-sizing convention (#525 umbrella, #526 pilot). Calling agents no longer have to precompute a static word target in their AGENTS.md — the pack measures source_content, picks an output range from a declared length_intent (summary / thorough / exhaustive), and reports target_words_chosen / output_words / compression_ratio so the agent can see what scale it actually got. Heuristic table: summary → ratio 0.10, floor 300, ceiling 1200; thorough (default) → 0.30, 800, 2500; exhaustive → 0.55, 1500, 6000. The chosen range is injected into the system prompt as an explicit override of the persona's word-count guidance — without this the persona's "800-1200 words" silently out-voted a chosen exhaustive target of 3300-4400 and the JIT sizing had no visible effect. Two escape hatches alongside the intent path: explicit target_words_min + target_words_max (both must be set; partial falls through), and inspect:true which short-circuits before any dispatcher use to return a suggestion without spending a model call (gateway-less deployments can use this path). New truncated boolean on every generate-mode output: strong signal is finish_reason=length from the gateway; fallback heuristic fires when output is within 95% of the upper target bound AND ends without sentence-terminating punctuation, so providers that don't expose finish_reason (Ollama doesn't always) still surface silent truncation. The motivating failure mode: long-form source documents getting compressed below the agent's static target's lower bound — a generic shape, not a one-off. Eleven new sub-tests cover inspect short-circuit + dispatcher-less inspect + intent scaling across each row + floor/ceiling clamps + numeric-override precedence + partial-numeric fall-through + prompt-target injection + finish-reason truncation + mid-sentence-heuristic truncation + back-compat metric presence; race detector clean. Reference doc updated with the heuristic table, precedence rules, and an inspect-mode worked example. The same convention is opt-in for podcast.generate, hyperframes.compose, slides.narrate, content.ground, and research.deep as they're touched for unrelated work (no big-bang migration); per-pack adoption is tracked from the umbrella issue.

[0.28.6] - 2026-06-16

Changed

content.ground pack closes four gaps from the 2026-06-15 audit. (A) Memory cache (ADR 047 compliance): pack now declares Memory: &packs.MemoryConfig{Cache: true, TTL: 24h, Category: "cache"}. Idempotent re-runs (same caller, same input bytes) get cached results instead of spending Firecrawl + LLM verify calls each time — matches the pattern github.go has used since v0.7. The TTL is 24 hours (vs github.go's 5-minute pattern) because source authority changes on a slow cadence. NOTE: the cache key is engine-derived from input bytes, so a typo-fix edit is still a miss — per-claim caching across edits would be a handler-internal layer, captured as audit follow-up. (B) Concurrent claim processing: per-claim Firecrawl search + LLM verify ran sequentially before, so a 12-claim post took 60-120s wall-clock even on a healthy stack. Refactored into three phases — Phase 1 fuzzy-locates findable claims (synchronous, fast), Phase 2 runs Firecrawl + verify under a bounded errgroup (SetLimit(4)), Phase 3 applies results to the document in original claim order. Patching stays sequential because each substitution can shift byte offsets of later claims; re-finding the span per-iteration handles that. Wall-clock drops to ~ceil(N/4)×(search+verify). (C) Fuzzy claim matching closes the silent-drop failure mode for Tier C extractors: the strict strings.Contains check dropped any claim whose text the LLM had normalized (double-space → single-space, soft-wrap newline → space, etc.) — even when the claim was real and the source was valid. New findClaimSpan helper: exact substring first (fast path preserves existing behavior for the 95% case), whitespace-tolerant scan on miss. Splices the citation after the doc's ORIGINAL bytes, not the LLM's normalized variant, so the patched file matches the original prose. Smart-quote / em-dash / Levenshtein folding intentionally deferred — whitespace is by far the most common normalization the extractor LLM applies, and broader fuzziness widens the false-positive surface. (D) ADR 051 verifier migration: the per-source verifier was the last content.ground caller still using the legacy extractFirstJSONObject fallback (the claim extractor had migrated earlier). Replaced with DecodeStructuredResponse, preserving the existing soft-degrade (parse failure → skip the claim, same as before). 7 new sub-tests cover the new helpers + memory cache declaration + a fuzzy-match end-to-end happy path; 2 existing tests updated to query-route their Firecrawl stubs (concurrent Phase 2 means non-deterministic call order); race detector clean.

[0.28.5] - 2026-06-15

Fixed

podcast.generate's validation pass now actually finds the MP3 it's supposed to validate. The Concat helper used to rm -rf /tmp/helmdeck-podcast immediately after reading the final final.mp3 bytes back to the control-plane process — and then podcast.generate's validation step (PR #515) tried to av-validate.sh --audio /tmp/helmdeck-podcast/final.mp3 a fraction of a second later, got exit 2 (file not found), soft-degraded into silent fallback (allow_silent_output:true's contract), and propagated audio_url="" through the entire scaffolded-narrated-video chain. Net effect was a 15-second silent MP4 with passed-validation consistency:audio_video_duration flagging "could not probe (arate= aframes=)" — the validator was correctly reporting the bug; the bug was upstream. Empirically found 2026-06-15 chasing the v0.28.4 retest's 122922a5661bcb63-video.mp4 artifact, after av-validate.sh confirmed the file had vcodec=h264 acodec=<empty> and no audio stream at all. ElevenLabs credentials, TTS API call, and voice IDs were all healthy — the MP3 was generated, briefly written, and then deleted before validation could see it. Fix: drop the post-readback cleanup in Concat (line 175-178). The session container's tmpfs is reclaimed when the session ends; the next Concat call already rm -rfs the tempdir at its step 1 (line 84). Net: no leak across sessions, no leak between Concat calls within a session, AND the file stays available for the in-call validation pass. New TestConcat_DoesNotPostCleanupTempDir regression test pins the fix — counts post-readback rm -rf calls; trips loudly if the cleanup ever sneaks back in. 1042 tests pass across pipelines + packs/builtin (up from 1041 with the new regression test).

[0.28.4] - 2026-06-15

Fixed

builtin.scaffolded-narrated-video pipeline now produces a narrated video that's actually narrated at the operator-controllable target length. Two related misses landed in v0.28.0's pipeline (#512) and surfaced empirically on the 2026-06-15 eBPF retest run, when the chain reached render successfully but produced a 9-second silent MP4 against an 11-minute generated narration. The pipeline (a) didn't thread podcast.generate's audio_url output to hyperframes.scaffold, so the scaffold used the upstream example's intrinsic 10-second data-duration and the rendered video had no audio track; and (b) didn't pass duration_target_min to podcast.generate at all, so the narration silently ran at podcast.generate's 8-minute internal default instead of the operator's expected 60-second social-first target. Both gaps are closed: hyperframes.scaffold gains an audio_url input that fetches + stages the bytes in-sidecar and passes --audio=<path> to hyperframes init (upstream then embeds the <audio> element and aligns data-duration to the audio length); the pipeline threads audio_url from podcast.generate.output.audio_url to scaffold, AND threads its own new duration_target_min input through to podcast.generate (default unset → 8-minute fallback; pass 1 for 60-second social-first per the old AGENTS.md convention; max 12 for long-form). 5 new sub-tests on scaffold cover audio_url empty / happy-path / 404 / 200-with-empty-body / oversize. Discovered architecturally because the prior 4 patch releases were all about Tier C model output variance; this one is the pipeline's first real "design miss" — composition gaps in the input plumbing the original PR didn't fully wire.
hyperframes.interpolate's content classifier now recognizes the decision-tree scaffold shape — <div class="node ...">, <div class="connector-label">, <div class="text-highlight">, and <span id="*-text">. Empirically found 2026-06-15 on the third eBPF retest after the v0.28.2 podcast-parser fix landed: hyperframes.scaffold succeeded, the agent's pipeline reached hyperframes.interpolate, but the pack rejected the scaffold with "no files in the scaffold matched a recognized content shape" because decision-tree's compositions/decision_tree.html uses sticky-note "node" boxes for its branching diagram (different element/class shape from swiss-grid's <h1>/<div class="stat-value"> patterns the classifier was originally calibrated against). The new patterns are word-boundary-anchored so existing swiss-grid / nyt-graph shapes still match. 8 new sub-tests cover the decision-tree node + connector-label + text-highlight + span-id-suffix-text shapes and the multi-class attribute preservation under splice. The known false-positive risk (\bnode\b matches inside compound class names like tree-node because - is non-word) is pinned by a test so a future tightening trips loudly.

[0.28.2] - 2026-06-15

Fixed

podcast.generate's script parser now also accepts multiple bare JSON objects in sequence (JSONL, whitespace-separated, or comma-separated {...}{...}{...} without [...] array brackets). Empirically found 2026-06-14 on the SECOND eBPF retest after the bare-single-object fix (v0.28.1) landed — gpt-oss-120b:free emitted ~10 sequential {"speaker":"Host","text":"..."} turns as JSONL instead of one array. The new Fallback C normalizes the }<whitespace and optional comma>{ boundary between sibling objects to },{ via regex and wraps the result in [ ... ] so the strict array parser succeeds. The earlier single-object fallback still fires on actual one-turn responses; well-formed array responses still take the fast path. Four new sub-tests cover JSONL / comma-separated / fenced multi-object / multi-object-with-preamble variants. Error message refined to "no JSON array, single object, or sequence of objects found in response" so the final failure mode is unambiguous.

[0.28.1] - 2026-06-14

Fixed

podcast.generate's script parser now accepts a bare single JSON object ({"speaker":"...","text":"..."}) as a valid one-turn script, not just a [...] array. Closes a Tier C failure mode found empirically 2026-06-14 running openai/gpt-oss-120b:free through the new builtin.scaffolded-narrated-video pipeline — the model emitted one object instead of an array (semantically a valid one-turn script, just missing the array wrapping), and the parser returned no JSON array found in response. Three new sub-tests cover the bare-object fallback (raw / with prose preamble / fenced in ```json ). No behavior change for the array path — existing tests pass unchanged.

[0.28.0] - 2026-06-14

The scaffold-mode video release. Closes the architectural arc surfaced empirically by the morning's 🎬 concept-animator retest against openai/gpt-oss-120b:free (rendered MP4 was structurally correct but visually flat — text on a black background, because asking a Tier C model to invent HTML/CSS/GSAP from scratch asks it to do the one thing Tier C reliably can't). Same evening, the architecture is rebuilt to borrow visual creativity from upstream's 140+ example catalog: the LLM's job becomes content interpolation, not design invention. Two original assumptions (#503 Path A: stitched HTML; first cut of #503 Path B: scaffold-mode in hyperframes.compose) were both surfaced + discarded mid-implementation when empirical scaffold inspection revealed multi-file structure (sub-compositions referenced by data-composition-src paths, JS TRANSCRIPT arrays in captions.html, A-roll slot in index.html) — the right shape was a 4-pack family matching helmdeck's existing decomposition pattern. Seven PRs over a single 2026-06-14 evening session: #506 ships scripts/hyperframes-init.sh inside helmdeck-sidecar-hyperframes plus a new CONTRIBUTING.md principle "prefer the upstream CLI over custom Go" (saved as [[feedback-upstream-cli-takes-precedence]] for future pack design); #507 pivots the script's output contract from "stitched HTML" to "gzipped project tarball" before any caller depends on it; #508 gives hyperframes.render a new project_artifact_key input alongside the existing composition_html (mutually exclusive, fully backward-compatible) so it consumes the project-shape upstream natively expects; #509 ships the new hyperframes.scaffold pack (picks an upstream --example, returns project_artifact_key + editable_slots manifest); #510 ships hyperframes.interpolate (pure-Go in-process tarball manipulation, per-file LLM rewriting for HTML text slots + JS TRANSCRIPT, tier-aware prompts, soft-degrade on per-file failure); #511 ships hyperframes.attach_asset (content-addressed asset embedding for A-roll image/video, videos emit muted per upstream convention, URL fetch deferred); #512 ships builtin.scaffolded-narrated-video — the sibling pipeline to builtin.prompt-narrated-video that wires podcast.generate → hyperframes.scaffold → hyperframes.interpolate → hyperframes.render. The 2026-06-14 blog post When agent-instruction docs drift from upstream spec (upstream-spec-drift) released yesterday tells the docs-layer prologue to this story; today is the implementation-layer chapter.

Operator upgrade: clean — no schema migrations, no removed packs, no breaking input changes. The additions:

Three new packs (hyperframes.scaffold v1, hyperframes.interpolate v1, hyperframes.attach_asset v1) — additive; existing hyperframes.compose freeform mode untouched and continues to work for callers who want raw HTML control.
One new pipeline (builtin.scaffolded-narrated-video) — additive; the existing builtin.prompt-narrated-video continues to work unchanged. Pipeline count is now 22.
hyperframes.render gains a project_artifact_key input alongside the existing composition_html (mutually exclusive; pass exactly one). Existing pipelines / callers passing composition_html see no behavior change.
helmdeck-sidecar-hyperframes bumped to HYPERFRAMES_VERSION=0.6.97 (was 0.6.7) — auto-pulled on first use of the new packs. Sidecar image is auto-rebuilt on main pushes and already shipped to GHCR.
New CONTRIBUTING.md principle "Prefer the upstream CLI over custom Go" (item 7 of "What makes a good pack") documents the architectural lesson for future pack contributions.

For Tier-C-targeted agents (gpt-oss-120b:free, gemma, smaller open-weight): update your AGENTS.md to call builtin.scaffolded-narrated-video (provide description + example) instead of builtin.prompt-narrated-video. The scaffolded pipeline borrows upstream's polished visuals so the model only does content interpolation — visually-rich output reliably, where the freeform compose path collapses to text-on-black. Common example picks by intent: swiss-grid (general explainer), decision-tree (flow diagrams + traces), code-snippet-dark-modern (technical content), kinetic-type (typography focus), nyt-graph (data viz). For Tier-A agents (Claude Sonnet/Opus, GPT-4-class): the freeform builtin.prompt-narrated-video path is still the right tool — the model authors HTML from scratch with full creative control.

Added

scripts/hyperframes-init.sh and a CONTRIBUTING.md "prefer the upstream CLI over custom Go" principle, executing the first half of the architectural refinement on #503. The script wraps hyperframes init --example=<x> inside helmdeck-sidecar-hyperframes and emits a gzipped tarball of the scaffolded project directory; it's the session-exec target the upcoming hyperframes.compose scaffold-mode change will call via ec.Exec, matching the av-validate.sh / hyperframes_render.go:276 pattern. Empirically grounded: the 140+ example catalog enumerated via hyperframes init's registry is the upstream-authoritative source of visual creativity — Tier C models (gpt-oss-120b:free, gemma) will only need to do content interpolation, not design invention. No caller changes yet; the script is dormant until subsequent PRs wire the compose handler to invoke it.
helmdeck-sidecar-hyperframes pins HYPERFRAMES_VERSION=0.6.97 (was 0.6.7) — the upstream renamed --template to --example and added --non-interactive "for CI/agents" between those versions, both of which the new script depends on. Image smoke test now also asserts hyperframes init --help succeeds and /usr/local/bin/hyperframes-init.sh is executable.
hyperframes.render gains a project_artifact_key input field alongside the existing composition_html (mutually exclusive — pass exactly one). When provided, render downloads the gzipped tarball from the artifact store, extracts it under /tmp/helmdeck-hf/, and runs hyperframes render <project-dir> against the multi-file scaffold the framework natively expects (index.html + compositions/*.html + assets/ + hyperframes.json). This is the consumer side of #503's Path B refactor — paired with the new hyperframes.scaffold pack (below) and upcoming interpolate / attach_asset packs, which produce the tarball this pack consumes. Backward-compatible: existing callers passing composition_html continue to work unchanged. Schema, error-mapping, and the 17 existing tests are untouched; 7 new tests cover both inputs missing/both set/happy-path/store-miss/tar-extract-fail/missing-index/empty-artifact.
New builtin.scaffolded-narrated-video pipeline — ties the four scaffold-mode packs together: podcast.generate (narration) → hyperframes.scaffold (picks an upstream example like swiss-grid / decision-tree / code-snippet-dark-modern / kinetic-type / nyt-graph / tiktok-follow — 140+ in the catalog) → hyperframes.interpolate (LLM rewrites visible text + caption transcript to fit the topic) → hyperframes.render (project tarball → MP4). Sibling to builtin.prompt-narrated-video — same narration + render halves, different compose strategy. prompt-narrated-video asks the LLM to author HTML from scratch (great on Tier A, visually-flat on Tier C); scaffolded-narrated-video borrows upstream's polished examples so Tier C produces visually-rich output reliably. Inputs: description + example (both required), resolution + aspect_ratio (optional, threaded to scaffold + render). For an A-roll image, chain image.generate + hyperframes.attach_asset between interpolate and render manually — the pipeline doesn't automate this (no conditional-step support in v1). Closes the four-pack #503 Option C refactor — see issue #503 for the full architectural arc from Path A (stitched HTML) → Path B (project artifact) → Option C (4-pack split + pipeline).
New hyperframes.attach_asset pack — third (optional) link in the scaffold-based video pipeline. Takes a project_artifact_key (from scaffold or interpolate) + an asset_artifact_key (from image.generate, stock.search, or any pack that uploaded an image/video to the artifact store), embeds the asset bytes at assets/aroll-<sha256-prefix>.<ext> in the project tarball, and modifies index.html to reference the asset from the target div (default #short_mag_cut_frame, matching upstream's canonical A-roll slot id). Returns a new project_artifact_key ready for hyperframes.render. Supports image/{png,jpeg,gif,webp,svg+xml} and video/{mp4,webm,quicktime} content types (50 MiB cap). Videos are emitted with muted per upstream's AGENTS.md convention. Asset filenames are content-addressed so identical asset bytes produce the same path — convenient for dedup across chained pipelines. Pure-Go in-process (like interpolate): no SessionSpec, no dispatcher, just the artifact store. URL fetching is intentionally not supported in v1 — chain http.fetch upstream if your asset is URL-only; keeps the pack focused. 18 tests cover input validation, store-miss / empty / oversize / unsupported-type rejection, missing-index, target-not-found, image happy-path, video happy-path with muted assertion, custom target_id, leading-# canonicalization, content-addressed filename dedup, and spliceAssetIntoTarget unit (image / video / no-match / preserves-div-attrs).
New hyperframes.interpolate pack — second link in the scaffold-based video pipeline. Takes a project_artifact_key (from hyperframes.scaffold) plus a user description + model, runs LLM passes per compositions/*.html file to rewrite the visible text content so it fits the topic, re-uploads the modified project as a new project_artifact_key. Auto-detects two content shapes per file: HTML text slots (<h1>, <h2>, <h3>, <div class="stat-value">, <div class="stat-label">) get on-topic text substituted via a numbered-slots LLM format, and the JS TRANSCRIPT word array (in captions.html) gets regenerated with timing aligned to duration_seconds at a 150 wpm cadence. Other files pass through unchanged. Pure-Go in-process tarball manipulation (archive/tar + compress/gzip) — no SessionSpec, no ec.Exec, just dispatcher + artifact store. Soft-degrades on per-file LLM failure (skipped files are surfaced in files_skipped; the whole call only fails when ZERO files got rewritten). Tier-aware prompts via llmcontext.BudgetFor. 23 tests cover input validation, content classification (transcript vs text-slots vs unknown), text-slot extract/splice round-trip, numbered-slot parsing (strict / out-of-order / extras), transcript parsing (strict JSON / lenient JS-keys / empty rejection), tarball roundtrip, the happy-path multi-file rewrite end-to-end, and the no-recognized-shape rejection path.
New hyperframes.scaffold pack — first link in the scaffold-based video pipeline. Picks one of upstream HyperFrames' 140+ pre-built examples (swiss-grid, decision-tree, code-snippet-dark-modern, kinetic-type, vignelli, tiktok-follow, etc.), runs hyperframes init --example=<name> inside helmdeck-sidecar-hyperframes, uploads the resulting project tarball to the artifact store, and returns a project_artifact_key plus an editable_slots manifest naming which compositions/*.html files the upcoming hyperframes.interpolate pack will rewrite. This is the first concrete pack from #503's Option C architectural decision: instead of folding scaffold-mode into hyperframes.compose (creating a multi-headed pack with split output schemas), the scaffold-based path becomes its own family of small composable packs — scaffold → interpolate → attach_asset → render — matching helmdeck's existing pattern (slides.outline + slides.render + slides.narrate, podcast.generate + image.generate + stock.search). hyperframes.compose freeform mode stays untouched for callers who want full HTML control. 15 tests cover input validation, resolution/aspect-ratio matrix, script exit-code mapping (caller-fix vs handler-failed), tarball-shape edge cases (empty / cat failure / leading-./ / directory entries), and artifact upload round-trip.

Changed

scripts/hyperframes-init.sh switched its output contract from "emit stitched composition HTML" to "emit a gzipped tarball of the scaffolded project directory" before any caller depended on it. The empirical hyperframes init scaffold is a multi-file project — index.html, compositions/*.html (with the caption-transcript word-timing array), assets/, hyperframes.json, package.json — and the sub-compositions are referenced by data-composition-src paths, not inlinable into a single HTML blob. This is the Path B branch of issue #503's plan: hyperframes.render will gain a project_artifact_key input that consumes the tarball natively (next PR in the chain). Bonus: the LLM content-interpolation step (PR 4) now operates on the structured TRANSCRIPT array in compositions/captions.html, not on regex-extracted HTML slots — a richer surface for word-level timing-preserving rewrites.

[0.27.1] - 2026-06-14

The video-pack hardening release. Closes the concept-animator empirical arc that began with PR #497's gpt-oss-120b:free recipes and ran through six follow-on PRs (#499, #500, #501, #502, #504) over a single 2026-06-14 session driven by an empirical first-run session against openai/gpt-oss-120b:free. Four pack-level changes on hyperframes.compose close foot-guns the session surfaced: (1) audio_url now requires explicit duration_seconds (#498 → #499, silent-truncation bug closed); (2) duration-band-aware engagement metadata generation mirrors podcast.generate's metadata_model pattern (#500, short_form / mid_form / long_form payload shapes); (3) blank-screen guard via timeline-coverage validation and tier-aware system prompts (#502); (4) track-index collision check + upstream-sourced rule rewrites + comprehensive integration guide derived from the actual upstream HyperFrames AGENTS.md / SKILL.md / hyperframes-student-kit (#504, after the #502 best-practices doc turned out to be synthesis-without-citation — the lesson is captured in the new 2026-06-14 blog post When agent-instruction docs drift from upstream spec). Companion: gpt-oss-120b:free concept-animator + slide-narrator recipes (#497, updated in #501 to drive the new pack capabilities) demonstrate end-to-end free-tier video chains. Strategic-direction issue #503 proposes a template.fetch pack to surface upstream reference repositories as composition seeds for future releases. SEO Tier 1 + Tier 2 discoverability pass (#494, #495) addresses the GSC "Discovered – currently not indexed" bucket at the docs-site layer. HuggingFace platform epic Phase 4 expanded with the consume/publish Track A/B split (#490, companion blog post promoted from draft).

Operator upgrade: clean — no schema migrations, no breaking input changes, no removed packs. Three opt-in additions:

metadata_model field on hyperframes.compose (string-ptr; default openrouter/auto; pass "" to disable; pin a free model for end-to-end free-tier discipline)
Composition timeline-coverage + track-index collision pre-checks (reject at compose-time with the upstream rule cited; existing recipes that already used the upstream patterns are unaffected)
Tier-aware system prompt selection via llmcontext.BudgetFor(model) — Tier C gets the verbose verbatim-rule prompt; Tier A/B gets a lean prompt referencing the best-practices guide

Operators driving the concept-animator recipe should refresh their AGENTS.md from the updated docs/howto/per-model-agents/gpt-oss-120b-concept-animator.md to pick up the speakers map default, free-tier metadata_model pinning on both podcast.generate and hyperframes.compose, and the engagement payload surfacing in OUTPUT FORMAT.

HuggingFace epic (#490) Phase 4 expanded: Track A (consume via hf-space-invoke) + Track B (publish via hf-space-create / update / delete trio). Track B framed around operator self-service — any helmdeck workflow becomes a hosted UI under the operator's HF account — with scoped tokens, default-private semantics, per-deployment consent flow, quota caps, and mandatory delete pairing as the security envelope.
Companion blog post HuggingFace isn't just another LLM router — it's a platform helmdeck barely uses promoted from draft: true after the Phase 4 expansion synced the post's framing with the epic.
SEO discoverability pass addressing the GSC "Discovered – currently not indexed" bucket (validation started 2026-05-13, ~61 pages stuck). Adds description: frontmatter to 73 docs (50 ADRs via bulk script + 23 hand-crafted non-ADR pages), de-orphans previously zero-inbound ADRs via ## Related ADRs sections in 5 hub docs (PACKS.md, integrations/SKILLS.md, integrations/openclaw.md, howto/multi-model-recovery.md, RELEASES.md), adds OpenGraph site-wide defaults (og:type, og:site_name, og:locale) in docusaurus.config.ts, adds per-page Article + BreadcrumbList JSON-LD via theme swizzles at src/theme/BlogPostPage/ and src/theme/DocBreadcrumbs/, and bumps sitemap default priority 0.5 → 0.6 for /adrs/*, /reference/*, /howto/*. Tier 3 (manual GSC "Validate Fix" re-click + per-URL inspection submits) remains an operator action.
Tier 2 SEO follow-on: per-blog-post OpenGraph cards + homepage "Recently shipped" section. New website/scripts/generate-og-cards.mjs renders 1200×630 PNG cards via @resvg/resvg-js (manual SVG template — no JSX/satori dep) with helmdeck branding + wrapped title + tags + date; 25 non-draft posts get unique cards under static/img/og/<slug>.png and their frontmatter image: field updated. New website/scripts/generate-recent-data.mjs runs at build time, writes top 8 recent posts to src/data/recent.json; homepage renders them in a new card grid below the Diátaxis quadrants — gives newest content the highest-PageRank inbound link the site has to offer. Generator documented as npm run og:generate (ad-hoc; not on CI to avoid the heavier toolchain).
Two new per-model agent recipes for openai/gpt-oss-120b:free covering video workflows (#496): docs/howto/per-model-agents/gpt-oss-120b-concept-animator.md drives a 5-call podcast.generate → hyperframes.compose → hyperframes.render → av.validate → artifact.verify_manifest chain, with the AGENTS.md template hardened against several empirical foot-guns observed in a 2026-06-13 first-run session: the required speakers: {Narrator: "21m00Tcm4TlvDq8ikWAM"} map on podcast.generate (omitting it triggers an infinite retry loop), model + metadata_model pinning to openrouter/openai/gpt-oss-120b:free on podcast.generate plus model on hyperframes.compose (end-to-end free tier — the default metadata_model: "openrouter/auto" would route engagement metadata to PAID), and the duration_seconds data-flow constraint matching podcast.generate's duration_s (without this the compose pack used to silently truncate the rendered MP4 to 8s — fix in #498 makes the pack reject audio_url without an explicit duration_seconds going forward). The companion docs/howto/per-model-agents/gpt-oss-120b-slide-narrator.md drives a single helmdeck__pipeline-run call selecting the right builtin.research-narrate / builtin.grounded-narrate / builtin.repo-presentation pipeline by input type. Both recipes use the sanitized Maya security-researcher persona, embed AGENTS.md templates in the Objectives + Constraints + Success-Criteria-as-Invalidation-Rules style the gpt-oss profile prefers, and document the helmdeck-trace extract command for capturing empirical community_traces[] entries in a follow-on PR.
Bug fix: hyperframes.compose now rejects calls that provide audio_url without an explicit positive duration_seconds (#498). The previous behavior defaulted duration_seconds to 8s — which is correct for silent micro-animations but silently truncated narration tracks longer than 8 seconds in chained podcast.generate → hyperframes.compose → hyperframes.render workflows. The 8s default still applies to genuinely-silent compositions; the new validation only fires when audio_url is non-empty. Reference doc (docs/reference/packs/hyperframes/compose.md) updated to mark duration_seconds as conditional-required when audio_url is set. Empirical repro from a 2026-06-13 session driving the concept-animator recipe (PR #497) against openai/gpt-oss-120b:free: an 88.58s podcast became an 8s video, with av.validate's consistency:audio_video_duration check passing trivially (both clipped together).
New: hyperframes.compose gains opt-in engagement-metadata generation mirroring podcast.generate's metadata_model pattern. A string-ptr-shaped metadata_model input (default openrouter/auto; "" opts out; any model id pins to that model) triggers a second gateway LLM call after composition success that produces a duration-band-aware engagement payload: short_form shape (<60s; title / hook / hashtags / caption / thumbnail_prompt for TikTok / Shorts / Reels), mid_form (60–179s; adds social_blurb for Twitter / LinkedIn-native), or long_form (≥180s; adds YouTube-shaped description / chapters / tags / hook_30s / category). The payload lands as the new engagement output object and engagement_artifact_key (stable key to a JSON sidecar at hyperframes.compose/engagement.json). Generation failures soft-degrade: the composition still succeeds, the engagement field is just absent. Reference doc (docs/reference/packs/hyperframes/compose.md) gains an Engagement metadata section with the per-band shape table. Empirical motivation: the 2026-06-13 concept-animator session produced an 88-second narrated MP4 that had no accompanying title / hashtags / thumbnail prompt — operators had to hand-author all of them. Mirrors the existing podcast.generate and slides.narrate engagement patterns rather than introducing a new shape.
Concept-animator howto (docs/howto/per-model-agents/gpt-oss-120b-concept-animator.md) updated to drive the new PR #500 capability: AGENTS.md template now passes metadata_model: "openrouter/openai/gpt-oss-120b:free" to hyperframes.compose (keeping engagement gen on the free tier) and the invalidation rules require it. OUTPUT FORMAT section adds the engagement.format/title/hashtags/thumbnail_prompt + engagement_artifact_key surfacing requirements (with the YouTube-shaped extras when long_form). "What to capture" metrics table adds engagement_payload_surfaced + engagement_format_correct, and cost_discipline_observed now checks four model fields instead of three.
Blank-screen guard + tier-aware system prompt on hyperframes.compose. Closes a quality gap surfaced by the 2026-06-13 concept-animator session, where the rendered 8-second MP4 hit a 2+ second black run that av.validate warned on but the chain didn't surface as a failure. Two changes: (1) the pack now inspects the composition's class="clip" element intervals at compose-time and rejects (CodeInvalidInput) when their union leaves a gap longer than min(2.0s, duration * 0.05) — the gap range and suggested fix (add a permanent background element) are cited in the error message. (2) The system prompt is now tier-aware via the existing llmcontext.BudgetFor(model) registry: Tier C (free / weak open models) gets a constraint-heavy compact prompt with the timeline-coverage rule inlined verbatim; Tier A/B gets a leaner prompt that trusts the model and references the new HyperFrames composition best practices guide. The best-practices doc covers visual hierarchy (one focal element per ~3s), type-on-screen rules (≥60px, ≥1.5s read time), pacing, color choices, GSAP transition patterns that play well, audio-aware composition for narrated chains, and a common-failure-modes table. Reference doc (docs/reference/packs/hyperframes/compose.md) gains "Timeline coverage" and "Tier-aware system prompt" sections.
Upstream-spec alignment on hyperframes.compose (PR #504) after the PR #502 best-practices guide turned out to be largely synthesis-without-citation. Three coupled changes: (1) new pack-side validation composeTrackCollision that rejects compositions where two class="clip" elements share an integer data-track-index AND temporally overlap — this is an upstream HyperFrames hard rule per the actual AGENTS.md (track-index is a non-linear-editor row index, NOT a CSS z-index; spatial layering happens via CSS z-index entirely separately). (2) Both Tier C and Tier A/B system prompts rewritten with upstream-sourced rules verbatim — layout-first pattern (write the static hero frame in flex/gap/padding before any GSAP), gsap.from()/tl.to() entrance-exit convention, track-index temporal-exclusion rule, audio data-volume is immutable (volume tweens silently ignored), DETERMINISTIC ONLY with PRNG seeding option. (3) Best-practices doc (docs/reference/packs/hyperframes/best-practices.md) completely rewritten as a helmdeck integration guide for the upstream HyperFrames project — cites the upstream AGENTS.md / SKILL.md / hyperframes-student-kit throughout; covers the seven-step pipeline (Capture → Design → Script → Storyboard → VO+Timing → Build → Validate), the full attribute vocabulary (data-media-start, data-composition-src, data-variable-values, data-layout-allow-overflow, data-layout-ignore), the upstream reference template catalog (warm-grain, swiss-grid, play-mode, vignelli, product-promo, nyt-graph, decision-tree, kinetic-type), WebGL shader transitions with optimal duration ranges, audio-reactive pre-extracted FFT pattern, ARM64 deployment escape hatch (PRODUCER_FORCE_SCREENSHOT=true), and React migration constraints — and explicitly marks helmdeck-specific guidance separately. Companion blog post When agent-instruction docs drift from upstream spec (2026-06-14, draft) captures the epistemic lesson. Companion issue #503 proposes a template.fetch pack to surface upstream reference repositories as composition seeds.

[0.27.0] - 2026-06-10

The per-model profiles + audit-callback release. Closes a major arc: empirically validated that per-use-case AGENTS.md hardening is the load-bearing layer for reliable agentic behavior on Tier C models (PR #481 → PR #484 Nemotron baseline-vs-hardened A/B: 24 calls / 0 deposit → 7 calls / deposit + verify with all_present: true). The 5-profile prompting library ships (#464 Phase 1: gpt-oss-120b, gemma-4-26b-a4b-it, llama-3.3-70b, nemotron-3-super-120b-a12b, qwen3-coder), with the multi-provider YAML schema accepting non-OpenRouter routes (huggingface / together / groq / cerebras / sambanova / custom) and the first HF Inference Providers template as a community-contribution starting point per #482. The audit-callback pattern (#461) gets its anchor pack with artifact.verify_manifest; the typed artifact store (artifact.put / .get / .list) replaces prose-instruction deposit guidance that Tier C models silently ignore. Companion infrastructure: helmdeck-trace CLI for community_traces[] extraction, configure-openclaw.sh canonical 4-file workspace seeds, personalize-an-openclaw-agent howto, canonical file roles section in integrations/openclaw.md §5d, audit + persona-leak fix of skills/helmdeck/SKILL.md. Catalog grows from 53 → 57: four new artifact packs. Strategic direction: HuggingFace integration epic (#490) frames 6 phases beyond routing layer (Datasets, Embeddings, Spaces, Tokenizers, Self-hosted runtime patterns).

New packs: artifact.put / artifact.get / artifact.list (PR #450) + artifact.verify_manifest (PR #462, audit-callback anchor) — all available as helmdeck__artifact-* MCP tools, no AI gateway required.

Operator upgrade: clean — no schema migrations, no breaking input changes, no removed packs. New provider: union for models/*.yaml is backwards-compatible (existing provider: openrouter files need no changes; convention for new files going forward is models/<provider>-<model>.yaml). New --seed-canonical-layout and --force-overwrite flags on configure-openclaw.sh; existing --seed-identity flag preserved as alias. Re-run ./scripts/configure-openclaw.sh after upgrade to refresh the v0.27.0-stamped skill (catalog grew from 53 → 57 packs).

Added

HuggingFace integration epic (#490) + companion strategic-direction blog draft. PR #489 added HF Inference Providers as alternative LLM routing — multi-provider YAML schema, first HF template profile, routing setup walkthrough, CI validation gate. This filing reframes the broader opportunity: HuggingFace isn't just another LLM router; it's a platform spanning 100K+ datasets, embeddings APIs, Spaces hosting demos, tokenizers, fine-tuning hooks. Helmdeck currently uses zero of those beyond the routing-only integration. Epic #490 enumerates six phases — (1) Inference Providers (foundation, mostly shipped via #489; acceptance pending community-contributed community_traces[] entry per #482); (2) Datasets integration (two new packs helmdeck__hf-dataset-search + helmdeck__hf-dataset-stream for domain-corpus grounding in content.ground / research.deep beyond Firecrawl scraping); (3) Embeddings + similarity (helmdeck__hf-embeddings pack for sentence-transformers / cross-encoder embeddings + helmdeck.memory_store integration for semantic recall beyond key/value-only lookups); (4) Spaces (helmdeck__hf-space-invoke for remote demo-endpoint composition with explicit security review for arbitrary remote code invocation); (5) Tokenizers (helmdeck__hf-tokenize pack for accurate per-model token counting + per-model profile YAML schema gains optional tokenizer: field for context-engine budgeting); (6) Self-hosted runtime patterns (expanded vLLM / TGI / SGLang walkthroughs at docs/howto/self-host-with-*.md + per-engine tool_parser: field guidance with Nemotron's qwen3_coder parser as the canonical example + deploy/docker/sidecar-vllm.Dockerfile patterns). Each phase ships acceptance criteria; ordering is community-driven; phases 1-4 are independent (any order); phases 5-6 build on earlier work. Companion blog draft at website/blog/2026-06-10-huggingface-as-a-first-class-platform.md — 600-word strategic-direction framing per CLAUDE.md draft-on-finding norm. Uses sanitized Maya security-research persona for worked examples per the standing memory rule. draft: true until at least Phase 2 ships so the post has concrete deliverables to reference beyond strategic framing. Empirical motivation anchored in PR #481 → PR #484 Nemotron baseline-vs-hardened A/B (24 calls / 0 deposit → 7 calls / deposit + verify with all_present: true) — per-use-case AGENTS.md hardening is the lever regardless of platform; HuggingFace gives helmdeck more substrate to harden against. What this filing does NOT do: doesn't predict per-phase timelines (acceptance criteria are listed; pacing is community-driven); doesn't gatekeep contributions on maintainer review (external PRs welcomed via existing patterns); doesn't restructure existing #464 / #482 issues (they remain specific tracks for their respective scopes); doesn't create a GitHub milestone (single-tracking-issue pattern matches existing #461 / #464). Cross-link comments posted on #464 and #482 clarifying the broader context.

Added

Multi-provider schema upgrade for the per-model profile library + first non-OpenRouter template (advances #464 Phase 1 + #482 HF community track). Today's empirical findings (3 of 5 Phase 1 models hit upstream rate limits on the OpenRouter :free pool — Google AI Studio 429 on gemma-4, "Venice"-attributed 429s on llama-3.3 and qwen3-coder) motivated this scope. Four deliverables in one cohesive change unblocking external contributions for routing layers beyond OpenRouter: (1) Schema reference doc at docs/reference/model-profiles-schema.md documenting the YAML schema explicitly for the first time — required + optional fields, accepted provider: union (openrouter / huggingface / together / groq / cerebras / sambanova / custom), per-provider extension fields (hf_routing_policy, hf_partner, endpoint_base_url, tool_parser), required empirical sections (validated_against / community_traces / comparison_traces — present even if empty []), file size soft cap, anonymization rules per the standing memory rule, full schema for each empirical section's entry shape with the existing models/openai-gpt-oss-120b-free.yaml as the most-populated reference. (2) First HF Inference Providers template at models/huggingface-openai-gpt-oss-120b.yaml — reuses the gpt-oss prompting guidance from the OpenRouter sibling unchanged (model behavior is provider-agnostic; only routing differs), adds HF-specific hf_routing_policy: ":preferred" default + context_window_notes explaining the HF routing layer (OpenAI-compatible at router.huggingface.co/v1, provider-selection policies :fastest/:cheapest/:preferred, free-tier credit ~$0.10/month writeup-quoted, BYOK alternative). Empirical sections [] — community contribution invited per #482. The cross-provider relationship (openai/gpt-oss-120b on OpenRouter vs HF) gives external contributors a clean A/B template: same model + same prompt across routing layers, measure whether reliability differs. (3) Routing setup howto at docs/howto/configure-non-openrouter-providers.md — primary section walks HF Inference Providers end-to-end (get HF API key, configure OpenClaw with base URL + key, provider-selection policies explained, free-tier credit ceiling notes, worked example of switching trace-test agent to HF for cross-provider trace contribution); secondary section briefs Together AI / Groq / Cerebras / SambaNova direct (all OpenAI-compatible with their own free tiers, base URLs + auth doc links per provider); tertiary section briefs self-hosted vLLM / SGLang / TGI with tool_parser: field reference for Nemotron-3 Super's qwen3_coder parser (per the Nvidia developer-forum thread's "Native fixed it" resolution captured in #475 research). Submission methodology section cross-links to existing helmdeck-trace CLI workflow. (4) CI YAML validation gate at scripts/validate-model-profiles.py + .github/workflows/model-profiles-validate.yml — Python stdlib + PyYAML, single file, validates: required top-level keys present, provider: in accepted union, tier: in A/B/C, file size under 30 KB soft cap (sanity check — bumped from 20 KB after nemotron landed at 22.5 KB post-PR #487 due to rich legitimate empirical content), empirical sections present even if empty arrays, provider-specific required fields when relevant (endpoint_base_url for custom). Workflow runs only when models/*.yaml or the validator changes — cheap to run, fast to fail. Negative-test validated against deliberately-broken fixture; positive-test passes all 6 existing profiles (5 OpenRouter + 1 new HF). Cross-references: docs/reference/models.md gains a "Non-OpenRouter profiles" subsection listing the new HF template + a "See also" pointer to the schema reference; docs/howto/add-free-models.md gains an "Adding a non-OpenRouter profile" section pointing at the schema + routing howto; CONTRIBUTING.md "Profile contribution" bullet updates the schema reference link to the new dedicated doc + notes non-OpenRouter providers are supported via the routing howto; sidebar registers both new docs under their respective categories (howto: "Per-model agent adaptation" alongside the gemma-4 recipe and personalize howto; reference: alongside reference/models). Backwards-compatibility: existing 5 OpenRouter YAMLs need no changes — the provider: openrouter line inside each is the explicit identifier; the convention for NEW files going forward is models/<provider>-<model-slug>.yaml (HF gpt-oss is the first to follow it). What this PR deliberately does NOT do: doesn't ship empirical HF traces (the HF gpt-oss profile starts empty; community contribution is the whole point of #482); doesn't rename the existing OpenRouter YAMLs; doesn't migrate to multi-provider-per-YAML object schema (sticking with simple union per-file: easier for external contributors to understand); doesn't ship HF templates for the other 4 Phase 1 models (community-contribution opportunities now that the schema + template + routing howto unblock them); doesn't add Together / Groq / Cerebras / SambaNova templates (community contributors can add specific templates as they validate models there).

Changed

skills/helmdeck/SKILL.md audit + small persona-leak refactor (Fixes #455). Audit categorization across all 14 top-level sections (lines 27–524): pack catalog + MCP resources + async wrappers + pipelines + repo discovery pattern = mechanism (clean — describes packs by capability, decision tables, contracts); error handling rules + default model selection + session chaining + when-to-create-a-github-issue = operating rules served as baseline defaults (acceptable — operators can override in AGENTS.md per the PR #483 layered pattern); developer guidance section at end = audience-targeted (intentional, clearly delimited for helmdeck developers, doesn't pollute agent prompt). Only one real persona-leak found: "Pack composition — you are a creative agent" (section header at line 305) used persona-shaped framing ("YOU are a creative agent...YOU generate creative content"). Refactored to "Pack composition pattern" with mechanism-shaped framing ("agent generates content, packs handle production") — same operational guidance, no persona prescription. Added explicit "Operator override" note at the end of the section pointing operators at docs/integrations/openclaw.md §5d for the layered-customization pattern when they want to pin a different composition style. Audit conclusion: the skill IS well-layered overall; the operating-rules sections serve as defaults that AGENTS.md overrides (per the empirical lesson from PR #481 → PR #484: docs-only profile is necessary but not sufficient; per-use-case AGENTS.md hardening is the load-bearing layer). No major refactor warranted. Companion audit for skills/helmdeck-debug/SKILL.md still tracked in #456 with the same methodology.

Changed

models/nvidia-nemotron-3-super-120b-a12b-free.yaml final empirical refinements (closes #475). Three structural updates synthesizing what the v1→v2 baseline-vs-hardened A/B taught us about per-model profile sufficiency: (1) validated_against[] populated with a structured maintainer-curated finding capturing the full A/B comparison table (24 calls / 0 deposit → 7 calls / deposit + verify with all_present: true), the three hardenings that empirically closed both Nvidia-documented failure modes (explicit tool whitelist + async pattern bounds for content.ground + plain-text-tool-call invalidation), the resilience observation (content.ground job actually failed upstream in v2; agent honored "don't retry" rule, recovered via operator's deposit reply), and the strategic lesson that the YAML profile is necessary but NOT sufficient for reliable Tier C Nemotron behavior — per-use-case AGENTS.md hardening is the load-bearing layer. (2) best_practices[] extended with three empirical entries prefixed EMPIRICAL 2026-06-10 capturing the load-bearing hardening pattern, the bounded-polling pattern for async packs, and the upstream-failure-resilience observation. (3) anti_patterns[] extended with two empirical entries prefixed EMPIRICAL 2026-06-10 capturing the "deploy with profile but no hardened AGENTS.md" anti-pattern and the "parallel async pack jobs" anti-pattern (both reproduced verbatim in the v1 baseline). (4) chain_call_reliability.notes extended with the empirical refinement that chain-call reliability is workflow-shape-dependent, not a pure model property — the same model on the same prompt produced 24 calls / 0 deposit (v1) vs 7 calls / deposit + verify (v2) purely on AGENTS.md hardening differences. The short/medium/long buckets describe the model's CAPACITY; the actual call counts depend on whether operator AGENTS.md constrains the workflow. Memory-rule compliance: validated_against[] entry uses sanitized labels ("Tier C agent on nvidia/nemotron-3-super-120b-a12b:free, three-turn iterative workflow") rather than naming Press-Nemotron explicitly. Closes #475 — first Phase 1 follow-up issue to land all four empirical sections (validated_against + community_traces × 2 + refined best_practices/anti_patterns/chain_call_reliability notes). Mirrors how models/openai-gpt-oss-120b-free.yaml has all four populated; sets the canonical bar for the remaining three empirical follow-up issues (#473 gemma-4, #474 llama-3.3, #476 qwen3-coder) when their traces eventually land.

Changed

scripts/configure-openclaw.sh now seeds the canonical four-file workspace layout instead of dumping concerns into IDENTITY.md (closes #454). Previous behavior: --seed-identity wrote three files (IDENTITY, USER, SOUL) with leaky concerns — SOUL.md mixed voice posture with operating instructions ("Follow the SKILLS.md decision tables..."), USER.md mixed operator description with technical assumptions about pack vocabulary, and AGENTS.md was never seeded. New behavior: four files (SOUL.md, IDENTITY.md, USER.md, AGENTS.md) with cleanly separated concerns per OpenClaw's canonical model (SOUL=voice, IDENTITY=name, USER=operator, AGENTS=operating rules) — each capped well under the 12,000-char bootstrap injection limit (current sizes: SOUL 980c, IDENTITY 198c, USER 855c, AGENTS 1988c) with operator-tunable  comments at each section so operators know what to customize. SOUL.md covers voice posture / editorial discipline / banned phrases ONLY — no operating instructions; IDENTITY.md is intentionally minimal (name / emoji / one-line theme); USER.md is the operator profile (who you are, where you publish, current focus, editorial preferences) with placeholders for customization; AGENTS.md carries operating rules (tool whitelist, workflow shape, hard constraints, etiquette) and explicitly references the per-model-agents recipes in docs/howto/per-model-agents/. New --seed-canonical-layout flag is the documented name for the new behavior; --seed-identity is preserved as an alias for backwards compatibility (no script breaking for any caller passing the old flag). New --force-overwrite flag for idempotency control: by default existing files are preserved (skip with informative log message pointing operators at docs/howto/personalize-an-openclaw-agent.md); pass --force-overwrite to .bak.YYYYMMDD-HHMMSS existing files and write fresh seeds. Why this matters: today's PR #485 (personalize-an-openclaw-agent howto) + PR #483 (canonical file roles section in integrations/openclaw.md) document the layered SOUL/IDENTITY/USER/AGENTS pattern as the maintainability story; this PR makes the script that bootstraps new agents actually produce that layout instead of overloading IDENTITY.md (the observed pre-fix behavior from the 2026-06-09 tech-blog-publisher debugging arc that motivated #454 in the first place). Layered seeding empirically matters: PR #481 → PR #484 Nemotron A/B (24 calls / 0 deposit → 7 calls / deposit + verify with all_present: true) demonstrates that per-use-case AGENTS.md hardening is the lever — and the script now seeds an AGENTS.md skeleton that explicitly invites that hardening.

Added

docs/howto/personalize-an-openclaw-agent.md — generic operator-personalization howto using the layered SOUL / IDENTITY / USER / AGENTS pattern (closes #458). Walkthrough for operators who want to use helmdeck shipped skills (and skills under ~/.openclaw/skills/) with their own persona, platforms, and goals — not the defaults baked into the upstream skill. Covers the five-layer mental model (SOUL=voice / IDENTITY=name / USER=operator / AGENTS=operating rules / SKILL=mechanism), what goes where with concrete examples per layer, walkthrough templates for populating USER.md (the most-customized file), tuning IDENTITY.md (override when defaults don't match), and SOUL.md (generally don't, but here's the dial). Tradeoffs table for when to fork the skill vs customize via identity files. Full worked example using sanitized Maya security-research persona (consistent across helmdeck docs — same persona docs/integrations/openclaw.md §5d canonical file roles section and docs/howto/per-model-agents/gemma-4-iterative-workflow.md recipe use) — Maya's SOUL.md, IDENTITY.md, USER.md files shown verbatim, demonstrating the persona-reuse pattern (same SOUL/IDENTITY/USER copied across multiple model variants; only AGENTS.md changes per model). Shows the multi-variant openclaw.json registration pattern with two example agents (maya-gemma-4 + maya-llama) sharing persona but using different per-model AGENTS.md. Verification section points operators at scripts/helmdeck-trace for empirically validating their personalized agent's behavior post-bootstrap — verify_manifest_called: True + all_present: True + tool tally matching AGENTS.md prescription. Bootstrap helper section points at configure-openclaw.sh and the #454 layered-seed work that will eventually automate the workspace scaffolding. Empirical motivation anchored in PR #481 → PR #484's Nemotron baseline-vs-hardened A/B (24 calls / 0 deposit → 7 calls / deposit + verify with all_present: true) showing that per-use-case AGENTS.md hardening is more impactful than persona dumping — the layered pattern documented here is what makes that hardening maintainable across many agents. Memory-rule compliance: worked example uses standing Maya persona throughout; no operator-personal agent names (Hat, Press-*, etc.) leak into the public doc. Sidebar registration: new sidebar entry under existing "Per-model agent adaptation" category, alongside the gemma-4 iterative workflow recipe — operators landing on the conceptual howto can hop to the per-model worked example and back.

Added

models/nvidia-nemotron-3-super-120b-a12b-free.yaml — second community_traces[] entry, hardened-v2 success (#475 Phase 1 follow-up closure). Direct A/B against PR #481's v1 baseline — SAME model, SAME prompt (eBPF kernel rootkit detection), hardened AGENTS.md (operator-local per memory rule). 7 total pack calls vs v1's 24 (71% reduction). artifact-put + verify_manifest both fired with all_present: true on 1 of 1 artifact. Three hardenings empirically closed the Nvidia-documented gap: (1) Explicit tool whitelist ("You MAY call ONLY these tools") forbidding filesystem write/read packs — empirically 0 filesystem calls (vs 5 in v1); (2) async pattern bounds for content.ground ("Call ONCE, poll pack-status max 5x, then pack-result OR honest timeout. NEVER start a parallel job") — empirically 1 content.ground call (vs 6 in v1) + 4 pack-status polls (within 5-budget); (3) plain-text tool call invalidation — explicit rule that tool calls generated as plain text invalidate the response, empirically 0 plain-text tool calls (vs the documented anti-pattern that fired in v1's final turn). Resilience observation: the content.ground job ACTUALLY failed upstream in v2 (state transitioned working → failed by poll #4). The agent honored the "don't retry" rule and reported the failure honestly in the Turn 2 response, ending with the literal handoff line. Operator replied "deposit"; Turn 3 fired artifact.put + verify_manifest correctly with the un-grounded draft, returning all_present:true. The hardened workflow is resilient to upstream pack failure, not just clean-path. Decision: profile-works (vs v1's profile-not-enough) — per-use-case AGENTS.md hardening on top of the docs-only profile closes the Nvidia-documented failure modes. Strategic lesson for future Nemotron operators: the YAML profile gives the prompting shape, sampling, and reasoning controls Nvidia recommends; the AGENTS.md gives the workflow constraints that turn those mechanics into reliable agentic behavior — you need both layers. Submitted via the helmdeck-trace CLI (PR #478 / #479) — third YAML to receive a community_traces[] entry via the canonical Phase 1 contribution tool.

Added

models/nvidia-nemotron-3-super-120b-a12b-free.yaml — first community_traces[] entry capturing both Nvidia-documented Tier C failure modes empirically (#475 Phase 1 follow-up advanced). Press-Nemotron agent (session 41863f17-43bc-447a-9828-87c812534615, 2026-06-10) ran the standard three-turn iterative blog-drafter workflow on nvidia/nemotron-3-super-120b-a12b:free. 15-minute session, 24 total pack calls, zero artifact-put or verify_manifest calls — workflow never reached the deposit step. Reproduces both anti-patterns the Nvidia agentic-coding cookbook documents: (1) Goal Drift: agent drifted from "blog draft + deposit" to "spam content.ground with multiple concurrent jobs and write random files" — used filesystem write/read packs (NOT prescribed by AGENTS.md) to save outline.md, draft.md, temp_draft.md, test.md to the workspace dir; six simultaneous content.ground jobs started, most hung at "progress: 10%", only ONE completed and only on a tiny 46-byte test file. (2) Tool-Call Failures: final assistant turn started generating <tool_call><function=helmdeck__pack-status><parameter=job_id>... as PLAIN TEXT instead of using the OpenAI toolCall format — literal "malformed function call" anti-pattern Nvidia documents. Decision: profile-not-enough — the docs-sourced profile guidance (ChatML format, sampling, enable_thinking, force_nonempty_content) was insufficient to prevent the failures; per-use-case AGENTS.md hardening is the apparent next step. Useful side observations: (a) content.ground is async (returns job_id + state:"working"); AGENTS.md says "Call content.ground ONCE" but doesn't mention the polling pattern — operators iterating on Nemotron recipes should add explicit "call once, poll pack-status until state:completed, then call pack-result" guidance. (b) The agent has access to filesystem packs that AGENTS.md never authorized (probably from a separate Claude Code MCP integration in OpenClaw); per-model AGENTS.md should explicitly enumerate allowed packs to prevent goal-drift escapes. Iterating Press-Nemotron AGENTS.md (operator-local per memory rule) sets up a v2-vs-v1 A/B for the next community_traces[] entry — and validates the helmdeck-trace CLI's role as the canonical evidence-capture tool for #464 Phase 1 follow-ups.

Added

models/openai-gpt-oss-120b-free.yaml — second community_traces[] entry capturing first end-to-end CLI dogfood run (PR #478 consumer, captured via PR #479's fixed helmdeck-trace). Trace babfee13-9d81-4f88-a3c8-3cab900c562e from the new trace-test agent on openrouter/openai/gpt-oss-120b:free: three-turn iterative workflow on an MCP tool catalog deep-dive prompt; artifact.put + verify_manifest fired end-to-end with all_present:true on 1 of 1 deposited artifact. Captures three findings worth pinning beyond the metric_summary: (1) workflow shape EXPANDED rather than simplified — AGENTS.md prescribed "exactly two tool calls" for Turn 3 but the agent fired 5 (1 content-ground + 1 artifact-put + 1 verify-manifest + 2 exploratory probes via pack.status + pack.result before the deposit). The publishing-strategist trace from 2026-06-09 simplified 9 platforms to 2 (workflow contracted); this trace shows the opposite drift (workflow expanded with exploratory probes). Two different Tier C deviation patterns, both away from the AGENTS.md prescription; both shapes of customization pressure operators should expect when designing per-use-case agents. (2) non-deliverable-terminal-turn retry-recovery is a real Tier C resilience pattern — operator's first deposit reply triggered the trajectory error; retry with same input succeeded. The free gpt-oss-120b route can fail one turn and recover on the next attempt, which is useful operator-facing patience guidance. (3) Three-turn iterative shape held under iterative pressure — both handoff lines fired literally, Turn 3 ended with the prescribed Done. Artifact deposited and verified. line, zero citation URLs fabricated (model honored the content.ground rule and didn't author URLs). Namespace deviation: artifact landed at artifact.put/...md not blog.publish/...md per AGENTS.md — pack default kicked in where the model should have honored the explicit namespace arg; worth tightening in future revisions. Decision: profile-works (the audit-callback pattern fired correctly and produced a verified artifact; the workflow deviations are operational observations, not workflow-breaking failures). Submitted via the same helmdeck-trace extract shape Phase 1 community contributors are expected to use — proving the CLI's primary use case works end-to-end on a successful session for the first time.

Added

scripts/helmdeck-trace CLI — extracts structured community_traces[] blocks from OpenClaw session jsonl files (issue #464 Phase 1 contribution tooling). Single-file Python CLI (stdlib only, no PyYAML or requests) with three subcommands: extract (one session → one YAML block matching the canonical community_traces[] schema in models/openai-gpt-oss-120b-free.yaml), compare (baseline vs profile-aware A/B markdown table for the methodology described in each empirical-baseline issue), and summary (quick stdout key:value dump for eyeballing). Walks the OpenClaw session jsonl forward, pairs toolCall parts with the next toolResult turn FIFO (matching the existing scripts/oc-capture/extract-oc-transcript.py parser pattern), and computes: real_pack_calls (count of actual toolCall parts, NOT text claims like "I deposited 6 artifacts"), tool_calls_by_name (per-tool tally), verify_manifest_called + all_present (from parsing the verify_manifest tool result JSON), artifact_put_called, content_ground_called + claims_considered / claims_grounded / skipped (from parsing the content.ground tool result), pipeline_run_called, citation_urls_in_text / citation_urls_from_grounding / citation_urls_fabricated (parses [N](url) and [source](url) patterns from assistant final text; cross-checks each URL against content.ground response grounding[] array — flags any inline URL that did NOT come from content.ground as fabricated; this is the Tier C citation-confabulation failure mode documented in 2026-06-10 traces), hallucination_count (heuristic: assistant text claims a deposit / verify outcome but the corresponding tool call never fired), and terminal_errors (captured from trajectory model.completed.data.terminalError — exercises the 429 / non_deliverable_terminal_turn path proven against the 2026-06-10 gemma-4 rate-limit trace). simplification_observed is intentionally NOT auto-detected — heuristic is too fragile; CLI emits null so the YAML schema is satisfied and operator sets it manually after review. Anonymization: default behavior strips operator-personal data per the standing memory rule that workspace files + agent names stay private (agent_id: press-gemma-4 → comment # trace agent (anonymized): Tier C agent on <model>; workspace path omitted entirely). --no-anonymize flag is available for local testing but the default is safe for community PRs. Validation pattern documented: rather than running the CLI against personal agents (Hat / Press-Gemma / etc.), the README recommends spinning up a dedicated trace-test agent on a known-good model (e.g., openrouter/openai/gpt-oss-120b:free) with a generic AGENTS.md that runs the same three-turn iterative workflow shape Hat/Press-Gemma use. The agent stays on the operator's machine (NOT in helmdeck) but the pattern is community-useful — surfaced in scripts/helmdeck-trace/README.md as the recommended validation approach. What this CLI does NOT do (explicit scope boundaries): doesn't fire sessions (OpenClaw's internal IPC protocol isn't documented for external automation; filed as a research follow-up if upstream OpenClaw ships a documented session-fire API; meanwhile, operator manually pastes the test prompt into the OpenClaw UI then points the CLI at the resulting jsonl); doesn't compute simplification_observed (manual after review); doesn't compare against expected behavior (the output is the trace; the operator picks the decision: value). Consumers: the four empirical-baseline issues filed alongside PR #477 — #473 gemma-4, #474 llama-3.3, #475 nemotron-3-super, #476 qwen3-coder — each invites community contribution of trace excerpts to populate the respective profile YAML's community_traces[] array; this CLI is the canonical tool for producing those excerpts. Docs at scripts/helmdeck-trace/README.md.

Changed

Empirical refinement: deposit-step skipping is workflow-shape-dependent, not tier-invariant (issue #466 follow-up). PR #469 / the 4th blog post (/blog/tier-a-empirical-baseline) framed the deposit-step skipping as tier-invariant based on three single-response traces (Tier C baseline, Tier C with profile, Tier A baseline). A fourth trace on the same openai/gpt-oss-120b:free route, run with a three-turn iterative workflow (outline → draft → operator-triggered deposit+verify), successfully called BOTH helmdeck__artifact-put AND helmdeck__artifact-verify_manifest, returning all_present: true, 1 of 1 verified. Real 10,438-byte artifact landed at the expected blog.publish/ namespace key. Latency was significant (~5 minutes total for the deposit-and-verify turn on the free route), but the mandatory tool calls executed correctly. Corrected conclusion: single-response workflows asking the agent to do classify-outline-draft-deposit-verify-checklist in one go fail on every tier; multi-turn iterative workflows with explicit operator handoffs (each turn small enough that 1-2 pack calls suffices per chain_call_reliability: high in the profile) drive the mandatory calls reliably even on cheap Tier C. Engine-level enforcement (#461 Phase 3) remains the durable architectural answer because it removes the workflow-shape dependency entirely — but well-shaped iterative skill prose CAN drive the mandatory call on every tier tested so far. What changed in the docs: docs/reference/models.md Tier C row updated with the iterative-workflow recipe; "Empirical findings from 2026-06-09" section gains the refined finding paragraph + a new "Iterative workflow pattern" subsection documenting the recommended Turn 1 / Turn 2 / Turn 3 structure with operator-triggered handoffs. The handoff line at the end of each turn is itself load-bearing — if the skill prose says "produce a handoff line" but doesn't list missing-handoff as an invalidation condition, the model will drop it. The doc explicitly recommends pinning handoff lines as success-criteria invalidation conditions. models/openai-gpt-oss-120b-free.yaml schema extended with a new comparison_traces[] entry capturing the iterative-workflow trace alongside the original Tier A entry; the original entry's "tier-invariant" notes are revised in-place to point at the new entry as the corrected finding. Methodological lesson: empirical claims based on a single workflow shape are premature. The architectural answer (Phase 3 engine hook) still holds, but the per-tier customization recommendations gain a new dimension — workflow shape, not just model tier, drives reliability of mandatory tool calls.

Changed

docs/reference/models.md tier-level recommendation table rewritten with empirical Tier A baseline data (issue #466). The original table (shipped in PR #465) claimed Tier A "works out of the box" as an assumption. The 2026-06-09 Tier A baseline test on anthropic/claude-sonnet-4.6 empirically revealed the assumption is only partially supported: Tier A handles every structural aspect of skill compliance better than either Tier C variant (parallel tool use at startup, full N-platform fanout, InfoQ 6-criterion fit check with per-criterion grades, multi-step plan acknowledged upfront, "one clarifying question" rule honored exactly) — but Tier A also skips the mandatory artifact.put + verify_manifest deposit step, same as both Tier C variants. The agent's text says "Now appending CTAs and depositing to artifacts — all in parallel" but its parallel tool calls were 8× blog.append_cta — conflating "append CTA" with "deposit to artifacts." The mandatory deposit step was never executed. Strategic finding: the deposit-step skipping is tier-invariant, not Tier-C-specific. Skill prose marked "MANDATORY, NOT ADVISORY" is treated as advisory regardless of model capability. What changed in the docs: the recommendation table now has two columns ("Structural compliance" vs "Mandatory deposit-step compliance"); a new "Empirical findings from 2026-06-09" section presents the three-trace comparison (Tier C baseline, Tier C with profile, Tier A baseline) across 10 metrics. Architectural implication: Phase 3 of #461 (engine-level post-call hook) was originally deferred pending Phase 1 + 2 evidence — today's trace strengthens its justification. The pattern is necessary because skill prose can't carry the mandatory-call weight on any tier, not just Tier C. The architectural shape that closes the loop: producer pack registers a paired auditor; engine intercepts the producer's completion and auto-invokes the auditor; auditor result attaches to the producer's response envelope so the LLM sees both in its next-turn context; no skill-prose dependency. Field report captured in 2026-06-09-tier-a-empirical-baseline.md — fourth post in the 2026-06-09 series, frames the tier-invariant deposit-step failure mode honestly and points at #461 Phase 3 as the architectural answer. models/openai-gpt-oss-120b-free.yaml schema extended with a comparison_traces[] array (distinct from community_traces[]) so cross-tier maintainer-captured comparison runs have a structured place to live. Today's Tier A run is the first entry; future Tier B comparison runs will follow the same shape.

Fixed

blog.append_cta now uses the defaultPackModel() resolver — closes the last hold-out from PR #453. When PR #453 added the default-model resolver to content.ground and blog.rewrite_for_audience, it deliberately excluded blog.append_cta because the conditional shape ("model is required when source_url / project_url / github_url is set") was thought to be a different failure surface. The 2026-06-09 Tier A baseline trace (issue #466) empirically proved otherwise: anthropic/claude-sonnet-4.6 running tech-blog-publisher on the mcp-adr-analysis-server prompt called blog.append_cta 8 times in parallel with project_url set but no model arg, and the pack rejected ALL 8 with invalid_input: model is required when one of source_url / project_url / github_url is set. That's the same upstream-failure failure mode #453 closed for the other content packs — the pack rejects before the LLM dispatcher fires. Fix: the handler now calls defaultPackModel(in.Model) exactly like content.ground and blog.rewrite_for_audience do, resolving the same precedence chain (caller input → HELMDECK_DEFAULT_PACK_MODEL env → first HELMDECK_OPENROUTER_MODELS → openrouter/auto hard fallback). The "model is required when..." error path is removed; the dispatcher gets a non-empty Model value on every call. Behavior change: callers omitting model while supplying any of the URL link inputs no longer hit CodeInvalidInput. Operators who want a specific model still pass it; the default fires only when omitted. Test surface: the existing TestBlogAppendCTA_RequiresModelWhenLinkSet was removed (the behavior it pinned no longer applies) and replaced with TestBlogAppendCTA_DefaultsModelWhenOmitted (asserts the dispatcher receives openrouter/auto when caller omits model) and TestBlogAppendCTA_DefaultsModelHonorsOperatorOverride (asserts the HELMDECK_DEFAULT_PACK_MODEL env wins over the hard fallback). Inline comment on the removed test in blog_append_cta_test.go documents the empirical-trace lineage so a future maintainer can audit the relaxation. Empirical impact: the same Tier A retry (next session post-merge) should now produce 8 successful blog.append_cta calls and the chain that broke today flows through cleanly. Architectural finding captured separately in issue #466: even with this fix, today's Tier A trace skipped artifact.put AND artifact.verify_manifest calls entirely — the deposit-step skipping appears to be tier-invariant, not Tier-C-specific. That observation reframes the "Tier A works out of the box" assumption in docs/reference/models.md and strongly supports the engine-level post-call hook (Phase 3 of #461) as the architectural answer regardless of tier.

Added

models/google-gemma-4-26b-a4b-it-free.yaml — second per-model prompting profile, stub (issue #464 Phase 1.2). 26B-total / 3.8B-active MoE Gemma 4 IT variant on Tier C (256K context window, multimodal — text + image + video up to 60s). Profile sourced from OFFICIAL Google Gemma 4 docs only: Hugging Face model card, Google AI model card, DeepMind product page (τ2-bench numbers), and Google's announcement blog. Schema captures Gemma's role-turn conversational format (replaces Gemma 3's <start_of_turn> syntax with standard system/user/assistant roles via the chat template), binary thinking-mode control via the <|think|> token (NOT a graded low/medium/high knob like gpt-oss; toggled via enable_thinking=True/False through the chat template), Google's universal sampling defaults (temperature=1.0, top_p=0.95, top_k=64 across all tasks), harmony_format: false (Gemma uses its own channel-tag thinking format <|channel> / <channel|> — important: per Google's docs, "Thoughts from previous model turns must not be added" back into history), and multimodal ordering rules (image content BEFORE text, audio content AFTER text). chain_call_reliability: high for short chains (1-2 calls), medium for medium (3-4), low for long (5+) — based on DeepMind's published τ2-bench 85.5% (retail agentic tool-use, 26B-A4B variant) plus the 3.8B active-parameter budget (small-active MoEs typically degrade on long horizons; binary-only thinking control leaves no escalation knob). best_practices[] quotes from Google's official sources; anti_patterns[] captures Gemma-specific gotchas (replaying prior-turn thoughts, hand-rolling Gemma 3 turn markers, expecting nuance/sarcasm reliability — model card explicitly cautions on each). validated_against, community_traces, and comparison_traces ship empty — baseline empirical trace deferred to a follow-up issue because the Google AI Studio shared :free pool on OpenRouter rate-limited the trace prompt at zero token cost on 2026-06-10 (429 Provider returned error: google/gemma-4-26b-a4b-it:free is temporarily rate-limited upstream / provider_name: "Google AI Studio"). The 429 finding itself is captured in the YAML's header comment as a Tier C infrastructure observation — Google AI Studio gates at the upstream-provider level, NOT at the model level, affecting all google/*:free routes simultaneously. BYOK (https://openrouter.ai/settings/integrations) is required for sustained empirical work on Gemma 4 via OpenRouter.
Per-model agent recipe: Gemma 4 iterative workflow (issue #464 Phase 4 down-payment). New how-to doc at docs/howto/per-model-agents/gemma-4-iterative-workflow.md walks through setting up an OpenClaw blog-drafter agent on google/gemma-4-26b-a4b-it:free with a Gemma-4-tuned AGENTS.md template — restructured for role-turn-conversational style instead of gpt-oss's Objectives + Source priority + Constraints + Output format + Success criteria sections. Same three-turn iterative workflow shape as PR #470's gpt-oss validation (outline → draft + ground → deposit + verify) for clean cross-model comparison_traces[] isolation. Sanitized worked example uses Maya persona (a hypothetical security researcher) per the standing memory rule that operator-personal workspace files stay anonymized in helmdeck-facing docs. Recipe covers pre-flight (OpenRouter key + Firecrawl overlay), per-agent model config (Google's universal temperature=1.0, top_p=0.95, top_k=64 sampling defaults + enable_thinking: true), the full AGENTS.md template, a test prompt that mirrors PR #470's validation arc, the metric-capture shape for comparison_traces[] submissions, and an honest "why three turns" rationale. Partial Phase 4 acceptance: issue #464 Phase 4 originally proposed shipping per-model templates under skills/tech-blog-publisher/templates/agents/<variant>/ — but the tech-blog-publisher skill itself isn't helmdeck-shipped (operators set it up locally per docs/howto/add-free-models.md). This recipe-doc shape closes the same intent without requiring helmdeck to ship the upstream skill: it gives operators a model-specific AGENTS.md template + worked example they can copy into their personal OpenClaw workspace. New sidebar category "Per-model agent adaptation" surfaces the recipe in the howto sidebar.
Profile stubs for three more #464 Phase 1 entries (issue #464 Phase 1.2). Schema scaffolds with docs-sourced metadata and prompting guidance ship for meta-llama/llama-3.3-70b-instruct:free (models/meta-llama-llama-3.3-70b-instruct-free.yaml, 70B dense Llama 3.3 on Tier C free route, role_header_chatml format with Meta's own <|start_header_id|> tokens, two function-calling paths documented (bracket-list vs JSON-after-<|python_tag|>), Meta's Llama prompting guide best-practices captured, the family-level "conversation alongside tool calling" anti-pattern noted), nvidia/nemotron-3-super-120b-a12b:free (models/nvidia-nemotron-3-super-120b-a12b-free.yaml, 120B-total / 12B-active hybrid Mamba-Transformer MoE with 1M context window, ChatML format with <|im_start|>/<|im_end|>, reasoning control via enable_thinking + low_effort sub-mode through chat_template_kwargs, Nvidia's force_nonempty_content: True recommendation for coding agents to prevent reasoning-only-empty turns — corroborates ADR 053, goal-drift and tool-call-failure documented as residual failure modes despite the 1M window, Nvidia's own Super+Nano deployment recommendation for long chains noted), and qwen/qwen3-coder:free (models/qwen-qwen3-coder-free.yaml, 480B-total / 35B-active MoE coder-specialized Qwen 3 variant with 256K native context extendable to 1M via YaRN, ChatML format with <|im_start|> / <|im_end|> plus FIM tokens for inline-completion contexts, NON-thinking-mode only — Qwen3-Coder explicitly does NOT generate <think></think> blocks per the HF card, Qwen-specific tool parser recommended in SGLang/vLLM, post-trained with long-horizon Agent RL for multi-turn tool trajectories, SWE-Bench Pro 38.7 / Terminalbench 2 23.9 documented; sourced from HF model card + GitHub README + Qwen announcement blog). All three stubs ship empirical sections empty (validated_against: [], community_traces: [], comparison_traces: []) with comments pointing at follow-up empirical-baseline issues; this lowers the bar for community contribution (per Phase 1 §7) — operators running these models on real workloads can submit trace excerpts to populate community_traces[] without rebuilding the schema scaffold first. Phase 1 substitution rationale: originally #464 Phase 1 listed z-ai/glm-4.5-air:free as the fifth entry, but live OpenRouter API enumeration on 2026-06-10 confirmed the :free variant has been deprecated (only the paid z-ai/glm-4.5-air remains; live /api/v1/models, the collections page, and third-party enumeration all agree). qwen/qwen3-coder:free is substituted in — it's an actively maintained coder-specialized model with strong agentic positioning (Agent RL post-training, SoTA among open models on Agentic Coding per the Qwen blog), and the Qwen upstream pool is independent of the Google AI Studio pool that gemma-4 hit today. Docs update: docs/reference/models.md "Per-model profiles available today" list promotes all four new YAMLs out of "Planned" into "Available today" (with four explicitly labeled as stubs); a new section above the Tier C routing table notes that per-model profiles override the row-level Notes column with prompting guidance sourced from official model docs; google/gemma-2-9b-it:free and z-ai/glm-4.5-air:free removed from the planned list (gemma-2 substituted with gemma-4-26b-a4b-it:free, glm-4.5-air substituted with qwen3-coder:free). Tier C table gets new rows for openrouter/google/gemma-4-, openrouter/meta-llama/llama-3.3-70b-instruct:free, and openrouter/qwen/qwen3-coder; the existing openrouter/z-ai/glm- prefix row gains a note about glm-4.5-air's deprecation; existing nemotron prefix row gets a "Profile: [...]" link. Follow-up empirical-baseline issues filed alongside this PR for gemma-4, llama-3.3, nemotron-3-super, and qwen3-coder — each follows the methodology shape of issue #466 (which validated gpt-oss-120b) and invites community contribution. Why ship stubs instead of one PR per model: it closes Phase 1 acceptance from "1 of 5" to "5 of 5 with at least one fully empirically validated" in a single push, declares the schema scaffold for community contributors to PR against, and surfaces the per-model prompting differences immediately in docs without waiting for empirical trace runs on every model. The gpt-oss profile started populated because a prior empirical session was available; the other four Phase 1 entries don't have prior traces, but the docs-sourced scaffold provides immediate value while empirical data accumulates.
models/openai-gpt-oss-120b-free.yaml — first entry in the per-model prompting-profile library (issue #464 Phase 1). Sourced from OFFICIAL model documentation only: OpenAI Harmony response format, Together AI GPT-OSS guide, IBM watsonx GPT-OSS behavior guidelines, and OpenRouter free-route. Schema captures: prompting_style (objectives + source priority + constraints + output format + success criteria — NOT step-by-step), reasoning_effort_control with per-task defaults (low/medium/high), source_priority_directive (gpt-oss can prefer internal knowledge unless told otherwise — skills must include an explicit source-priority section), harmony_format (gpt-oss uses harmony response format with internal chain-of-thought), chain_call_reliability per chain length (high for 1-2 calls, medium for 3-4, low for 5+ — Tier C reliably makes 1-2 real pack calls per turn then hallucinates the rest as text, per the 2026-06-09 trace in PR #462), best_practices[] (10 items derived from official docs), anti_patterns[] (5 items including the plausibility-shaped-output failure mode), and a prompt_template showing the canonical shape. Schema extended with community_traces[] array so external operators contributing their own use-case traces have a structured place to submit them (contributor / use_case / session_date / metric_summary / decision / notes / pr_or_issue_url). First entry is the 2026-06-09 empirical run: profile-aware agent on openai/gpt-oss-120b:free vs baseline, both calling the same publishing-strategist skill. Empirical finding (full results in validated_against.finding): profile-aware agent produced 2 real blog artifacts, called artifact.verify_manifest once with all_present: true, 2 of 2 verified, hallucinated 0 manifest entries — vs baseline which produced 0 deposits, 0 verify_manifest calls, and (in earlier sessions) 6 hallucinated entries. SAME agent simplified the skill's 9-platform table to 2 variations by choosing pipeline-run (auto-deposit) over per-platform blog.rewrite_for_audience calls — the strategic insight is that the profile raises the floor of structural compliance but does NOT eliminate per-use-case simplification on Tier C. Per-use-case AGENTS.md customization remains the architectural truth for non-frontier models. Documentation ships in three new operator-facing surfaces: docs/howto/add-free-models.md (strict recommendation: must customize per (model × use-case), with §7 community contribution paths), docs/howto/experiment-with-tier-b-models.md (Tier B is an open research question — A/B methodology and mandatory share-your-findings ask), docs/reference/models.md gains a tier-level recommendation table at the top (Tier A out-of-box / Tier B experiment / Tier C must-customize). CONTRIBUTING.md adds a "Reporting model behavior" section pointing at the two howtos and the community_traces[] schema. Blog field-report: 2026-06-09-empirical-validation-per-model-profile.md — third post in the 2026-06-09 series (companion to "plausibility-shaped output" and "the audit-callback pattern" in PR #463), explicitly frames the library as a starting point that operators must finish via per-use-case customization. Privacy: blog post + howto worked examples use a SANITIZED hypothetical persona (Maya, security researcher with generic platforms) — the operator's personal press-gpt-oss workspace files are NOT reproduced publicly. Phase 2 follow-ups (tracked in #464): same profile shape for meta-llama/llama-3.3-70b-instruct:free, nvidia/nemotron-3-super-120b-a12b:free, google/gemma-2-9b-it:free, z-ai/glm-4.5-air:free — each requires its own empirical validation trace before shipping. Tier B unknown, tracked as community research per the experiment-with-tier-b-models howto.
artifact.verify_manifest pack — anti-hallucination audit for the artifact deposit step (#461 Phase 1). Empirical motivation: live trace on 2026-06-09, tech-blog-publisher agent on openai/gpt-oss-120b:free with all morning fixes merged (PR #450 artifact triad, PR #452 declarative bridge, PR #453 default pack model, layered SOUL/IDENTITY/USER/AGENTS workspace split). Agent made one real blog.rewrite_for_audience call, then produced a confidently-formatted six-entry "Artifact Deposit Manifest" table with realistic byte sizes (7.4 KB, 2.1 KB, 3.8 KB, 4.0 KB, 3.5 KB, 3.2 KB) and the disclaimer "Artifact deposit was performed via helmdeck__artifact_put for each variation (mandatory per SKILL.md)." Empirical ground truth from GET /api/v1/artifacts: zero artifacts in the blog.publish namespace, one artifact total in the entire store (an unrelated test). Every line of the manifest was fabricated. The architectural fixes from this morning close the prose-instruction-skipped failure mode (PR #450's typed deposit) and the required-arg-missing failure mode (PR #453's default-model resolver) — they do not close the lying-about-tool-calls failure mode where a Tier C model produces a plausibility-shaped manifest for artifacts it never deposited. This pack closes that gap. Input: {expected: [{artifact_key: "..."}]} (also accepts a flat string array [...] for Tier C friendliness — both shapes decode). Output: {verified[], missing[], all_present, summary}. Handler: per-key ArtifactStore.Get accumulating found vs not-found, dedup before lookup, whitespace-only / empty-string entries dropped silently during decode, summary is one-line "M of N claimed artifacts verified; K missing". Architectural shape mirrors ADR 052 at the chat-response layer: turn an implicit trust ("the agent said it deposited") into a typed pack call that reads ground truth and surfaces the gap in O(200) tokens instead of the multi-thousand-token REST-poking dance an operator would otherwise do to verify. Skill integration documented in docs/reference/packs/artifact/verify-manifest.md: every skill that produces multiple artifacts should chain helmdeck__artifact-verify-manifest as § 4b after the deposit step, with explicit instructions to surface the verified/missing result honestly in the response. tech-blog-publisher/SKILL.md updates to add the § 4b rule ship in the same release as a worked example. Test surface: 15 new tests across artifact_verify_manifest_test.go covering all-verified (object shape), all-verified (flat-string Tier C shape), partial-missing (the today-trace reproduction — 1 of 6 verified), all-missing, dedup of duplicate keys, empty/whitespace entries dropped silently, verified-entry shape carrying filename + namespace + size + content_type + key, 6 error-path cases (missing field, empty array, all-empty entries, wrong type, malformed JSON, no store wired), and a round-trip test that puts two artifacts via artifact.put then verifies both — proof the producer/consumer pair works as matched. 100% per-function coverage on the new file; internal/packs/builtin package total: 93 artifact-related tests pass. Phase 2 follow-ups tracked in #461: same audit shape for repo.verify-clone (claimed clone_path exists, commit SHA matches), blog.verify-published (claimed URL is reachable, content matches), pack.verify-completed (job_id is completed not working), slides.verify-rendered (MP4 artifact exists + passes av.validate), content.verify-grounded (claims_grounded_count matches grounded[] length), pipeline.verify-completion (claimed step outputs match run record). Phase 3 (deferred): engine-level post-call hook that auto-invokes the registered auditor without skill-prose dependency — likely its own ADR if Phase 1 + 2 prove the pattern is generally useful. Field-report blog drafts scheduled (per CLAUDE.md draft-on-finding norm): (a) "Plausibility-shaped output: Tier C models hallucinate multi-step pack-call chains as text, including manifests of fictitious deposits" quantified from the 2026-06-09 trace; (b) "The audit-callback pattern: verify-against-ground-truth as anti-hallucination middleware" — architectural framing, applicable beyond helmdeck.

Fixed

content.ground and blog.rewrite_for_audience no longer hard-fail when the caller omits model. Live-trace evidence from today's tech-blog-publisher retest against openai/gpt-oss-120b:free (the Tier C validation case): agent loaded the skill, ran repo.fetch + web.scrape + fs.read successfully, then looped re-calling content.ground without the model argument and bouncing off Validation failed: model: must have required properties model indefinitely. Same architectural pattern PR #450 fixed for artifact deposit and the validation arc (ADR 052) fixed for AV post-processing: skill prose tells the agent to call the pack, pack contract requires a parameter the prose doesn't mention, Tier C model has no anchor to fill it in, call rejects, loop. Fix: new shared helper defaultPackModel(callerInput string) string in internal/packs/builtin/model_defaults.go resolves a sensible default when the caller omits model. Precedence: explicit caller input wins → HELMDECK_DEFAULT_PACK_MODEL env (operator override, new) → first entry of HELMDECK_OPENROUTER_MODELS env (reuses the existing gateway-side model registry pin from internal/gateway/hydrate_openrouter.go so a stack pinning HELMDECK_OPENROUTER_MODELS=minimax/minimax-m2.7 gets the same model resolved consistently on both the gateway-registration path AND the pack-default path) → openrouter/auto hard fallback. The hard fallback is openrouter/auto rather than a specific free model because (a) it routes through OpenRouter's per-call provider selection which is generally available on every deployment that has HELMDECK_OPENROUTER_API_KEY set, and (b) it preserves the existing project posture that the gateway prefers auto for orchestration work (ADR 053 PromptVariantFullSteps runs on auto by default). The first non-empty source wins; trimming guards against whitespace-only env values being treated as set. Wired into: content.ground (input schema Required drops model; handler resolves via the helper before any LLM dispatch); blog.rewrite_for_audience (same — both packs are in the tech-blog-publisher skill's chain). hyperframes.compose also requires model but is hyperframes-specific and deferred until a Tier C trace shows it as a real blocker. Why a hard fallback rather than returning CodeInvalidInput: the typical zero-config dev experience is a fresh helmdeck stack with HELMDECK_OPENROUTER_API_KEY set (the only way the gateway works) and no model override. The Tier C silent-skip mode means an agent calling a pack on that stack would hit model is required with no hint of what value to pass. Defaulting to openrouter/auto makes the pack succeed at the cost of using more tokens than a hand-tuned model choice would. Operators who want a different default set HELMDECK_DEFAULT_PACK_MODEL once at the stack level. Test surface: 10 new tests across model_defaults_test.go (caller-wins, whitespace-trim, operator-override, stack-pin-with-and-without-prefix, hard-fallback, empty/whitespace env handling, leading-empty-skip in comma list, prefix-preservation); plus TestBlogRewrite_DefaultsModelWhenOmitted + TestBlogRewrite_DefaultsModelHonorsOperatorOverride + TestContentGround_DefaultsModelWhenOmitted confirm the helper fires end-to-end through each pack's dispatcher path. Existing TestBlogRewrite_RequiredFields/no_model and TestContentGround_MissingRequiredFields/no_model were removed (the behavior they pinned no longer applies) with inline comments pointing to the replacement tests so a future maintainer can audit the relaxation. What this PR explicitly does NOT do (scope boundaries): (a) apply the same fix to every pack that takes a model parameter — blog.append_cta has a conditional "model is required when..." shape that's a different semantics, and hyperframes.compose hasn't surfaced as a real blocker yet; both are tracked for follow-up. (b) Touch the gateway-side LLM provider chain — the default model is resolved purely at the pack boundary; the dispatcher's existing provider routing handles whatever string lands in in.Model. (c) Change the OUTPUT schema or pack metadata — callers checking output.model still see the resolved value (now the default when the caller omitted one), preserving the wire shape.
OpenClaw ↔ helmdeck network bridge now declarative — survives docker compose up --build instead of evaporating on every rebuild. Recurring 24-hour debugging loop: the bundle-mcp process in openclaw-gateway needs DNS resolution for helmdeck-control-plane, which requires both containers to share the baas-net Docker network. The previous mechanism was a runtime docker network connect baas-net openclaw-openclaw-gateway-1 call in scripts/configure-openclaw.sh step 1 — runtime attachments are erased every time the container is recreated. Symptoms: bundle-mcp probes failing with getaddrinfo EAI_AGAIN (network gone) or 401 (a stale token survived but the network didn't), the agent stopping mid-conversation with "I don't have access to MCP tools," and the operator getting pulled back into manual recovery via configure-openclaw.sh --rotate-jwt. Fix: new deploy/openclaw-baas-net.compose.yml declares the attachment as a Docker Compose override on the OpenClaw service. configure-openclaw.sh step 1 now installs this override into the OpenClaw compose directory (typically /root/openclaw/docker-compose.override.yml) before the runtime network connect — the runtime call remains a best-effort safety net for the CURRENT container instance so the rest of the script can probe + verify without requiring a restart, but the override is what makes the bridge survive the NEXT compose-recreate. New --skip-compose-override flag opts out for operators who manage the override themselves. The script preserves any pre-existing differing override at docker-compose.override.yml.bak.YYYYMMDD-HHMMSS before replacing, so a hand-edited override isn't silently clobbered. Why this lives in helmdeck's tree rather than OpenClaw's: helmdeck and OpenClaw are independent projects with separate compose lifecycles — the override file is generated by helmdeck (so the integration runbook ships with it) but installed into OpenClaw's compose dir (so OpenClaw's container lifecycle remains the source of truth for OpenClaw's networking). Same pattern as Phase 1 of PR #450: turn an advisory step (re-run configure-openclaw.sh after every rebuild) into a declarative artifact (the override file is applied automatically by compose). Updated docs/integrations/openclaw.md §5b with the "Network bridge survival across rebuilds" troubleshooting section so operators hitting the symptom find the fix instead of re-running the script.

Added

artifact.put / artifact.get / artifact.list packs — typed surface for the artifact store, replacing prose-instruction "save to / read from artifacts" guidance that Tier C free models silently ignore. Motivating observation: the tech-blog-publisher OpenClaw skill was generating blog content correctly on openai/gpt-oss-120b:free but returning the markdown inline in the chat response instead of depositing it under the artifact store the way the SKILL.md prose instructed. Same failure mode the validation arc solved at the pack-output layer per ADR 052: turn an advisory step into a typed pack call so model tier doesn't matter. artifact.put accepts {content, kind, filename?, content_type?, encoding?, namespace?} and returns {artifact_key, url, size, content_type, filename, namespace}. The kind hint (one of blog, markdown, transcript, summary, json, text, html, csv, binary) drives default filename + content_type so skills don't have to think about MIME types — kind:"blog" → content.md + text/markdown. encoding:"base64" opt-in for binary content the JSON envelope can't carry literally; unsupported encodings reject fast rather than silently passing base64 text through as if it were UTF-8. Filename safety: leading slashes stripped, .. segments resolved, path.Clean applied, empty/./.. fall back to the kind default. artifact.get is the symmetric reader: input {artifact_key, encoding?}, output {content, encoding, content_type, size, artifact_key, filename, namespace}. Encoding policy: text-shaped content types (text/*, application/json, application/yaml, application/xml, *+json, *+xml, *+yaml per RFC 6839) return as UTF-8 strings by default; everything else returns base64 so a non-UTF-8 byte sequence doesn't blow up the JSON envelope. Callers can force either with encoding:"utf-8" / encoding:"base64". artifact.list is the introspection capability: input {namespace?, filename?, limit?} (filename is a case-insensitive substring match, not a glob), output {artifacts:[...], count, truncated}. Default limit 100 entries, newest-first sort by created_at. Pair artifact.list (find the key) with artifact.get (read the bytes) when an operator may have uploaded a file the agent needs to discover, or to enumerate what a multi-pack skill produced. No external deps, no NeedsSession — pure passthroughs to the existing ArtifactStore interface (already on ExecutionContext.Artifacts since T205). Each pack registers in cmd/control-plane/main.go in the always-available section. Test surface: 78 new test cases across three pack files. artifact.put covers happy path, all 9 kind defaults + unknown-kind fallback + case-insensitivity, explicit filename/content_type override of kind defaults, custom namespace, base64 round-trip, filename sanitization (absolute path, .. traversal, internal .. cleanup, ./../empty defaults), and 7 error paths (missing content, empty content, no store wired, bad base64, unsupported encoding, malformed JSON, store-backend failure). artifact.get covers UTF-8 vs base64 routing across 9 text content types and 7 binary content types, forced-encoding overrides in both directions, key-split parsing for filename/namespace extraction, 6 error paths (missing/empty/whitespace key, no store, not-found, malformed JSON), and a round-trip test that chains artifact.put → artifact.get to confirm the value comes back unchanged. artifact.list covers empty store, listAll, namespace filter, filename substring filter (case-insensitive + suffix matching), namespace+filename combined filter, limit + truncation, and 4 error paths. All three packs hit 97-100% per-function coverage; internal/packs/builtin stays above the 80% floor. Skill pattern documented in docs/reference/packs/artifact/put.md: every skill in ~/.openclaw/skills/ that produces audience-facing content should end its procedure with a mandatory helmdeck__artifact-put call. The pattern was introduced specifically because Tier C free models on OpenRouter (openai/gpt-oss-120b:free, meta-llama/llama-3.3-70b-instruct:free, etc. — see ADR 051 and the models reference) ignored the prose deposit step. What's NOT in this PR (explicit scope boundary): the POST /api/v1/artifacts upload endpoint that would let the management UI write operator-uploaded files into the store. That's a separate small PR with its own design questions (per-caller namespacing, MIME allowlists, size limits, auth posture). Once it lands, the round trip closes: operator uploads via REST → agent finds via artifact.list → agent reads via artifact.get → agent processes and deposits via artifact.put. Until then, artifact.list/get are still useful for inspecting pack-produced sidecars (validation.json, engagement.json, captions.srt) and for skills that chain artifacts between stages.

[0.26.0] - 2026-06-05

The validation-arc release. Closes the four-phase AV-validation arc (PRs #428 / #430 / #431 / #432 / #433) — script → pack → upstream fix → default-on integration → ADR record. Token cost of "the video has issues" diagnostics drops from ~3,000 tokens per incident (manual ffprobe loop) to ~200 tokens (read validation.checks[] from the run record) per ADR 052. Sibling work: tier-aware Budget.PromptVariant for helmdeck.plan (ADR 053, #437) routes Tier C models to a single_pick one-step-at-a-time plan path, addressing the 50% multi-step plan failure rate observed against openrouter/nvidia/nemotron-3-super-120b-a12b:free during validation-arc testing. New documentation surface: operator-facing models tier reference (#439), intent-first cookbook with 17 recipes (#435 + #441), two field-report blog posts capturing the arc + cookbook-pattern thesis. Pack catalog grows from 52 → 53: new av.validate pack ships no-gateway-required.

New packs: av.validate (helmdeck__av-validate via MCP).

Operator upgrade: clean — no schema migrations, no breaking input changes, no removed packs. Validate *bool and CaptionsSidecar *bool are pointer-bool default-on with backward-compatible nil-→-on semantics. Production deployments using the published ghcr.io/tosin2013/helmdeck-sidecar:latest get the new validation script automatically on next pull (the script COPY is in the sidecar Dockerfile). Local builders using compose.build.yaml get the local override automatically per #434.

Added

helmdeck.plan gets tier-aware prompt templates via Budget.PromptVariant — ADR 053. Motivated by empirical data captured during the validation-arc testing window on 2026-06-05: six helmdeck.plan calls against openrouter/nvidia/nemotron-3-super-120b-a12b:free for the same multi-step intent class showed 3/6 success (50%), with 33% length-truncation at the 600-token output cap and 17% near-empty responses with the canonical reasoning-token leak pattern (423-token completion, 71 chars of user-visible JSON — TokenMix measures the analogous behavior at ~40% on DeepSeek R1 with max_tokens=200). Same intent class on openrouter/auto in the same window: 2/2 clean stops at 15–34s latency. The architectural finding — captured in the new blog draft (PR #436) and the new ADR 053 — is that output shape, not model size, is the right primitive: small models reliably make ONE pack-pick decision in 50–200 tokens but fail at emitting a 1,500-token multi-step plan in one shot. BFCL data confirms the multi-turn cliff at the model-family level (xLAM-2-1B at 53.97% overall but 8.38% multi-turn; Qwen3-1.7B at 55.49% overall but 16.88% multi-turn — TinyLLM, arXiv 2511.22138). Two prompt variants now ship: PromptVariantFullSteps (Tier A/B default) emits the complete pipeline JSON in one shot (today's behavior — same planSystemPrompt template); PromptVariantSinglePick (Tier C default) emits the SINGLE NEXT step + a more_steps_likely flag, agent re-calls helmdeck.plan with updated context to plan the next step. The output schema is the same across both variants ({steps:[], complexity, more_steps_likely, reasoning}) so the handler doesn't need to parse two response shapes — only the model's TASK changes. Selection via Budget.ResolvePromptVariant(): explicit PromptVariant field on a budget entry wins; otherwise tier defaults apply (Tier A/B → FullSteps, Tier C and unknown → SinglePick — fallback for unknown tiers is the conservative path, matching the ADR 051 "we don't know, route to the safer path" posture). Operators override per-entry when their per-model knowledge contradicts the tier default — e.g. a Tier B model trained specifically for tool calling that handles multi-step plans reliably should get FullSteps despite the tier default suggesting otherwise. Output additions: planOutput gains prompt_variant_used (which template fired) and more_steps_likely (set by SinglePick on the first step of a chain; always false on FullSteps). Both omitempty — wire-shape stable for callers that haven't migrated. Backward compatibility: Tier A/B behavior is identical to pre-ADR-053 — same prompt, same output, same model interactions. Only Tier C and unknown-tier models see a behavioral change, and that change is they now produce reliably parseable output where 50% of the time they previously did not. Agent loop pattern: the single_pick variant composes naturally with the MCP agent loop — agent calls helmdeck.plan → runs the step → calls helmdeck.plan again with updated context → repeats until more_steps_likely:false. Each call is a self-contained Tier-C-sized decision; the catalog projection is already cached on prefix-cache-enabled providers per ADR 051 PR #4, so the per-step cost is dominated by output tokens. Regression guards at two layers: TestResolvePromptVariant_TierDefaults + TestResolvePromptVariant_ExplicitOverride in internal/llmcontext/budgets_test.go assert the variant resolution rules; TestSelectPlanSystemPrompt in internal/packs/builtin/plan_test.go asserts template-marker presence per tier+variant (Tier A → "ORDERED sequence of tool/pipeline calls", Tier C → "Emit EXACTLY ONE step in the steps array", and the override paths in both directions). Same rule-with-test posture PR #404 introduced for the no--c copy audio-concat guard. Architectural framing in ADR 053: routes by output shape, not parameter count; references the literature converging on the same point (Portkey "Smart Fallback with Model-Optimized Prompts", DSPy compile-per-LM Signatures, PLAN-TUNING arXiv 2507.07495, Pre-Act arXiv 2505.09970, Anthropic's "Building Effective Agents" essay). Future-deferred: PromptVariantHybrid value for Tier B models that handle short multi-step plans but not full pipelines — deferred until we have empirical data on a specific Tier B model failing the current FullSteps posture; speculative variants without motivating evidence are how the variant enum bloats into a footgun.
av.validate default-on integration in slides.narrate + podcast.generate — Phase 3 of 4 in the validation arc. Phase 1 (#428) shipped the standalone script. Phase 2 (#430) wrapped it as the av.validate pack. Phase 3 is the token-savings payoff the entire arc was built for: every successful slides.narrate and podcast.generate run now embeds the structured validation report directly in the run output. The next "the video has issues" diagnostic costs ~200 tokens (read validation.checks[] from the run record) instead of the ~3,000-token manual ffprobe loop we ran before the validator existed. Refactor first: the core validation logic in internal/packs/builtin/av_validate.go was extracted into a reusable runAVValidation(ctx, ec, opts) (scriptReport, string, error) function. The av.validate pack handler now calls it after resolving artifact-keys to paths; the new slides.narrate and podcast.generate post-concat steps call it directly with paths already in the session tmpfs (no double-fetch overhead — the whole point of accepting video_path / audio_path direct inputs back in Phase 2). The function applies the known-issue demotion map, persists the validation.json sidecar under the caller's namespace, and returns the typed report. slides.narrate integration: new Validate *bool input field on the pointer-bool default-on pattern (mirrors CaptionsSidecar from PR #425 and Mermaid from PR #379 — nil → on, &false → off). Validation runs at the new step 9b between video upload (step 9) and engagement metadata (step 10): runAVValidation is called with VideoPath: "/tmp/final.mp4" (still on disk in the sidecar) and CaptionsPath: captionsValidatePath (the SRT bytes written to /tmp/captions-validate.srt when sidecar is enabled but burn-in is not — a ~10 KB tmpfs write whose result is consumed by the script's srt:* + consistency:captions_coverage checks). The artifact namespace is set to "slides.narrate" so the validation.json sidecar lives next to the engagement.json + captions.srt artifacts the pack already persists. podcast.generate integration: same pattern. New Validate *bool input; validation runs after audio artifact upload at progress ~97%. runAVValidation is called with AudioPath: "/tmp/helmdeck-podcast/final.mp3" (the path internal/podcast/concat.go uses for its concat output). Audio-only invocation means mp4:* and consistency:audio_video_duration checks skip automatically per the script's argv dispatch — only audio:packet_contiguity, audio:rms_sweep, audio:loudness_lufs, and audio:silence_runs run, which is the correct check set for an MP3 output. The artifact namespace is "podcast.generate". Output additions: both packs gain validation (the structured report — shape mirrors av.validate's output: {checks[], passed, failed, warnings, all_passed}) and validation_artifact_key (the persisted sidecar) in their OutputSchema.Properties. validation_artifact_key is always emitted (empty string when validate is off or the script failed) so consumers can branch on its presence. validation is conditionally added — present when validate ran successfully and produced ≥1 check, absent when validate is off OR the script invocation failed. Soft-surface contract preserved (the load-bearing reason validation runs default-on): validation script-exec failures, JSON-parse failures, and validation findings (checks at any severity) all log and continue rather than failing the pack. The artifact is the value; validation is a description of the artifact. Operators who want fail-fast behavior call av.validate standalone with strict:true (Phase 2's escape hatch); the default-on integration intentionally never blocks artifact ship. scripts/pipelines-smoke.sh refactor: the mp4:av and mp3:av assertion specs gain a new validation_assert helper as the primary signal. When the run record contains a validation field, the smoke script reads validation.all_passed and short-circuits to green-success. When the field is absent (validate explicitly disabled, OR runs from before Phase 3 shipped during avbench's cutover window), the script falls back to the legacy inline ffprobe checks (mp4_faststart_ok, audio_packets_contiguous, audio_rms_above, audio_codec_params_ok, plus the engagement/captions structural assertions). Net effect: an avbench run on a post-Phase-3 artifact now does most of its work by reading one JSON field; the inline checks become a backwards-compatibility safety net rather than the primary verification path. Regression guards at three layers: TestSlidesNarrate_ValidationDefaultOn and TestPodcastGenerate_ValidationDefaultOn confirm the handler invokes av-validate.sh when the pointer-bool is nil (default-on) AND the output schema validates even when the script fails (the soft-surface contract). TestSlidesNarrate_ValidationExplicitlyDisabled and TestPodcastGenerate_ValidationExplicitlyDisabled confirm validate:false suppresses the script call entirely. The existing TestAVValidate_NoDemotionsInForce from the #429 fix continues to assert no checks are demoted and the demotion mechanism still works. Coverage: full go test ./internal/... -race -count=1 passes 2,006+ tests across 32 packages. Coverage gate PASS at every floor (internal/packs/builtin 80.6%). Phase 4 of the validation arc remaining: ADR audit of ADR 008, ADR 015, ADR 045, ADR 051 for the implications of default-on validation + a new ADR-052 capturing the architecture (severity policy, known-issue demotion lifecycle, soft-surface contract, script-delivery via sidecar Dockerfile COPY).
av.validate pack — Phase 2 of 4 in the validation arc (Phase 1 shipped the standalone script in PR #428). The pack wraps scripts/av-validate.sh so any pipeline or agent can call validation as a typed surface and read structured findings rather than re-deriving the diagnostic flow from scratch every time. Token-savings rationale (the load-bearing motivation for the whole arc): every manual "the video has issues" diagnostic burns ~3,000 tokens of bash output + analysis. This pack collapses that to ~200 tokens once Phase 3 wires it as a default-on post-step. Pack inputs: video_artifact_key / audio_artifact_key / captions_artifact_key (fetched from the artifact store and written to /tmp/av-validate-{video,audio,captions}.{mp4,mp3,srt} in the session before invoking the script), OR video_path / audio_path / captions_path (direct paths — useful for chained-pack scenarios where the file is already in the session /tmp, eliminating double-fetch overhead Phase 3 will rely on). Plus ebur128_target (default -14 LUFS, YouTube spec; -23 for broadcast), skip_checks (comma-separated; video:freeze_runs is default-skipped because slide-deck videos hold a static image per slide and that check false-positives 100%), and strict (boolean, default false). Pack outputs: validation (object with checks[], passed, failed, warnings, all_passed mirroring the script's --json shape) + validation_artifact_key (the persisted validation.json sidecar — same pattern as engagement.json / captions.srt from #424 / #425). Severity policy is honest: the script reports each check at its natural severity (fail for matches-shipped-bug-fixes, warn for soft heuristics). The pack then overrides the script's severity for checks listed in an internal knownIssueDemotions map. When a fail-severity check is in the map, the pack demotes it to warn and appends the tracking-issue reference to the detail string. Current demotions (will shrink as fixes land): consistency:audio_video_duration → demoted to warn per issue #429. The 888de7b23142ba81 artifact diagnostic during Phase 1 development surfaced that PadAudioToMin produces duration-stretched AAC packets at slide boundaries — exactly 13 packets summing 26.246s on the symptom artifact, matching the 25.9s timeline-vs-content discrepancy. The audio PLAYS correctly (665s narration + 26s inter-slide pauses = 691s timeline); the container metadata over-claims because each silence-pad becomes a single AAC frame with stretched duration metadata. The demotion is coupled to the tracking issue, not to a release calendar — when the fix lands (replace GenerateSilence + ConcatAudio pad with an -af apad filter in runSegmentEncode; ~30 LOC), the same PR removes the entry from knownIssueDemotions, bumping severity back to fail together with the underlying fix. Same-PR coupling makes the regression guard impossible to silently leave behind. Default behavior is soft-surface (strict:false): the pack returns success even when checks fail; the findings ARE the output. The orchestrating LLM agent reads validation.all_passed, sees the specific check names + details, and decides whether to retry / escalate / report — matching the project norm "honest output > convenient lie" and the typed-error model from ADR 008 where errors are for "couldn't proceed," not for quality findings. Strict mode (strict:true): any fail-severity check failure after demotion surfaces as a typed CodeArtifactFailed error with the failing check names in the message. Use this for CI publish gates and downstream consumers that can't tolerate processing a structurally-invalid artifact. Runtime-error vs check-finding distinction is intentional (closes a class of confusing-error bugs): exit code 2 from the script (missing dependency / usage error — validation DIDN'T RUN) returns CodeHandlerFailed; failed checks (validation RAN AND REPORTED FINDINGS) return success with the findings in the output unless strict:true. Script-delivery mechanism: deploy/docker/sidecar.Dockerfile gains a COPY scripts/av-validate.sh /usr/local/bin/av-validate.sh && chmod +x step (alongside the existing helmdeck-entrypoint copy). The pack handler invokes the script via session exec at the stable /usr/local/bin/av-validate.sh path — no Go //go:embed complexity, no file duplication. ffprobe / ffmpeg / python3 (the script's only dependencies) are already in the sidecar from earlier installs; PR #425's libass smoke check confirms ffmpeg has the filters the script's silencedetect / volumedetect / blackdetect / ebur128 calls rely on. Test surface: 7 new unit tests in internal/packs/builtin/av_validate_test.go: input validation (no v/a inputs → CodeInvalidInput); happy-path JSON parse + sidecar artifact persist; known-issue demotion (the #429-class JSON in → all_passed:true, warnings:1, detail string contains #429); strict-mode surface (strict:true + fail-severity check → CodeArtifactFailed naming the failing check); soft-surface default (same inputs without strict → success with findings); script-exit-2 distinction (CodeHandlerFailed not check failure); argv wiring (paths, ebur128_target, skip_checks all flow through; --json is always passed). Pack registers in cmd/control-plane/main.go in the always-available section (no LLM, vault, or egress-guard deps). Not yet integrated as a post-step on slides.narrate / podcast.generate — that's Phase 3, explicitly deferred until this Phase 2 pack has been called against 5-10 real artifacts (avbench monthlies + ad-hoc operator invocations) to confirm the false-positive rate after demotion is acceptable. The validation arc remaining: Phase 3 (default-on integration), Phase 4 (ADR audit + new ADR-052 capturing the architecture).
scripts/av-validate.sh — standalone validator for slides.narrate / podcast.generate AV artifacts (Phase 1 of 4 in the validation arc). Every time an operator reports "the video has issues" we run the same manual ffprobe sweep: auth → fetch artifact → check faststart → sample RMS at intervals → verify packet contiguity → eyeball duration parity. The 888de7b23142ba81-video.mp4 diagnostic we just ran burned ~3,000 tokens of bash output + analysis to discover an audio/video duration mismatch (27.930s of trailing video-without-audio past the audio stream's end) — a finding that's trivially expressible as a single JSON field. This script is the executable spec for that diagnostic: a 350-LOC bash + python3 + ffprobe/ffmpeg validator that takes a video/audio/captions path and emits either a colored human report or a structured JSON document. Phase 2 will wrap it as an av.validate pack; Phase 3 will integrate that pack as a default-on post-encode step on slides.narrate and podcast.generate so the validation result lands in the run record's validation field — collapsing the next "video has issues" diagnostic from ~3,000 tokens to ~200. Phase 4 audits the relevant ADRs (008/015/045/051) and lands a new ADR-052 capturing the architectural decisions. Check set, calibrated to the bugs we've actually shipped fixes for (each labeled fail-severity, exits the script non-zero on regression): mp4:faststart (PR #422 — moov-before-mdat via pure-Python byte scan, no ffprobe dep), mp4:codec_pin (PR #421 — h264 + aac LC + 44.1 kHz pinned via ffprobe -show_entries stream), mp4:bitstream_decode (research §"Deep Bitstream Decoding" — ffmpeg -v error -xerror -err_detect crccheck+bitstream+buffer -f null - null muxer pass, catches macroblock corruption that survives the muxer but fails decoders), audio:packet_contiguity (PR #423-class — packet pts gap > 0.5s indicates the ElevenLabs partial-response cascade), audio:rms_sweep (5-point sweep, -45 dB floor, catches silent-fallback regressions), consistency:audio_video_duration (the bug we just found — audio_content_duration = aframes × 1024 / sample_rate vs container format=duration, 1s tolerance), srt:first_cue_anchor (PR #425 — must be exactly 00:00:00,000 for YouTube CC import), srt:comma_separator (period decimal silently parses as hours in some libass builds — 7-hour offset captions), consistency:captions_coverage (last SRT cue end within 2s of audio_content_duration). Plus three warn-severity heuristics that surface for review but don't fail the run: audio:loudness_lufs (EBU R128 integrated loudness, YouTube target -14 ± 2 LUFS via ebur128 filter — drift surfaces operators shipping out-of-spec audio that platforms then normalize aggressively), audio:silence_runs (silencedetect=noise=-50dB:d=2, ≥2s runs flagged — could be legitimate between-slide pauses), video:black_runs (blackdetect=d=2.0:pix_th=0.10 — catches marp render failures inserting accidental long black frames). The video:freeze_runs check (freezedetect=n=-60dB:d=2) is implemented but default-skipped via the SKIP_CHECKS env var because slides.narrate output is static-image-per-slide by design — every slide IS technically a freeze, so the check false-positives 100% of the time on our dominant use case. Talking-head pipelines (none exist yet) should --no-skip it. Operator interface: make av-validate VIDEO=/path/to.mp4 CAPTIONS=/path/to.srt JSON=1 or call the script directly with --video / --audio / --captions / --json / --ebur128-target / --skip-checks. Exit code 0 = no fail-severity check failed (warns may be present); exit 1 = at least one fail-severity check failed; exit 2 = usage error or missing dependency. Acceptance test against the artifact that motivated this work (slides.narrate/888de7b23142ba81-video.mp4): the script correctly fires consistency:audio_video_duration with container=693.344s audio_content=665.414s delta=27.930s exceeds 1s tolerance and exits 1, while every other applicable check passes — confirming the script catches what the manual diagnostic found AND doesn't false-positive on the surrounding healthy parts of the artifact. What this PR explicitly does NOT do (per the plan's "what this deliberately doesn't" section): no MP4Box/GPAC integration (CVE risk per CVE-2026-9572 / CVE-2026-7135 / CVE-2025-70116; functionally redundant with ffprobe for our use case where we control encoding); no Bento4 mp4dump deep atom inspection (overkill); no mp3val / mp3check (we control encoding so garbage MP3 frames aren't a realistic failure mode); no QCTools / qcli analog-tape forensics (we don't have analog tape); no MediaConch policy compliance (no operator has asked for institutional-archive policy schemas); no untrunc repair tooling (fix root causes upstream in the encoder, not patch corrupted output); no pack wrapping yet — that's Phase 2, deliberately deferred until the standalone script has been run against 5-10 real artifacts to confirm the false-positive rate is acceptable. Reusable patterns leaned on: the green/red/yellow color helpers + ffprobe wrapper functions from scripts/pipelines-smoke.sh (audio_packets_contiguous, audio_rms_above, audio_codec_params_ok, mp4_faststart_ok, captions_assert) are lifted with path-arg wrappers, so Phase 3 will refactor pipelines-smoke.sh mp4:av / mp3:av checks to read the validation field from the run record instead of re-implementing the same probes inline — net LOC reduction across the codebase once Phase 3 lands.
Multi-model recovery matrix workflow + openrouter/auto-as-default decision rule (v0.26.0 candidate). PR H of the v0.25.0 arc proved the cheap-model bet against ONE pinned model (openai/gpt-oss-120b:free recovers correctly on all 5 typed-error scenarios at ≥7/10). This PR takes the proof and turns it into a discovery mechanism: which other free models on OpenRouter handle helmdeck's typed-error contract reliably, and should openrouter/auto be surfaced as a recommended default for users without a configured API key? NEW .github/workflows/model-discovery.yml — weekly Wednesday 06:00 UTC (different day from model-recovery.yml's Sunday so the two don't compete for the runner pool). 4-row matrix with fail-fast: false: openai/gpt-oss-120b:free (required tier, MUST pass) + google/gemma-4-31b-it:free (observational, threshold modifier -1 for the size gap) + nvidia/nemotron-3-ultra-550b-a55b:free (observational, modifier 0) + openrouter/auto (observational, modifier -1 for per-call routing variance). continue-on-error: ${{ matrix.tier != 'required' }} — only the pinned model can fail the workflow; observational rows publish their per-scenario scores but don't block. An aggregator job downloads every per-model report and posts a combined summary table to the run page with a 3-state status legend: ✓ (all scenarios passed), ⚠ (at least one below threshold but received responses), ✗ (at least one scenario fully dark — provider may be deprecated). NEW internal/reliability/recovery_test.go gains the HELMDECK_RECOVERY_THRESHOLD_MODIFIER env var. Default 0 preserves v0.25.0 single-pinned-model behavior; the matrix workflow sets per-row modifiers so weaker observational models can have honest lower thresholds without globally weakening the v0.25.0 reliability bet. Gemma-4-31B at threshold-1 is the same reliability story as gpt-oss-120B at threshold+0 — "reliably correct in 60% of cases for the smaller model" vs "70% for the larger" — documented per-model so the comparison stays honest. Floor-clamped at 1 — even the most accommodating row demands "model emitted a usable recovery at least once." NEW .github/workflows/model-discovery-alert.yml — separate workflow with issues: write permission scoped HERE only. Chains off model-discovery.yml via workflow_run. Opens (or comments on) a GitHub issue when an observational row scores 0/N on at least one scenario — "fully dark" signals provider deprecation, model-id rotation, or upstream unreachability. A row at 4/10 is normal variance and does NOT trigger an alert; the narrow 0/N threshold avoids weekly issue spam. Duplicate-issue avoidance: searches for an open issue with label model-discovery-alert and the exact model in the title; if found, comments on it instead of opening a duplicate. Splitting the alert into a separate workflow confines the elevated issues: write permission to ~50 lines of YAML and keeps model-discovery.yml at contents: read — smaller blast radius if either workflow is ever compromised. NEW docs/howto/multi-model-recovery.md — operator-facing guide: per-row purpose, threshold-modifier rationale, summary-table reading guide, and the load-bearing decision rule for openrouter/auto-as-helmdeck-default: ≥7/10 across all 5 scenarios for 6 consecutive weekly runs → surface as recommended default; <5/10 on any scenario → never offer; between those lines → document the gaps in the howto and leave routing to the operator. The rule is in the howto (not in code) because it's a product decision informed by the matrix data. What's NOT in v0.26.0 (deliberate scope decisions): the actual UI change to recommend openrouter/auto as a default — that lands AFTER the 6-week observation window produces the evidence, in a separate small PR that cites the matrix run window. A long-term trend dashboard rendering per-scenario scores across weekly runs — deferred until maintainers find themselves diffing artifacts often. Auto-swap of the required row when an observational row outperforms — deliberately manual so a maintainer confirms the new pin. Combined cadence after this PR: model-recovery.yml (pinned model, weekly Sunday) + model-recovery + mutation.yml (decision-dense code mutation, daily 04:00 UTC) + model-discovery.yml (4-model matrix, weekly Wednesday) = three workflows producing reliability signal across different time horizons. Total cost: ~260 runner-min/month (free tier easily covers, GitHub Pro comfortable).
Captions/SRT support on slides.narrate: sidecar default-on, burn-in opt-in (v0.26.0 candidate). PR #424 declared engagement.captions_recommended: true but didn't actually produce captions — operators saw the recommendation but had to write SRT files themselves. This PR makes the recommendation actionable. Sidecar SRT (default-on): a captions.srt artifact persisted alongside the MP4. YouTube/Vimeo auto-import as the CC track via Studio "Subtitles → Upload file → With timing" — the path that backs the research-cited ~12-13% YouTube view boost (sidecar CC, NOT burn-in). Essentially free: a few KB of bytes per run, zero encode cost. Burn-in (opt-in via captions_burn_in:true): renders captions into every frame via ffmpeg's libass subtitles= filter. Required on platforms that don't surface CC tracks (Twitter/X embedded videos, LinkedIn embeds, raw MP4 downloads viewed in players without CC support). The two outputs feed from the same SRT byte stream — generating once and persisting as sidecar is the cheap default; burn-in adds the encode cost + OOM risk when explicitly requested. NEW internal/packs/builtin/slides_captions.go — pure-function buildSRT(slides, durations) []byte + formatSRTTimestamp(seconds) string. Kept out of slides_narrate.go (already ~1,270 LOC) to match the slides_notes.go separation pattern. formatSRTTimestamp is intentionally DISTINCT from the existing formatTimestamp (M:SS with period, used by YouTube chapter markers in the engagement object) — SRT spec mandates the wider HH:MM:SS,mmm field AND a COMMA decimal separator; using a period would parse as hours in some libass builds and produce 7-hour-offset captions. Co-located rationale comments cite the spec source so a future refactor can't conflate them. Text normalization inside buildSRT: CRLF/CR → LF (paste-from-Word safety), per-cue whitespace strip, empty-notes → single literal space (preserves cue numbering so cue N corresponds to slide N+1 — operators reviewing the .srt by eye need this alignment for sane debugging). Pack inputs: captions_sidecar *bool (mirrors the Mermaid pointer-bool default-on shape — nil ⇒ on, explicit false ⇒ off) and captions_burn_in bool (default false). OutputSchema additions: captions_artifact_key: "string" (empty when sidecar suppressed or artifact-store Put failed) + captions_burned_in: "boolean" (ALWAYS emitted so consumers can branch on its presence — even when false). Handler integration sits between the close of the audio-generation loop (durations finalized) and the start of the per-segment encode loop — the only point where both slides[] and durations[] are simultaneously known. Burn-in wiring appends ,subtitles=/tmp/captions.srt to the existing vf chain right after the fade-filter block — same comma-separated filter-chain shape, no escaping needed because /tmp paths have no spaces or quotes. Failure semantics are intentionally soft: sidecar artifact-store Put failures log + continue (captions are auxiliary; failing a 3-minute encode over an artifact-store hiccup is worse than degraded output); burn-in write failures degrade to no-burn rather than failing the segment encode. Pipeline wiring: builtin.repo-presentation already inherits the default-on sidecar via the pack default; Produces updated to include srt_captions; Limitations gains a captions-honesty entry alongside the existing engagement-honesty entry — distinguishes the cheap sidecar path from the costly burn-in. Sidecar Dockerfile smoke (deploy/docker/sidecar.Dockerfile): the existing ffmpeg -version check is extended with ffmpeg -filters | grep -q ' subtitles ' so an image build fails LOUDLY if a future apt change drops libass support from the ffmpeg package — prevents the confusing Unrecognized option 'subtitles' exit at run-time that would otherwise surface only on the first captions_burn_in:true run. Burn-in OOM honesty per [[feedback-pipeline-description-honesty]] (and explicitly user-confirmed during planning): document the risk, don't engineer around it. The pack Description warns that burn-in adds 5-50% encode wall-clock + 20-50 MB per encoder thread; on memory-tight hosts with large decks the existing OOM-retry path may fire AND (if libass-with-threads=1 also OOMs) fail the run. No auto-fallback retry that silently drops captions (would violate honest-output preference); no preflight memory check (would add brittle estimation that's hard to keep accurate). The engagement.format_ceiling_note already establishes the precedent that helmdeck describes real constraints rather than papering over them. Test surface: 5 new pure-function tests in slides_captions_test.go (cue numbering, timestamp format with HH:MM:SS,mmm + comma separator + cumulative arithmetic, empty-notes cue preservation, multiline normalization, timestamp edge cases) + 4 new handler-level tests in slides_narrate_test.go (sidecar default-on emits non-empty captions_artifact_key; explicit captions_sidecar:false suppresses; captions_burn_in:true wires ,subtitles=/tmp/captions.srt into the per-segment ffmpeg argv AND sets captions_burned_in:true; the new output keys round-trip through OutputSchema.Validate). The existing TestSlidesNarrate_RealOutputMatchesSchema schema-contract test continues to gate Engine.Execute validation for the wider output shape. Pipeline-smoke / avbench (#423) deep asserts: pipelines-smoke.sh mp4:av spec gains a captions_assert block that extracts captions_artifact_key, fetches the SRT via the existing fetch_artifact helper, asserts size > 30 bytes (sane floor), and greps for both --> (cue separator) AND 00:00:00,000 (YouTube-acceptance signature — comma decimal, NOT period). Graceful CAPTIONS_ABSENT skip when the operator disabled the sidecar via captions_sidecar:false. Out of scope (deferred per the plan, each with one-line justification): per-word/karaoke captions (TTS gives per-cue timing only — word-level alignment needs Whisper or a forced-aligner, separate pack); caption styling (font/color/position would need ASS/SSA format and a styling schema — libass renders SRT in a serviceable default for v0); podcast.generate captions (audio-only output, no canvas to burn into); WebVTT (.vtt) sidecar (YouTube/Vimeo accept both — defer until an operator hits a platform that doesn't accept SRT); multi-language captions (operator can translate the SRT externally; bundled translation needs an LLM step + per-language artifact persistence); thumbnail with caption preview (orthogonal feature); auto-language-detect filename (video.en.srt matching — YouTube auto-imports via Studio manual upload regardless of filename; future PR can detect language from engagement.language and rename). All affected packages pass go test ./internal/packs/builtin/ -race -count=1 (767 tests; +9 vs the engagement PR baseline of 858 in this package alone — adjusted up after the 5+4 new tests landed).
Engagement-metadata best practices baked into slides.narrate + podcast.generate (v0.26.0 candidate). Operator question: "should there be built-in best practices for video / podcast generation in the packs? what makes a YouTube video or podcast actually get views?" External research surfaced concrete, research-validated rules for each platform (YouTube official chapter spec, retention-curve data on the first 30 seconds, Apple Podcasts chapters guidance, Buzzsprout 2025 listen-duration data, Podcasting 2.0 namespace) AND the honest reality that slide-deck-with-voiceover videos sit in the lower retention bracket vs talking-head regardless of metadata polish (5-12pp structural gap that no prompt closes). This PR bakes the rules in as hard prompt constraints, ships them as a typed engagement output object on both packs, and surfaces the format-ceiling reality in three machine-readable places so the system can't silently drift optimistic. slides.narrate.engagement (YouTube-shaped): {title, title_char_count, description, chapters:[{timestamp,title,seconds}], hashtags, tags, hook_30s, captions_recommended, category, language, format_ceiling_note}. Structural rules enforced by the prompt — title 45-55 chars target (hard cap 60), first chapter MUST be at 0:00, ≥3 chapters when video > 7min, ≥10s between chapter starts, 3-5 hashtags, hook follows the pattern-interrupt → payoff-promise → commitment-hook structure that retention research validates. podcast.generate.engagement (Apple Podcasts + Podcasting 2.0): {title, subtitle, summary, show_notes_md, chapters:[{startTime,title}], hook_30s, cta:{placement,copy}, language, format_ceiling_note}. Structural rules: title 60-80 chars takeaway-first, chapters[0].startTime always 0, ≥3 chapters when episode > 10min and ≥120s each, cta.placement is force-overridden to "mid-roll" server-side regardless of what the LLM emitted — a defensive layer that means a future prompt drift can't silently flip the research-validated placement. Operator-overridable inputs (per the plan §2): metadata_model (podcast: default-on at openrouter/auto — pass "" to disable; slides: stays opt-in for back-compat), cta_style (podcast: natural/direct/none), hashtag_count (slides: clamped to 3-5), category+language (slides + podcast, server-authoritative override of LLM-emitted values). Everything else (chapter floors, char caps, hook structure, 0:00 anchor) is non-overridable — the research is unambiguous and an override would just let drift back to the patterns the research warns against. Sidecar artifact: both packs persist engagement.json alongside the binary artifact (mirrors the existing metadata.json pattern slides.narrate already had). New OutputSchema fields: engagement (object), engagement_artifact_key (string) on both packs. BREAKING change on slides.narrate: the v0.25.x metadata + metadata_artifact_key fields are renamed to engagement + engagement_artifact_key. helmdeck is pre-1.0 (CHANGELOG header authorizes breaking changes per minor release); the renamed path was already opt-in via metadata_model so consumers are power-users who can adapt. Engagement is a strict superset of the old metadata shape (gains chapters as a structured array, hashtags, hook_30s, captions_recommended, title_char_count, format_ceiling_note). Three-layer format-ceiling honesty per the user's stored preference ("pipeline descriptions must match the mechanism"): (1) engagement.format_ceiling_note constant string baked into both pack output objects — slides.narrate carries the explicit talking-head retention-gap note; podcast.generate carries the solo-vs-cohost honest caveat. (2) PipelineMetadata.Limitations entries added to builtin.repo-presentation, builtin.repo-readme-podcast (newly .withMeta()-promoted), and builtin.prompt-narrated-video (also newly .withMeta()-promoted) — each pipeline now declares the engagement-metadata reality alongside its existing constraints. (3) Pack Description suffix on slides.narrate — keeps the catalog (which agents read first via helmdeck://packs) consistent with the run-time output. Pipeline wiring: builtin.repo-presentation threads metadata_model:"openrouter/auto" into its narrate step so pipeline runs get engagement metadata default-on (the bare pack stays opt-in). builtin.repo-readme-podcast inherits the podcast's default-on behavior automatically. Test surface: 6 new unit tests on slides side (engagement shape, disabled path, operator-override-LLM precedence, hashtag-count clamp, existing happy-path retrofitted to assert constant enrichment) + 3 new on podcast side (default-on engagement, disabled-via-empty-string, custom cta_style/language prompt-shape verification). The existing schema-contract tests (TestSlidesNarrate_RealOutputMatchesSchema, TestPodcastGenerate_RealOutputMatchesSchema) continue to gate Engine.Execute → OutputSchema.Validate so a future field rename can't silently violate the declared schema — closes the [[feedback-pack-tests-bypass-execute-validation]] gap for this PR's surface. Pipeline-smoke / avbench (#423) deep asserts: pipelines-smoke.sh mp4:av spec now asserts engagement.title_char_count <= 60, engagement.chapters[0].timestamp == "0:00", len(engagement.chapters) >= 3 when video > 7min; mp3:av asserts engagement.chapters[0].startTime == 0, ≥3 chapters when episode > 10min, engagement.cta.placement == "mid-roll". Engagement helpers degrade gracefully when the field is absent (yellow ENGAGEMENT_ABSENT line) so pipelines without metadata_model set don't false-fail. Out of scope (deferred per the plan §"Out of scope"): transcript/SRT caption artifact (research-validated ~13% YouTube view boost — distinct artifact + handler work; user explicitly chose to defer), thumbnail generation (better as a dedicated image.thumbnail pack with aspect-ratio + face-prominence rules), end-screen / cards JSON (publish-time concern → future youtube.publish pack), auto-publish to YouTube/Spotify (needs OAuth + credential contract), A/B title variants (requires a downstream picker that doesn't exist), RSS feed XML generation (engagement object carries the data; serialization is a publish concern), B-roll/motion-graphics insertion (the actual lever against the format ceiling — touches the codec path, needs its own pack), and auto-validation of LLM-emitted chapters against the structural rules at LLM-call time (the prompt enforces; if the LLM violates, we accept what it produced and let avbench catch drift over time rather than pad with stub chapters, which would be dishonest output). All affected packages pass go test ./internal/packs/builtin/ ./internal/pipelines/... -race -count=1 (856 tests across 2 packages).
Monthly avbench workflow + mp4:av / mp3:av deep asserts in pipelines-smoke.sh — catches the bug class unit tests structurally can't see. Motivating bug: PR #422 fixed an MP4 +faststart regression that had silently shipped with every helmdeck-produced video since #379 (six months — the entire lifetime of slides.narrate). Every unit test in internal/avenc/audio_test.go pinned the ffmpeg argv shape; the broken file had a valid-looking command line, so coverage stayed green at 99.3% while every operator's MP4 was streaming-broken. The only test that would have caught it is one that runs the real pipeline end-to-end and ffprobes the output. This is that test. NEW .github/workflows/avbench.yml — runs the first Sunday of every month at 04:00 UTC (different time from model-recovery.yml's 06:00 Sun + model-discovery.yml's 06:00 Wed so the three workflows don't compete for runner slots) plus workflow_dispatch for ad-hoc runs. Brings up the full helmdeck stack from current source via scripts/install.sh --no-smoke --no-embeddings, runs builtin.repo-presentation (slides.narrate path → MP4) + builtin.repo-readme-podcast (podcast.generate path → MP3) against a configurable public repo (default tosin2013/helmdeck), deep-verifies each artifact, uploads on failure with 30-day retention, tears down. NEW deep-assert specs in scripts/pipelines-smoke.sh — extending the existing mp4 and mp3 assertions which previously only checked file magic + minimum size (the bugs they could see were "file is empty" / "wrong format" — both rare). The new mp4:av spec adds: (1) faststart layout check via pure-Python moov-before-mdat scan, always runs (no external dep); the regression-impossibility guard for #422; (2) audio packet contiguity via ffprobe -show_packets, fails on any consecutive packet gap > 0.5s — catches mid-segment dropouts where packets simply stop, the class of bug an ElevenLabs 200 OK with truncated body would cascade into; (3) RMS sanity sampled at 5 evenly-spaced 2-second windows across the file, fails if any window's mean is below -45 dB — catches the "TTS silent-fallback fired for slide N" failure mode; (4) codec/sample-rate verification — aac + 44100 Hz for MP4, mp3 + 44100 Hz for MP3 — catches encoder drift (codec swap, sample-rate not pinned). The mp3:av spec applies (2)-(4) but skips (1) since MP3 has no moov atom. Both specs degrade gracefully when ffprobe isn't installed (yellow "audio NOT verified" line), same posture pdf_pages takes for pdfinfo — keeps the script useful on hosts without the optional dep. NEW gate: elevenlabs — checks HELMDECK_ELEVENLABS_API_KEY is reachable to the control-plane (env-var fallback path) before running an av case. Like firecrawl / docling, when the gate isn't satisfied the case is SKIPPED (not failed) so PR contributors without a TTS key get clean green local runs. Cost analysis: ElevenLabs Creator-tier per-character rate × ~3,000-4,500 chars per run × 12 runs/year = ~$1/year in TTS credits at current rates. GitHub Actions: 2-3 minutes per run × 12/year = ~30 runner-minutes/year. Artifact storage: ~10 MB × 30-day retention = ~300 MB-days/year. All three are trivially small. What this catches that unit tests can't (the load-bearing claim): regressions in the FINAL artifact that look fine at every other layer. Container muxing flag drift (the #422 shape). ElevenLabs API-shape change. Native AAC encoder regression. Sample-rate-not-pinned reintroduction. Silent-fallback firing on a slide that should have audio. These all surface as "audio sounds wrong" operator reports days/weeks after they ship; the workflow turns each into a maintainer-visible monthly red signal. Repository secrets required to enable: ELEVENLABS_API_KEY (TTS) + OPENROUTER_API_KEY (LLM for slides.outline / podcast scripting). Without either secret the preflight emits a ::warning:: and skips the run cleanly — same gating shape the model-recovery workflow uses. What's deliberately NOT in this PR: (1) per-PR fire on internal/avenc/** changes — the monthly cadence is enough to catch upstream drift; per-PR adds noise for code-level changes the unit tests already cover. Could be added later if the existing unit-test pins prove insufficient. (2) The bash-side ffprobe asserts don't currently capture downloaded artifacts to the upload path (the script's mktemp dir is cleaned on exit). A follow-up could persist failures into /tmp/avbench-artifact-*.bin so the workflow's upload-artifact step has something to attach — useful when a future operator wants to ffprobe the broken file rather than re-run locally. (3) Long-term trend dashboard rendering pass/fail across months — deferred until a maintainer finds themselves diffing artifacts often, same posture as the model-discovery trend.

Changed

Cookbook expansion (+7 recipes) + new blog draft about the cookbook pattern. The cookbook shipped in PR #435 had 10 recipes across 5 sections; user feedback on it surfaced demand for more entries (community + contributor angle). This PR adds 7 more recipes — every one validated against the actual shipped pack surface in docs/PACKS.md (no recipes for hypothetical capabilities). New recipes by section: Repos → code work gains "Audit a repo's code for a security pattern" (repo.fetch + cmd.run grep + LLM analysis, with the session-chaining contract note) and "Generate developer documentation from a codebase" (repo.fetch + repo.map + blog.rewrite_for_audience — flagged as a candidate for a builtin.repo-onboarding-doc pipeline). Web → structured output gains "Extract structured data from a single-page web app" (web.scrape_spa with CSS-selector schema; distinguished from web.scrape and web.test) and "Compare two competitor products' marketing pages" (web.scrape × N + blog.rewrite_for_audience with the persona knob for honest-vs-weighted comparison). Validation + reliability gains "Strict-mode validate before publishing" (av.validate strict:true as the CI publish gate — bridges the soft-surface default to the typed-error path per ADR 052). NEW section "Media & creativity" — three recipes targeting weekend-builder / hobbyist intents: "Generate AI artwork from a text prompt" (image.generate via fal.ai with the FLUX schnell-vs-pro cost trade-off documented), "Find stock photos for a topic" (stock.search Pexels-backed, with the photos-vs-illustration decision note for chaining), "Build a quick demo video from a HyperFrames description" (hyperframes.compose + hyperframes.render with the HyperFrames-vs-slides.narrate selection guidance). Plus "Generate marketing copy for an upcoming release" (repo.fetch + blog.rewrite_for_audience + image.generate chain — flagged as another candidate composition for a future builtin.repo-release-marketing pipeline). Total cookbook coverage: now 17 recipes across 6 sections covering repos-as-content, web extraction, code work, validation, media generation, and memory. NEW blog draft website/blog/2026-06-05-cookbook-pattern.md — "Recipe-style docs are dramatically underused. Here's the case for them." Frames the cookbook pattern as a generalizable docs technique that survives outside this codebase: the "I don't know what to type" gap is bigger than most docs systems account for; recipe-style docs reward composition because each entry stands alone; recipes are honest about what your system can do (the Tip block has space for non-obvious behavior that tutorials sell-past and reference can't fit). The post documents the four-field recipe shape (OpenClaw prompt + direct invocation + outputs + Tip) and includes a 5-step "how to contribute a recipe" walkthrough so the post functions as a contributor on-ramp. Cited time estimates: ~3 hours for a tutorial vs ~15 minutes for a recipe; per-recipe ROI is high; partial coverage (unlike a tutorial series where missing entries break later ones) is still valuable. draft: true per the template workflow; flips to draft: false in a follow-up after maintainer review. Targets ~1,100 words; tags contributor-experience + field-report + agent-architecture. Build verified locally via cd website && npm run build — both the cookbook page and the blog draft build clean with no broken links. What this PR doesn't do: doesn't create the pipeline candidates the recipes flag (builtin.repo-onboarding-doc, builtin.repo-release-marketing) — those are tracked as cookbook tips rather than filed issues today; revisit when concrete demand emerges. Doesn't extend the cookbook into integrations not currently shipped (Notion / Slack / Linear / Jira recipes — those packs are in the pack-candidate backlog #73–#80; they get cookbook entries when the packs ship). Doesn't refresh the cost table sync (README.md + docs/explanation/why-helmdeck.md + 2026-05-08 blog) — still deferred to the next release cut per RELEASES.md §"Agent sync checklist" step 6.
NEW docs/reference/models.md — operator-facing tier table. Closes the last deferred item from PR #435 ("docs/reference/models.md operator-facing tier table — depends on the tier-aware PromptVariant work landing first") and the natural-next-page sequencing from PR #437 (the tier-aware PromptVariant work that landed first). Surfaces the tier system to operators with a single information-oriented lookup page. Page structure: (1) "How tier affects behavior" — a 6-row matrix showing what changes per tier across catalog projection, output budget, helmdeck.plan prompt variant, strict-JSON mode, prefix-cache routing, and LLM filter pass. Each row links to the source ADR. (2) "When you'll see Tier C behavior" — the three situations that trigger the conservative path (explicit Tier C entry, prefix match, unknown model fallback) with a note explaining why parameter count is the wrong proxy (openrouter/nvidia/nemotron-3-super-120b-a12b:free is 120B parameters but Tier C because free-tier inference quality doesn't match what parameter count alone suggests — cites the validation-arc blog post's 50% multi-step success measurement). (3) Tier A table — 13 entries (anthropic/claude-opus-, claude-sonnet-, claude-3.7-sonnet, claude-haiku-, openai/gpt-4o, gpt-5, o3-mini, google/gemini-2.5-pro, gemini-2.5-flash, plus the OpenRouter relays) with input ceiling, output budget, strict-JSON / prefix-cache / hybrid-reasoning flags, and the calibration source from budgets.go. (4) Tier B table — 8 entries covering Llama 3.1/3.3 70B, Gemma 2, Mistral, DeepSeek V4 Pro / V3.2 / chat, Grok. (5) Tier C table — 7 entries (openrouter/openrouter/free, nvidia/nemotron-, z-ai/glm-, qwen/qwen-2.5-, moonshotai/kimi-k2, moonshotai/kimi-, tencent/) with the Tier C-specific notes about what happens on the single_pick path. (6) "Picking a model for your goal" — 6 scenarios (most reliable, lowest cost, exercise the agent loop, best Tier B price, max context, reasoning models) → recommended model + rationale. (7) "Overriding the tier" — when and how to set Budget.PromptVariant explicitly on an entry to defy the tier default. (8) Cross-links to ADRs 050/051/053, the calibrate-model-tiers HOWTO, the free-models-and-context HOWTO, and the validation arc blog post that motivated the single_pick design. Cross-links from existing docs: docs/howto/calibrate-model-tiers.md Related section gains a top-of-list pointer at /reference/models (the calibration methodology produces entries that surface here); docs/howto/free-models-and-context.md Related section gains the same plus an explicit pointer at ADR 053. docs/reference/index.md gains a "Models reference" bullet alongside the cookbook and prompt-templates entries in the pack-catalog section. Sidebar registration: website/sidebars.ts reference section gains 'reference/models' between agent-memory and the Prompt-templates category — same posture as PR #438 caught (orphan markdown pages slip through review when authors forget the sidebar registration; this PR registers proactively). Posture: information-oriented lookup, not a tutorial — the file reads like a contract, not prose. Operators looking up "what's my tier" or "why is my plan emitting one step at a time" land here and get a 5-second answer. The architectural narrative continues to live in the ADRs; this page is the index over the data. Verified locally via cd website && npm run build before push (catches the doc-id-vs-sidebar-id class of bug PR #438 hit). Sequencing impact: with this PR, the only loose threads from the validation-arc session are (a) the cost-table sync between README.md + docs/explanation/why-helmdeck.md + the 2026-05-08 blog post per RELEASES.md §"Agent sync checklist" step 6 (next release cut), (b) the orphan-page CI check that would have caught both the av-validate frontmatter mismatch in #438 and the cookbook+av sidebar omissions before merge ([good first issue] follow-up), and (c) the actual slides.narrate test run through OpenClaw end-to-end now that the stack is fully fixed.
Doc refresh post-validation-arc + new docs/cookbook/intent-to-prompt.md. Five existing docs were stale after the validation arc landed (PRs #428 / #430 / #431 / #432 / #433) — they referenced the pre-validation-arc world by not mentioning av.validate at all. README.md updated: pack count 52 → 53, slides.narrate highlight row gains the validation + captions + engagement parenthetical, new av.validate row added to the "Document & vision" section. docs/PACKS.md updated: slides.narrate row's Input column gains captions_sidecar? / captions_burn_in? / validate? (the inputs that shipped in PRs #425 / #432), Output column gains engagement (renamed from metadata in PR #424) / engagement_artifact_key / captions_artifact_key / captions_burned_in / validation / validation_artifact_key. Description updated to call out the pointer-bool default-on pattern for captions + validation. podcast.generate row receives the analogous updates plus metadata_model? / cta_style? / language? (engagement defaults shipped default-on in PR #424). New av.validate row added under a new "AV utilities" section with the full 13-check set documented + severity model + strict-mode behavior + link to ADR 052. Gateway-gated count updated to 10 of 53 (43 without a gateway — av.validate has no LLM dependency). Source files section gets av_validate.go → av.validate. docs/explanation/why-helmdeck.md updated: header pack count bumped, new sixth per-task comparison entry — "Diagnosing 'the video has issues' — reliability as a token tax" — using the validation arc as a concrete worked example of the broader thesis (~3,000 LLM tokens / incident manual ffprobe loop vs ~200 tokens reading validation.checks[] from the run record). The 27.9-second audio/video duration mismatch on 888de7b23142ba81-video.mp4 (issue #429) is cited as the motivating example. The added paragraph closes by drawing the parallel: "moving the diagnostic mechanism from 'frontier model derives it' to 'deterministic pack computes it' is exactly the same lever as moving navigation from 'vision model interprets screenshots' to 'browser pack executes deterministic actions.'" docs/reference/prompt-templates/packs.md updated: podcast.generate template's Notes line gains a one-sentence pointer about engagement + validation defaults; new "AV utilities" section + av.validate template added between Podcast and Image sections. The template shape mirrors the rest of the file (Template / Variables / Notes blocks). NEW docs/cookbook/intent-to-prompt.md — the index this docs system has been missing. Ten worked recipes organized by intent class (repos→content, web→structured output, repos→code work, validation+reliability, memory), each showing three things: the OpenClaw natural-language prompt that resolves cleanly, the direct REST/MCP invocation underneath, and the structured output fields that land in the run record. Each recipe also has a Tip block calling out the non-obvious behavior (engagement defaults, soft-surface validation, citation handling, when to prefer pipelines over bare packs). The cookbook addresses what the Nemotron-3-super-120b-a12b:free testing surfaced as the highest-leverage onboarding gap: users not knowing what to type. The architectural alternative (a separate prompt-generator tool or website) was deliberately rejected — fragmentation cost > value when the helmdeck catalog already publishes intent metadata via /api/v1/packs and the cookbook can leverage the existing Docusaurus build. Cross-linked from the prompt-templates index (the cookbook is the intent-first index over the pack-first templates), the calibrate-model-tiers HOWTO, and the when-a-pipeline-fails HOWTO. What's deliberately deferred (sequenced after this PR): docs/reference/models.md operator-facing tier table (depends on the tier-aware PromptVariant work landing first), the HOWTO amendments calling out validation as a diagnostic step (small follow-up touching existing files), and refreshing the cost-table sync between README.md + docs/explanation/why-helmdeck.md + the 2026-05-08 blog post per RELEASES.md §"Agent sync checklist" step 6 (next release cut). Test plan: docs-only PR; no code touched. go vet ./... clean. CHANGELOG mirror byte-identical to website. Post-merge verification: Docusaurus build picks up the new cookbook page, internal links resolve, sitemap regenerates.
ADR 052 lands; ADRs 008, 015, 045, 051 get focused amendments — Phase 4/4 of the validation arc. Closes out the four-phase arc that started with the standalone script in PR #428. Phase 4 is the architecture record. NEW ADR 052 — "AV Output Validation as a Default-On Post-Encode Step" captures the five sub-decisions the arc encoded in code: (1) Tool selection: ffprobe + libavfilter (silencedetect, blackdetect, freezedetect, ebur128) + null-muxer decode pass + pure-Python moov-vs-mdat byte scan — explicit, per-tool rejection of MP4Box/GPAC (CVE risk + functional redundancy), Bento4 mp4dump (atom-level surgery not where our bugs live), mp3val/mp3check (over-scoped for a single-codec single-bitrate pipeline), QCTools/qcli (built for analog-tape forensics), MediaConch (policy-driven archival compliance — YAGNI), untrunc (we fix the encoder, not patch the output — same lesson as #431's apad swap). (2) Severity model: pass/warn/fail, with fail reserved for checks that match a shipped bug fix (faststart per #422, codec pin per #421, packet contiguity per #404, RMS floor, audio/video duration parity per #429→#431, SRT first-cue anchor + comma separator, captions coverage) — soft heuristics like loudness LUFS, silence runs, black-frame runs stay at warn so pipelines don't break on advisory findings. (3) Known-issue demotion lifecycle: three rules to keep the mechanism honest — file the issue first; same-PR coupling on removal; no demotions for already-warn checks. The lifecycle is enforced by the TestAVValidate_NoDemotionsInForce test (asserts the map is empty post-#431) + lifecycle documentation in the ADR. (4) Soft-surface contract: the pack's output IS the report; failing the pack over a silence_runs advisory would defeat the surface. Strict-mode (strict:true) is the opt-in escape hatch for CI publish gates. (5) Scope boundary: helmdeck-generated artifacts only. Operator-uploaded artifacts (future) have a different threat model (untrusted bitstreams need adversarial parsing + sandboxing posture + GPAC CVE mitigations) and get a sibling pack rather than extending av.validate's check set. ADR 008 amendment explains the severity-vs-error-code axis distinction: a failed check returns success at the runtime layer because the operation proceeded; a typed error code (CodeHandlerFailed) returns when the operation didn't proceed. strict:true is the bridge — translates fail-severity findings into CodeArtifactFailed, keeping the closed-set error vocabulary closed while letting quality findings flow as data. ADR 015 amendment documents the validation post-step as part of slides.narrate's contract: after ConcatVideoMP4s + video upload, runAVValidation runs against /tmp/final.mp4 + optional SRT path; report lands as validation field; validation.json sidecar persists alongside engagement.json + captions.srt. New input: validate *bool pointer-bool default-on. ADR 045 amendment captures the ~600 MB null-muxer decode pass memory peak on 1080p × 11-minute video: sits on top of the existing encoder peak; operators on memory-tight Compose hosts should set SessionSpec.MemoryLimit: 1g. The pass is CPU-bound short-burst, not parallel-heavy, so ProfileCompute's clamp(host_cores - 1, 1, 6) cap is unchanged. ADR 051 amendment clarifies that validation findings are NOT routed through FailureClass — the two systems target different concerns. FailureClass disambiguates empty-completion symptoms across hybrid models (safety filter vs. length truncation vs. constrained-decoding deadlock vs. timeout); validation findings are quality observations on a successfully-produced artifact. Routing retries on a silence_runs advisory would re-encode the entire video to chase a heuristic finding — burning encode time the validation step was built to save. strict:true is again the bridge for operators who explicitly opt into fail-fast. Architectural posture preserved across the arc: same-PR coupling between the fix and its regression guard (#431 demonstrated this — the apad swap landed with the severity-promotion test); soft-surface as the default; strict mode as the explicit opt-in for CI gates; tool-selection rationale documented per-tool so future maintainers don't need to re-derive why we said no to GPAC. What's now closed: the validation arc — Phase 1 (#428), Phase 2 (#430), Phase 3 (#432), Phase 4 (this PR). The token cost of "the video has issues" diagnostics is now ~200 tokens (read validation.checks[]) vs the previous ~3,000 tokens (manual ffprobe loop). The mechanism for catching future encoder regressions is in place at three layers (script, pack, default-on integration) with the architecture documented and the severity policy ossified in the script.
Audio quality lift for slides.narrate + podcast.generate: 128k → 192k MP3, 44.1 kHz pin throughout the avenc pipeline. Operators reported audio sounded "fine" but still noticeably compressed despite PRs #379–#408 closing every functional dropout/OOM/silent-failure bug. Audit traced the cause to four compounding lossiness events stacked on top of each other: (1) both packs hardcoded ElevenLabs output_format=mp3_44100_128 — the cheapest tier — bounding everything downstream by the source quality; (2) internal/avenc/video.go:runSegmentEncode didn't set -ar, so ffmpeg defaulted to 48 kHz AAC for MP4 output, forcing every per-segment encode to resample the 44.1 kHz TTS source through the worst-case 44.1→48 kHz non-integer libswresample path (audible high-frequency aliasing); (3) internal/avenc/audio.go:ConcatVideoMP4s re-encoded audio at concat (PR #404, load-bearing for the dropout fix) but inherited the 48 kHz mismatch and compounded AAC's psychoacoustic loss over already-AAC input; (4) internal/podcast/concat.go passed BitrateKbps: 128 explicitly, leaving no headroom for the silence-segment splice. Fix is one change applied through the whole pipeline: bump the ElevenLabs default to mp3_44100_192 (Creator-tier), pin -ar 44100 on every avenc ffmpeg command that re-encodes audio (runSegmentEncode, ConcatAudio, ConcatVideoMP4s), bump ConcatAudio's default bitrate 128→192 to match the new source, and drop the explicit 128k pin in podcast.Concat so it picks up the new default. NEW HELMDECK_ELEVENLABS_FORMAT env var is the escape hatch for operators on the ElevenLabs Starter tier (capped at mp3_44100_128): set the env var on the helmdeck process and both slides.narrate's and podcast.generate's TTS calls downgrade. Same env-var ladder shape as resolveElevenLabsKey in internal/packs/builtin/elevenlabs_creds.go; resolved per-call so a config reload doesn't require a restart. Kept package-local to internal/podcast and internal/packs/builtin to avoid an internal/podcast → internal/packs/builtin import cycle — minor duplication, much cleaner dependency graph. Why an env var, not a pack input: most operators set TTS tier once per deployment (their ElevenLabs subscription is fixed). A pack input would clutter every pipeline definition with a quality choice the operator doesn't change call-to-call. Env var is the right cardinality. Documented in docs/reference/packs/slides/narrate.md and docs/reference/packs/podcast/generate.md. Test surface: every flag change ships with a positive regression guard. internal/avenc/audio_test.go pins -b:a 192k + -ar 44100 on ConcatAudio's default and -c:v copy -c:a aac -b:a 192k -ar 44100 on ConcatVideoMP4s; internal/avenc/video_test.go pins -ar 44100 on the per-segment encode; internal/packs/builtin/slides_narrate_test.go pins -ar 44100 on the slides.narrate concat shape alongside the existing PR #404 dropout-regression guard. internal/podcast/elevenlabs_test.go pins the new 192k default in the API query AND adds two env-var tests (HELMDECK_ELEVENLABS_FORMAT=mp3_44100_128 returns the override, unset env returns the new 192k default) — same "rule-with-test" shape PR D's property tests follow so a future revert is loud. Cost impact: ~50% more per ElevenLabs character (Creator tier vs Starter tier per-character rate). For a typical 5-minute narrated slide deck (~750 chars/min × 5 = 3,750 chars), the absolute delta is sub-cent at current ElevenLabs rates. Operators sensitive to the cost set the env var. Why this is the right scope: reverting PR #404's concat re-encode (option to "eliminate" the lossiness chain by stream-copying audio at concat) would bring back the mid-segment AAC frame-boundary dropouts; the right move is to make the load-bearing re-encode less lossy by matching its input quality, not to eliminate it. PCM source format (pcm_44100, eliminates source-side MP3 loss entirely) is available on higher ElevenLabs tiers and works through the same env var — deferred until operators ask. All affected packages pass go test ./internal/avenc/... ./internal/podcast/... ./internal/packs/builtin/... -race -count=1 (863 tests across 3 packages).

Fixed

Compose build overlay: session sidecars now use the locally-built image, not the GHCR-published :latest. Surfaced as a missing-script bug on the first slides.narrate run after the validation arc shipped (PR #430): the COPY scripts/av-validate.sh /usr/local/bin/av-validate.sh directive was in the Dockerfile, make sidecar-build produced a fresh helmdeck-sidecar:dev image with the script present, but every session container spawned by slides.narrate's validation post-step failed with OCI runtime exec failed: ... stat /usr/local/bin/av-validate.sh: no such file or directory. Phase 3's soft-surface contract worked exactly as designed (ADR 052): the failure was logged as a warn, the artifact still shipped, the pack returned success — but the underlying mismatch had been silently masking every Dockerfile change made under the build overlay since the overlay shipped in PR #134. Root cause: compose.build.yaml previously only declared a build: directive for control-plane. The sidecar-warm service in the base compose.yaml runs docker pull ghcr.io/tosin2013/helmdeck-sidecar:${HELMDECK_VERSION:-latest} at every compose up, populating the local Docker cache with the GHCR-published image (built from the last release, not the current source). The session runtime (internal/session/docker/runtime.go:47) then defaults to that same :latest tag when HELMDECK_SIDECAR_IMAGE is unset. Net effect: developers running with the build overlay would see their control-plane changes land instantly, but their sidecar.Dockerfile changes would only take effect after a release to GHCR — defeating the whole point of building from source for local development. Fix: compose.build.yaml gains two complementary overrides. First, HELMDECK_SIDECAR_IMAGE: helmdeck-sidecar:local on the control-plane's environment block — pointing the runtime's imageOverride resolution path at a tag the build overlay can populate. Second, the sidecar-warm service gets repurposed: instead of image: docker:cli + command: ["docker pull ghcr.io/..."], it now declares image: helmdeck-sidecar:local + build: { context: ../.., dockerfile: deploy/docker/sidecar.Dockerfile } + entrypoint: ["true"] + command: []. Same end goal (the sidecar tag is in the local Docker cache before the control-plane starts launching sessions), inverted mechanism (BUILD from source instead of PULL from GHCR). Compose's build:+image: semantics tag the freshly-built image with the image: reference, the no-op entrypoint exits cleanly, and the freshly-built image is now what the runtime resolves to when launching session containers. What stays the same: compose.yaml without the overlay layered still pulls from GHCR as before — production deployments are untouched. scripts/install.sh already layers the overlay by default, so dev installs get the fix automatically. The HELMDECK_SIDECAR_IMAGE env var hooks into the existing override mechanism documented at runtime.go:40-47 and runtime.go:90-95 — no new code, just compose-level wiring of an already-existing knob. Verified via docker compose config showing both overrides land cleanly (HELMDECK_SIDECAR_IMAGE on control-plane env, helmdeck-sidecar:local tag + sidecar.Dockerfile build context on sidecar-warm), and via re-running the slides.narrate validation post-step which now finds the script at /usr/local/bin/av-validate.sh and lands a validation field in the pack output with consistency:audio_video_duration: pass: true, severity: fail.
slides.narrate audio/video duration mismatch (#429): replace PadAudioToMin pre-encode silence pad with an -af apad=whole_dur=X filter inside encodeSegment. Issue surfaced during the Phase 1 (#428) av-validate.sh acceptance test against slides.narrate/888de7b23142ba81-video.mp4: ffprobe revealed exactly 13 abnormally-long audio packets at inter-slide boundaries summing to 26.246 seconds — matching the 25.9s timeline-vs-content discrepancy the validator's new consistency:audio_video_duration check detected. The audio PLAYED correctly (the decoder emitted the right samples); the bug was that each silence-pad got compressed into ONE duration-stretched AAC packet carrying metadata claiming ~2s of duration vs the natural 23.22ms-per-1024-sample-frame, pushing the audio stream's container-claimed duration ~26s past the actual content on a typical 14-slide deck (13 inter-slide pads × ~2s each). The discrepancy propagated to the video container, the SRT alignment, and the engagement chapter timestamps. Root cause flow (the previous, buggy path): TTS → narration MP3 (~3s); dur < minTurnSec → PadAudioToMin (internal/avenc/audio.go:159) → GenerateSilence(2s) via anullsrc → libmp3lame → ConcatAudio (libmp3lame re-encode) merges narration + silence MP3 → 5s padded MP3 → runSegmentEncode with -loop 1 -i image -i audio.mp3 -shortest -c:a aac -b:a 192k -ar 44100: the silent tail re-encodes to AAC and the encoder emits one duration-stretched packet covering the silent region in metadata while only containing a single 1024-sample frame of actual PCM silence. Fix (the new path): ffmpegEncodeOpts.AudioPadDur float64 field on the local encodeSegment in internal/packs/builtin/slides_narrate.go. When non-zero, the encode command gains -af 'apad=whole_dur=X.XXX' and the legacy -shortest is replaced with -t X.XXX for deterministic per-segment duration. apad generates the silence inline as PCM samples during the encode pass; the AAC encoder then emits normal-density 1024-sample frames covering the silent region — no more duration-stretched metadata. The handler call site (slides_narrate.go:657 area) drops the PadAudioToMin invocation and instead computes durations[i] = max(tts_dur, minTurnSec); the encode loop passes AudioPadDur: durations[i] unconditionally — apad's whole_dur is a no-op when the input audio is already at or above the target, so this is safe regardless of whether the per-slide TTS naturally exceeds the floor. PadAudioToMin itself stays in internal/avenc/audio.go because internal/podcast/concat.go still calls it for the podcast turn-padding flow. Podcast outputs are MP3 (libmp3lame end-to-end), not AAC — MP3 frames are time-uniform (1152 samples / 44100 Hz = 26.12ms each) with no per-frame duration field to stretch — so the bug doesn't manifest there and the existing podcast pad path stays correct. Same-PR severity promotion (per the av.validate demotion lifecycle): the consistency:audio_video_duration entry is removed from knownIssueDemotions in internal/packs/builtin/av_validate.go in this same PR. The check returns to its natural fail severity. Same-PR coupling means the fix and the regression guard travel together — if a future revert breaks the apad change, the validation will start failing at fail-severity again immediately. Regression guard (TestSlidesNarrate_AudioPadDur_WiresApadFilter in internal/packs/builtin/slides_narrate_test.go): asserts the per-segment ffmpeg argv contains -af 'apad=whole_dur=, contains -t (not -shortest), and explicitly does NOT contain -shortest when AudioPadDur is set. Same posture PR #404 introduced for the no--c copy audio-concat guard. TestAVValidate_NoDemotionsInForce (renamed from TestAVValidate_KnownIssue_DemotedToWarn) asserts the demotion map is empty and the check now lands at fail severity with no (known issue, …) suffix on the detail string — protects against accidentally re-adding a demotion entry without the corresponding tracking issue. Test suite: 2,006 tests pass across 32 packages under -race. Coverage gate PASS at every floor (internal/packs/builtin 80.6%). Phase 3 (default-on integration of av.validate as a post-step on slides.narrate/podcast.generate) is now unblocked — the validation check will surface real regressions at fail-severity going forward rather than producing a stream of pre-known warnings on every output.
slides.narrate MP4 playback dropout: -movflags +faststart on the final concat so streaming players can begin playback before download completes. Operator reported a deterministic audio dropout — "audio plays for ~45 seconds then goes silent; restart and it does the same thing at the same point" — that initially read like an audio-encoding bug. Live ffprobe of the affected artifact (slides.narrate/ee1d32882b4d9962-video.mp4, 5.8 MB, 229.7s duration) ruled out every audio-side cause: packets contiguous from 0 to 227.7s with no gaps, RMS uniform at -22 to -24 dB sampled every 30s across the full file, no DTS discontinuities, audio duration matches video duration. The audio track was fine. The actual bug was in the MP4 container layout: moov atom at byte 5,919,942 (97% into the file), placed after mdat — the mp4 muxer's default behavior when no -movflags +faststart is passed. Streaming consumers (HTML5 <video>, the OpenClaw chat-UI inline preview, mobile MP4 frameworks, most browser-based viewers) cannot begin playback until the entire file streams in because the seek index lives at the tail. In practice the player plays what it has buffered, then stalls — looking exactly like an audio dropout. Bug was present in every helmdeck-produced MP4 since slides.narrate shipped (#379 era); no +faststart flag has ever existed in the codebase per grep -r "faststart" internal/. Fix is one flag on internal/avenc/audio.go:ConcatVideoMP4s — -movflags +faststart — which triggers ffmpeg's second-pass moov-relocation. Confirmed via re-encode on the affected artifact: the second-pass log line [mp4 @ ...] Starting second pass: moving the moov atom to the beginning of the file appears, and the resulting file plays correctly in every player tested. Diagnostic methodology that surfaced this (worth preserving as a future debugging pattern): when an operator reports "audio dropout at deterministic timestamp" on a helmdeck artifact, the first move is not to read the code — it's to download the artifact and ffprobe it. The on-disk audio either has gaps in the packet stream (real audio truncation) or doesn't (container/playback issue). The two failure classes have completely different root causes and the audit reads completely different. PR #421 was scoped against the assumption it was the former; the actual issue was the latter. Test surface: positive regression guards at two levels. internal/avenc/audio_test.go:TestConcatVideoMP4s_VideoStreamCopyAudioReencode pins +faststart in the ffmpeg argv alongside the existing -c:v copy -c:a aac -b:a 192k -ar 44100 shape assertions; internal/packs/builtin/slides_narrate_test.go:TestSlidesNarrate_ConcatReencodesAudio pins +faststart at the pack-level concat shape so a future revert is caught even if someone refactors internal/avenc independently. Same regression-impossibility pattern PR #404 introduced for the no--c copy guard. External research surfaced two architectural follow-ons (deliberately out of scope for this tonight-ship fix, candidates for v0.26.0): (1) ElevenLabs has a documented "200 OK with no audio data" failure mode (status.elevenlabs.io incident 2025-11-27); the current code path reads response body via io.ReadAll(io.LimitReader(resp.Body, 32<<20)) which silently passes a truncated body through to ffmpeg. Hardening would add an ffprobe-based duration sanity check at the TTS-fetch boundary so the failure is loud instead of cascading as silent audio later. (2) The "per-segment encode → concat" architecture has known fragility around AAC priming-sample drift and DTS discontinuities (FFmpeg Trac #10379, #5448); production NLE tools (Descript, Adobe Premiere) use a "timeline → single-render" pattern where audio is one continuous track rendered in one encode pass. Migrating helmdeck to this pattern would eliminate an entire bug class but requires reshaping the segment loop in slides_narrate.go. Doesn't fix podcast.generate (.mp3) dropouts — MP3 has no moov atom, so faststart doesn't apply. Operator reported the same symptom in podcast output; that case has a different root cause and needs its own diagnostic pass.

[0.25.0] - 2026-06-04

Theme: The cheap-model reliability bet, empirically proved. Eight PRs (A–H) shipped the v0.24.0 + v0.25.0 reliability arcs as a single release. The architectural claim — that weak, cheap models can drive complex workflows iff the surrounding environment is perfectly reliable — has moved from "we have typed errors and contract tests" to "every layer (handlers, schemas, engine, MCP, S3, model recovery) has a regression-impossible backstop AND we have empirical evidence that a free 120B-class model recovers correctly from helmdeck's typed errors at ≥7/10 across all 5 reliability scenarios." Concretely: 7 packages newly tracked in the coverage gate (avenc/llmcontext/gateway/packs/builtin/api/packs/pipelines/mcp at 80–90% floors), 5 new contract / property / mutation / wire / recovery test surfaces, 2 nightly/weekly workflows surfacing reliability signal that didn't exist a week ago, ~2,000 internal tests (was ~1,650), and the first piece of empirical evidence in helmdeck's history that the cheap-model bet actually holds — openai/gpt-oss-120b:free passes all 5 typed-error recovery scenarios on the weekly model-recovery workflow.

Added

Model-recovery loop test against moonshotai/kimi-k2.6:free (PR H of the v0.25.0 reliability arc — final). PRs A-G proved helmdeck's environment is correct: coverage gates, contract tests at the schema seam, property tests on validators, mutation testing on decision-dense code, engine audit/memory machinery, S3 wire surface, MCP transport. The reliability bet under all of that is the headline claim: weak, cheap models can drive complex workflows iff the surrounding environment is reliable. A 100%-covered codebase still doesn't prove the LLM understands the typed-error vocabulary helmdeck advertises. PR H closes that gap with the first piece of empirical evidence in helmdeck's history that the cheap-model bet actually holds (or doesn't — both are useful results). NEW .github/workflows/model-recovery.yml — nightly schedule (06:00 UTC, after the 04:00 mutation workflow) + workflow_dispatch for ad-hoc runs. Pins moonshotai/kimi-k2.6:free via three workflow-level env vars: RECOVERY_MODEL, MODEL_LAST_VERIFIED=2026-06-04, MODEL_NEXT_REVIEW_DUE=2026-09-04. The dates live next to the model id so updating the pin prompts updating the review date — no separate calendar to drift. Preflight step: (a) calls OpenRouter's /api/v1/models and asserts RECOVERY_MODEL is in the catalog; if absent, fails loudly with "model deprecated; check https://openrouter.ai/models?supported_parameters=free and update RECOVERY_MODEL + bump LAST_VERIFIED + push NEXT_REVIEW_DUE forward." (b) Compares today's date against MODEL_NEXT_REVIEW_DUE and emits a GitHub ::warning:: annotation past the deadline — visible on the run summary, present on every subsequent nightly until the maintainer updates the dates. Same "loud-but-not-blocking" cadence the coverage gate uses. NEW internal/reliability/ package — build-tagged recovery so a default go test ./... compiles only doc.go and the package is a no-op for ordinary CI. Three gates protect the live API: the build tag (-tags=recovery), HELMDECK_RECOVERY_TESTS=1 env var, and OPENROUTER_API_KEY. All three must be set; the test skips cleanly otherwise so forks and PR contributors without the secret get clean green runs. scenarios.go declares 5 recovery scenarios + the closed-set action vocabulary the model returns (retry_corrected, retry_as_is, escalate_to_user, report_bug) + the system prompt that explains helmdeck's typed-error vocabulary to the model (mirroring what an MCP client surfaces). client.go is a 180-LOC OpenRouter chat-completions caller — deliberately not the production internal/gateway stack so the harness can't be confused with what's being measured. Forces response_format: json_object and temperature=0.2; strips ```json fences from providers that wrap output; treats malformed JSON as a recovery failure (a model that can't emit parseable output for a typed envelope is failing the contract just as much as one that picks the wrong action). recovery_test.go runs each scenario N=10 attempts (configurable via HELMDECK_RECOVERY_ATTEMPTS for ad-hoc shorter runs), tallies actions against ExpectedActions, asserts successes ≥ threshold. Default threshold 7/10; message-only ambiguity scenario uses 6/10 because multiple recoveries are inherently acceptable. Persists a per-scenario JSON report to /tmp/recovery-report.json; the workflow uploads it as a 30-day-retention artifact and posts a per-scenario table to the run summary via $GITHUB_STEP_SUMMARY. The 5 scenarios (each pinning a specific reliability claim): (1) CodeInvalidInput with named field — caller-fixable, model must emit a corrected input (the headline claim). (2) CodeSchemaMismatch on output — pack bug, model must report (NOT retry, NOT escalate to user) — the v0.17.1-class regression. (3) CodeHandlerFailed transient — model must retry with same inputs. (4) CodeCredentialInvalid — model must escalate to user (auto-retry could lock the account). (5) Message-only ambiguity — vague message, only the code carries actionable signal — tests whether the typed code is doing the work or the model is pattern-matching message text. What this proves: each PASS is direct empirical evidence that the typed-error vocabulary works for the weak-model regime the bet is making a claim about. A FAIL is also useful — it surfaces "the message isn't clear enough" or "the bet is weaker than we claimed for this code" honestly. The wrong move is hiding the result. Why a free model, not Haiku 4.5: real-token cost on nightly CI compounds; free tier keeps the budget at zero. More importantly, recovery from a weak model is a stronger reliability claim than recovery from a smart one — Kimi-K2.6 doesn't have the general intelligence to "figure out" the right action from prose, so a PASS is evidence the typed-error contract is doing the work. GitHub repository secret to add before first nightly fires: OPENROUTER_API_KEY (Settings → Secrets and variables → Actions). Without the secret the workflow's preflight emits a clear "secret not set" warning and skips the test. Closes the v0.25.0 reliability arc. Eight PRs (A-H) shipped: coverage gate (A), contract tests (B), handler coverage (C), property + mutation (D), S3 (E), engine audit/memory (F), MCP ratchet (G), model-recovery proof (H). The architectural reliability claim has moved from "coverage % says we're 80%-tested" to "every layer (handlers, schemas, engine, MCP, S3, model recovery) has a regression-impossible backstop + the cheap-model recovery loop is empirically measured."
internal/mcp ratcheted 69.5% → 81.5%, added to the coverage gate at floor=81 (PR G of the v0.25.0 reliability arc). PR D's reshape closed the v0.24.0 arc with internal/mcp deferred (69.5% < 80% infra floor; "would fail the gate"). PR G fixes that — the MCP package is the wire surface every connected agent (OpenClaw, Gemini CLI, Claude Code) talks to, and an untested branch in the tool dispatcher silently breaks every agent's pipeline-execution workflow at once. NEW pipelines_test.go (20 tests) — the highest-leverage file in internal/mcp was at 8.4% before this PR. Covers WithPipelines's nil-service gating (so deployments without pipelines still serve a working pack catalog) + the tools/list wire shape (tool names are BARE — pipeline-run, not helmdeck__pipeline-run — because namespacing MCP clients would double-prefix to helmdeck__helmdeck__pipeline-run; the docstring's load-bearing contract is pinned) + every action of dispatchPipelineTool: list/get/create/run/run-status/rerun/cancel, each with happy-path service forwarding, missing-required-field validation, and service-error translation. Pins specifically that coalesced: true from the single-flight guard is NOT an error in the tool-result envelope — the LLM's recovery code branches on this, and a regression that promoted it to isError would silently break every agent re-firing a pipeline. Also pins the distinct error codes per tool (pipeline_run_failed vs pipeline_cancel_failed vs the generic pipeline_error) — the LLM's recovery branches on the specific code, not the message. NEW my_resources_test.go (12 tests) — buildMyDefaults, buildMyMemory, buildRoutingGuide, formatPipelineAuditChunk. The helmdeck://my-defaults / helmdeck://my-memory / helmdeck://routing-guide MCP resources are what the chat agent reads at the top of every session to understand who it's talking to and what's been learned. Pin three distinct note states (memory-not-configured vs no-store vs empty-history) so the UI can distinguish "memory off" from "memory on but new caller"; pin the wire shape rewrite from packs.Defaults to MyDefaults (a future JSON-tag rename on the underlying type would silently corrupt the agent's defaults reading); pin the audit-category filter in buildMyMemory (pack_history / pipeline_history rows MUST be excluded — they're surfaced via my-defaults, not my-memory, and a regression that leaked them would clutter the agent's user-facts view with engine-written rows). formatPipelineAuditChunk (QMD MCP corpus bridge) — every field renders in a stable header → key/value layout; optional fields (empty Run ID, zero DurationMs, empty LearnInputs) MUST NOT produce dangling labels. NEW helpers_test.go (7 tests) — isInlineableImage closed-set MIME check (PNG/JPEG/GIF/WebP inlineable; SVG/AVIF/BMP fall back to text-URL — a regression here would silently break inline screenshot rendering); base64Encode round-trip; rpcError.Error() format stability (log-parsing scripts depend on the exact shape); extractWebhookFields security boundary (the pack handler MUST NEVER see webhook_url or webhook_secret — they're MCP-server-level metadata; the test pins that the cleaned input does not leak these fields, and that the no-url branch returns the input unchanged so a webhook_secret without a URL doesn't get silently stripped). NEW registry_factory_test.go (4 tests) — defaultAdapterFactory for all three transports (stdio / SSE / WebSocket): valid-config happy path + malformed-config typed error per transport. Unknown-transport branch surfaces a typed error naming the bad value (operator typos like "stio" in the DB row don't silently route to nil). Coverage: internal/mcp 69.5% → 81.5% (+12pp). pipelines.go 8.4% → 92.2% (the biggest single-file jump in the v0.25.0 arc). routing_guide.go 47% → 88%. my_defaults.go 35% → ~95%. my_memory.go 20% → ~85%. Floor: internal/mcp newly tracked at 81. Tests: 113 passing in internal/mcp/ (was 70). All ./internal/... package suites pass go test -race -count=1 -timeout=240s (1,994 total tests across 34 packages). Coverage gate reports PASS at the new floor. What's deliberately left: jobs.go.sweep (the SEP-1686 async-job janitor — exercised end-to-end by the integration suite when it next runs; standalone test would need timer-clock injection that adds more weight than it removes), stdio.go reader-side adapter (sub-process spawn semantics; integration territory), the trivial WithArtifacts / WithInlineImageThreshold one-line setters. v0.25.0 arc remaining: PR H — model-recovery loop test against Haiku 4.5 (the actual cheap-model-reliability proof; real-token cost, opt-in env var, budget plan).
Engine audit + memory machinery covered (PR F of the v0.25.0 reliability arc). PR E closed the S3 store gap; PR F closes the LLM-context machinery — the ADR 048 surface that builds the model's per-caller defaults projection on every run. Before this PR: WritePlanAudit, WritePipelineAudit, MemoryStore() accessor, StoreFact, ProjectDefaults, CallerFromContext, WithProgress, ProgressFromContext, FactStoreError.Error() were all at 0%. The reliability story rests on these being right — CallerFromContext returning the wrong subject means every audit row lands under the wrong namespace and the per-caller learned defaults silently swap between users; WritePlanAudit losing the IntentSHA from its key shape means the planning-history projection breaks; MemoryStore() returning nil from a wired engine means the QMD MCP bridge mounts a 503 stub when it should serve real corpus. NEW audit_engine_test.go (9 tests): WritePlanAudit happy-path pinning the plan_history/<intent_sha>/<nano> key shape + category=plan_history (ADR 049's reservation); preserves caller-set non-zero AtUnix (the branch existed but was untested); nil-store no-op guard (without this, every plan run on a no-memory deployment would nil-deref); unknown-caller default namespace (callerFromContext falls back to "unknown" so memory writes always have a well-defined namespace). WritePipelineAudit happy-path with learnable-input filtering (theme + model extracted, markdown body dropped — same closed-set as writePackAudit); empty-pipelineID no-op (so the projection isn't polluted with empty-ID rows the my-defaults UI can't group); nil-store guard. MemoryStore() accessor: returns the configured store, returns nil when unwired — the QMD bridge's mount-vs-stub gate. NEW facts_engine_test.go (5 tests): FactStoreError.Error() round-trip through errors.As (the REST handler at internal/api/memory.go uses this seam to extract the typed code for status mapping); StoreFact happy path through the REST entry point; nil-store path synthesizes the entry so memory-disabled deployments get a stable response shape; validation errors pass through unchanged (so missing-key doesn't coerce to backend-error 500); backend errors wrap as FactErrBackend (so a SQLite write failure surfaces as 500 instead of 400 invalid_input). NEW context_test.go (4 tests): WithCaller/CallerFromContext round-trip + nested-child inheritance + empty-subject-fallback-to-"unknown" (the namespace MUST be non-empty); WithProgress/ProgressFromContext round-trip + the always-non-nil contract (no-op callback returned for bare context so handlers don't need a nil-check); nil-clears branch. NEW project_defaults_test.go (5 tests): ProjectDefaults (the slice-input variant used by helmdeck.route, distinct from the store-backed BuildDefaults) — empty-inputs returns non-nil empty slices (JSON marshals as [] not null, same shape contract as PR C's null-fixes); ranks by call count; excludes failed runs from learned defaults (a caller-fixable failure with persona="executive" must NOT pin executive as default — that's reinforcing the wrong intent, the regression class the LLM's recovery story depends on most); pipeline-audit accepts both "succeeded" and "ok" outcomes (pack-level vs pipeline-level vocabulary); top-N cap applied so a heavy caller doesn't blow up the routing prompt. Coverage: internal/packs 82.0% → 87.6% (+5.6pp). Per function: WritePlanAudit 0→76.9%, WritePipelineAudit 0→75%, StoreFact 0→full happy/error paths, ProjectDefaults 0→full, CallerFromContext/ProgressFromContext 0→full. Floor bumped to 87. Tests: 140 passing in internal/packs/ (was 117). All ./internal/... package suites pass go test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floor. What's still untested in internal/packs (deliberate scope decisions): memoryAdapter.Namespace/List/Delete (exercised indirectly via Engine.Execute; pinning the adapter directly is a follow-up if signals warrant), ExecutionContext.Report (no-op when no progress sink wired; integration-tested via the pipeline runner's progress capture), and the trivial WithCDPFactory/WithSessionExecutor/WithArtifactStore option setters (one-line assignments — testing them is theater). v0.25.0 arc remaining: PR G — internal/mcp ratchet (69.5 → 80; the deferred-from-PR-D infrastructure floor). PR H — model-recovery loop test against Haiku 4.5.
internal/packs/s3store.go wire-tested against a stub S3 endpoint (PR E of the v0.25.0 reliability arc). Post-v0.24.0 coverage audit surfaced the bigger latent risk: the artifact store EVERY operator's production deployment depends on — internal/packs/s3store.go — was 0% covered in CI. The existing s3store_test.go had a compile-time interface check + an opt-in TestS3ArtifactStoreLive that runs against real MinIO when HELMDECK_S3_TEST_ENDPOINT is set, but the CI surface was zero. Any operator deploying with MinIO/R2/B2/AWS S3 was running unreviewed code on every artifact upload. PR E closes that gap with a stub S3 server that speaks just enough of the AWS S3 wire protocol for the minio-go SDK to round-trip. Scope (NEW internal/packs/s3store_wire_test.go, 11 tests): full Put → Get round-trip with content-type + size preservation + presigned-URL shape (X-Amz-Signature query param verified, not just non-empty); BucketExists failure surfaced at construction time (operators get a clear error at startup, not on first Put); upstream-error translation to *PackError{Code: CodeArtifactFailed} on Put / Get / Delete (the engine's typed-error contract held end-to-end); ListForPack reads the in-process index (cross-handler-within-run lookup, not bucket scan); ListAll walks the bucket and parses Pack from the key prefix (the only entry point the TTL janitor uses — if the prefix parse breaks, janitor either deletes the wrong artifacts or stops working); Delete removes the object AND drops the index entry (without the index update a follow-up ListForPack would return a stale handle); PublicEndpoint rewrites the presigned-URL host (the docker-internal-vs-public-DNS seam compose deployments rely on); PresignTTL=0 defaults to 15min; Region defaults to us-east-1 for MinIO sign-path compatibility. Why a stub, not testcontainers: a real MinIO container in CI adds a docker-in-docker dependency and ~5s of startup per run. A stub server that emulates the path-style S3 endpoints (HEAD bucket, PUT/GET/DELETE object, GET ?list-type=2) is enough for unit-test coverage of the helmdeck-side translation logic — the wire-shape we care about. The two non-trivial pieces the stub had to model: (1) AWS chunked-signed PUT payloads (minio-go uses X-Amz-Content-Sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD for streaming uploads, so each chunk arrives as <hex-size>;chunk-signature=<sig>\r\n<data>\r\n; the stub decodes this so Get round-trips the raw bytes the test wrote), and (2) persistent error injection (minio-go retries failed requests internally — a one-shot error stub fires on the first attempt and the retry succeeds against the stub's normal flow, so the error field has to stay set for the duration of the test). Why this matters for the reliability bet: PRs A–D proved the handlers are correct. The engine's artifact store is the substrate every artifact-producing pack writes to (slides.narrate, podcast.generate, image.generate, screenshot_url, hyperframes.render, swe.solve's trajectory dumps). A bug here breaks all of them at once with the worst possible failure mode: silent. The presigned URLs are how agents reach back to fetch what they produced — a regression in the URL shape would silently break every agent's fetch loop. Coverage: internal/packs jumped 72.6% → 82.0% (+9.4pp). internal/packs/s3store.go from 0% (across every function) to 75-100% per function. Added to the coverage gate at floor=80 — the engine layer now has the same regression-impossible backstop the v0.24.0 packages got. Tests: 117 passing in internal/packs/ (was 106). All ./internal/... package suites pass go test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floor. Why this kicks off v0.25.0: PR D closed the v0.24.0 arc with a self-audit promising "engine internals + S3 are next." PR E starts the v0.25.0 arc against the actually-untested infrastructure: S3 store (this PR), then engine audit/memory machinery (WritePlanAudit/WritePipelineAudit/StoreFact/ProjectDefaults/CallerFromContext — all 0%), then internal/mcp ratchet (69.5 → 80), then the model-recovery loop test (the actual cheap-model-reliability proof).
Property-based tests on seam validators + nightly mutation-testing workflow (PR D of 4, v0.24.0 reliability arc — final). PRs A-C ratcheted the quantity floor (coverage gate, contract tests at the schema seam, closing zero-coverage handlers). PR D adds the quality gates that coverage can't see — the things that actually prove the cheap-model reliability bet. Property tests (pgregory.net/rapid v1.3.0 added; test-only dep, no production import): internal/pipelines/validate_property_test.go (6 tests) pins pipelines.Validate's invariants — every well-formed pipeline must validate, every duplicate-step-ID / empty-pack / forward-step-ref / packExists-rejects pipeline must reject with a message naming the offending element (the LLM's recovery key). internal/packs/schema_property_test.go (4 tests) pins BasicSchema.Validate — every conforming output validates, every missing-required / type-mismatch / non-object input rejects with a clear message. internal/gateway/splitmodel_property_test.go (4 tests) pins gateway.SplitModel's round-trip identity AND the docstring's load-bearing claim that the split is on the FIRST / only (so "ollama/library/llama3" routes correctly to provider=ollama, model=library/llama3 — a naive strings.Split would corrupt it). And caught a real bug while writing them: BasicSchema.Validate accepted top-level null even though the docstring promises rejection of non-objects. json.Unmarshal([]byte("null"), &map[string]json.RawMessage{}) succeeds with the map left as nil — Go's decoder treats null as "no value" for map types. Without an explicit nil-check, any pack returning null instead of {} would silently pass validation. Same regression class as PR C's browser.interact null-slice screenshots: an empty value JSON-encoded as null instead of [] / {} slips past validation that "looks" right by line coverage. Fixed in internal/packs/schema.go with if obj == nil { return ... "got null" }. Why property tests, not more example tests: example tests at 95% line coverage pin the cases the test author thought of. Property tests pin the INVARIANT — across thousands of generated inputs per check. If a future refactor of extractStepRefs accepts ${{ steps. }} with a trailing dot, the well-formed property doesn't notice (it doesn't generate that shape), but the forward-ref property does — every well-formed run includes refs that the validator now misparses. The reliability bet rests on these validators being right for the inputs they haven't seen yet; properties are how we test that. Nightly mutation-testing workflow (NEW .github/workflows/mutation.yml, go-mutesting v1.2.0): scheduled at 04:00 UTC daily, scoped narrowly to three places where a flipped condition has the largest blast radius on the reliability story — internal/packs/classify.go (typed-error closed-set mapping; a mutation swapping CodeInvalidInput for CodeInternal would silently break every LLM's failure-recovery channel), internal/gateway/fallback.go (Chain.Dispatch retry/fallback ladder; flipped predicates surface as routing dead-letters no example test catches), internal/avenc/ (the codec byte-floor checks from PRs #400/#404/#405; size < floor vs size <= floor is a 1-byte difference coverage % can't detect). Runs as a matrix (3 parallel jobs), 25-minute timeout per target, uploads survivor lists as artifacts retained 14 days, posts a per-target summary to the run page. Not a per-PR gate because go-mutesting runs the test suite once per mutation and is slow (~5-15 min per file, longer for avenc which has more functions); per-PR would burn CI. Nightly + on-demand workflow_dispatch is the right cadence for a "this drift caught us before we noticed" signal. Final per-package floors locked: avenc=90 (99.3 actual), llmcontext=90 (92.1), gateway=88 (88.1; bumped from 85), packs/builtin=80 (80.5), api=80 (80.1), pipelines=80 (84.0; new tracked package). internal/mcp deliberately not yet tracked — currently 69.5%, below the 80% infrastructure floor; adding it to the gate now would fail the run. Ratcheting mcp is a focused v0.25.0 task. cmd/* excluded as documented (entry-point os.Exit/signal handling has a realistic ~60% ceiling). Deferred to v0.25.0: the model-recovery loop test — drive Haiku 4.5 through deliberately-broken pack outputs and assert the LLM's recovery behavior — is the test that would truly prove the cheap-model reliability bet end-to-end. Real token cost per CI run requires a dedicated budget plan + opt-in env var so dev machines don't burn credits, so it's its own arc, not bundled into PR D. Tests: 14 new property tests across 3 packages, all pass under rapid.Check (each runs ~100 generated cases per invocation by default). 855 total tests in internal/packs/... (was 750). All ./internal/... package suites pass go test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new locked floors.
Close the zero-coverage handler set in internal/packs/builtin (PR C of 4, v0.24.0 reliability arc). PR A landed the regression gate, PR B added contract tests at the schema seam. PR C closes the actually-untested handlers that PRs A and B left at 0% — the LLM-facing surfaces where coverage genuinely was theater. browser.interact (NEW browser_interact_test.go, 14 tests): full happy-path walkthrough exercising every action shape (click, type, focus, screenshot, extract, assert_text, wait, execute) against cdpfake.Client, plus per-action input validation (selector/value/text required), navigate-error propagation, assert_text → CodeSchemaMismatch mapping, the no-CDP defense-in-depth path. And caught a real bug while writing the tests: the handler initialized screenshots as a nil slice (var screenshots []string), which JSON-marshals as null and violates the array type declared in the OutputSchema for action sequences that don't include a screenshot. Production runs without the screenshot action would have failed Engine.Execute's output-schema validation with invalid_output: expected array, got null. Initialized as []string{} so empty marshals as [] — exactly the class of bug PR B's schema-contract tests are designed to catch in the future, and a useful proof point that the contract-test pattern works. github.* handler set (NEW github_handlers_test.go, 8 tests): the existing github_cache_test.go exercises the engine's cache seam by stubbing the handler — never hits githubAPI. PR C closes that gap by overriding the package-global githubAPIBase (newly var instead of const so tests can point it at httptest.NewServer, same pattern as voices.ElevenLabsBaseURL) and running the real handlers through the real HTTP call. Tests pin: request-shape headers (Authorization Bearer, Accept, X-GitHub-Api-Version, User-Agent), the no-token branch (no Authorization header — public reads still work), upstream-error surface (4xx/5xx → CodeHandlerFailed with status + message), per-pack body/path shape for create_issue / list_prs / post_comment / create_release / search. A header regression in githubAPI is exactly the bug that would silently break every github pack at once — pinning it once protects the whole family. ElevenLabs credential ladder (NEW elevenlabs_creds_test.go, 7 tests): resolveElevenLabsKey is called at handler entry by both podcast.generate and slides.narrate. The 4-step resolve ladder (explicit credential → canonical vault name elevenlabs-key → back-compat alias elevenlabs-api-key → HELMDECK_ELEVENLABS_API_KEY env var) had 76.5% coverage with the alias + explicit + env paths untested. Tests pin each step's precedence + the no-source empty return + the explicit-missing-falls-through behavior + the nil-vault defensive path. A ladder reorder would now fail loudly pointing at the source step. Floors: internal/packs/builtin 77 → 80 (+3pp; actual coverage now 80.4%). The larger 76 → 88 sweep the original plan called for proved aspirational — PR C's 4pp ratchet (76 → 80 across PRs B+C) is the real path the available test surface supports. Tests: 750 passing in internal/packs/builtin/ (was 721). All ./internal/... package suites pass go test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floors. PR D (reshape per the v0.24.0 plan) adds rapid property tests + nightly go-mutesting workflow — quality gates on top of the quantity floor.
Schema-contract + typed-error contract tests across the builtin pack catalog (PR B of 4, v0.24.0 reliability arc). PR A landed a coverage gate; PR B addresses the dimension coverage can't see — quality. You can hit 90% coverage with tests that only assert HTTP 200 and never verify the typed-error code the LLM's recovery depends on. The bug we actually pay for is "pack returned CodeInvalidInput but the test only checked status 400" — coverage stays green while the model receives a generic error, gets confused, and burns tokens retrying. PR B closes two specific drift surfaces at the seam where the reliability bet lives. Output-schema contract (internal/packs/builtin/output_schema_contract_test.go) extended from 2 packs (slides.narrate, podcast.generate) to 7 — adds helmdeck.plan, helmdeck.route, content.ground, research.deep, swe.solve. The class of bug closed: a pack's unit tests call pack.Handler directly, bypassing Engine.Execute which is the only place OutputSchema.Validate runs. So a handler can emit tts_chars: {by_voice: {...}} while the schema declares tts_chars: number, every unit test passes, and pipeline runs fail in production with invalid_output: field "tts_chars": expected number, got object (this exact regression shipped in v0.17.1). Each contract test invokes the real handler with valid input and asserts pack.OutputSchema.Validate(output) returns nil. Typed-error all-pack contract (NEW internal/packs/builtin/typed_error_contract_test.go) — table-driven test enumerating 47 builtin packs. For each: invoke handler with deliberately-invalid input ({} in most cases), assert errors.As(err, &perr) succeeds AND packs.IsValidCode(perr.Code) is true. The architectural promise per ADR 008 is that NO error escaping Engine.Execute carries a code outside the closed set in internal/packs/classify.go. Pack handlers returning &PackError{Code: "weird"} get coerced to CodeInternal, which the pipeline-level FailureClass router maps to pack_bug — wrong bucket, wrong recovery, the LLM tries an issue-filing flow when the real fix is caller_fixable. The table makes future drift visible: a new pack returning &PackError{Code: "something_not_in_the_set"} fails the contract loudly. Why this matters for cheap-model reliability: helmdeck's bet is that weak models drive complex workflows iff the surrounding environment is perfectly reliable. The closed-set typed errors are the channel through which the LLM learns "this is your fault, fix your input" vs. "this is infrastructure, retry with backoff" vs. "this is a bug, escalate." Without these contract tests pinning the channel, schema-vs-handler drift breaks the channel silently and the model's recovery becomes a stab in the dark. Floors: internal/packs/builtin 76 → 77 (modest ratchet — the larger jump to 82 lands in PR C alongside browser_interact / github_handlers / elevenlabs_creds test coverage). Tests: 721 passing in internal/packs/builtin/ (was 712). All ./internal/... package suites pass go test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floors. Plan reshape: PR D was originally "close gateway 88→91, lock floors, write a docs page" — a symbolic close. After landing PR A and observing that coverage % alone doesn't prove the reliability bet, PR D was reshaped to add rapid property-based tests on the seam validators (pipelines.Validate, OutputSchema.Validate) and a nightly go-mutesting workflow against the decision-dense LLM-facing code (classify.go, gateway/fallback.go, avenc codec floors). Floor-locking stays but is no longer the headline. The model-recovery loop test — drive Haiku 4.5 through deliberately-broken pack outputs and assert the LLM's recovery behavior — is a v0.25.0 candidate (real token cost per CI run requires a dedicated budget plan).
Per-package coverage gate + golangci-lint job in CI (PR A of 4, v0.24.0 reliability arc). The architectural bet behind helmdeck is that weak, cheap models can drive complex workflows iff the environment around them is perfectly reliable — typed errors, strict schemas, context compaction. Coverage is the measurable proof that bet holds; an untested branch that returns a raw Go error breaks LLM context and burns tokens reasoning out. This PR makes regression impossible from this point forward, then closes the biggest gap (internal/api/) so the floor can lift. Coverage gate (scripts/coverage-gate.sh) parses coverage.txt directly, computing statement-weighted percentages per package via awk — same metric go tool cover -func prints for the total: row — so a single big untested function can't hide behind many small tested ones the way function-averaging via cover -func | tail would. Initial floors land at current values rounded down (avenc:90, llmcontext:90, gateway:85, packs/builtin:75, api:60); PRs B–D ratchet these up. The script reports every tracked package on every CI run, success or fail, so maintainers see drift before it crosses a floor. Slack of 0.05% absorbs rounding noise so a 89.95% reading doesn't flake a 90% gate. golangci-lint job in .github/workflows/ci.yml uses the v2 config schema in .golangci.yml and enables errcheck, govet, staticcheck, unused, ineffassign — the bug-class linters that coverage doesn't replace (shadowed variables, unchecked errors, dead code). errcheck excludes the common Close/Flush patterns ((*sql.DB).Close, (*io.PipeWriter).Close, (*tabwriter.Writer).Flush, …) explicitly — errcheck matches on the static receiver type, not interface satisfaction, so (io.Closer).Close alone wouldn't catch them. staticcheck is scoped to the SA* prefix (genuine bug checks); the ST* (style), S* (simplification), and QF* (quick-fix) categories are off in PR A because the codebase had never run staticcheck and would surface ~33 pre-existing findings whose cleanup belongs in a focused style PR. Subsequent reliability-arc PRs can opt categories back in as the backlog drains. only-new-issues: true ratchets the same way the coverage gate does — the lint blocks NEW issues a PR introduces, doesn't force the cleanup of pre-existing drift in code paths a PR doesn't touch. Action pinned to v8 with golangci-lint v2.12.2 — the major version of the action, the linter binary, and .golangci.yml's schema must stay in sync on bumps (action v6 only supports linter v1; v8 maps to linter v2). internal/api/ coverage lift: 62.8% → 80.1% (+17.3pp). Tests added across handler-shape branches that previously lived in the 0% column. Mostly small, high-fanout additions: REST handlers returned 503/404/405 paths that weren't exercised; pack/pipeline/MCP adapters that the in-process MCP surface routes through; SSE handshake paths for the QMD memory-corpus bridge; vault and key-store error mappings; webhook dispatch + post-issue-comment fallback. New tests honor existing seam patterns — fake.Runtime for session ops, cdpfake.Client for CDP, httptest recorders for HTTP shape, memory.NewInMemoryStore for memory paths — so future contributors find familiar scaffolding instead of one-off mocks. CI scope tightened: go vet and go test -race restricted to ./internal/... ./cmd/... (was ./...) so the test job doesn't try to compile the Docusaurus-bundled website/build/assets/*.go tree, which references internal/api symbols but isn't part of the go.mod build module — this drift was masking real failures behind an unrelated compile error. Why per-package, not aggregate. An org-wide average lets a strong package subsidize a weak one — avenc at 99.3% and gateway at 88% would absorb a hypothetical api drop from 80% to 30%, and operators would notice only when the cheap-model reliability story collapsed in production. Per-package floors with explicit exclusions (cmd/* entry points, generated code, integration-only paths) make backsliding loud at the PR that introduces it. The 4-PR cadence (A–D, this is A) lets each step add tests and raise its package's floor without locking in two more PRs against a flawed baseline — if the gate logic has a counting quirk, we discover it in this PR, not in PR D. Final floors at PR D: critical reliability packages (avenc, llmcontext, gateway, packs/builtin) at 90%; infrastructure packages (api, pipelines, mcp) at 80%; cmd/* excluded. Test count: 291 passing tests in internal/api/ (was 173). All ./internal/... package suites pass go test -race -count=1 -timeout=240s.

[0.23.0] - 2026-06-03

Theme: Reliable narrated decks + shared audio/video helper. The slides.narrate failure surface that produced "ffmpeg segment N failed (exit 0)" — for weeks the most-reported single error — is closed end-to-end: silent-failure detection, honest error messages, OOM retry, post-encode validation, Mermaid pre-rendering, audio re-encode at concat boundaries, and most importantly a shared internal/avenc/ package that captures every lesson PRs #379–#405 paid for so the next pack that needs ffmpeg starts from a solid base. Plus a long string of supporting fixes: pipeline-run single-flight coalescing, session-timeout extension on pinned reuse, paid-API credential precheck, and the ADR 051 routing-reliability work (reasoning-token stripping, parser parity, calibration tooling, cause-typed errors, strict JSON, prefix cache).

Changed

internal/podcast.Concat adopts internal/avenc/ shared helpers (PR C of 3 — final consolidation step). Completes the 3-PR avenc consolidation arc. internal/podcast/concat.go was the second of two production callers shelling out to ffmpeg directly; this migration removes its duplicated generateSilence/probeAudioDuration/padTurnToMin/concat-command helpers and replaces them with avenc.GenerateSilence/avenc.ProbeAudioDuration/avenc.PadAudioToMin/avenc.ConcatAudio calls. The Concat function steps now read like the documentation: write the per-turn files (writeTurnFile stays local — streaming Stdin isn't ffmpeg-shaped), call avenc.GenerateSilence for the between-turn segment, build the concat list, call avenc.ConcatAudio for the audio-only concat with mandatory re-encode, then avenc.ProbeAudioDuration for the final duration. padTurnToMin collapsed from a 45-line inline 4-step ffmpeg pipeline to a 6-line composition of avenc.ProbeAudioDuration + avenc.PadAudioToMin. SilenceTurn now wraps avenc.GenerateSilence + a cat-readback (the wrap closes the 0-byte-output hole the original SilenceTurn had — avenc post-validates the produced file). Bridging shape: session.Executor is the runtime-side interface with an Exec(ctx, sessionID, req) signature, while avenc.Executor is a closure already bound to a sessionID. A 3-line avencBind(ex, sessionID) avenc.Executor adapter lives at the top of concat.go and gets passed to every avenc call. Considered putting the sessionID parameter on the avenc API directly — rejected because it would propagate the session-ID-dispatch indirection into every avenc caller including slides.narrate, which already passes ec.Exec directly without indirection; the closure-adapter pattern is a one-line cost paid only by callers that need it. Net change: 166 LOC → ~140 LOC in concat.go (the 45-line padTurnToMin shrank to 6 lines + 9-line doc comment, and the inline concat block lost ~12 lines of error-handling boilerplate that avenc owns now). Concat reads as a much shorter pipeline of named-helper calls. Behaviour deltas worth noting: (1) ffprobe invocations now prefix LC_ALL=C (avenc's locale-stability guard from the external research in PR #406). A sidecar with LC_NUMERIC=de_DE previously had ffprobe emit "5,123" which fmt.Sscanf silently parsed to 0 — now the parse simply succeeds since LC_ALL=C forces period decimal separators. (2) Final-duration probe failures (corrupt MP3 etc.) now ignore the avenc-returned error and fall back to duration = 0 — preserves the historical behaviour where fmt.Sscanf silently returned 0 on garbage so the caller's cost-accounting code didn't blow up on a probe failure. (3) padTurnToMin intermediate file names changed from /tmp/helmdeck-podcast/turn-NNN-pad.mp3 to /tmp/helmdeck-podcast/avenc-pad-turn-NNN.mp3 — the prefix is now avenc-pad- instead of the bare turn-NNN-pad-* shape. Files are still in concatTempDir; temporary; no semantic impact. (4) SilenceTurn previously had a known 0-byte-output hole (mid-write SIGPIPE produced an empty file with exit 0). The avenc-wrapped version closes it — silent-fallback runs now surface "silence-gen output: produced only N bytes" instead of generating a video over an empty audio track. Tests: 3 existing concat_test.go tests pass (one needed the fake-executor's response map updated to include "wc -c < " so avenc's post-encode size validation gets a healthy response — same fixture pattern as PR B used for slides.narrate); 2 existing podcast_generate_test.go tests needed the same wc -c mock added and the ffprobe HasPrefix switched to Contains so the new LC_ALL=C ffprobe ... shape still matches. All 1702 internal tests pass (count unchanged from PR B — no leaf-helper tests existed at the podcast level to delete, since the duplicated helpers were exercised through the higher-level padTurnToMin and Concat orchestration). What this completes: every audio/video pack that shells out to ffmpeg directly (slides.narrate + podcast.generate via internal/podcast.Concat) now imports internal/avenc/. Future packs (tiktok.shorts, audiobook.generate, …) inherit every battle-tested pattern PRs #390–#405 paid for, with concentrated 99.3%-coverage tests rather than partial coverage across N call sites. The consolidation arc described in /root/.claude/plans/i-would-like-to-elegant-kahan.md is closed.
slides.narrate adopts internal/avenc/ shared helpers (PR B of 3 — slides.narrate migration). Validates PR #406's abstraction by deleting the duplicated requireNonEmptyOutput / generateSilence / probeAudioDuration / padSlideAudioToMin / looksLikeMP3 / validateElevenLabsBody helpers from internal/packs/builtin/slides_narrate.go and replacing them with avenc.RequireNonEmptyOutput / avenc.GenerateSilence / avenc.ProbeAudioDuration / avenc.PadAudioToMin / avenc.ValidateMP3Body calls. The manual concat command + error-handling block (10+ lines of OOM lifts, transport-error detection, stderr capture) collapses to a single avenc.ConcatVideoMP4s call that owns PR #404's byte-stable -c:v copy -c:a aac -b:a 192k shape. Net change: 298 LOC removed from slides_narrate.go (-200 net once you account for the avenc call sites added), 312 LOC removed from slides_narrate_test.go (the leaf-helper tests are now covered by avenc's 99.3%-coverage test surface — keeping them in slides.narrate would be redundant). Total: 567 lines deleted, 43 lines added across both files. Local helpers KEPT: (a) encodeSegment + ffmpegEncodeOpts + the per-segment OOM-retry loop — preserves persistFfmpegStderr artifact-store dump on the most-common production failure path (avenc.EncodeVideoSegment surfaces stderr inline only, truncated at 4 KB; the per-failure artifact dump is genuinely useful for production debug); (b) validateMarpPngs + pngMagicHex + minRenderedSlidePngBytes — Marp-specific PNG validation, not shared with any other pack; (c) slidesNarrateFfmpegThreads + the env-var → constants — slides.narrate-specific operator tuning knob; (d) persistFfmpegStderr + truncStr + artifactSuffix + extractFirstJSONObject — used by the kept encodeSegment path. Local helpers DELETED: every audio/duration/validation helper from PR #400 + PR #404 + PR #405 — they live in avenc now. Behaviour changes worth noting: (1) Concat-step failure messages no longer reference an ffmpeg-stderr-concat.txt artifact key (concat failures are rare — single occurrence vs. per-segment, which still gets the artifact dump). The inline 4 KB stderr is still in the surfaced error message, sufficient for diagnosis. (2) PadAudioToMin's intermediate-file names changed from /tmp/audio-NNN-pad.mp3 to /tmp/avenc-pad-slide-NNN.mp3 (and similar for the merged + list files). Operators inspecting the sidecar mid-run see the new names; the intermediate files are temporary so the rename has no semantic impact. (3) The new avenc floor constants apply: MinSilenceMP3Bytes = 256 (was the same), MinTTSResponseBytes = 512 (was the same), MinEncodedSegmentBytes = 1024 (was the same) — byte-identical, just relocated. Tests: existing slides.narrate orchestration tests pass byte-identically (37 SlidesNarrate-prefixed tests), proving the abstraction is right. The 13 leaf-helper test functions (~30 individual sub-cases) that directly called the deleted local helpers were removed since avenc's 99.3%-coverage test surface covers the same behaviours, and keeping them would be testing avenc twice. The test count drops from 1732 → 1702 (-30 redundant sub-cases). All 1702 internal tests pass. What this PR validates: PR A's abstraction is correct — every removed helper had a 1:1 avenc equivalent with the same signature shape and the same error-message conventions. No edge case surfaced that required reshaping avenc. PR C (internal/podcast.Concat migration) can proceed against the same stable surface.

Added

internal/avenc/ — shared ffmpeg/ffprobe/TTS-validation helper package (PR A of a 3-PR consolidation; no callers change yet). Operator framing: "the slides.narrate audio code keeps breaking; there's ffmpeg code all over the place across packs; could we have ONE helper everyone uses with 90% test coverage that covers things it can run into?" The intuition is correct. PRs #390, #399, #400, #401, #404, #405 each closed a real audio/video failure mode, but each landed in only one caller (slides.narrate or internal/podcast.Concat). Without consolidation, the next pack that shells out to ffmpeg starts from zero on lessons already paid for. This PR extracts the canonical patterns into internal/avenc/ so future packs (tiktok.shorts, audiobook.generate, …) inherit every battle-tested behaviour automatically AND so we can cover the patterns with a single concentrated test surface instead of partial coverage across N packs. Scope verified: the two packs that shell out to ffmpeg directly are slides.narrate (5+ call sites) and podcast.generate/internal/podcast.Concat (4 call sites). hyperframes.render uses a hyperframes CLI wrapper that opaquely runs ffmpeg — out of scope. slides.render and hyperframes.compose have no ffmpeg surface. External research (Mux, WaveSpeed, vidcutter, ffprobe docs — citations in the plan file) confirmed the canonical patterns the bug history already encoded AND surfaced two gaps closed in this PR: (1) LC_ALL=C prefix on every ffprobe invocation so a sidecar with LC_NUMERIC=de_DE doesn't emit "3,14" instead of "3.14" and strconv.ParseFloat silently returns 0; (2) ffprobe-based MP4 stream presence validation (-show_entries stream=codec_type) so an ffmpeg-exit-0 output with no moov atom / no audio stream surfaces honestly instead of slipping past the byte-floor check. The 10 exported surface area: Executor type alias matching packs.ExecutionContext.Exec; MinEncodedSegmentBytes / MinSilenceMP3Bytes / MinTTSResponseBytes / LocalePrefix size + locale constants; IsOOMExitCode shared classifier; RequireNonEmptyOutput post-success size stat; LooksLikeMP3 byte-level MP3 sniffer (MPEG-1/2 Layer III sync words + ID3v2); ValidateMP3Body (size floor + MP3 sniff for HTTP-200-wraps-error case); ValidateMP4Streams (LC_ALL=C ffprobe stream-presence check, optional video/audio); ProbeAudioDuration (LC_ALL=C ffprobe + NaN/±Inf/non-positive rejection); GenerateSilence (anullsrc → libmp3lame + post-validate); ConcatAudio (audio-only concat with mandatory re-encode, configurable codec + bitrate); ConcatVideoMP4s (video stream-copy + audio re-encode — PR #404's asymmetric pattern locked in); PadAudioToMin (4-step composition of silence-gen + concat that no-ops within 1ms epsilon); EncodeVideoSegment (still-image-plus-audio → .mp4 with PR #390's OOM-retry pattern: primary -threads 4 / medium preset → on exit 137 retry ONCE with -threads 1 -preset veryfast, surface CodeResourceExhausted on double-OOM). Tests: 80 unit tests in internal/avenc/*_test.go covering every failure mode per function (happy path, transport error, OOM exit, generic non-zero exit, output validation: missing/0-byte/below-floor, edge cases: NaN/±Inf/0/negative/garbage stdout, MP3 sync variants, comma-decimal locale guard, OOM retry fires + double-OOM escalation + non-OOM-no-retry, byte-stable codec/bitrate/flag-shape regression guards: -c:v copy/-c:a aac/-b:a 192k/-tune stillimage/-shortest/-pix_fmt yuv420p). go test -race -coverprofile=avenc-cover.out ./internal/avenc/... reports 99.3% line coverage across 80 tests — well above the 90% target. The mock executor scaffolding (mockExec in validate_test.go) is a substring-keyed scripted-response Executor modelled on slides_narrate_test.go's narrateExecScript, generalised so every avenc test reuses the same goroutine-safe mock instead of each helper rolling its own. Three convenience builders chain: .stdout(needle, out) for happy-exit, .fail(needle, code, stderr) for non-zero exits, .transport(needle, errMsg) for the err != nil case. Reusable in test files of downstream packages without dragging in the engine. What this PR does NOT do: migrate any callers. The plan's PR B will delete slides.narrate's duplicated generateSilence/probeAudioDuration/padSlideAudioToMin/encodeSegment/looksLikeMP3/validateElevenLabsBody/requireNonEmptyOutput helpers and call avenc instead (~200 LOC net deletion in slides_narrate.go). PR C will do the same for internal/podcast/concat.go (~80 LOC net deletion). Each pack migration is a delete-and-replace refactor with byte-identical behaviour tests — if PR A's abstraction is wrong we'll discover it without locking in two pack rewrites. All 1732 internal tests pass (1652 from main + 80 new in avenc/).

Fixed

slides.narrate no longer speaks  comments aloud — the narrator was literally saying "image prompt colon: a chart of revenue by year" because the speaker-notes extractor matched every  block indiscriminately. Operator-reported on the run that produced the first video after PR #404 closed the Mermaid + audio-dropout gaps: the narrator on each slide was reading both the freeform speaker notes AND the image_prompt comment that slides.outline embeds next to them. The bug was structural: slides.outline instructs the LLM to emit a  comment (freeform text the narrator should say) AND a  comment (structured metadata consumed by slides.outline's own extractImagePrompts to produce a typed image_prompts[] output array). slides.narrate's extractNotes at internal/packs/builtin/slides_notes.go:106 used the generic notePattern regex () and concatenated EVERY match into the spoken-notes string. Result: the image_prompt's content (a description of the visual the slide should show) ended up in the TTS payload and the narrator spoke it as if it were dialog. Fix: a small isStructuredMetadataComment helper checks each comment's inner-text prefix; comments whose trimmed lowercase body starts with image_prompt: are skipped when building the narrator's TTS input but still get stripped from the visible slide content (the existing catch-all ReplaceAllString keeps that behavior). The filter is an explicit allowlist of prefixes — currently just image_prompt: — chosen over a generic "anything-with-a-colon" filter so legitimate freeform notes that happen to contain a colon ("Note: discuss this further") still get spoken. Future structured-comment prefixes get added to the same allowlist as they ship. Tests: 6 new sub-cases in TestExtractNotes table — speaker notes plus image_prompt — only notes spoken pins the production-shape behavior, image_prompt only — empty notes confirms a slide with only a prompt and no narration produces empty notes (the narrator path then falls back to silence — correct), image_prompt interleaved with speaker notes — image_prompt dropped confirms ordering doesn't matter, IMAGE_PROMPT uppercase — still filtered pins case-insensitivity so a model that produces uppercase or mixed-case doesn't slip through, image_prompt with weird whitespace — filtered pins whitespace tolerance, and the critical false-positive guard freeform note containing image_prompt as substring — preserved confirms that a legitimate narration that mentions the words "image_prompt" mid-sentence ("The image_prompt feature is documented in the README.") is NOT filtered — the HasPrefix check on trimmed inner text only matches when the metadata prefix is at the very start. All 1652 internal tests pass (+6 from main). What this means for the operator's video: the next narrated deck will speak only the actual speaker notes for each slide, not the image_prompt descriptions. The image_prompts themselves remain available on slides.outline's output as the typed image_prompts[] array (slides.outline's behavior is unchanged); downstream packs that consume that array (e.g. for hero-image generation) continue to work.
slides.narrate now pre-renders mermaid fenced blocks via mmdc (parity with slides.render) AND re-encodes audio at concat to eliminate mid-segment dropouts. Operator-reported on the first successful run of builtin.repo-presentation after PR #401 unblocked the engine: the video completed end-to-end (no more session: not found), but two surface-level rendering bugs surfaced. (1) Mermaid not rendered. slides.render has shipped a preprocessMermaidFences helper for a while (slides_render.go:399) — it finds mermaid fences, runs mmdc (mermaid-cli) inside the sidecar, converts each diagram to SVG, and substitutes the fence with an inline <img src="data:image/svg+xml;base64,..." />. The sidecar image already ships mmdc with /etc/mmdc/puppeteer-config.json. slides.narrate simply was not calling that helper; raw mermaid blocks landed in /tmp/helmdeck-deck.md and Marp's headless Chromium (which has no built-in Mermaid renderer) left them blank in the per-slide PNGs. Fix: wire preprocessMermaidFences into the slides.narrate handler right after hero-image inlining and before injectFitStyle / write-to-sidecar. The helper lives in the same builtin package, so it's a one-call addition. New mermaid *bool input field on the pack mirrors slides.render's same field — default on (nil ⇒ on), explicit false opts out for decks without diagrams (saves ~500ms of mmdc startup per diagram). (2) Audio dropouts mid-slide. Per-segment AAC frames (1024 samples each) rarely divide cleanly into a TTS-driven segment duration, so the per-segment .mp4s contain partial AAC frames at their tail boundaries. The existing concat command was ffmpeg -y -f concat -safe 0 -i /tmp/concat.txt -c copy /tmp/final.mp4 — -c copy stream-copies BOTH streams, splicing at the wrong-boundary AAC frames and producing audible mid-segment dropouts whenever the audio crossed a segment edge mid-word. Fix: split the concat codec flags — video stays stream-copy (-c:v copy, fast and lossless), audio is re-encoded (-c:a aac -b:a 192k, matches the per-segment bitrate). The re-encode pass re-aligns AAC frames at concat time, eliminating dropouts. Cost is a single AAC pass over the total audio (typically 5-15 min of audio, encoded by libavcodec in seconds — negligible vs. the per-segment h264 encode that already spent 5-15 min on video). Video stream-copy is preserved because per-segment h264 IS identical across segments (same libx264 invocation, same params) and GOP structure aligns to keyframes at each segment start. Tests: 4 new in slides_narrate_test.go. Commit A (Mermaid): TestSlidesNarrate_MermaidFencePreprocessed asserts mmdc ran AND the markdown handed to Marp contains the inline-SVG data-URI AND NOT the raw mermaid fence. TestSlidesNarrate_MermaidOptOut asserts mermaid:false skips mmdc even on a deck with fences. TestSlidesNarrate_NoMermaidFenceSkipsMmdc asserts a fence-free deck pays zero mmdc cost. The narrateExecScript test harness gained an mmdc case (ordered BEFORE the cat > case because the mmdc wrapper shell script contains both substrings — a subtle ordering bug we caught with the first failing test run). Commit B (audio): TestSlidesNarrate_ConcatReencodesAudio pins the new flag shape — must contain -c:v copy, -c:a aac, -b:a 192k, and MUST NOT contain the legacy -c copy (which would stream-copy both streams and re-introduce dropouts). The explicit "must not contain" assertion is the bug-shape guard against a future "make concat faster" refactor that quietly reverts. All 1646 internal tests pass (+4 across both commits). What this means for the originally-reported video: re-running the same builtin.repo-presentation against the same deck now produces (a) per-slide PNGs with Mermaid diagrams visible (the slide that triggered PR #399's earlier failure shape will actually render correctly this time), and (b) continuous audio across segment boundaries with no mid-sentence dropouts.
Pinned-session reuse honors the longest-needed Spec.Timeout across packs in a pipeline — closes the shared-session-watchdog bug where slides.narrate's 30-minute timeout was silently overridden by repo.fetch's 5-minute default and the watchdog killed multi-segment encodes at ~5 minutes with session: not found. Operator-observed on run_71be278e92d7bb5b after PR #400 made the failure honest: slides.narrate failed at segment 7 (~4 minutes into the encode loop) with the message PR #400 introduced — ffmpeg segment 7: docker-exec transport error (ffmpeg did NOT return a real exit code): session: not found. The honest error was the win — operators now know the session was killed, not that ffmpeg failed. The underlying bug was that the watchdog (internal/session/watchdog.go:57) computes the kill deadline as s.CreatedAt + s.Spec.Timeout, where s.Spec.Timeout is frozen at session-create time by whichever pack first called Runtime.Create. In builtin.repo-presentation's flow — repo.fetch (creates session, preserves via _session_id) → repo.map → slides.outline → slides.narrate — every follow-on pack reuses the session created by repo.fetch, inheriting its Spec.Timeout. Even though slides.narrate's pack declaration sets SessionSpec.Timeout = 30 * time.Minute, the reused session retained repo.fetch's (default) 5-minute timeout. Control-plane logs from the operator's run confirmed it: 13:41:30 reusing pinned session pack=slides.narrate session_id=f9a98cec…, then 13:45:45 watchdog terminating expired session age=5m7s — exactly the 5-minute pre-extension deadline, well inside slides.narrate's needed window. The fix adds a new method Runtime.ExtendTimeout(ctx, id, newTimeout) to the session.Runtime interface — when called with newTimeout > current Spec.Timeout, it updates the session's in-memory Spec.Timeout so the watchdog uses the longer deadline; when called with an equal or shorter value, it is a no-op (the deadline never shrinks under a pinned reuse, so a fast follow-on pack cannot accidentally pull the deadline down). Implemented in both internal/session/docker.Runtime (production) and internal/session/fake.Runtime (tests). internal/packs/packs.go around line 605 — the existing pinned-session-reuse branch — now calls ExtendTimeout when pack.SessionSpec.Timeout > sess.Spec.Timeout and logs extended pinned session timeout with the old and new values. The call is best-effort: on failure the engine logs at WARN and proceeds, and the worst case is the pre-fix behavior (watchdog kills at the old deadline) — that's a fallback, not a regression. What this does NOT do (deliberate scope decisions): MemoryLimit, CPULimit, SHMSize, and the container's actual runtime resources are NOT mutated on reuse — those are frozen by Docker at container creation and cannot be changed on a live container without restart. A pipeline that needs more memory for slides.narrate than repo.fetch allocated would still need a "pipeline-level max-Spec aggregation" pass (separate follow-up, larger surgery). Timeout is uniquely runtime-mutable because it only affects the in-memory deadline the watchdog reads, not container resource caps. Tests: 7 new tests in 3 files. internal/session/fake/fake_test.go (NEW) — TestFakeRuntime_ExtendTimeout_GrowsDeadline (basic extend contract), _NeverShrinks (table-driven across equal/shorter/zero values — critical so the deadline never goes backward), _UnknownSession (returns ErrSessionNotFound so callers can distinguish missing session from no-op). internal/session/watchdog_test.go — TestWatchdogRespectsExtendedTimeout (regression guard for the actual production failure: session injected with CreatedAt 6 minutes ago and Timeout=5m would normally die immediately on watchdog tick; after ExtendTimeout to 30m the watchdog must skip it). internal/packs/packs_test.go — TestEngine_PinnedSessionReuse_ExtendsTimeoutWhenLonger (slides.narrate-shaped: pack Spec.Timeout=30m reusing a session with current Timeout=5m triggers exactly one ExtendTimeout call with the right session id and newTimeout=30m), _NoExtendWhenShorter (repo.map-shaped: pack Timeout=5m reusing a session with Timeout=30m triggers NO call — critical for the no-shrink invariant at the engine layer), _NoExtendWhenEqual (boundary: equal timeouts skip the extend so there is no spurious log line or registry mutation), _ExtendErrorDoesNotFailHandler (when ExtendTimeout errors, the handler still runs — best-effort posture). Three existing Runtime stub implementations across the test tree updated to satisfy the new interface method: internal/packs/builtin/screenshot_url_test.go, internal/api/desktop_vnc_test.go, and the existing engine-level fakeRuntime in packs_test.go (extended with extendCalls capture + getTimeout knob). All 1642 internal tests pass (+8 from main). What this means for the originally-reported failure (run_71be278e92d7bb5b): a re-run of the same builtin.repo-presentation against the same Mermaid-bearing deck now extends the shared session's timeout to slides.narrate's 30 minutes at the moment slides.narrate starts, so the watchdog will not kill the session mid-encode. If the deck still fails at segment 7, the failure must be something OTHER than the watchdog — either ffmpeg producing 0-byte output (caught by PR #400's post-encode check), a Mermaid render issue (caught by PR #399's PNG validation), or a genuinely transient docker disconnect (now visible as the honest transport error). The three remaining failure modes are distinguishable from each other and from real pack bugs.
slides.narrate silent-failure surface closed across PNG, ffmpeg encode/concat, ffprobe, ElevenLabs TTS, and silence/pad paths — eliminates the recurring "ffmpeg segment N failed (exit 0)" misclassification and the audit-identified gap class behind it. Operator reported the same handler_failed: ffmpeg segment 4 failed (exit 0) shape PR #399 was meant to eliminate, this time on a Mermaid sequence diagram in slide 4. Three parallel Explore-agent audits surfaced the structural cause: slides.narrate had twelve silent-failure modes, and PR #399's validateMarpPngs size-only check only covered two of them. Among the rest: the per-segment ffmpeg error template at slides_narrate.go:587 (and the parallel concat path at 616) printed res.ExitCode unconditionally even when the failure was err != nil (docker-exec transport error from a session disconnect / container kill mid-call) and res.ExitCode was the zero value — operators were reading "exit 0" when ffmpeg never actually returned anything, then classifyShellExitCode couldn't match 0 and the classification fell through to CodeHandlerFailed → FailurePackBug, minting a misleading "file a helmdeck issue" URL for what was really an infrastructural failure. The taxonomy also included no post-encode file existence check (ffmpeg can exit 0 yet produce a 0-byte mp4 on malformed input), no PNG magic-byte check (a >=1024-byte file can still be corrupt placeholder content), probeAudioDuration silently accepted NaN/Inf/0 (locale-affected ffprobe or upstream LLM garbage), generateSilence had no post-write stat, ElevenLabs returned {"error":"..."} wrapped in HTTP 200 was treated as valid audio bytes, and padSlideAudioToMin had zero test coverage on its 4-step pipeline. The fix is structural — two batched commits: (A) honest error messages on transport errors (lines 587/616 split into err != nil branch with "docker-exec transport error (ffmpeg did NOT return a real exit code): " and Cause: wrapping so callers can errors.As), a new requireNonEmptyOutput(ctx, ec, path, minBytes, label) helper that stats produced files via the same wc -c < FILE pattern validateMarpPngs already uses and is called after every per-segment encode AND after the concat output, and a PNG-magic-byte extension to validateMarpPngs that reads the first 8 bytes via head -c 8 | od -An -tx1 and compares against pngMagicHex (89504e470d0a1a0a) so corrupt-but-larger-than-floor placeholder content surfaces with the same Mermaid hint. (B) probeAudioDuration rejects NaN, ±Inf, and dur <= 0 with math.IsNaN/math.IsInf after ParseFloat, generateSilence calls requireNonEmptyOutput after exit 0 with a 256-byte floor for libmp3lame's ID3v2-plus-frame overhead, elevenLabsTTS validates the HTTP-200 body via a new validateElevenLabsBody helper (extracted so the logic is unit-testable without an HTTP stub — elevenLabsBaseURL is a const) that rejects bodies under minTTSResponseBytes (512) or that fail the looksLikeMP3 sniff (accepts MPEG-1/2 Layer III sync words 0xFF 0xFB/0xFA/0xF3/0xF2 and the ID3 v2 tag header). What about the cross-pack pattern? The audit found hyperframes.render already does if len(videoBytes) == 0 (line 330) and slides.render does if len(res.Stdout) == 0 (line 264) — both correct for their shape (already-loaded byte slices vs. files on disk). The new requireNonEmptyOutput is specifically for stat-after-write scenarios; it didn't make sense to retrofit packs whose checks are already correct in mechanism. Tests: 24 new tests + 1 stub expansion across both commits. Commit A: TestValidateMarpPngs_BadPngMagic_ReturnsInvalidInputNamingSlide (slide-2 corrupt magic → CodeInvalidInput naming slide and explaining the signature mismatch), TestSlidesNarrate_SegmentTransportError_HonestMessage (asserts the message must NOT contain "exit 0" on transport error, MUST contain "transport error" and "did NOT return a real exit code"), TestSlidesNarrate_SegmentExitZeroEmptyOutput_PostCheckFires (ffmpeg exit 0 + empty .mp4 surfaces at SEGMENT step, not later at concat), TestSlidesNarrate_ConcatTransportError_HonestMessage (mirror of the segment-path test for the concat step), TestRequireNonEmptyOutput_* (3 direct unit tests on the helper — healthy/missing/below-floor). Commit B: TestProbeAudioDuration_RejectsNaN/_RejectsInfinity/_RejectsNonPositive (table-driven), TestProbeAudioDuration_AcceptsPositiveFloat (happy baseline), TestGenerateSilence_PostCheckCatches0Byte (ffmpeg exit 0 + empty silence file surfaces an error), TestLooksLikeMP3_Identifies (9 sub-cases: MP3 sync variants, ID3v2, JSON envelope, empty, garbage, wrong second-byte mask), TestValidateElevenLabsBody (5 sub-cases: healthy/JSON-error/empty/under-floor/HTML-error-page), TestPadSlideAudioToMin_HappyPath/_NoOpWhenDeficitNegligible/_StopsOnMidStepFailure (closes the audit-flagged zero coverage on the 4-step pipeline). fakeMP3 expanded from 6 bytes to 1026 bytes of valid MP3 prefix + zero padding so existing TTS tests still pass minTTSResponseBytes. Two existing tests (TestSlidesNarrate_FfmpegConcatFailure, TestSlidesNarrate_FfmpegSegmentFailure_FullStderrSurfaced) updated to also script a healthy head -c 8 magic response so they reach their targeted ffmpeg failure paths past the new validation gates. All 1634 internal tests pass. What this means for the originally-reported failure (run_b0aacfabb479f5f3, segment 4 with Mermaid sequence diagram): if the failure was a transport error, the message is now honest about it ("ffmpeg segment 4: docker-exec transport error (ffmpeg did NOT return a real exit code): ") instead of misleading "exit 0". If the failure was ffmpeg-exit-0-but-no-output, the new post-encode check surfaces "ffmpeg segment 4: produced only 0 bytes (below the 1024-byte floor)" — operators see the encode produced nothing and the error names the actual cause. If the failure was Mermaid producing a corrupt but >=1024-byte PNG, the magic-byte check intercepts it at validation with the same caller-fixable "edit slide N" hint. The three paths are now distinguishable from each other and from real pack bugs.
slides.narrate validates each marp-rendered PNG BEFORE handing it to ffmpeg — silent marp render failures (Mermaid blocks, custom HTML, broken fenced YAML) now surface as caller_fixable: slide N produced no rendered PNG instead of the misleading pack_bug: ffmpeg segment N failed (exit 0). Operator-reported: a live builtin.repo-presentation run failed at slides.narrate step with handler_failed: ffmpeg segment 3 failed (exit 0), which the gateway classifier routed to failure_class: pack_bug and minted an auto-generated "file a helmdeck issue" URL. The smoking gun: ffmpeg exited 0 (success) yet the handler returned a failure — because the Exec wrapper observed a transport-layer error on what was nominally a successful segment, OR (more commonly) ffmpeg "succeeded" on a malformed input PNG and produced a zero-byte segment file. Either path pointed operators at a non-existent pack bug instead of the actual problem: the slide markdown contained an embedded block — in the reported case a flowchart LR Mermaid diagram — that marp's headless Chromium silently failed to render, leaving an empty or near-empty PNG for that slide. The bug class is structural: marp returns exit 0 from --images png even when individual slides render to nothing, so the existing exit-code check at the marp call site (line 402) cannot catch per-slide render failures. The fix is a pre-flight validateMarpPngs pass after marp succeeds and before the per-segment ffmpeg loop. For each expected slide PNG (/tmp/slides/deck.NNN.png, 1-based per marp's convention), wc -c < FILE is statted via the same shell-exec pattern fs.read uses (fs_packs.go:140-151). Two failure cases surface as CodeInvalidInput (which classify.go maps to FailureCallerFixable): (1) wc exits non-zero → file missing entirely → "slide N produced no rendered PNG (marp exited 0 but the expected output file is missing). Most common cause: an embedded block marp's headless Chromium can't render — a Mermaid diagram (flowchart, sequenceDiagram), custom HTML with broken CSS, or a fenced YAML that confuses the parser. Edit slide N's markdown to remove or simplify the offending block, then re-run."; (2) wc returns under minRenderedSlidePngBytes (1024) → file is the marp-blank signature → "slide N's rendered PNG is only X bytes (below the 1024-byte floor), which is the signature of a silent marp render failure …". A transport-layer error on the stat call surfaces as CodeHandlerFailed (NOT CodeInvalidInput) because the caller's input is fine and the failure is infrastructural — same defense-in-depth posture as the rest of the handler. Threshold reasoning: 1024 bytes is well below any real rendered slide (the smallest sensible solid-color 1920×1080 PNG is several KB after deflate overhead even with maximal compression) and well above the few hundred bytes marp's blank-output mode produces, so the floor is safe in both directions — no false positives on legitimately sparse slides, no false negatives on tiny garbled output. The existing per-segment ffmpeg error path is unchanged: if a PNG passes validation but ffmpeg still fails downstream, the operator gets the original ffmpeg-segment-failed message (the pre-flight check is additive, not a replacement). Tests: 5 new tests in slides_narrate_test.go — TestValidateMarpPngs_AllHealthy_NoError (3 slides all >=1024 bytes pass, 3 wc-c calls observed), TestValidateMarpPngs_MissingFile_ReturnsInvalidInputNamingSlide (slide 3 missing → CodeInvalidInput with "slide 3" in message + "Mermaid" hint; loop stops at first failure, so slide 4 is not statted), TestValidateMarpPngs_TinyFile_ReturnsInvalidInputWithSize (slide 2 at 256 bytes → CodeInvalidInput with "slide 2" and "256 bytes" both surfaced so operators can sanity-check), TestValidateMarpPngs_AtFloor_Passes (boundary test: exactly 1024 bytes passes, catches < vs <= off-by-one regressions), TestValidateMarpPngs_TransportError_ReturnsHandlerFailed (an Exec error returns CodeHandlerFailed, not CodeInvalidInput — input may be fine, failure is infrastructural). Two existing tests (TestSlidesNarrate_FfmpegConcatFailure, TestSlidesNarrate_FfmpegSegmentFailure_FullStderrSurfaced) updated to also script a healthy wc -c response so they reach their targeted ffmpeg failure paths instead of stopping at the new pre-flight gate. All 1600 internal tests pass. What this means for the originally-reported failure: the slide-3 Mermaid flowchart LR block would now stop the run with failure_class: caller_fixable and message "slide 3 produced no rendered PNG (marp exited 0 but the expected output file is missing). Most common cause: an embedded block marp's headless Chromium can't render — a Mermaid diagram…" — the operator gets the exact slide to edit, without burning ElevenLabs TTS credits and ~30s of ffmpeg work for the misleading-bug-report outcome. Out of scope: rendering Mermaid blocks ahead of marp (a marp-cli --engine plugin), or marp-side per-slide error reporting (would require an upstream marp change). Both are valid follow-ups but orthogonal to surfacing the failure honestly.
Control-plane image builds the web bundle inside a Node Docker stage — eliminates the recurring "blank page after rebuild" failure mode where web/dist/index.html references bundle hashes that aren't in the image. Operator-observed pattern (visible twice in the local stash history as local web/dist rebuild entries): after a docker rebuild, the Management UI loads / but renders blank, because the embedded index.html references /assets/index-XXX.js paths that don't exist in the image. The 801-byte HTML returned for every URL was the SPA fallback serving index.html for any unknown path — so the browser tried to execute HTML as JavaScript and silently failed. Root cause: only web/dist/index.html was tracked in git (the placeholder mentioned in web/embed.go); the matching web/dist/assets/*.js,*.css were always untracked. The Dockerfile did not run npm run build — it just COPY web ./web from the host. So the image's embedded bundle = whatever happened to be on the developer's host at build time, and EVERY drift between the committed HTML and the local assets (a stale checkout, a pulled main, a git stash of a local rebuild, a git checkout reverting index.html while leaving local assets/* untouched) produced a broken image. The bug class is structural — not a one-off — which is why the fix is structural too. Fix shape: add a Node web-build stage to deploy/docker/control-plane.Dockerfile that runs npm ci && npm run build inside the image, producing a self-consistent web/dist/{index.html,assets/*}. The Go stage then COPY --from=web-build /web/dist ./web/dist so the embedded HTML and the embedded assets are co-generated from the SAME source tree in the SAME build — byte-for-byte consistent by construction. .dockerignore adds web/dist/ so the host's local dist is never copied into the build context, removing the only path by which host-side drift could leak in. web/dist/index.html is now a tiny stable stub (no asset references, no hashed filenames) that exists solely to satisfy //go:embed all:dist during host-side go build / IDE compilation / go test. The stub is byte-stable so it can't drift; a local npm run build overwrites it for browser-facing dev. Verified end-to-end: built dockerfile-fix-test image from this Dockerfile, ran it on port 3099 — GET / returned index.html referencing index-4AsUCFtK.js (fresh hash from the in-image Node build, distinct from whatever was on the host), and GET /assets/index-4AsUCFtK.js returned 215097 bytes of text/javascript — proving the image's HTML and assets are bound together, not subject to any host-side dependency. What this fixes long-term: no developer needs to remember to cd web && npm run build before docker build; no contributor needs to know which files to commit and which are gitignored; CI builds and local builds produce equivalent images; the recurring "rebuild + blank page" footgun is removed from the dev loop. web/embed.go's doc comment updated to describe the two-source flow (production = web-build stage; host = stub or local npm build). Out of scope: trimming web/dist/index.html from the git index (would require all contributors with stale local dist to do a clean checkout — separate housekeeping PR).
Pipeline-run single-flight coalescing — duplicate concurrent pipeline-run requests with the same (caller, pipeline id, inputs) no longer spawn a second identical execution. Operator-observed: some LLM clients time out on a long-running pipeline-run call (multi-minute pipelines like slides.narrate, *-video, research.deep) before the underlying pipeline finishes, then RETRY the same call thinking the original failed. The original run was still in-flight; the retry happily started a SECOND identical run. With pipelines like slides.narrate — which we JUST fixed (PR #390) to encode within an 8g memory cap by capping ffmpeg threads to 4 and adaptive-retrying OOM segments — two concurrent runs against the same memory budget reliably OOM both, defeating the single-run fix. The shape was wrong: internal/mcp/pipelines.go (line 146 case pipeline-run) called s.pipelines.StartRun(ctx, a.ID, a.Inputs) and returned the new run_id unconditionally — no fingerprint check, no in-flight-duplicate detection. The fix is single-flight coalescing at the StartRun boundary, not rejection: when an identical in-flight run already exists, the new caller gets back the ORIGINAL run's run_id plus coalesced: true. The caller's next pipeline-run-status poll works against the real run instead of spawning a duplicate execution. Fingerprint = sha256(caller || pipeline_id || canonical_json(inputs)). The canonicalization is deterministic across JSON whitespace and object-key ordering — two callers POSTing the same logical inputs with different formatting (one minified, one pretty-printed; one with keys declared, one alphabetized) coalesce together. Empty inputs normalize to null so empty-body POSTs coalesce with each other. Migration 0008_pipeline_run_fingerprint.sql adds caller TEXT NOT NULL DEFAULT '' + fingerprint TEXT NOT NULL DEFAULT '' columns to pipeline_runs plus a partial unique index WHERE fingerprint <> '' AND status IN ('pending','running'). Both columns are additive with safe defaults so a downgrade-to-prev-binary still reads old rows (the empty-fingerprint legacy rows are excluded from uniqueness). Concurrency guard: a new startMu sync.Mutex on runRegistry serializes the (fingerprint-lookup, INSERT) critical section so two goroutines racing with identical fingerprints can't both miss the lookup and insert duplicates. The partial unique index is the belt (DB-level guarantee against multi-process races, e.g. two control-plane replicas); the mutex is the suspenders (turns the constraint violation into a clean coalesced=true return). When the INSERT does collide despite the mutex (only possible across replicas), StartRun re-resolves the fingerprint and returns the winner instead of surfacing a UNIQUE error. What does NOT coalesce: different caller, different pipeline id, different inputs (all 3 are in the fingerprint); a terminal run (the lookup filters status IN ('pending','running'), so a finished run never coalesces a fresh request onto a stale result). Rerun gets the dedup for free since it delegates to StartRun — an operator who spam-clicks "Rerun" gets the existing in-flight back, not 5 duplicate runs. Surface change: Runner.StartRun and Runner.Rerun signatures gain a coalesced bool return; internal/mcp.PipelineService.StartRun/Rerun same; MCP pipeline-run/pipeline-rerun responses gain a coalesced field; REST POST /api/v1/pipelines/{id}/run and /runs/{runId}/rerun responses gain the same field. Existing callers that don't read coalesced see byte-identical run-status semantics — they still poll, the run-status flow is unchanged. MCP tool descriptions updated so LLM clients learn that coalesced: true is NOT an error and should be polled like any other run_id. Tests: 4 new tests in internal/pipelines/runner_test.go — TestComputeRunFingerprint_StableAndDistinct (8 sub-cases: identical / reordered-keys / whitespace / nested-reorder all coalesce; different-caller / different-pipeline / different-input / empty-vs-empty all distinguish correctly), TestRunner_StartRun_CoalescesIdenticalInFlight (4 sub-assertions: first call non-coalesced + 3 duplicates with whitespace-normalized inputs coalesce, different-caller spawns fresh, different-inputs spawns fresh), TestRunner_StartRun_DoesNotCoalesceOntoTerminalRun (the regression guard — a finished run must NOT coalesce a fresh request onto stale results), TestRunner_StartRun_ConcurrentIdenticalCalls (the race-window guard — N=8 goroutines fire simultaneously, exactly 1 run exists in the store, N-1 callers see coalesced=true). All 1595 internal tests pass. Architecture note: this is "single-flight at the API boundary, not the runner" — the runner still has exactly one execution per run; we just dedupe the creation of new runs when an identical one is already running. Friendlier than rejection (the retrying client gets a useful run_id to poll) and friendlier than back-pressure (no wait, no holding the connection).
slides.narrate ffmpeg thread cap (4) + adaptive retry on OOM-killed segments (degraded encoder settings). Operator reported that different LLM models produced decks of variable visual complexity, and the dense ones OOMed even at HELMDECK_SLIDES_NARRATE_MEMORY_LIMIT=8g — but the sparse ones didn't. Same memory budget, same slide count, same resolution; only the per-frame encoder working set varied. The root cause: the per-segment ffmpeg command had no -threads flag, so libx264 grabbed every host core (12 on a typical workstation), and each thread holds ~50-80 MB of frame buffers at 1080p. That's ~800 MB of encoder state before reference frames, lookahead, and Chromium's resident set — and the marginal slide that pushes encoder peak over budget OOMs every time. Two fixes: (1) add an explicit -threads N to the per-segment ffmpeg command. Default N=4 — cuts peak by ~3× at the cost of ~20% wall-clock per segment, which is negligible against the wins. New env var HELMDECK_SLIDES_NARRATE_FFMPEG_THREADS lets operators with abundant RAM bump higher, or hosts with tight RAM drop to 1-2. Same operator-tunable idiom as HELMDECK_SLIDES_NARRATE_MEMORY_LIMIT. ADR 045 stays in place — CPUProfile=ProfileCompute still scales the container's CPU quota with host cores; this cap is narrowly about the encoder thread count, not CPU allocation. (2) Adaptive retry on CodeResourceExhausted: if a per-segment encode returns exit 137 (OOM-classified by classifyShellExitCode), the handler retries that ONE segment with -threads 1 -preset veryfast — combination cuts encoder memory roughly in half versus the primary attempt at the cost of a small bitrate-efficiency hit (CRF 23 still looks fine; the difference isn't visual artifacts). Retry is bounded to one attempt per segment; if both OOM, the handler surfaces CodeResourceExhausted so the operator can bump MemoryLimit and rerun. Retry logs at WARN level so post-mortems show when degraded encoding fired and on which segment. Architecture note: this is "smart resource management without going to Kubernetes" — Docker compose stays, the operator gets two tunable knobs (memory cap + thread cap) plus automatic per-segment degradation. A future PR may add GPU/NVENC swap when the sidecar exposes /dev/nvidia* (filed as a follow-up issue), which would eliminate the memory wall entirely on GPU-equipped hosts. Tests: 3 new helper tests (TestSlidesNarrateFfmpegThreads_DefaultWhenEnvUnset / _OverrideHonored / _GarbageFallsThroughToDefault), 1 new retry-success test (TestSlidesNarrate_AdaptiveRetryOnOOM — primary returns exit 137, retry returns 0, asserts the retry carries the degraded flags AND the primary does NOT), 1 new retry-fails-too test (TestSlidesNarrate_DoubleOOMSurfacesCodeResourceExhausted — both attempts OOM, asserts exactly 2 attempts, no third escalation, returns CodeResourceExhausted). All 1583 internal tests pass.
Closed-set classifier coerced PR #379 and PR #381 typed codes to internal (silent regression in both prior PRs); ElevenLabs precheck now uses /v1/voices (scope-matched). Two surgical fixes caught while a live builtin.repo-presentation run failed at slides.narrate with step "narrate": internal: credential_invalid: ElevenLabs rejected the stored API key (401): "The API key you used is missing the permission user_read" — failure_class: pack_bug. Two bugs in one error message: (A) internal/packs/classify.go:14-23 defines validCodes, the closed-set the engine's middleware uses to gate handler return codes. PR #379 added CodeResourceExhausted and PR #381 added CodeCredentialInvalid to internal/packs/errors.go, but neither added the new codes to validCodes. Result: Classify() lines 60-65 walked the chain, saw the handler returned a *PackError with a code NOT in the set, and silently coerced to CodeInternal. The pipeline-level classifier (internal/pipelines/classify.go) then mapped CodeInternal → FailurePackBug, minting a bogus "file an issue" URL for what was actually a resource/credential issue. Both prior PRs shipped non-functional in the wire envelope — the inner message was right, the outer code was internal, failure_class was pack_bug. (B) vault.ValidateElevenLabs (PR #381) called GET /v1/user which requires the user_read ElevenLabs scope. But ElevenLabs scopes are independent — text_to_speech, voices_read, user_read, history_read are granted separately. A production-shaped key minted with just text_to_speech + voices_read can do every TTS operation slides.narrate needs but 401s against /v1/user. The precheck therefore blocked working keys with a scope-mismatch false-positive. Fix A: add CodeResourceExhausted and CodeCredentialInvalid to the validCodes map. 2 lines + a doc comment naming the regression so future code additions don't repeat the omission. After this, OOM-killed ffmpeg surfaces as failure_class: transient (PR #379's intent), and a rejected ElevenLabs credential surfaces as failure_class: caller_fixable with the "update the vault" reason (PR #381's intent). Fix B: switch the precheck endpoint from GET /v1/user to GET /v1/voices. The voices_read scope is what slides.narrate's own pickRandomVoice path already calls — keys that pass the precheck are guaranteed to work through the rest of the handler. Updated doc comment explains the scope reasoning so the choice survives the next refactor. Tests: 2 new cases in internal/packs/classify_test.go asserting Classify(&PackError{Code: CodeResourceExhausted}) and Classify(&PackError{Code: CodeCredentialInvalid}) both round-trip unchanged — would have caught Bug A on its own. internal/vault/validate_test.go's path expectation flipped from /v1/user to /v1/voices with an explanatory error message naming the scope reasoning. All 1578 internal tests pass.
slides.narrate resolution normalization + video pipelines no longer hardcode aspect_ratio/resolution. Two bugs the helmdeck-debug skill caught in the same sweep: (1) slides.narrate accepted resolution: "1080p" per its declared input schema, but passed the value verbatim to ffmpeg's scale= filter — which rejected it with Invalid size '1080p'. The schema and the handler disagreed on the vocabulary: hyperframes.render takes named presets (720p/1080p/4k), slides.narrate only took WIDTHxHEIGHT. (2) builtin.html-video, builtin.prompt-video, and builtin.prompt-narrated-video all hardcoded "resolution":"1080p","aspect_ratio":"16:9" in their hyperframes.render/hyperframes.compose step inputs. A caller passing an HTML composition whose intrinsic dimensions were vertical (1080×1920 for Shorts/TikTok) got back "outputResolution landscape does not match the composition" with no surface area to fix it — the pipelines didn't expose aspect_ratio as an input at all. Fix 1: new normalizeSlidesNarrateResolution() helper in slides_narrate.go translates named presets to WIDTHxHEIGHT before ffmpeg sees them — 720p→1280x720, 1080p→1920x1080, 1440p→2560x1440, 2160p/4k→3840x2160. Pre-formatted strings pass through; empty stays empty (caller's downstream default applies); unknown values pass through so ffmpeg surfaces its own "Invalid size" message (silent normalization would mask typos). Case-insensitive, whitespace-tolerant. Fix 2: the 3 video pipelines now thread "resolution":"${{ inputs.resolution }}" and "aspect_ratio":"${{ inputs.aspect_ratio }}" instead of literals. PR #380's resolver drops the fields when the caller omits them, so hyperframes.render/hyperframes.compose use their own 1080p+16:9 defaults — zero observable change for current callers. Callers who want vertical (9:16 for Shorts/TikTok) or square (1:1) compositions can now pass the value through the pipeline input. Tests: TestNormalizeSlidesNarrateResolution table-driven across 12 cases (presets / pre-formatted / empty / unknown / case-insensitive / whitespace); TestVideoPipelines_DoNotHardcodeAspectRatio regression guard on the 3 production pipelines. All 1566 internal tests pass.
Paid-API credential precheck + honest has_narration (slides.narrate); production narrate pipelines fail-fast on missing/rejected ElevenLabs key. Operator-reported: ran a *-narrate pipeline, got back a silent video, output said has_narration: true. Three architectural bugs in one shape: (1) slides.narrate set hasNarration from apiKey != "" BEFORE any provider call — the field was decided on key presence, not call outcome; (2) the per-slide TTS loop fell back to silence on ANY error (including 401/403/quota), so a dead key produced a video that looked narrated according to the output schema but was actually silent throughout; (3) the production pipelines builtin.grounded-narrate, builtin.research-narrate, builtin.repo-presentation all literally hardcoded allow_silent_output: true, which masked the missing credential entirely — a caller asking for "narrate this" got silence with no signal that the credential was the cause. The fix introduces a new typed error code packs.CodeCredentialInvalid (internal/packs/errors.go) for credentials rejected by an upstream paid API — distinct from CodeInvalidInput (caller passed bad input — they can fix without touching the vault) and CodeHandlerFailed (pack code misbehaved): the pack ran correctly, the caller's input was structurally fine, the stored credential is dead. classify.go maps it to FailureCallerFixable with the actionable reason "The vault-stored API credential this pack needed was rejected by the upstream provider (401/403/quota). The pack itself ran correctly; the credential is dead. Update it via /api/v1/vault/credentials/{id} (PUT) or re-hydrate from your .env.local, then re-run. Retrying with the same key would burn more provider quota for no benefit.". isRetryable=false. New helper vault.ValidateElevenLabs(ctx, hc, apiKey) (internal/vault/validate.go): single GET /v1/user against ElevenLabs to confirm the key is accepted before doing expensive work. Returns nil on 200, *packs.PackError{CodeCredentialInvalid} on 401/403/402-quota-exhausted, transient errors on 429/5xx (caller proceeds — per-slide TTS calls have their own fallback path). Signature template reusable for sibling providers (fal.ai, Firecrawl, HeyGen, Runway — see follow-up list). slides.narrate wiring: after key resolution, before voice listing + Marp render + LLM YouTube-metadata call, call ValidateElevenLabs. On CodeCredentialInvalid return immediately — saves ~$0.01-0.05 in wasted LLM tokens + ~30s of CPU per failed run. On transient error, log warning + proceed. Honest has_narration computed at return as narrationRequested && voiceID != "" && narratableSlideCount > 0 && ttsFailureCount == 0. New output field tts_failure_count for diagnostics so an operator can see "I asked for narration, but 3 of 25 slides fell back to silence." Pipeline cleanup: builtin.grounded-narrate, builtin.research-narrate, builtin.repo-presentation change "allow_silent_output": true → "allow_silent_output": "${{ inputs.allow_silent_output }}". PR #380's resolver drops the field when the caller doesn't pass it, so slides.narrate's AllowSilentOutput zero-values to false and the fail-fast credential check kicks in. Callers who explicitly want silence pass allow_silent_output: true on the run input — the opt-in path still works. Tests: 8 new tests in internal/vault/validate_test.go (200/401/403/402/429/5xx/empty-key/whitespace-key); 1 new case in classify_test.go asserting CodeCredentialInvalid → FailureCallerFixable; TestIsRetryable extended; 1 new pipeline-shape test TestNarratePipelines_DoNotHardcodeAllowSilentOutput that prevents regression on the 3 production pipelines. All 1562 internal tests pass. Out of scope (follow-up PRs that adopt the same pattern): podcast.generate (4 pipeline sites still carry hardcoded allow_silent_output: true; needs its own provider-specific precheck), image_generate (fal.ai), research.deep (Firecrawl), heygen_video, runway_video, slides.outline LLM metadata calls. Each adoption is ~50 LOC once the validator helper for that provider lands.
Pipeline template resolver: drop the JSON field when a whole-value inputs.* reference misses (typed-field fix). Surfaced by the helmdeck-debug skill's sweep — builtin.repo-presentation failed at the outline step with field "export_outline": expected boolean, got string, and the skill correctly identified that 6 sibling pipelines reference the same optional booleans the same way (builtin.grounded-deck, builtin.grounded-narrate, builtin.research-deck, builtin.research-narrate, builtin.scrape-deck, builtin.research-ground-deck). PR #377 made the resolver substitute "" for missing top-level inputs.* references — correct for string-typed targets, broken for bool/number/array targets where 7 pipelines pass the input as a whole-value template (e.g. "export_outline": "${{ inputs.export_outline }}"). The receiving pack's JSON decoder rightly rejects an empty string in a bool field. The fix distinguishes whole-value misses from embedded misses: when a ${{ inputs.X }} reference is the ENTIRE value of a JSON field and X isn't supplied, the resolver now drops the field from the output JSON entirely. The receiving pack then sees an absent field and uses its declared zero-value default (false for bool, 0 for number, [] for array, "" for string). Embedded references ("prefix-${{ inputs.x }}-suffix") keep current behavior — they substitute "" because dropping would unhelpfully truncate the surrounding string. Array elements with a missing ref substitute null (preserves indices). Implementation uses a package-private missingRef sentinel that lookupExpr returns and walk() recognizes at the map/array boundary — never leaks to JSON. Steps. references stay loud always*: a missing steps.X.output.Y indicates a real inter-step wiring bug and the safety net is unchanged. Tests: 4 new tests in template_test.go — TestResolve_MissingInput_WholeValueDropsField (the contract: drop on whole-value miss), TestResolve_MissingInput_EmbeddedKeepsField (substitute "" in embedded position), TestResolve_MissingInput_DropAcrossTypes (the motivating case — export_outline/include_image_prompts/fade_ms/voice_ids all dropped when omitted), TestResolve_MissingInput_ArrayBecomesNull (array indices preserved). The PR #377 test TestResolve_MissingInputDefaultsToEmpty was split into the two new whole-value/embedded tests since its prior conflated assertion no longer holds. TestResolve_MissingStepStillFails still passes — steps.X.output.Y references stay loud. All 1552 internal tests pass.
slides.narrate OOM-killed ffmpeg now classifies as transient, not pack_bug. Surfaced by the helmdeck-debug skill's diagnostic sweep: builtin.repo-presentation failed at slides.narrate with ffmpeg segment 9 failed (exit 137) and the gateway classifier emitted failure_class: pack_bug plus an auto-generated "file an issue" URL — but exit 137 is SIGKILL, which in our sandboxed sessions overwhelmingly means the kernel OOM killer reaped ffmpeg because the per-segment 1080p h264+AAC encode exceeded the cgroup memory limit. That's a resource/environment issue, not a bug in the pack. The fix introduces a new typed error code packs.CodeResourceExhausted (internal/packs/errors.go) — distinct from CodeTimeout (deadline expired) and CodeSessionUnavailable (couldn't acquire a session): the session ran fine, the workload was too heavy for the memory/CPU budget. classify.go maps it to FailureTransient with the actionable reason "The OS killed a child process for resource reasons (typically OOM — exit 137 / SIGKILL). The pack itself isn't buggy; the workload was too heavy for the session's memory/CPU budget. Bump SessionSpec.MemoryLimit, reduce the job size (fewer slides/segments/pages), or re-run on a host with more memory.". isRetryable() now returns true for CodeResourceExhausted so the ADR 044 auto-retry loop gives it a shot before surfacing. New shared helper classifyShellExitCode(exitCode int) (packs.ErrorCode, bool) in internal/packs/builtin/shell_exit.go is the single source of truth for "what does this exit code from a shelled-out tool mean in a typed way?" — today it lifts exit 137 to CodeResourceExhausted and returns ok=false for everything else (caller falls through to CodeHandlerFailed). Future packs (hyperframes.render, slides.render, scrape_spa, doc_ocr, etc. — every shell-out path) can adopt it incrementally; the lore lives in one place instead of being reinvented per-handler. slides.narrate wired: both the per-segment ffmpeg encode (line ~429) and the final concat step (line ~452) check the exit code through the helper. When OOM is detected the error message also improves — instead of the generic "ffmpeg segment 9 failed (exit 137): <stderr tail>" the operator sees "ffmpeg segment 9 killed by the OS on exit 137 (likely OOM at 1080p — bump SessionSpec.MemoryLimit, reduce slide count, or lower the encode resolution). stderr: ...". The stderr artifact still lands in the artifact store via the existing persistFfmpegStderr path so post-mortem debugging keeps the full ffmpeg output. Tests: 2 new tests in shell_exit_test.go (exit 137 → CodeResourceExhausted; every other code returns ok=false); 1 new case in classify_test.go asserting CodeResourceExhausted → FailureTransient; TestIsRetryable extended to assert it's in the retryable set. All 1549 internal tests pass. Scope kept tight: today only exit 137 is recognized empirically — adding SIGTERM (143), GNU timeout (124), etc. is a follow-up decision that now lives in one helper, not 8 handlers.
isSafeClonePath / safeJoin accept ADR 040 persistent clone paths (<PersistentReposPath>/<Caller>/...). Surfaced by the helmdeck-debug skill while running its diagnostic sweep: builtin.repo-presentation failed with step "map": invalid_input: clone_path must be an absolute path under /tmp/helmdeck- or /home/helmdeck/work/. The bug was a Class 3 schema-vs-handler drift that the skill correctly identified: ADR 040 wired repo.fetch to emit clone_path = <ec.PersistentReposPath>/<ec.Caller>/<hash> (e.g. /repos/admin/6d3bd03b49986330) when the persistent repos volume is mounted, but isSafeClonePath (internal/packs/builtin/repo_push.go:324-333) was never updated to accept that prefix family — it still only allowed /tmp/helmdeck- and /home/helmdeck/work/. Every downstream consumer of repo.fetch's output (repo.map, repo.push, fs.read/fs.write/fs.list/fs.delete, cmd.run, and content.ground when reading from a clone) therefore rejected the legitimate fetch output with CodeInvalidInput. The failure_class came out as caller_fixable — secondary Class 4 misclassification, because the caller (the pipeline definition) had passed the fetch output verbatim and had no recourse to fix it. The fix widens isSafeClonePath to take an *packs.ExecutionContext and accept a third path family: strings.TrimSuffix(ec.PersistentReposPath, "/") + "/" + ec.Caller + "/". The per-caller subdir is required — bare /repos/loose-file and /repos/other-user/... are still rejected — so the validation continues to enforce the per-caller scoping ADR 040 documents. When ec is nil or persistence is off (PersistentReposPath == ""), behavior is byte-identical to the pre-fix version (pre-ADR-040 callers see no change). safeJoin now takes the same ec parameter and threads it through. Error message is now generated by clonePathRejectMessage(ec) so the rejection text surfaces the actual allowed prefix in this deployment — e.g. clone_path must be an absolute path under /tmp/helmdeck- or /home/helmdeck/work/ (or /repos/admin/ for ADR 040 persistent clones) — instead of a stale hard-coded list. Verification: live re-run of builtin.repo-presentation on https://github.com/octocat/Hello-World.git after the fix — repo.fetch produced /repos/admin/35045901fb0127aa, repo.map consumed it without rejection, the chain advanced two steps further before failing on an unrelated slides.outline input-shape issue (separately filed). Tests: 3 new tests in repo_push_test.go — TestIsSafeClonePath_ADR040Persistent (per-caller positive cases + cross-caller / bare-root / traversal negative cases), TestIsSafeClonePath_PersistenceOff (empty PersistentReposPath keeps pre-ADR-040 behavior), TestIsSafeClonePath_CallerEmpty (no anonymous fallback to a shared namespace). All 1546 internal tests pass. Surface area touched: 1 function signature change (isSafeClonePath), 1 helper signature change (safeJoin), 8 call sites updated to thread ec through, 5 hard-coded error messages replaced with the helper. Behavior on pre-ADR-040 paths is preserved bit-for-bit.
Pipeline template resolver: tolerate missing top-level optional inputs.* references (resolve to ""). The opt-in CTA inputs that landed with the blog.append_cta wiring (project_url, github_url, cta_source_url, cta_copy) failed every real *-rewrite-blog pipeline run unless the caller passed every field explicitly. The validation-suite test fixture sets every input explicitly so unit tests passed, but a live call to builtin.brief-rewrite-blog with only project_url + github_url set returned: step "cta": unresolved reference "inputs.source_url": no field "source_url". The resolver was designed to fail loud on every unresolved ${{ ... }} reference — correct for inter-step wiring, where a missing steps.X.output.Y indicates a real producer/consumer bug; wrong for pipeline inputs, where callers routinely omit optional fields. The fix scopes leniency tightly: only top-level inputs.* references that miss → resolve to "". Nested traversal errors (inputs.foo.bar where foo exists but bar doesn't) still fail loud — that surfaces caller-side shape bugs. steps.*.output.* references stay loud always — that's the safety net for inter-step wiring bugs (the high-value catch). Verification: live run of builtin.brief-rewrite-blog on openrouter/anthropic/claude-haiku-4-5 post-fix — the optional CTA inputs cta_source_url / source_url resolved to empty when not passed; the CTA step LLM-rewrote a natural closing section weaving in project_url + github_url with cta_copy ("invite the reader to try it and contribute") honored throughout the 3 paragraphs. Tests: 2 new tests in template_test.go — TestResolve_MissingInputDefaultsToEmpty (whole-value ref and embedded ref both resolve to empty) and TestResolve_MissingStepStillFails (the safety-net guard for the inter-step path).

Added

blog.append_cta pack + opt-in CTA wiring across the *-rewrite-blog pipelines. An external agent driving helmdeck through OpenClaw asked it to "promote this project" via builtin.scrape-rewrite-blog and got back well-written articles with zero promotional links and visible [1] / [source] citation markers throughout — the pipeline did its job correctly, but the output shape (blog.rewrite_for_audience's ghostwriter contract + content.ground's verifiability contract) didn't match the user's conversational publication target (dev.to / Medium). New pack blog.append_cta (internal/packs/builtin/blog_append_cta.go) closes the CTA half of the gap: when ALL of source_url / project_url / github_url are empty, the pack is a strict no-op that returns markdown unchanged and never calls the dispatcher — so the step can slot into every blog pipeline unconditionally without burning a model call for the common no-CTA path. When at least one link is set, the pack LLM-rewrites a closing CTA section in the article's voice, reusing resolveBlogRewritePersona from blog.rewrite_for_audience so a "technical" / "marketing" / "educational" persona threaded through the pipeline locks the voice across both packs. The model is instructed to emit ONLY the closing section; the original article body is appended verbatim in code so the LLM cannot introduce drift. Optional cta_copy field lets the caller steer the ask in plain English ("invite contributors", "highlight the free tier"). Wired into four pipelines: builtin.brief-rewrite-blog, builtin.scrape-rewrite-blog, builtin.doc-rewrite-blog, and builtin.research-rewrite-blog all gained a cta step between content.ground and blog.publish. Optional pipeline inputs project_url?, github_url?, cta_source_url?, cta_copy? thread through. doc-rewrite-blog uses cta_source_url separately from its existing source_url (which is the doc URL) so the CTA stays opt-in — threading the doc URL into the CTA would have fired the LLM on every doc-rewrite-blog run regardless of intent. Pipeline descriptions tightened in internal/pipelines/seed.go for all four blog pipelines: each now explicitly calls out that the output includes inline [1] citations from content.ground and recommends stripping them in post-processing for conversational publication targets. Honors the existing project memory about pipeline descriptions matching the mechanism. Tests: 9 new tests in blog_append_cta_test.go (no-op when no links / no-op when whitespace-only links / appends when project_url set / all 3 links land in prompt / model required when link set / persona matches article voice / code fence unwrapping / empty markdown rejected / empty model response surfaces error). All 4 blog pipelines re-verified through TestBuiltins_RunEndToEnd. 1541 internal tests pass. Companion blog draft at website/blog/2026-06-02-pipeline-output-shape-vs-publication-target.md (draft: true) frames the broader pattern: pipelines are tight contracts on purpose; multi-action intents need the planner to compose pipeline-run + post-processing rather than asking the pipeline to absorb responsibilities it wasn't designed for. Out of scope: citation stripping (its own pack — the design question is sharper than "remove [N] markers"; footnote / inline-hyperlink / references-list-only are all valid targets), and the planner-asks-user clarifying-question flow ("want deep research?"), which needs helmdeck.plan prompt engineering plus a UI surface for asking back.
Prefix-cache routing for the catalog block in helmdeck.plan and helmdeck.route (ADR 051 PR #4). ADR 051 PR #2 added Budget.SupportsPrefixCache + CachedInputCostUSDPerMTok as capability flags on 15 tier entries (Anthropic / OpenAI / Google / DeepSeek native + their OpenRouter relays — the providers whose APIs document prompt-prefix caching). PR #4 wires the flag to message assembly. The cache-defeating mutation: today the catalog block (the largest chunk of input tokens by far — 3KB compacted, 30KB uncompacted on Tier A) lives in the USER message alongside the per-call intent and defaults. Two consecutive calls to helmdeck.plan with different intents produce different user messages → no shared prefix → no provider cache hit. Anthropic's prompt-prefix cache (50% input discount), Gemini's (75% discount), and DeepSeek's (96.7% discount, the 30× number on V4 Pro) all key on byte-identical message-array prefixes; the moment any byte differs, the cache misses. The fix: when the budget advertises SupportsPrefixCache, the catalog moves into the SYSTEM prompt. The system prompt then carries planSystemPrompt + "\n\nCATALOG (helmdeck routing-guide):\n<full catalog>" — stable across every call for that model, since catalog is global engine policy (not per-caller). Per-call variation (defaults projection + intent + optional context) lives in the user message tail. assemblePlanPrompt(budget, ...) and assembleRoutePrompt(budget, ...) helpers carry the branching logic — when budget.SupportsPrefixCache is true, return (systemWithCatalog, userWithoutCatalog); when false, return (legacySystem, legacyUser). The legacy path is byte-identical to pre-PR-4 dispatches, so behavior on non-caching providers (Tier C fallback, Ollama, Mistral / Grok / Fireworks without the flag) is unchanged. Cascade interaction: when the ADR 050 PR #4 filter cascade fires on a SupportsPrefixCache model (only openrouter/deepseek/deepseek-v4-pro carries both flags today), the restricted catalog goes into the system prompt for that call. The filter pass keeps its own system prompt (filter and planning system prompts have different role instructions — consolidating them is deferred). Tests: 6 new tests in plan_test.go + route_test.go — (1) Tier A SupportsPrefixCache=true puts catalog in system message and intent in user message, (2) byte-identical system prompts across two sequential calls with DIFFERENT intents (the cache-hit contract), (3) Tier C fallback without the flag keeps the legacy single-user-message shape. 2 pre-existing tests (TestPlan_TierAModelGetsFullCatalog, TestRoute_TierAModelGetsFullCatalog) updated to assert against the combined system + user text since the catalog lifted out of the user message on Tier A. All 1532 internal tests pass. Completes the ADR 051 4-PR roadmap (PR #5 calibration tooling shipped ahead of #2–#4 to unblock operator self-service).
Provider-side strict JSON via response_format on gateway.ChatRequest (ADR 051 PR #3). ADR 051 PR #2 added Budget.WantsStrictJSON as a capability flag on 14 tier entries (Anthropic/OpenAI/Google native + their OpenRouter relays, Mistral, Grok), but no code read it — the gateway request shape had no field for constrained-decoding mode, so every plan/route call still relied entirely on prompt engineering to ask for JSON. The research synthesis cited in ADR 051 names provider-side strict JSON as the cleanest mitigation for the trailing-prose / markdown-injection failure modes Tier A models occasionally exhibit; it also flags constrained decoding as the wrong mode for quantized open-weight inference (Tier C), where the logit masker can deadlock and emit JSON-shaped garbage. New ResponseFormat string field on gateway.ChatRequest with documented values "" (unconstrained — current behavior, zero-diff for callers that don't set it) and "json_object" (provider validates output is syntactically valid JSON). String-based for forward-compat: a future "json_schema" value can be added without touching every adapter. Pack handlers set it from Budget.WantsStrictJSON; the dispatcher passes it through unchanged so any future gateway client (engine.Execute, integration tests) opts in without touching pack code. Per-provider translation: OpenAI adapter sends response_format: {type: "json_object"} upstream (Mistral, Groq, Fireworks, OpenRouter all share NewOpenAIProvider so they inherit the translation for free). Gemini adapter sets generationConfig.responseMimeType: "application/json". Anthropic uses tool-call structure for strict output and ignores the field. Ollama passes through unconstrained. Unknown ResponseFormat values fall through unconstrained at every adapter so a forward-compat value (e.g. "json_schema") rolling out faster than the translator can't break dispatch. helmdeck.plan + helmdeck.route wire-up: both handlers read budget.WantsStrictJSON and set ResponseFormat="json_object" when the flag is set AND the tier is not C. The Tier C guard is the safety belt the research synthesis explicitly called out — even an admin who manually sets WantsStrictJSON=true on a Tier C fallback entry stays on the prompt-engineered path because constrained decoding crashes there. Tests: 6 per-provider translation tests in internal/gateway/providers_test.go (openai forwards json_object envelope / openai omits response_format on unset / openai ignores unknown values / gemini sets responseMimeType / gemini omits responseMimeType on unset / anthropic ignores silently). 4 new pack-handler tests in plan_test.go + route_test.go (Tier A flips to json_object on a model with WantsStrictJSON=true / Tier C stays empty even when the flag is set on the fallback entry). All 1530 internal tests pass. Sets up PR #4 (prefix-cache-aware two-pass cascade gated on Budget.SupportsPrefixCache).
Cause-typed empty completions + Budget capability flags (ADR 051 PR #2). ADR 051 PR #1 stripped reasoning-token blocks and consolidated the JSON parser, but every parse failure still surfaced to operators as the same opaque "gateway returned an empty plan response" text regardless of root cause. The research synthesis cited in ADR 051 identifies four distinct causes for empty HTTP-200 completions, each with a different correct response: provider safety filter redaction, length truncation, constrained-decoding deadlock, and connection timeout on hybrid-reasoning models. PR #2 makes the cause inspectable via errors.Is. New sentinel errors in internal/packs/builtin/json_response.go: ErrSafetyFiltered, ErrLengthTruncated, ErrConstrainedDeadlock, ErrLikelyTimeout. Each is plain error (set as the Cause of the returned *packs.PackError). Callers that don't care keep using the existing wrapper; callers that want to bucket telemetry or pick a retry strategy use errors.Is(perr.Cause, ErrSafetyFiltered) etc. New DecodeStructuredResponseWithCause(rawBody, finishReason, packName, v) is the cause-typed variant. Reads finishReason (which gateway.ChatResponse.Choices[0].FinishReason has been carrying all along — the gateway captures it per provider for the provider_calls audit table) and classifies the failure. The existing DecodeStructuredResponse becomes a backward-compat wrapper that passes an empty finish reason — unchanged behavior on the wire except empty-completion paths now classify as ErrLikelyTimeout (preserving the historical message prefix). helmdeck.plan and helmdeck.route wire-up: both handlers now call DecodeStructuredResponseWithCause and thread chat.Choices[0].FinishReason through. No observable behavior change for callers that don't introspect the Cause; new visibility for those that do. Budget extended with four capability flags: IsHybridReasoning bool (model emits <think>/<reasoning> blocks — set on o3-mini, claude-3.7-sonnet, claude-opus/sonnet thinking variants, deepseek-v4-pro, the Moonshot kimi-k2/kimi- family). WantsStrictJSON bool (provider supports request-time strict-JSON mode — set on Anthropic / OpenAI / Google native, Mistral, Grok). SupportsPrefixCache bool (provider offers prompt-prefix caching for 2×–30× input-cost discount — set on Anthropic / OpenAI / Google / DeepSeek native + their OpenRouter relays). CachedInputCostUSDPerMTok float64 (cached-input rate per million tokens — populated from Artificial Analysis and per-provider pricing pages). Empty defaults on unmapped models are conservative ("we don't know" → don't make affirmative claims). helmdeck://context-budgets MCP resource extended: surfaces is_hybrid_reasoning, wants_strict_json, supports_prefix_cache, cached_input_cost_usd_per_mtok on each entry with omitempty so the resource stays compact for legacy entries while exposing the new flags on entries that carry them. 27 tier entries updated with PR #2 flags following the methodology in docs/howto/calibrate-model-tiers.md — Tier A native APIs get strict-JSON + prefix-cache + a cached rate (Anthropic 1.5/M, OpenAI gpt-5 0.46/M, Gemini 2.5 Pro 0.125/M, etc.). DeepSeek V4 Pro flagged hybrid + cache-supporting (30× discount at 0.0145/M). Kimi K2 family flagged hybrid. Open-weights routes (Llama, Gemma, Qwen, free tier) keep all flags off — the report warns of constrained-decoding deadlock when strict-JSON is forced on quantized inference engines. Tests: 9 new cause-typed tests in json_response_test.go (safety filter / length truncated / likely timeout via empty finish_reason / unknown finish_reason fallback / length-truncated parse fail / constrained deadlock / safety-filtered parse fail / backward-compat sentinel preservation / wrapper still produces historical message prefix). 4 new capability-flag tests in budgets_test.go (hybrid reasoning, strict JSON, prefix cache, fallback conservative defaults). 1 new MCP resource test asserting o3-mini's flags surface on the wire. 1499 tests passing across all internal packages (was 1485 before PR #2, +14 new). Sets up PR #3 (provider-side response_format translation in gateway.ChatRequest gated on WantsStrictJSON) and PR #4 (prefix-cache-aware two-pass cascade gated on SupportsPrefixCache).
Model-tier calibration tooling + maintenance docs (ADR 051 PR #5). ADR 051 PR #1 introduced 14 new entries in internal/llmcontext/budgets.go calibrated from a research synthesis on 2026-06-02. Without a documented calibration process the table will be stale within a quarter and operators won't know how to extend it. PR #5 fixes that by shipping the methodology + automation that produced PR #1's table, so the next tier addition is a 5-minute task instead of an afternoon of reverse-engineering. New scripts/calibrate-model.sh runs a fixed suite of helmdeck-specific prompts against a given model id via the live /api/v1/packs/helmdeck.plan REST endpoint and emits a recommended tier + draft budgets.go entry. The prompt suite covers three failure-mode classes: trivial single-action (baseline + "does it respond at all"), multi-action 3-step pack-chain (structured-output reliability), paste-heavy multi-action (the original ADR 050 motivating prompt). For each prompt it measures HTTP status, wall-clock duration, parsed-response shape, and which cascade stages fired (lexical truncation / LLM filter pass via the compaction.dropped field). The tier decision tree maps to: Tier A when all prompts succeed with no compaction, Tier B when metadata trim alone suffices, Tier C when lexical/filter cascade fires, Tier C-unstable when only trivial works, "unsupported" when even trivial fails. Hybrid-reasoning detection via trivial-intent latency > 20s. Latency budget tunable via TIMEOUT_S env var; --skip-paste-heavy shortens the run on weak models. Output as human text (default) or --json for scripting. New scripts/calibrate-model.test.sh is the self-test — invokes the calibrator against two anchor models (openrouter/openrouter/free should be Tier C / C-unstable / unsupported; openrouter/anthropic/claude-haiku-4-5 should always be Tier A) and asserts the recommendation matches. Catches regressions in the heuristic logic. New docs/howto/calibrate-model-tiers.md is the operator-facing methodology walkthrough: when to calibrate, where to find benchmark scores (Berkeley Function-Calling Leaderboard, Aider polyglot edit-format adherence, Artificial Analysis pricing), how to identify architectural quirks from provider docs (hybrid reasoning, strict JSON support, prompt-cache support), how to interpret the calibrator's recommendation, how to set the capability flags PR #2 will introduce (IsHybridReasoning / WantsStrictJSON / SupportsPrefixCache), and what the trailing source-of-classification comment should contain. Includes the rules for selecting MaxCatalogBytes per tier (0 / 22000 / 10000) and AllowsLLMFilter per tier (false for A and B, true for C). docs/RELEASES.md §"Agent sync checklist" step 9 added: every release cut now includes a tier-table refresh check pointing operators at https://openrouter.ai/api/v1/models for newly-shipped models and at the calibrator + how-to for evaluating them. Discovery stays manual — no helmdeck cron watching provider catalogs — but the maintainer who runs the release also notices when their fallback chain has new options worth investigating. Why PR #5 ships ahead of PRs #2–#4: the calibration methodology is freshest in maintainer memory right now; operators wanting to add new models get unblocked immediately; PR #2's capability flags need calibration data the script feeds them; calibration can evolve asynchronously from the typed-errors / strict-JSON / prefix-cache architecture work.
Reasoning-token stripping + JSON parser parity + research-calibrated tier table (ADR 051 PR #1). ADR 050 shipped a cascade calibrated against three free models. A research synthesis landed today documenting that the "empty completion" failure has four distinct root causes (only one of which — trailing prose — our json.Decoder tolerance fix addresses), and a live test with openrouter/moonshotai/kimi-k2.6 immediately exposed a fifth gap: hybrid-reasoning models emit <think>...</think> / <reasoning>...</reasoning> blocks before the structured payload, and nothing in helmdeck strips them. The Kimi-K2.6 call streamed for 296 seconds inside its <think> block and got cut off by OpenClaw's 5-minute timeout before reaching the JSON; even if it had finished, the parser would have hit the reasoning block first. New internal/llmcontext/reasoning.go exports StripReasoningTokens(s string) string and HasReasoningTokens(s string) bool. Strips <think>...</think>, <reasoning>...</reasoning>, and [REASONING]...[/REASONING] blocks — case-insensitive, multi-line, tolerates attributes (<think type="planning">), idempotent on clean input, requires a closing tag (unclosed open tags pass through so we never silently drop a real answer). Collapses runs of blank lines that the strip leaves behind; trims leading/trailing whitespace. New internal/packs/builtin/json_response.go exports DecodeStructuredResponse(rawBody, packName, v) consolidating the defensive parsing pipeline every LLM-backed pack was reimplementing slightly differently: strip reasoning tokens → trim → unwrap code fences (unwrapCodeFence existing helper) → json.Decoder.Decode (tolerates trailing prose/HTML/markdown, the ADR 050 PR #4 fix) → balanced-brace extractFirstJSONObject substring fallback (reuses the helper from webtest.go that properly handles } inside JSON string literals — better than the naive LastIndex("}") approach plan.go used to use). Returns *packs.PackError with CodeHandlerFailed and a packName-threaded Message ("gateway returned an empty plan response" / "empty routing response" / "empty rewrite response" depending on caller). Migration: plan.go (had the ADR 050 PR #4 tolerance fix), route.go (still on strict json.Unmarshal — this brings it to parity), and content_ground.go (had its own substring fallback) all now call DecodeStructuredResponse. Three independent fallback paths converge to one. Net code reduction; uniform behavior. Tier table refresh — internal/llmcontext/budgets.go gains 14 new entries calibrated from the research report's BFCL (Berkeley Function-Calling Leaderboard), Aider polyglot edit-format adherence, and Artificial Analysis pricing data. Tier A additions: openai/o3-mini (BFCL 84.00%; hybrid reasoning — emits <think>, now stripped), google/gemini-2.5-pro (BFCL 85.04% leaderboard top, Aider 99.6% edit-format), google/gemini-2.5-flash (BFCL 75.58%), anthropic/claude-3.7-sonnet (BFCL 73.24%, Aider 84.2%; hybrid thinking mode). Tier B additions: openrouter/deepseek/deepseek-v4-pro (BFCL proxy 71.4%, Aider proxy 74.2%; hybrid reasoning with documented 30-minute serverless timeouts), openrouter/deepseek/deepseek-v3.2 (Aider 74.2%), openrouter/deepseek/deepseek-chat (broader family), openrouter/x-ai/grok- prefix (BFCL proxy 61.38%, Aider 97.3% edit-format; price-tier bumps past 128K context). Tier C additions: openrouter/moonshotai/kimi-k2 (256K context, hybrid reasoning — observed to time out without the <think> stripper), openrouter/moonshotai/kimi- prefix (catches future Kimi releases), openrouter/tencent/ prefix (250K context, conservative until live-validated). Each entry's classification source is named in its trailing comment so future operators can trace the call to its evidence. Tests: 15 new tests in internal/llmcontext/reasoning_test.go (idempotent on clean input, drops think/reasoning/REASONING variants, case-insensitive, multi-line bodies, multiple blocks, tag attributes, unclosed-tag pass-through, blank-line collapse, regression sample modeled on real Kimi-K2.6 output). 12 new tests in internal/packs/builtin/json_response_test.go (happy path, think-block prefix, reasoning-block prefix, code-fence unwrap, trailing-content tolerance, leading-prose substring extraction, brace-inside-string regression guard, empty-body error message, reasoning-only input post-strip, unrecoverable garbage, packName threading, combined think + fence). 12 new tier-classification assertions in internal/llmcontext/budgets_test.go covering each of the report's recommended model ids. Sets up PR #2 (cause-typed empty completions: ErrSafetyFiltered, ErrLengthTruncated, ErrConstrainedDeadlock, ErrLikelyTimeout, plus Budget capability flags IsHybridReasoning / WantsStrictJSON / SupportsPrefixCache / CachedInputCostUSDPerMTok), PR #3 (provider-side strict JSON mode via response_format on gateway.ChatRequest), and PR #4 (prefix-cache-aware two-pass cascade — restructure the ADR 050 PR #4 filter so both passes hit the same provider cache).

[0.22.0] - 2026-06-01

Theme: Agents that work on free models, with memory. Closes ADR 047 (catalog metadata + memory-driven routing), ADR 048 (memory write surface + OpenClaw memory-corpus bridge), ADR 049 PR #1 (helmdeck.plan intent decomposer), and ADR 050 (4-PR retrieval-augmented tool selection cascade). The exact MiniMax M3 launch paste + 3-action ask that motivated the cascade work now returns a valid 3-step plan on openrouter/openrouter/free (was: empty completion).

Added

Two-pass LLM-filter cascade + JSON-decoder tolerance — original motivating prompt now succeeds on free models (ADR 050 PR #4). Closes the ADR 050 roadmap and the gate the entire ADR was scoped to meet. PR #1 shipped per-model budgets + metadata compaction; PR #2 wired route + added the context-budgets resource; PR #3 added the cascading Select() with lexical pre-filter. Empirical gap PR #3 left open: complex multi-paragraph prompts (1.5KB MiniMax M3 launch paste + 3-action ask) still empty-completed on free models even with a 3KB catalog — failure had shifted from "catalog overflows working set" to "structured-output reliability on long user pastes." PR #4 closes that gap via two cooperating changes: (1) an optional pre-planning LLM filter pass that runs when lexical retrieval is ambiguous, and (2) a tolerant JSON parser that reads the first complete object from the model's response and ignores trailing prose/HTML/markup. The live acceptance gate now passes: the exact MiniMax M3 prompt that motivated ADR 050 returns a valid 3-step pack-chain plan in ~46s on openrouter/openrouter/free (was: empty completion at 29.5s before any PR). Diagnostic showed the model was producing JSON + trailing garbage all along; strict json.Unmarshal was rejecting otherwise-recoverable output, and operators saw "empty plan response" because the parser failed before extracting the leading object. json.Decoder reads one value and stops, surfacing the actual plan. Mechanism. When Budget.AllowsLLMFilter == true AND Select ended with lexical.low_confidence (an ambiguity signal — top scores within 40% of each other), the pack handler dispatches a SMALL second LLM call: catalog names + one-line descriptions + the user intent → returns just a JSON list of relevant tool ids. The handler then restricts the full catalog to the union of the filter picks and the lexical top-N, preserving lexical's strong signals while letting the filter recover from lexical false-negatives. The planning call then sees a SMALL catalog of only the picked ids. New surfaces in internal/llmcontext. FilterSystemPrompt (versioned alongside the pack code). BuildFilterUserMessage(rg, intent) string (~10KB for the current 70-entry catalog, vs ~35KB full-metadata). ParseFilterResponse(text) []string tolerates code-fenced JSON, leading prose, dedup. RestrictCatalog(rg, keep) subsets by id; unknown ids ignored. MergeKeepOrder(primary, secondary) unions two id lists preserving primary order. IDsFromRoutingGuide(rg) extracts sorted ids for reproducible filter prompts. ShouldEscalateToFilter(ranked, min) bool combines HighConfidence < 0.4 with len(ranked) >= min. Budget extension. Budget gains AllowsLLMFilter bool + FilterModel string. Tier C entries enabled by default with empty FilterModel (caller falls back to the planning model). Tier A/B disabled. helmdeck://context-budgets exposes the new fields. JSON parser tolerance. Switched helmdeck.plan from json.Unmarshal(body, &raw) to json.NewDecoder(strings.NewReader(body)).Decode(&raw) with a body[first{:last}+1] substring fallback. Reads the first complete JSON object and stops, tolerating trailing prose/HTML/markup that weak models produce. Cascade gating fix. PR #3's escalation was eagerly firing on every lexical truncation, adding ~30s of latency on cases lexical alone handled. PR #4 gates on the new lexical.low_confidence marker Select appends only when ShouldEscalateToFilter returns true — confident-pick cases bypass the filter pass entirely, restoring PR #3's 5-second latency on simple prompts. Wire-up. Both helmdeck.plan and helmdeck.route orchestrate the filter pass after Select returns when escalation conditions match. Filter failures (parse errors, dispatcher errors) fall back silently to the lexical-only selection — the filter is an enhancement, never a hard dependency. Successful runs append llm_filter(picks=N,kept=M) to the Trim record so operators see the filter stage in the same audit surface. Tests. 14 new tests in internal/llmcontext/filter_test.go (prompt shape, JSON parsing variants — code-fenced, leading prose, dedup, real-world response — RestrictCatalog membership, MergeKeepOrder primary precedence, IDsFromRoutingGuide stability, ShouldEscalateToFilter thresholds). 2 new plan integration tests using scripted dispatchers (TwoPass cascade fires both LLM calls in correct order with correct system prompts, FilterFails falls back to lexical without breaking the plan call). 1446 tests passing across all internal packages.
Cascading Select() + LexicalRank + helmdeck://my-plans (ADR 050 PR #3). Closes the "simple multi-intent prompts work on free models" gate. PR #1 shipped metadata compaction; PR #2 wired it into route; PR #3 wraps both stages in a cascading Select(catalog, intent, budget) → (selected, Trim) entry point that adds lexical retrieval + top-N truncation as stage 3 when compaction alone can't reach the model's budget. Live test on openrouter/openrouter/free: a 3-action intent ("remember this fact, then write a blog about it, then generate an image") returned a clean 3-step pack-chain plan in ~5 seconds post-cascade (catalog 30KB → 3.16KB, 89% reduction, all compaction steps + lexical.top_n fired). New LexicalRank(catalog, intent) []Scored scores every catalog entry by keyword overlap against intent_keywords (×3.0 weight), accepts/produces (×2.0), name (×2.0), description (×1.0), plus pipeline supersedes (×2.5) so the supersedes-honoring policy lives at the ranking layer too. Stop-word filtering, case-insensitive, single-character tokens dropped, deterministic ordering on ties (by entry name). New TopK(ranked, k) truncates ranked slice; HighConfidence(ranked, threshold) reports whether the top score is meaningfully ahead of the second (>=threshold ratio gap) — used by the future PR #4 LLM-filter pass to decide whether escalation is needed. New Select(catalog, intent, budget) cascade is the public entry point: Tier A pass-through → CompactCatalog (PR #1 metadata trim) → if still over budget, LexicalRank + TopK (cap is SelectMaxEntriesTierC=12 or SelectMaxEntriesTierB=25 by tier). Returns same Trim record callers already log; appends lexical.top_n to dropped[] when stage 3 fires. helmdeck.plan + helmdeck.route wire-up: both packs now call Select(...) instead of CompactCatalog(...) directly. The cascade is internal; callers see one function call. INFO log line renamed from "catalog compacted" to "catalog selection ran" to reflect the broader cascade. New helmdeck://my-plans MCP resource (always listed, ADR 050 PR #3 consolidation of ADR 049's deferred PR #2) projects the caller's plan_history audit rows into per-intent_sha cohorts: {intent_sha, count, complexity, top_tools[], last_unix, models[]}. Operators audit the planner's behavior over time; agents detect stable learned plans. Tests: 11 new tests across internal/llmcontext/ (tokenize, LexicalRank intent_keywords beat name matches, supersedes boost, deterministic ties, TopK, HighConfidence, Select Tier A passthrough, compact-only sufficient, lexical escalation, supersedes-survives-truncation, over-budget marker not forwarded). 3 new tests in internal/mcp/resources_test.go cover my-plans listing, aggregation correctness, and empty-history note. Bumped always-listed-resource count assertion to 6. Total: 1431 tests passing across all internal packages. Honest scope. Complex multi-paragraph prompts (e.g. the original MiniMax M3 launch paste + 3-action ask) still empty-complete on the worst free models even with 3KB catalog — the failure has shifted from "catalog overflows working set" to "structured-output reliability on long user pastes." That's a PR #4 problem (two-pass LLM-filter cascade with a paid filter model + a free planner model), not a PR #3 problem. PR #3 closed the "simple multi-action intents work on free models" gate that PR #1 was originally scoped to meet; PR #4 closes the remaining "complex paste + multi-action" gate.
helmdeck.route compaction + helmdeck://context-budgets resource + plan compaction field (ADR 050 PR #2). Generalizes the llmcontext integration from PR #1: helmdeck.route's handler now calls llmcontext.BudgetFor(model) + CompactCatalog(catalog, budget) after buildCatalog, matching the pattern PR #1 added to helmdeck.plan. Free models hitting the router now see the same trimmed catalog they see from the planner, with the same INFO log surface so operators can correlate compaction events across both packs. New helmdeck://context-budgets MCP resource (always listed, no caller scoping, no memory dependency) projects the budgets table — budgets[] with {model, input_tokens, output_tokens, max_catalog_bytes, tier} per entry, a fallback row showing what unmapped models inherit, and a policy string explaining lookup rules. Operators can audit which model gets which tier without grepping source; agents reading the resource can understand why a plan was made under a slim catalog and decide whether to escalate to a stronger model. New compaction field on helmdeck.plan output (optional, omitted on Tier A pass-through) surfaces the Trim record on the wire: {before_bytes, after_bytes, dropped[]}. Same shape as the INFO log line so agents and operators see the same numbers. Output schema declares compaction: object; agents that ignore unknown fields are unaffected. Tests: 2 new route tests cover Tier A full-catalog pass-through and Tier C supersedes preservation; 2 new plan tests cover the omitempty contract on Tier A and the field-present contract on Tier C with a 30-pack catalog overflowing the budget; 2 new mcp tests cover context-budgets listing + read shape. Total: 1029 tests passing across mcp + packs + llmcontext + pipelines. Sets up PR #3 (retrieval-augmented tool selection: cascading Select() entry point with lexical pre-filter + TopK + helmdeck://my-plans projection; the public-API shift from CompactCatalog to Select is the migration target that wires PR #1 + PR #2 + PR #3 into one cohesive flow).
internal/llmcontext module — per-model prompt budgets + deterministic catalog compaction (ADR 050 PR #1). ADR 049 PR #1 (helmdeck.plan) shipped correctly but live-test on a real multi-intent prompt reproducibly empty-completed on free models: openrouter/nvidia/nemotron-3-super-120b-a12b:free returned no completion after 29.5s; openrouter/z-ai/glm-4.5-air:free returned no completion after 58.0s; OpenClaw's MCP client logged 3 timeouts + 1 empty-plan error. Root cause: the catalog projection assembled by buildCatalog() is 35KB of JSON for the current stack (52 packs + 21 pipelines with full metadata). Combined with the user's paste, the system prompt, and the structured-output ceiling, free models with imperfect structured-output reliability bail. The pack itself is correct — the failure is a cross-cutting concern affecting every LLM-backed pack that ships catalog or large input context. New internal/llmcontext module exports three surfaces: Tier (A frontier / B mid-tier / C weak/free), Budget (per-model InputTokens / OutputTokens / MaxCatalogBytes), and CompactCatalog(full, budget) → (compacted, Trim). Tier classifications are calibrated against live OpenClaw tests, not vendor specs — a model with a 32K window that empty-completes at 20K of input is Tier C even though its window is larger than some Tier B models. Budgets table (budgets.go) maps canonical model ids to budgets via exact-match then longest-prefix-wins; unknown models fall back to Tier C so a fresh model still gets a working (if conservative) profile. Compaction order: pack intent_keywords[] → pack typical_use → pack limitations[] → pipeline steps[] bodies (kept: id/name/pack) → pipeline input/output schemas (replaced with field-name lists) → description truncation to first sentence. Each pass marshals + re-checks until len(JSON) <= MaxCatalogBytes. Pipeline metadata.supersedes is NEVER trimmed — it anchors plan's rule P2 (pipeline supersedes packs the user named by hand). Pack names + pipeline ids are also preserved (they're dispatch identifiers). helmdeck.plan integration: handler calls llmcontext.BudgetFor(model) + CompactCatalog(catalog, budget) immediately after buildCatalog, before assembling the prompt. When trim record is non-empty, logs an INFO line with model, tier, before_bytes, after_bytes, dropped[] so operators see when free models are getting a slim catalog. 20 tests in internal/llmcontext/ cover exact + prefix lookup, Tier-A pass-through, priority-order trim, supersedes survival, slimPipelineSteps preservation of dispatch-relevant fields, firstSentence helper, immutable-input contract, and the still-over-budget marker. 2 plan tests added asserting Tier C compaction never drops dispatch identifiers and Tier A models see the full catalog. Token-counting heuristic: 1 token ≈ 4 chars byte-count instead of a real tokenizer — bounded cost of being slightly conservative (sending a leaner catalog than the model needs) versus pulling a model-specific tokenizer into Go. Empirical results. Trivial-intent calls on openrouter/openrouter/free post-compaction succeed in ~23s (catalog 30KB → 13.9KB, 54% reduction, all 6 trim steps fire). Complex multi-paragraph intents on the same model still empty-complete — the 14KB irreducible floor (after stripping every metadata field but preserving names, ids, and supersedes) is still too much for some free models when combined with a long user paste and a structured-output ceiling. The right fix for that case is retrieval-augmented tool selection (load only the catalog entries relevant to the intent) — designed as the next step of this ADR, not a brittle entry-truncation hack on top of metadata compaction. Sets up PR #2 (wire helmdeck.route + add helmdeck://context-budgets MCP resource for operator visibility) and PR #3 (retrieval-augmented selection: lexical pre-filter + top-N selection over catalog entries via plan_history priors, ships helmdeck://my-plans projection).
helmdeck.plan self-learning intent decomposer pack (ADR 049 PR #1). ADR 047 PR #3's helmdeck.route answers "given an intent, which ONE tool?" — and that's enough when the user's ask maps to a single pack or pipeline. But real conversational prompts often span multiple intents in one message. A live OpenClaw test made the gap concrete: a free model (nvidia/nemotron-3-super-120b-a12b:free) received a paste + "do you have memory using helmdeck for [paste]... we can use the memory to create a blog to test the memory" and only called the image-gen tool — never reaching for memory tools or the blog pipeline ADR 048 had just shipped. The bridge worked; the agent simply didn't decompose the multi-intent prompt. helmdeck.plan closes that gap: a new LLM-backed meta-pack that returns an ordered steps[] array (each {order, tool, args, rationale}), a derived rewritten_prompt string a free model can execute line-by-line, and a complexity classifier (single-action / pipeline-direct / pack-chain). The rewritten_prompt is derived from steps in the handler (not asked of the LLM independently) so the two surfaces can't drift. Pipeline-aware: reuses helmdeck.route's catalog projection so the model sees both packs AND pipelines; the system prompt teaches three rules — pipeline wins when accepts/produces fit, honor metadata.supersedes (a pipeline supersedes packs the user named by hand), decompose pack-by-pack only when no pipeline matches. Re-implementing a pipeline's curated chain as pack-by-pack steps would regress the supersedes guarantee. Guards: every step.tool MUST resolve to a registered pack name OR the literal helmdeck__pipeline-run with args.id matching a real pipeline; unknown ids get demoted to "tool": "unknown" with a populated rationale, and helmdeck.plan cannot call itself (recursive-call guard). Partial demotion is fine — valid steps survive alongside unknown ones, the agent decides. Self-learning seam: every successful plan writes a compact PlanAudit row to the caller's bare namespace under category plan_history (new) — intent SHA + complexity + per-step tool name + SHA-8 of args, NOT the rewritten prompt or rationales. Rows expire after 30 days (matching pack_history / pipeline_history) or via helmdeck.memory_forget. The plan_history category joins the reserved-category guard in internal/packs/facts.go so agents can't poison the projection through helmdeck.memory_store. Sets up PR #2 (helmdeck://my-plans projection mining the history into priors) and PR #3 (frontier-model gap detection comparing expert_baseline against the plan's decomposition). New docs/howto/intent-decomposition.md walks operators through when to call, the wire shape, the pipeline-aware behavior, and the self-learning story; SKILL.md adds a one-paragraph planning tip.
OpenClaw memory-corpus bridge — QMD-compatible MCP endpoint at /api/v1/mcp/qmd/sse (ADR 048 PR #3). Closes the ADR 048 roadmap. ADR 047 + PRs #1 and #2 of ADR 048 built up helmdeck's per-caller memory layer (audit history + agent-written facts); this PR makes that memory queryable from OpenClaw's own memory_search tool so agents see helmdeck's corpus inline with their conversational memory. New QMDServer type (internal/mcp/qmd_server.go) speaks just enough MCP — initialize, tools/list, tools/call — to expose a single tool named query matching the wire shape OpenClaw's MCPorter daemon expects (extensions/memory-core/src/memory/qmd-manager.ts:2167–2205). Response shape: {results: [{docid, score, snippet, collection, file?, start_line?, end_line?}]}. Scoring is substring/keyword (helmdeck doesn't carry embeddings); semantic recall happens client-side via OpenClaw's embedding pipeline (PR #1 sidecar). New SSE transport at /api/v1/mcp/qmd/sse (internal/api/mcp_qmd_sse.go) mirrors /api/v1/mcp/sse 1:1 (session GET/POST pairing, 15s keepalives, chanWriter framing). Separate route + server because MCPorter expects the literal tool name query and the main PackServer uses dotted pack names — multiplexing would collide. Corpus projection renders three layers verbatim: pack_history rows format as ## Pack call: <name> summaries with input fields; pipeline_history rows format as ## Pipeline run: <id>; agent-written user_facts (and any other non-reserved category) surface the raw fact value with a category footer. Caller scoping reuses packs.CallerFromContext so the corpus is JWT-subject-namespaced just like every other memory surface. compose.openclaw-sidecar.yml wires OPENCLAW_MEMORY_QMD_MCPORTER_ENABLED=true + SERVERNAME=helmdeck + STARTDAEMON=true by default; operators opt out via OPENCLAW_QMD_ENABLED=false in their shell. scripts/openclaw-register-qmd.sh completes the wire by registering helmdeck with MCPorter's own config (reuses the helmdeck JWT OpenClaw already stores so token rotation propagates). Auto-runs from scripts/install.sh after the stack is healthy; idempotent for manual reruns. Memory-disabled deployments return 503 from /api/v1/mcp/qmd/sse so MCPorter logs a clean "tool not found" and memory_search degrades to the user's local chunks without an agent-side error. New docs/howto/openclaw-memory-corpus.md documents the wire path, verification steps, opt-out, and what the bridge intentionally does NOT do (no writes via this endpoint, no cross-caller mixing, no vault leaks). 9 new tests in internal/mcp/qmd_server_test.go cover handshake, tools/list shape, user-fact + pack-audit projection, per-caller isolation, unknown-tool rejection, nil-store safety, limit clamping, and the dual structuredContent + content envelope so both newer and older MCP clients parse the response.
Helmdeck memory write surface — POST /api/v1/memory/store + helmdeck.memory_store pack + helmdeck://my-memory MCP resource (ADR 048 PR #2). ADR 047 PR #2 made the memory layer queryable; this PR makes it writable. Any MCP client (and the chat agent) can now persist user-supplied facts under the caller's bare namespace with category tagging + TTL. Two surfaces share one engine policy (internal/packs/facts.go → packs.ValidateFact): POST /api/v1/memory/store for REST callers and the management UI; helmdeck.memory_store pack for chat agents calling helmdeck via MCP. Request shape: {key, value, category?, tags?, ttl_seconds?} — key/value required, category defaults to user_facts, TTL defaults to 90 days (min 1h, max 365d). Reserved-category guard: pack_history and pipeline_history are owned by the engine audit hooks and rejected with 400 / CodeInvalidInput so an agent can't poison the my-defaults projection. NoAudit: true on helmdeck.memory_store so storing a fact doesn't pollute the audit history with helmdeck.memory_store ranked alongside real packs. New helmdeck://my-memory MCP resource (always listed): per-caller index of stored facts grouped by category, with counts + recent_keys. Audit categories filtered out — those still surface via helmdeck://my-defaults. Agents read my-memory at the top of a session to discover existing facts before re-asking the user. Lifecycle: facts auto-expire via the existing memory TTL; the existing helmdeck.memory_forget pack (ADR 047 PR #2) already handles scope: "key:<exact-key>" so per-fact cleanup composes for free. SKILL.md teaches the agent to persist durable preferences/conventions via helmdeck__memory_store and to peek helmdeck://my-memory before re-asking. New docs/howto/agent-facts.md walks operators + users through the full lifecycle. Memory-disabled deployments degrade gracefully: writes soft-succeed with a note so chat agents don't have to special-case nil-store paths. Sets up PR #3 (OpenClaw memory-corpus bridge: helmdeck's audit + user_facts surface through OpenClaw's memory_search).
Embedding sidecar overlay for OpenClaw memory_search semantic recall (ADR 048 PR #1). Today OpenClaw's memory_search degrades to FTS (keyword/BM25) when no embedding provider is configured — recall on a fresh install is "the OpenAI key for provider 'openai' is missing" and the agent falls back to lexical search. ADR 048's first PR ships an opt-in compose overlay (deploy/compose/compose.embeddings.yml) that runs ollama/ollama:latest as helmdeck-embeddings on baas-net, plus a one-shot init service that ollama pulls nomic-embed-text (~270 MB, ~600 MB RAM idle, CPU-only, Apache 2.0). A named volume persists the model cache so container re-creates don't re-download. scripts/install.sh layers the overlay by default; --no-embeddings opts out for operators who'd rather use OpenAI cloud or a remote Ollama. OpenClaw still needs one manual openclaw agents add main to register the openai-compatible provider pointing at http://helmdeck-embeddings:11434/v1 — there's no env-var auto-discovery in OpenClaw today, so zero-config will come in a follow-up once the upstream surface stabilizes. docs/howto/openclaw-memory.md walks operators through verify / override / opt-out paths. Sets up PR #2 (helmdeck memory write surface — POST /api/v1/memory/store + helmdeck.memory_store pack) and PR #3 (OpenClaw memory-corpus bridge — helmdeck's audit history + user_facts surface through OpenClaw's memory_search).
Blog persona directives now call out code blocks, mermaid diagrams, and numeric tables alongside tone/length. When the slides persona enrichment shipped, the technical slides directive started inviting fenced code + mermaid flowchart/sequenceDiagram blocks where the source supports them, and executive/educational/academic each gained a visual affordance hint. The blog rewriter was behind on that side — technical mentioned code blocks but was silent on diagrams; the other personas didn't mention either. blog.rewrite_for_audience's persona map now matches the slides vocabulary: technical invites a mermaid diagram for process/architecture sources; executive promotes a numeric comparison into a small markdown table when more than two values are involved; educational invites a minimal code block before each concept explanation + a mermaid sequence-of-steps where it builds a mental model; academic includes a mermaid diagram or numbered figure when the source presents structured data. marketing / general stay text-first by design (visual treatment for marketing is product screenshots, which the rewriter doesn't control). content.ground is left untouched — it's a citation pass, and asking it to introduce visual structure mid-grounding would destabilize the citations. New TestBlogRewrite_PersonaVisualAffordances asserts the new substrings land in the system prompt per persona so prompt drift surfaces as a test failure.
Auto-split slide overflow for code blocks and image+bullets. Marp silently clips anything that doesn't fit the slide — a 60-line code sample renders with its bottom 30 lines invisible, the reader blames "the model produced bad slides" when really the renderer ate half the content. slides.outline now runs a deterministic post-pass between the LLM's output and the artifact write: it walks the deck, splits any code block longer than 22 lines into continuation slides ("Title (cont. 2/3)") with the fence reopened on each, and splits any slide where an image sits next to more than 8 lines of bullets/text into image-only + bullets-only continuation slides. Continuation slides carry a synthetic  speaker note so slides.narrate produces sensible audio; the LLM's original speaker notes stay on the first chunk. Speaker notes, frontmatter, post-code paragraphs, and image-prompt indices all survive the split (the pass runs BEFORE extractImagePrompts so slide_index maps to the final slide count). max_slides is now a soft cap — the LLM aims for it, but the post-pass overshoots when overflow demands; the output's slide_count reflects the final post-split count. The 22-line threshold is tuned for Marp's default 14pt monospace on a 16:9 slide; the splitter prefers a blank-line boundary within ±3 lines of the cut so functions don't get sliced mid-statement when a natural break exists nearby. Idempotent on already-fitting decks. Wide-table pagination (>20 rows) is a known gap deferred to a follow-up; the existing 60vh CSS cap keeps tables on-slide for now. All 7 slide pipelines (grounded-deck, grounded-narrate, research-deck, research-narrate, research-ground-deck, scrape-deck, repo-presentation) get the fix automatically — no pipeline changes needed.
Routing Memory management UI (ADR 047 PR #4). Closes the ADR 047 roadmap. PR #2 added per-caller audit memory + helmdeck://my-defaults over MCP; this PR makes that data visible and clearable from the Management UI without needing an MCP-aware client. New page at /memory ("Routing Memory" in the sidebar) shows three blocks for the logged-in caller: (1) Learned pack defaults ranked by call count, each row carrying the common_inputs chips the chat agent pre-fills from; (2) Learned pipeline defaults — same but per pipeline; (3) Recent activity — the last 200 audit rows with {kind, id, outcome, when, learn_inputs}. Every row gets a forget button (per-pack-id / per-pipeline-id / per-exact-key), each defaults section has a "Clear all packs / pipelines" button, and the header has a global "Clear all history". Backed by two new REST endpoints — GET /api/v1/memory/defaults (the projection + recent rows) and POST /api/v1/memory/forget with scope body — both wired in internal/api/memory.go. The forget endpoint accepts the same scope vocabulary as the helmdeck.memory_forget pack (all / packs / pipelines / pack:<id> / pipeline:<id>) plus a new key:<exact-key> scope that backs the per-row buttons. Auth-disabled deployments resolve the caller to "unknown" (matching packs.callerFromContext's convention) so audit rows remain queryable. Memory-disabled deployments return an empty payload + an explanatory note; forget is a soft-success no-op.
helmdeck.route meta-pack with gap analysis (ADR 047 PR #3). PR #1 made the catalog self-describing; PR #2 gave it per-caller memory; PR #3 is the LLM-backed router that fuses both into a single call the agent makes BEFORE picking a pack/pipeline. Inputs: user_intent (the user's natural-language request) + model. Internally builds the same routing-guide projection PR #1 ships at helmdeck://routing-guide plus the same defaults projection PR #2 ships at helmdeck://my-defaults (now factored into a reusable packs.Defaults projection both surfaces share) and packs them into one model prompt. Returns a structured {recommendation, alternatives[<=3], gap_warning, reasoning, model} JSON object: recommendation is the best fit with suggested_inputs pre-filled from learned defaults; alternatives are runners-up; gap_warning is populated with a structured pack proposal (name, input_schema, output_schema, integration_pattern, why_useful) when nothing in the catalog fits. The agent confirms with the user, then either runs the recommendation or files the gap as a GitHub issue. Hallucination guard: if the model returns an id that doesn't exist in the catalog, the handler demotes the recommendation and surfaces a gap_warning so the agent can't dispatch to nothing. Audit IS recorded for helmdeck.route itself — "how often is the router called and what does it route to" is exactly the meta-signal PR #4's management UI surfaces. Registered in main.go with the existing vision dispatcher + pack registry + a thin *pipelines.Store adapter; degrades to "no dispatcher" CodeInternal when no gateway is configured. SKILL.md teaches the agent to prefer helmdeck__route over reading helmdeck://routing-guide directly for multi-step requests.
Per-caller audit memory + helmdeck://my-defaults + helmdeck.memory_forget (ADR 047 PR #2). PR #1 made the catalog self-describing; PR #2 turns every pack and pipeline run into a learning event so a fresh conversation can pre-fill from past use instead of starting from zero. *packs.Engine.Execute and *pipelines.Runner.RunSync now write one audit row per terminal outcome — pack name (or pipeline ID + run ID), outcome (ok or the closed-set error code), duration, and a tiny learn_inputs map containing the most useful low-cardinality string fields (persona, audience, angle, model, theme, voice, title, author, kind, format, persona_used). Markdown bodies, URLs, raw queries are dropped at write time — audit memory is for routing hints, not data retention. Rows live under the caller's bare namespace (just callerFromContext(ctx), not session-scoped) so learning spans sessions for the same authenticated subject. helmdeck://my-defaults is a new always-listed MCP resource that aggregates a caller's recent audits into top-N packs + top-N pipelines, each with a common_inputs map ("most-used value per field"). The agent's contract is to peek here before asking the user for inputs that have learned defaults: pre-fill and confirm rather than re-ask from scratch. Empty arrays mean no history; ask normally. helmdeck.memory_forget is the cleanup half — a pack the agent calls when the user says "forget my defaults" with scope = all (or packs, pipelines, pack:<id>, pipeline:<id> for targeted resets). Targets only audit rows (categories pack_history / pipeline_history); never touches pack caches or vault credentials. Audit rows otherwise expire automatically after 30 days via memory.WithTTL. Memory-disabled deployments: every surface degrades gracefully — audit is a no-op, my-defaults returns an empty payload with a note, forget is a soft-success no-op. SKILL.md teaches the agent to query my-defaults before asking. Sets up PR #3 (helmdeck.route meta-pack with gap analysis) and PR #4 (memory-management UI in web/).
Self-describing routing metadata + helmdeck://routing-guide MCP resource (ADR 047 PR #1). Today every pack and pipeline is a schema with no machine-readable hint about when to use it vs. an alternative — the agent has to read SKILL.md prose and infer. Packs and pipelines now carry an additive metadata block (accepts / produces / intent_keywords / typical_use / limitations, plus supersedes on pipelines for doc-rewrite-blog → doc-ground-blog-style transitions) populated on 10 packs (blog.rewrite_for_audience, content.ground, slides.outline, doc.parse, web.scrape, research.deep, podcast.generate, swe.solve, github.get_issue, hyperframes.compose) and 5 pipelines. A new always-on MCP resource at helmdeck://routing-guide returns a thin catalog projection — policy text (6-step pack-vs-pipeline decision flow) + per-entry {id, title, description, metadata} for packs and pipelines, with empty metadata collapsed off the wire so the resource stays compact. Cooperates with the existing full helmdeck://packs catalog rather than replacing it: clients fetch routing-guide once per turn to pick, then pull the full schema for the chosen entry. SKILL.md gets a one-paragraph routing tip ("for any multi-step request, query helmdeck://routing-guide first"). Lays the groundwork for memory audit hooks (PR #2) and a helmdeck.route meta-pack with gap analysis (PR #3).
Persona + audience + angle + outline-export + image-prompts on all seven slide pipelines. slides.outline already accepted a persona input, but none of the slide pipelines (grounded-deck, grounded-narrate, research-deck, research-narrate, research-ground-deck, scrape-deck, repo-presentation) forwarded it — so every deck defaulted to the generic general register. All seven now thread persona / audience / angle / title / author plus two new opt-in flags through to the outline step. Persona vocabulary now matches blog: general / technical / marketing / executive / educational / academic (last one new on the slides side). Each persona's directive is enriched to drive slide content, not just tone — technical asks for fenced code blocks and mermaid diagrams; educational for a "Try this" slide; marketing for scannable bullets + CTA; executive for numbers + decisions; academic for hedged language and an open-questions closing. export_outline: true persists the final Marp markdown as an outline.md artifact alongside the PDF/MP4 so the user can review or edit the structure and re-render. include_image_prompts: true asks the model to embed  comments in speaker notes AND a handler post-pass emits a structured image_prompts: [{slide_index, prompt}] array on the outline-step output — visible inline (Marp presenter view), available structured for downstream image-gen tools. SKILL.md teaches the agent to ask for persona + audience + angle on slide pipelines, mirroring the blog picker.
builtin.brief-rewrite-blog. Closes the rewrite-blog matrix for pasted user input. Takes a brief (markdown — a title idea + hook + what-to-cover + audience pitch — not a finished draft) and runs it through blog.rewrite_for_audience to expand into an original post, then content.ground (citation-only) and blog.publish. Inputs: brief, audience, angle?, persona?, title. Use this when the user pastes ideas/outline notes; the matrix is now: brief paste → brief-rewrite-blog, PDF/DOCX → doc-rewrite-blog, web page → scrape-rewrite-blog, research query → research-rewrite-blog.
persona input on blog.rewrite_for_audience and content.ground. Without it, every blog rewrite defaulted to a formal-academic register even when the audience was developers, marketers, or executives. Both packs now accept a closed-set persona (general / technical / marketing / executive / educational / academic — same vocabulary as slides.outline) that injects a tone+register+length directive into the system prompt. Unknown keys are treated as freeform tone hints (e.g. "crisp newsroom"). Persona is threaded through all four blog pipelines (brief-rewrite-blog, doc-rewrite-blog, scrape-rewrite-blog, research-rewrite-blog); each pack echoes persona_used on output. In content.ground, persona only affects the rewrite (rewrite:true) path — citation-only mode preserves voice by design. SKILL.md teaches the agent to ask for persona alongside audience and angle.
builtin.scrape-rewrite-blog and builtin.research-rewrite-blog. Mirrors of the doc-rewrite-blog swap shipped earlier — for borrowed sources from a web page (scrape-rewrite-blog) or a deep-research query (research-rewrite-blog), the pipeline now runs the source through blog.rewrite_for_audience before publishing instead of saving the cited synthesis verbatim. Both gain audience and angle inputs; the existing builtin.grounded-blog (which takes the user's OWN markdown as input) is unchanged because it should preserve the user's voice, not rewrite it. SKILL.md gains a small picker table so the agent reaches for the right pipeline by source type.
blog.rewrite_for_audience pack + builtin.doc-rewrite-blog pipeline. The old builtin.doc-ground-blog chain (doc.parse → content.ground → blog.publish) produced a citation-strengthened transcription of the source — useful as research notes, but as a blog post it read as republishing someone else's work. The new pack translates a source document into an ORIGINAL post for a stated audience and angle: it leads with why-it-matters, de-jargons the source's terms, connects them to tools the audience uses, and adds an explicit "Author's note" — staying grounded in source_content (the system prompt forbids claims not present in the source). The new pipeline chains it after doc.parse and follows with content.ground (citation-only) as a post-rewrite citation pass. Inputs: source_url, audience, angle?, title. SKILL.md instructs the agent to ask the user for audience+angle before running (defaults exist but produce bland output).

Removed

builtin.grounded-blog. Replaced by builtin.brief-rewrite-blog (above). The old pipeline ran content.ground (rewrite:true) → blog.publish on whatever markdown was pasted — but content.ground is an annotator, not a generator, so the output was always roughly the same length and shape as the input. A pasted brief came back as the brief plus a few [source] links — never a real blog post. The startup reaper deletes the persisted row on upgrade. Operators with finished drafts who want pure citation-strengthening (the one case grounded-blog WAS the right tool for) should call content.ground directly or helmdeck__pipeline-create a custom pipeline with rewrite:true.
builtin.scrape-ground-blog and builtin.research-blog. Same product issue as doc-ground-blog: they took a borrowed source and saved the cited synthesis verbatim, which reads as republishing rather than as an original blog post. Replaced by builtin.scrape-rewrite-blog and builtin.research-rewrite-blog (above). The startup reaper deletes the persisted rows on upgrade. Operators who depended on the raw cited-synthesis can recreate it via helmdeck__pipeline-create with the old shape.
builtin.doc-ground-blog. Replaced by builtin.doc-rewrite-blog (above). The old pipeline's output was an honest cited-rewrite (the description matched the mechanism), but the result wasn't a usable blog post — it cited the source's own claims without adding any perspective. Operators who depended on the raw cited-transcription should call doc.parse + content.ground (rewrite:true) directly via MCP, or helmdeck__pipeline-create a custom pipeline that matches the old shape. A new startup reaper in the pipeline store (PruneStaleBuiltins) deletes any persisted builtin=1 rows whose id is no longer in the current Builtins() set — so on the upgrade from a prior version, operators land on a clean catalog without running SQL by hand. User-created pipelines are never touched (the guard is the builtin column, not the id prefix).
Coding pipelines (beta) + github.get_issue pack. Four new pipelines wrap swe.solve for the common coding workflows — builtin.issue-to-pr (read a GitHub issue → open a PR), builtin.repo-solve-pr (ad-hoc task → PR), builtin.repo-solve-branch (push without PR), builtin.repo-solve-patch (preview as a unified diff, no remote write). They appear under a new Coding section on the Pipelines page (output badge: Code) with a yellow beta tag rendered from a " (beta)" suffix on the pipeline name. A new lightweight github.get_issue pack — mirror of github.list_issues but filtering by {repo, issue_number} — feeds the issue's title + body into swe.solve's task field; it shares the same 5-minute read-through cache as list_issues so a rerun against the same issue doesn't re-hit the REST API. ADR 046 documents the policy plus a research-backed recommendation for the next coding-agent integration (Cline is the recommended v2; OpenHands needs a spike; Aider doesn't fit the pack contract; full SWE-agent isn't needed alongside mini).
Pipelines page is grouped by output format. The flat table of every built-in is gone; the page now renders one section per output category — Video / Slides / Blog / Podcast / Other — in a fixed order, with each row showing its output as a badge (MP4 / PDF / MP3 / Blog). So "I want to make a video" is one heading and four rows away instead of a description-by-description scan. Category is inferred client-side from each pipeline's terminal pack (slides.render → PDF / Slides, slides.narrate & hyperframes.render → MP4 / Video, podcast.generate → MP3 / Podcast, blog.publish → Blog; anything else falls to "Other") — no SQL migration, no MCP wire change.
builtin.grounded-narrate and builtin.grounded-podcast pipelines. Mirrors of the existing builtin.grounded-deck / builtin.grounded-blog for the video and podcast outputs — a single markdown input is grounded against web sources via content.ground (so un-citable claims are marked skipped rather than silently passed through), then turned into a narrated MP4 (slides.outline → slides.narrate) or a multi-speaker MP3 (podcast.generate). Closes the matrix: paste a chunk of pre-researched notes and produce any of the four output formats in one call.

Fixed

doc.parse rejects non-document URLs upfront with a routing hint. The pack used to accept any source_url and forward it to Docling — so a Medium / blog / extension-less URL slipped through, Docling tried (and failed) to fetch + parse it, and the user got a cryptic pack_bug (e.g. docling 404: task result not found) instead of "wrong tool, here's the right one." The pack now closed-set-allowlists the URL's file extension at input validation: .pdf .docx .pptx .xlsx .odt .ods .odp .png .jpg .jpeg .tif .tiff (case-insensitive, query strings ignored). Anything else — web pages, arxiv abstract URLs (/abs/1706.03762), .epub, etc. — fails fast with a caller_fixable message that points to web.scrape for web pages and to source_b64 + filename for documents whose URL has no extension. Pack description rewritten to declare the same contract so MCP-listing agents pick the right pack first time. source_b64 path is unchanged (the filename requirement already carried the type discriminator).
doc.parse against current Docling. Upstream Docling consolidated its /v1/convert/source request body from separate http_sources / file_sources arrays into a single discriminated sources array (each entry tagged by kind: "http" | "file" | "s3" | …). The pack lagged and was sending the old shape, so every call against a recent docling-serve image failed pack_bug with HTTP 422: missing body.sources. The pack now sends sources: [{kind, …}] matching the live OpenAPI schema; the existing happy-path tests are updated to assert the new shape and explicitly fail if either legacy field reappears.

[0.21.0] - 2026-05-30

Theme: Pipelines you can see into, stop, and resize. Running runs now surface each step's live progress in the UI; a Cancel button (+ helmdeck__pipeline-cancel MCP tool, + REST) genuinely stops a wedged run by tearing down its session container; the runner auto-cleans in-flight runs orphaned by a control-plane restart; and CPU-bound packs (hyperframes.render, slides.narrate) declare a host-aware compute profile instead of inheriting the legacy 1-core default. Plus a new hyperframes.compose pack turns a plain-language description into a HyperFrames composition so callers no longer hand-author the data-* / window.__timelines contract.

Added

CPU profiles for session packs. A pack now declares its workload class — session.ProfileIO (the default, 1 core) or session.ProfileCompute (host-aware autodetect) — instead of a raw core count. The runtime resolves the compute profile to clamp(host_cores - 1, 1, 6) so an 8-core box gives a video render 6 cores instead of 1, and operators tune per-profile via HELMDECK_IO_CPU_LIMIT / HELMDECK_COMPUTE_CPU_LIMIT. hyperframes.render and slides.narrate (MP4 encode) migrate to ProfileCompute; every other session pack stays on the implicit ProfileIO default (no behavior change). On boot the control plane logs the resolved per-profile caps. New CPU-bound packs (and marketplace packs) pick a profile instead of reimplementing the host-aware math. See ADR 045 for the policy and docs/reference/hardware-sizing.md for operator-facing numbers.
Running pipelines show live per-step progress and can be cancelled. The pipelines UI now renders each running step's latest progress message (e.g. compose "outlining…", render "rendering 1920×1080…/uploading…") inline beside its status badge, refreshed by the existing 3s poll — so a long run stops being a black box. A Cancel button (and POST /api/v1/pipelines/{id}/runs/{runId}/cancel + the helmdeck__pipeline-cancel MCP tool) hard-stops a running or pending run: it fires the run's context cancel AND force-removes every session container tagged with the run's id (via a new helmdeck.run_id Docker label), so a stuck render frees its CPU within ~1-2s instead of waiting on the 30-min pipeline timeout. Partial output from the in-flight step is discarded by design. Already-terminal runs return 409 not_cancellable.
hyperframes.compose + describe-a-video pipelines. A new pack turns a plain-language description into a HyperFrames HTML/CSS/JS composition ready for hyperframes.render — so callers no longer hand-author the data-* / window.__timelines contract. The pack guarantees the render contract (canvas sized to the aspect ratio, root scaffolding, a paused GSAP window.__timelines registration); the model writes only the creative visuals. Two one-call pipelines chain it: builtin.prompt-video (describe → compose → render, silent) and builtin.prompt-narrated-video (describe → podcast.generate → compose with the narration synced → render). podcast.generate now always emits audio_url (empty when no presigned store is configured) so the narrated pipeline degrades to a silent video instead of failing on a missing reference. builtin.html-video stays for agent-authored compositions, with its description/docs reworded to make clear the HTML is agent-authored, not hand-typed.

Fixed

Docker image pulls retry on transient failures. Runtime.ensureImage would call docker pull exactly once — so a Docker Hub 429, a TLS handshake hiccup, or a transient connection reset on the way to a registry failed the whole session Create with no such image. CI runs that pulled alpine:3 for the runtime smoke test broke regularly on rate-limited shared runners. The pull now retries up to 3 times with a 0/2/4-second linear backoff, honoring ctx.Done() between attempts. Permanent errors (manifest unknown, unauthorized, denied, no such image) fail fast — the retry is for the transient class only.
In-flight pipeline runs are no longer stuck on running after a control-plane restart. A pipeline run's terminal status is written by the in-process goroutine that's executing it; that goroutine dies with its process, so a restart left every active run frozen at running in SQLite forever, with no way to clear them — Cancel even reported success because the API ack'd, but nothing in-process flipped the row. On boot the runner now scans pipeline_runs WHERE status IN ('pending','running') and reaps each one to failed with failure_class=transient and failure_reason="control plane restarted while this run was in flight", marking any in-flight steps inside the run the same way (so the UI's per-step badges aren't stuck either). Runs before the HTTP listener accepts requests, so there's no race with a live goroutine. Already-terminal runs are untouched; the reaper is idempotent.
Pipeline MCP tools are now callable as helmdeck__pipeline-run (and -list/-get/-create/-run-status/-rerun). They baked the helmdeck__ server prefix into their MCP tool names while pack tools are advertised bare — so namespacing MCP clients (OpenClaw, etc.) double-prefixed them to helmdeck__helmdeck__pipeline-run, making the documented name (the UI copy-prompt button, SKILL.md, the prompt templates) fail with "tool not found." Pipeline tools are now advertised bare (pipeline-run, …) like packs, so the client resolves them to helmdeck__pipeline-* exactly as documented. (MCP pipeline tools were previously only reachable via REST.)
Built-in podcast pipelines run without a manually-supplied model. builtin.research-podcast, builtin.repo-readme-podcast, and builtin.prompt-narrated-video chained podcast.generate in source-text mode (which writes the script via an LLM) but omitted the model field, so a real run failed caller_fixable ("model is required …"). They now default model to openrouter/auto like every other pipeline, so the run needs only its documented input (no model/voice to supply — speakers is already pre-set to ElevenLabs premade voices).
podcast.generate prompt / source_url / source_text modes work in gateway deployments again. The pack was registered twice at startup — with the gateway dispatcher inside the gateway-gated block, then again with nil after it — and the registry is last-wins, so the nil registration clobbered the dispatcher one. Any prompt/source-mode call (the script-generating modes) then failed internal: registered without a gateway dispatcher. The nil/body-mode registration now runs before the gated block (same order as blog.publish), so the dispatcher version wins when a gateway is configured.
Pipeline runner no longer threads a non-preserved session into later steps. It carried _session_id forward after every session-producing step, including ones whose session is torn down at step end (PreserveSession: false). So builtin.prompt-narrated-video handed podcast.generate's already-dead session id to hyperframes.render, which failed session_unavailable: session not found (render needs its own hyperframes-sidecar session anyway). The runner now only threads a session forward from a pack that preserves it — e.g. repo.fetch → repo.map/fs.*/repo.push still chain correctly.

[0.20.0] - 2026-05-28

Theme: A more trustworthy agent surface. Pipelines reject unfilled {{PLACEHOLDER}} inputs instead of running with them; built-in pipeline descriptions say what the packs actually do (cite + save, not "rewrite + publish"); slides.outline guarantees a title slide and gains audience personas so decks open and close properly; and a new installable helmdeck-debug skill sweeps every pipeline + pack and drafts GitHub issues for what it finds.

Added

Pipeline runs reject unfilled {{PLACEHOLDER}} inputs. An input whose value is still a literal prompt-template variable (e.g. title = {{TITLE}}, pasted from the prompt-template docs without substituting) now fails fast with a caller_fixable error that names the input and tells the agent to fill it — ask the user for a value, or propose one and confirm it — instead of silently running and producing a post titled {{TITLE}}.
helmdeck-debug integration-debugger skill. A second installable agent skill (skills/helmdeck-debug/SKILL.md) that sweeps every pipeline + pack — a static check of the definitions (oversold descriptions, unguarded inputs, schema-vs-handler drift, failure misclassification) plus a live end-to-end run sweep classified by failure_class — and drafts a ready-to-file GitHub issue per real bug, confirming before it files anything. Both installers now ship it: scripts/configure-openclaw.sh installs every skills/*/SKILL.md, and the new scripts/configure-claude.sh installs them into a project's .claude/skills/.
slides.outline guarantees a title slide and supports personas + an author byline. When title is provided, the pack deterministically prepends a title slide (with an optional author byline) if the model omitted one — and never duplicates one the model already wrote. A new persona input (general/technical/marketing/executive/educational, or any freeform audience string) injects an audience-appropriate tone and closing-slide directive (e.g. marketing → call-to-action, executive → the decision/ask). New outputs has_title_slide + persona_used. The strengthened prompt makes the opening/closing slides a hard requirement; SKILL.md now tells agents to ask the user for title/author/persona before generating.

Changed

Honest descriptions for the ground/blog built-in pipelines. grounded-blog, scrape-ground-blog, doc-ground-blog, research-blog, grounded-deck, and research-ground-deck no longer say "fact-check + rewrite … publish." content.ground cites claims against web sources (and, in rewrite mode, strengthens the cited sentences) — it does not rewrite a post into a new voice or structure — and blog.publish saves a markdown artifact by default (publishing to Ghost requires cloning the pipeline with a credential + host). The descriptions and prompt-template docs now say so.

[0.19.1] - 2026-05-28

Fixed

Pipelines page "Copy prompt" button now works over plain HTTP. navigator.clipboard only exists in a secure context (HTTPS or localhost), so on a Management UI served over plain HTTP on a LAN host the button hit an undefined clipboard, threw, and was silently swallowed — nothing reached the clipboard. It now falls back to a hidden-<textarea> + execCommand('copy') in non-secure contexts and reflects the real result (Copied / Copy failed), so it can never silently do nothing.

[0.19.0] - 2026-05-28

Theme: Repo presentations worth watching. builtin.repo-presentation (replacing repo-readme-narrate) builds a narrated deck from a repo's README plus its docs and code structure — not a paraphrase of the front page — backed by a new repo.fetch docs output.

Added

repo.fetch now surfaces a docs output — concatenated markdown/adoc/rst from the repo's doc dirs (docs/, doc/, content/, …) plus top-level design docs (ARCHITECTURE.md, DESIGN.md, …), bounded to 16 KB with a path header per file (empty when the repo has none). Lets presentation/grounding pipelines ground on a project's real docs, not just its README.

Changed

builtin.repo-readme-narrate replaced by builtin.repo-presentation. The old starter fed only the README to slides.outline, so a thin README produced a shallow deck. The new pipeline chains repo.fetch → repo.map → slides.outline → slides.narrate, building the deck from the README plus the repo's docs and code structure (repo.map's symbol map) — a fuller picture of what the project is and how it's built. Same repo_url input; the builtin.repo-readme-narrate id is gone.

[0.18.0] - 2026-05-28

Theme: Pipelines you can see and trust. The deck/narrate pipelines now turn prose into a real multi-slide deck via the new slides.outline pack — no more a whole README collapsing onto one slide and rendering a degenerate 7-second video — and the Management UI shows which pipelines are running plus a copy-paste agent prompt for each.

Added

slides.outline pack — restates prose/markdown (a README, a research.deep synthesis, content.ground output) as a structured Marp deck: ----separated slides with titles, bullets, and , ready for slides.render/slides.narrate. Bounded by a max_slides ceiling and a clamped completion-token budget, and it guarantees a multi-slide deck or fails invalid_input ("content too thin") rather than emitting a degenerate one-slide deck.
Pipelines page (Management UI): live "running" indicators + a per-pipeline "Copy prompt" button. The /pipelines page polls a new GET /api/v1/pipeline-runs (recent runs across all pipelines) and shows a pulsing running badge on any pipeline with an active run, plus an N running header count — so you can see what's executing without expanding each row. Each pipeline also gets a Copy prompt button that copies a ready-to-paste agent prompt (helmdeck__pipeline-run …) with a fill-in line per ${{ inputs.* }} the pipeline declares — generated from the live definition, so it can't drift from the actual inputs.

Changed

Deck & narrate pipelines now structure prose into a real deck before rendering. grounded-deck, research-deck, research-narrate, research-ground-deck, scrape-deck, and repo-readme-narrate used to feed raw prose (a README, a synthesis, grounded text) straight into slides.render/slides.narrate, which split slides only on --- — so prose with no --- collapsed onto a single slide and produced a degenerate ~7-second silent video that still reported succeeded. They now insert a slides.outline step, so a README or synthesis becomes a genuine multi-slide deck — or fails legibly (caller_fixable) when the content is too thin. Podcast pipelines are unaffected (podcast.generate already turns source text into a multi-speaker script).

[0.17.2] - 2026-05-28

Theme: Honest failures — pipeline runs attribute failures correctly. A malformed input or a still-booting overlay no longer masquerades as a helmdeck pack_bug you should file an issue for: input problems are caller_fixable, and overlay-backed packs ride out a cold start instead of failing the first call. Plus the v0.17.1 tts_chars schema regression that broke every slides.narrate/podcast.generate run.

Changed

Overlay-backed packs now retry a still-booting service instead of failing on the first hit. research.deep / content.ground / web.scrape (Firecrawl) and doc.parse (Docling) wrap their HTTP round-trip in a bounded cold-start retry: a connection-refused/reset or a 502/503/504 is treated as "still starting" and retried with exponential backoff (4 attempts, ~3.5s worst case). So the first pack or pipeline call from the OpenClaw chat UI after the stack — or an individual overlay — comes up waits a few seconds for readiness instead of surfacing a failed run. Genuine application errors (4xx/500) and successes return immediately and unchanged, so the pipeline failure classifier behaves exactly as before once the service is actually up.

Fixed

slides.narrate / podcast.generate failed every real run with invalid_output: field "tts_chars": expected number, got object (regression from v0.17.1). #299 declared the tts_chars cost-output field as number, but both handlers emit a per-speaker/per-slide breakdown map (with a _total key, see computeTTSChars/computeSlideTTSChars). The engine validates handler output against the declared OutputSchema on every Execute, so the mismatch failed slides.narrate, podcast.generate, and any pipeline using them (e.g. builtin.repo-readme-narrate). Corrected the declaration to object. The unit tests missed it because they call the pack handler directly, bypassing the engine's output validation — a new output-schema contract test now validates each pack's real output against its declared schema, so this class of drift fails in CI.
research.deep reported "no usable sources" as a pack_bug, telling callers to file a GitHub issue for a refine-your-query situation. When a Firecrawl search yields zero usable sources (query too long/narrow/obscure, or every result unscrapable), the pack returned handler_failed — which the pipeline classifier maps to pack_bug — even though the error message itself advised refining the query. It now returns invalid_input, so pipelines (e.g. builtin.research-blog) classify it caller_fixable: shorten/refocus the query and re-run, no issue to file. helmdeck searched fine; the query just didn't match anything usable.
hyperframes.render reported a malformed composition as a pack_bug. A composition missing data-composition-id/data-width/data-height, an unregistered window.__timelines, or an output preset whose orientation doesn't match the composition's dimensions made the hyperframes CLI exit non-zero, which the pack returned as handler_failed → the pipeline classifier (e.g. builtin.html-video) labeled it pack_bug and told callers to file a GitHub issue for "fix your HTML." It now classifies the known caller-input signatures as invalid_input → caller_fixable; genuine render/encode failures (browser crash, ffmpeg error) stay handler_failed.

[0.17.1] - 2026-05-28

Theme: Fresh-stack reliability — persistent repos, grounded decks, and slide rendering now work on a clean install, and the test-suite gaps that let those bugs ship green are closed (every Docker/integration test now runs in CI, gated against silent skips).

Changed

blog.publish now renders mermaid diagrams to inline SVG server-side (default mermaid: true): mermaid fenced blocks in a markdown body are pre-rendered via mmdc (the same renderer slides.render uses) into <img src="data:image/svg+xml;base64,…">, so diagrams show reliably on Ghost (any theme), in email, RSS, and plain-markdown readers — no client-side MermaidJS required. Set mermaid: false to keep the previous client-render behavior. As a result blog.publish now runs with a session (NeedsSession: true) to reach mmdc — each publish acquires a short-lived sidecar.

Fixed

Persistent repo.fetch (ADR 040) failed with mkdir: cannot create directory '/repos/…': Permission denied. The helmdeck-repos volume was root-owned, but the session sidecar runs as uid 1000 and the control-plane janitor as uid 65532 — neither could create clone directories under it, so every persistent clone (e.g. builtin.repo-readme-narrate/-podcast) failed. The session runtime now makes /repos world-writable on first use (a throwaway root container), with a repos-init compose one-shot as belt-and-suspenders — so it works for any deployment, not just Compose. The persistent clone also runs umask 000 so the janitor (a different uid) can GC clones. Covered by a new Docker integration CI job that runs the //go:build integration suite (which exercises this exact clone-into-/repos path but wasn't previously run in CI).
Pipeline-driven repo.fetch clones all collided in /repos/unknown. StartRun executes on a detached context that dropped the caller subject, so persistent clones weren't namespaced per caller. The runner now threads the caller (StartRun/Rerun carry it and re-attach it via packs.WithCaller), so a pipeline started by alice clones into /repos/alice/… like a direct pack call.
slides.render still clipped tall mermaid diagrams by ~39px in PDF/PPTX — the non-scrolling formats #280's auto-fit was meant to protect. The mermaid cap was max-height: 70vh (504px on a 720px slide), but a slide also carries its heading plus Marp's ~255px section padding, so a top-down diagram + chrome overflowed. Lowered the cap to 60vh, leaving headroom even for a two-line title. The integration suite's geometric overflow check (TestSlidesFit_NoSectionOverflow) had been silently skipping — it required a playwright module the sidecar doesn't ship — so it never caught this; it now runs on Marp's bundled puppeteer-core (the same Chromium that prints the PDF) and asserts zero section overflow, so the clip can't regress unnoticed.
slides.narrate and podcast.generate now declare their cost-transparency outputs (tts_chars, estimated_cost_usd, estimated_cost_breakdown) in their OutputSchema. The handlers already emitted them; declaring them fixes catalog/schema drift so agents and pipeline authors can see and reference the cost fields.

[0.17.0] - 2026-05-28

Theme: Legible, recoverable failures — agents and operators can tell why a run failed and what to do: actionable model errors with a model catalog to pick from, and pipeline failure attribution with one-call re-run.

Added

helmdeck://models MCP resource (ADR 043): lists the chat-completion models the gateway can route to right now, as full provider/model IDs (e.g. openrouter/minimax/minimax-m2.7). Agents read it to pick a valid model for any pack's model input instead of guessing one that fails. Mirrors helmdeck://voices / helmdeck://image-models. (#293)
Legible pipeline failures + re-run (ADR 044, slice 1): when a pipeline run fails, each failed step is now attributed with a typed error_code, a failure_class — caller_fixable (the inputs/model given were wrong — fix and re-run), pack_bug (a code error in helmdeck — the reason includes a prefilled GitHub issue link to file), transient (environment blip — re-running may work), or state_changed — and a one-line failure_reason saying what to do. Surfaced in GET …/runs/{runId}, the helmdeck__pipeline-run-status tool, and the Management UI /pipelines run view (failure-class badge + "Report bug" link). Plus a one-call re-run: POST /api/v1/pipelines/{id}/runs/{runId}/rerun, the helmdeck__pipeline-rerun tool, and a "Re-run" button. Resume-from-failed-step and auto-retry are the next slice. (#294)
Pipeline run records now list each step's artifacts: a step's produced files (keys/URLs) are captured on the run, so run-status and the /pipelines UI show what each step emitted (previously only the final output JSON was visible). (#292)

Fixed

A bad/unroutable model now returns invalid_input with an actionable hint, not an opaque handler_failed. Calling an LLM pack (content.ground, research.deep, blog.publish prompt mode, web.test) with a model the gateway can't route — e.g. minimax/… when MiniMax is only reachable as openrouter/minimax/… — used to fail as handler_failed: … unknown provider: minimax: unknown provider: minimax (a non-recoverable code, with a doubled message). It now returns invalid_input pointing at the helmdeck://models resource, so the agent retries with a valid model instead of hallucinating another. The doubled message is gone. (#293)

[0.16.0] - 2026-05-27

Theme: Correctness + housekeeping — grounding stops truncating long slide decks, artifacts become deletable on demand, and the email.send pack lands.

Added

email.send pack (helmdeck__email-send): send a transactional email via Resend. Required input to; optional from, subject, html, cc, bcc, reply_to; returns a message_id. Vault credential resend-api-key. Brings the in-tree catalog to 44 packs. (#289)
Prompt-template reference pages at /reference/prompt-templates/: a copy-and-fill {{VARIABLE}} prompt for every built-in pack and pipeline, kept current by a contributor convention. (#288)
Manual artifact deletion: DELETE /api/v1/artifacts/{key} plus a delete (trash) button in the Management UI Artifact Explorer remove a single artifact on demand. Previously the only delete path was the TTL janitor (default 7-day age-out); operators can now reclaim space immediately. Delete is idempotent — a missing key still returns 204. (#290)

Fixed

content.ground no longer truncates or drops content during the optional rewrite. The full-document rewrite was hard-capped at 2048 output tokens, so a long input — e.g. a 20–25 slide deck — was silently cut off mid-document and every slide past the cap vanished. The rewrite's completion budget now scales with the input size (capped at 8192 tokens); a response that still hits the token ceiling is discarded in favor of the structure-preserving citation-only version; and the rewrite prompt is instructed to preserve every --- slide separator and slide count. grounded_text is now always present in the output (equal to the input when no claims were grounded), so pipeline steps wiring ${{ steps.<id>.output.grounded_text }} never fail on an unresolved reference. (#290)
builtin.grounded-deck and builtin.research-ground-deck now ground decks with citations only (rewrite: false) rather than a full prose rewrite, which reflowed and collapsed slide structure. Blog-oriented pipelines (grounded-blog, scrape-ground-blog, doc-ground-blog) keep rewrite: true and are protected by the truncation guard above. (#290)

[0.15.0] - 2026-05-26

Theme: Pipelines as a first-class resource — a saved, runnable sequence of pack steps any actor can create, run, and watch.

Added

Pipelines (ADR 041): a pipeline is a stored, named, ordered list of pack steps with ${{ steps.<id>.output.<field> }} / ${{ inputs.<name> }} templating and automatic _session_id threading. Ships as a runnable slice — SQLite-persisted definitions + run history, a sequential runner reusing the pack engine, REST CRUD + async run + run-history at /api/v1/pipelines, the helmdeck__pipeline-{list,get,create,run,run-status} MCP tools (so agents create/run pipelines conversationally), ~13 auto-seeded built-in starters (grounded deck/blog, research→{deck,podcast,blog}, scrape→ground→blog, clone-a-repo→narrated-deck/podcast, …), and a Management UI /pipelines panel to list, run with JSON inputs, and watch run status/history poll live. Migration 0007_pipelines.sql (additive). (#283, #284)
podcast.generate now surfaces a presigned audio_url in its output, unlocking a clean podcast.generate → hyperframes.render narrated-video chain (embed the URL in the composition's <audio src>). (#283)

Fixed

slides.render and slides.narrate no longer clip oversized mermaid diagrams or wide tables off the fixed Marp slide canvas: a theme-independent auto-fit <style> scales diagrams/images down (max-height, object-fit:contain) and shrinks-to-fit tables (table-layout:fixed + wrapping). Applies to PDF/PPTX (which can't scroll) across curated and built-in themes. (#280, #282)

[0.14.0] - 2026-05-26

Theme: Autonomous code-fix (swe.solve) lands end-to-end, the Universal Memory layer and persistent repos ship as default-off-but-on-by-default seams, and ADR 037 upstream pinning is fully enforced across every sidecar.

Added

swe.solve — an autonomous code-fix pack. Give it a repo_url + task and it runs a mini-swe-agent loop inside a session sidecar to produce a reviewable change. mode selects the output: patch (diff + trajectory, no push), branch, or pull_request. The agent never sees git or AI-gateway credentials (vault-injected), never pushes to the default branch, and every run uploads a replayable trajectory artifact to the object store. Built on a HelmdeckEnvironment adapter (mini-swe-agent's Environment contract routed through cmd.run). (#265, #271, #233 Phases 1/3/4)
GitHub-issue auto-trigger for swe.solve (ADR 033) — the webhook receiver now handles issues/issue_comment: label an issue and helmdeck opens a PR, then posts the result back as an issue comment. HMAC-verified, label-gated, dispatched on a detached context. (#277, #233 Phase 6)
Universal Memory delivery layer (ADR 039) — an ec.Memory engine seam giving packs transparent, per-caller, namespace-scoped memory, with a declarative read-through cache (Pack.Memory{Cache,TTL}; github.list_issues is the first exemplar) and Context() aggregation. Backed by a pluggable MemoryStore (SQLite default, AES-256-GCM at rest). Memory is durable by default — the installer now generates HELMDECK_MEMORY_KEY. (#272, #278, epic #254: #255/#256/#257/#258/#260)
Persistent repos volume (ADR 040) — repo.fetch (and swe.solve) clone into a per-caller path on a shared helmdeck-repos volume and git fetch instead of re-cloning on a repeat, with a persistent per-language dependency cache (.hdcache) and a GC janitor (TTL + size cap). Default-off (no volume ⇒ ephemeral /tmp); enabled by default in the bundled Compose. New repo.fetch output fields reused/persistent. (#274, #259)
New /reference/agent-memory and repo.fetch persistent-clones docs; the "Clones aren't browser state" design post.
The in-tree pack catalog grows to 43 (adds swe.solve, github.post_comment).

Changed

ADR 037 fully enforced — exact upstream version pins, Dependabot, CLI-surface sentinels, and docs across every sidecar Dockerfile, plus follow-up cleanups (drop marp --stdin, fix --html spec, pin the global playwright-mcp). (#240–#243, #264)
The clients-smoke matrix builds the control-plane from source and its bridge leg is response-driven rather than sleep-timed. (#273)

Fixed

clients-smoke no longer aborts a slow cold-sidecar screenshot via a blind sleep 30 then EOF — it polls for the reply and surfaces a real timeout distinctly. (#273)
The GitHub webhook's async dispatch no longer borrows the request context (cancelled the instant the 200 returns), which would have killed any long-running dispatched pack. (#277)

[0.13.2] - 2026-05-23

Theme: Hot-patch for the v0.13.1 release that shipped without a control-plane image. No code-behavior changes, only the build pipeline that produces the image is unblocked.

Fixed

web/ build now succeeds under Vite 8 + TypeScript 6 + lucide-react 1, restoring the Publish control-plane image step that failed silently on the v0.13.1 tag push. Dependabot PR #247 carried three breaking major bumps in one auto-merged group (Vite 6 → 8, TypeScript 5 → 6, lucide-react 0 → 1), each of which broke the web build. CI never exercised the failure because the CI workflow only builds the Go binary — only the Release workflow builds web/, so the regression was invisible until the v0.13.1 tag fired the release pipeline. Goreleaser binaries, helmdeck-bridge:0.13.1, and @helmdeck/mcp-bridge@0.13.1 on npm shipped fine; only ghcr.io/tosin2013/helmdeck:0.13.1 (the control-plane image) was missing. Three concrete fixes: (a) Vite 8 swapped Rollup for Rolldown, whose manualChunks only accepts the function form, so the declarative chunk-grouping moves to codeSplitting.groups — same two-chunk layout (react + query) preserved; (b) TypeScript 6 removed baseUrl, so paths are relative under ./src/* and a new web/src/vite-env.d.ts (/// <reference types="vite/client" />) restores side-effect CSS module resolution under TS 6's stricter rules; (c) lucide-react 1 dropped brand icons, so the GitHub-PAT preset swaps Github for GitBranch — purely visual, the preset label still names the system. (#250)

[0.13.1] - 2026-05-18

Theme: Post-v0.13.0 cleanup. No feature changes. Four post-release bugs found during v0.13.0 → v0.13.1 upgrade verification, each documented per-issue with a reproducer.

Fixed

repo.fetch now surfaces session_id inside its output (not only on the response envelope), so follow-on packs (fs.*, cmd.run, git.*, repo.push) can find the value adjacent to clone_path. Without this, callers reading only output.clone_path missed the session_id on the envelope, then issued follow-up calls without _session_id, which made the engine spin up a fresh session whose /tmp did not contain the clone — surfacing as silent empty results (fs.list, repo.map) or cannot open errors (fs.read, cmd.run). New internal/packs/builtin/session_reuse_integration_test.go (build-tagged integration) pins the cross-pack session reuse contract against a real Docker daemon so this can't silently regress. (#232)
deploy/compose/.env.example now documents HELMDECK_ELEVENLABS_API_KEY, HELMDECK_FAL_KEY, and HELMDECK_PEXELS_API_KEY. These keys have first-class vault auto-hydration but were absent from the example file an operator copies on first install, so the only way to discover them was via a CHANGELOG entry or a pack's "key not found" error message. (#229)
HELMDECK_PEXELS_API_KEY now auto-hydrates into the credential vault under pexels-key on startup — the v0.13.0 stock.search CHANGELOG advertised this behavior but the entry in internal/vault/hydrate.go was missed. Operators who set the env var no longer have to POST a credential by hand to get the vault rotation/audit story working, and stock.search's credential: input override now resolves through the vault path as documented. (#230)
compose.firecrawl.yml healthcheck for the firecrawl service now probes via node -e instead of wget. The upstream ghcr.io/firecrawl/firecrawl:latest image ships only node (no wget, no curl), so every prior healthcheck invocation hit exit 127 and the container reported unhealthy indefinitely despite serving traffic correctly. Real Firecrawl outages were invisible because the steady-state false negative looked identical to a real failure. (#231)

Changed

Every npm/corepack package installed globally in deploy/docker/sidecar*.Dockerfile is now pinned to an exact ARG <NAME>_VERSION=x.y.z (no @latest, @stable, ^x.y, ~x.y). Affects @playwright/mcp, @mermaid-js/mermaid-cli, pnpm, yarn, typescript, ts-node, eslint, prettier, vitest, and the previously-caret-pinned hyperframes (now exact 0.6.7). T-2 of ADR 037's migration plan; together with the Dependabot config from #240, every pinned dep now has a delivery mechanism for proposed upgrades that runs the full CI matrix. No functional change — same versions, just declared explicitly so a typosquat or yanked release fails the build instead of shipping silently. (#213)
CLI-surface sentinels split into two layers (T-3 of ADR 037). Catches the failure mode that motivated the ADR — an upstream flag rename or typo-squat — at the earliest possible point. (#214)
- Layer 1 (docker build-time): each sidecar Dockerfile runs cheap <tool> --version smoke checks after install. A yanked release or missing binary fails the image build before the artifact escapes.
- Layer 2 (CI-time): new internal/packs/builtin/cli_surface_invariant_test.go (build-tagged integration) walks pack source via go/ast to extract every --flag string passed to a known sidecar binary, runs the binary's --help inside the built image, and asserts each extracted flag appears in the help output. The flag list is derived from Go pack source rather than hand-maintained, so adding a flag to a pack's argv automatically gets verified. A structured Skip allowlist handles deliberately-undocumented flags with a reason string. Covers marp (7 flags from slides_render.go) and hyperframes render (4 flags from hyperframes_render.go); adding a new sidecar binary takes one new cliSurfaceCase entry.
- Discovered while building this: slides_render.go passes marp --stdin (silently accepted, not documented; marp reads stdin automatically when piped), and sidecar-entrypoint.sh:85 invokes @playwright/mcp@latest via npx, bypassing the pinned global install. Both are tracked as separate follow-ups.

[0.13.0] - 2026-05-15

Theme: Marketplace beta — discover, install, and run community packs from a signed catalog.

Eight headline threads ship in v0.13.0. The marketplace track (T810 catalog endpoint, T812 install/uninstall REST, T813 /marketplace UI, T814 community repo scaffold) is the headline — operators browse helmdeck-marketplace's catalog from the Management UI or the new helmdeck CLI, install with one click, and run the pack immediately via the new helmdeck-sidecar-marketplace image. Trust ships as stage A (deterministic SHA256 content hash, hard-rejects install on mismatch); stage B (full sigstore keyless cosign-verify) is queued for v1.0 hardening. Alongside marketplace: hyperframes.render for HTML→MP4 short-form video (the bigger lift of the cycle, slotted at issue #200 with a new sidecar image and async render pipeline); stock.search for Pexels-backed stock photography that chains into every other media-output pack via the same feature_image_artifact_key contract image.generate introduced in v0.12.0; slides.render contrast guardrails (docs + lint + curated themes — the WCAG-AA reproducer goes from "render succeeds, slide unreadable" to "render succeeds with explicit warnings the agent can act on"); provider_calls diagnostic columns (job_id + finish_reason + raw_content_len, joining the gateway audit table back to the pack-job that triggered the call in a single SQL query); subprocess pack manifest format (typed I/O schemas via YAML sidecar, completing the v0.12.0 MVP); and the blog.publish artifact-first refactor (Ghost failures now return a partial-success response with the saved markdown instead of losing the expensive prompt-expanded body). Three new ADRs land with the cycle — ADR 034 captures the marketplace design ahead of the implementation, ADR 037 turns the hyperframes-npm-pin incident into a project-wide upstream-version discipline, and ADR 038 explains why marketplace packs route through a dedicated sidecar rather than running in the distroless control plane.

Added

Marketplace trust verification stage A (#30 follow-up) — replaces the structured stub from PR #220 with real deterministic content-hash verification. The installer now computes a stable SHA256 over a pack's non-manifest files (excluding helmdeck-pack.yaml itself to avoid the chicken-and-egg of "the file containing the hash is in the hash"), compares to manifest.trust.sha256, and hard-rejects the install on mismatch (removes the materialized files, returns trust verification failed). Algorithm is platform-deterministic — no tar/gzip non-determinism, no timestamp leakage — so the marketplace's sign.yml workflow can produce the same digest. What stage A catches: handler/data modified between author-sign and install, file rename/add/remove, corrupt downloads. What it doesn't catch (deliberate, documented): a malicious author modifying the manifest itself — that's stage B (full sigstore keyless verification of the signer identity), tracked as a v1.0 hardening item. New trust-note vocabulary surfaces verified hash + declared signed_by in the install response; UI's "Signed (pending)" badge flips to "Signed (verified)" on a passing stage A check. See docs/reference/marketplace/catalog.md §Trust model. 9 new tests cover hash determinism, sensitivity to file change/add/rename, install-rejects-mismatch with cleanup, and the no-sha256-but-signed-by intermediate state.
helmdeck CLI binary (#30 follow-up) — operator-facing CLI that wraps the marketplace REST endpoints from a terminal. Subcommands: pack list (every registered pack), pack marketplace [--refresh] (browse catalog), pack install <name>, pack uninstall <name>, pack installed (marketplace-installed only). Same env-var conventions as helmdeck-mcp: HELMDECK_URL (default http://localhost:3000) + HELMDECK_TOKEN. --json on any subcommand emits raw response for shell pipelines (helmdeck pack installed --json | jq '.installed[] | .name'). Install output surfaces trust_verified + trust_note so operators see verification status in the terminal. Non-zero exit on errors and preserves the structured error code (pack_not_in_catalog, marketplace_install_disabled, etc.). Ships via goreleaser alongside the existing control-plane + helmdeck-mcp binaries. New file: docs/howto/use-the-helmdeck-cli.md. 16 tests cover env-var resolution, request shape (Authorization header, JSON body, content-type), 4xx envelope preservation, and happy-path dispatch for every subcommand.
Marketplace UI panel + pack-detail endpoint (#31 / T813) — new /marketplace route in the Management UI: browse-by-category chips, free-text search across name/description/tags, pack-detail dialog with input/output schema preview + worked examples + trust badge (Signed / Unsigned), Install / Uninstall buttons with busy state and automatic tools/list cache invalidation, Refresh button, unsigned-pack confirmation dialog per ADR 034. New REST endpoint GET /api/v1/marketplace/packs/{name} returns the catalog entry + full helmdeck-pack.yaml manifest fetched from the marketplace repo on demand — the catalog endpoint deliberately doesn't pre-load every manifest. Sidebar gains a "Marketplace" nav link (Store icon). Operator reference: docs/reference/marketplace/catalog.md §Management UI panel.
Marketplace install / uninstall REST endpoints (#30 / T812) — packs from the marketplace catalog can now be materialized to disk and hot-loaded into the running control plane without a restart. POST /api/v1/marketplace/install resolves a pack from the cached catalog, git clone --depth=1 --filter=blob:none's the marketplace repo, copies packs/<name>/ to HELMDECK_PACKS_DIR (default ~/.helmdeck/packs/<name>/), preserves executable bits, then registers the pack with the live packs.Registry so it appears in tools/list and GET /api/v1/packs immediately. POST /api/v1/marketplace/uninstall reverses it (deregister-then-delete, atomic from the operator's POV). GET /api/v1/marketplace/installed enumerates everything the operator has installed via the marketplace (NOT built-in core packs). command-handler packs only in this beta — builtin / composite / wasm reject with a clear message. Trust verification ships as a structured stub: the response always carries trust_verified + trust_note, the manifest's trust: block flows through end-to-end, but the actual sigstore.dev cosign-verify call lands in a follow-up PR. CLI deferred to its own PR to keep this one review-sized; the REST surface is what T813's UI panel actually depends on. New file: docs/reference/marketplace/catalog.md §Install/uninstall.
Marketplace pack execution via dedicated sidecar (ADR 038, paired with #30) — installed marketplace packs run inside a new helmdeck-sidecar-marketplace image (bash + jq + curl + python3 + Node 20 + standard Unix utils) rather than the distroless control-plane process. The pack handler closure uploads the on-disk handler script to the sidecar via ec.Exec on each call, chmod +x 's it, and pipes the pack input to stdin — matching the slides.narrate / hyperframes.render execution model. Manifests can override the sidecar per-pack via a new optional handler.sidecar.image field (heavier toolchains, e.g. image processing, video, ML). Operators override the default globally with HELMDECK_SIDECAR_MARKETPLACE. Image is amd64 only at v0.13.0; multi-arch follows the base sidecar's track. New files: docs/adrs/038-marketplace-pack-execution-via-sidecar.md, deploy/docker/sidecar-marketplace.Dockerfile, .github/workflows/sidecar-marketplace.yml, Makefile sidecar-marketplace-build target.
Marketplace catalog endpoint (#28 / T810) — first slice of the v0.13.0 Marketplace beta. The control plane now fetches a community pack catalog (index.yaml) from HELMDECK_MARKETPLACE_URL (default https://github.com/tosin2013/helmdeck-marketplace) at boot and serves it via two REST endpoints: GET /api/v1/marketplace/catalog returns the cached snapshot, POST /api/v1/marketplace/refresh forces a fresh fetch. A failed refresh preserves the previously-cached catalog so a transient upstream blip doesn't blank the UI. Three source-URL shapes supported: github.com/<owner>/<repo> (auto-translated to raw index.yaml), direct raw URLs, and file:/// for air-gapped operators. Set HELMDECK_MARKETPLACE_DISABLE=1 to turn the endpoints off entirely. New Go types in internal/marketplace/ mirror the JSON Schemas published in the helmdeck-marketplace repo. Read-only in this PR — install/uninstall (#30 / T812) and /marketplace UI panel (#31 / T813) land in follow-up PRs. Operator reference: docs/reference/marketplace/catalog.md. Design: ADR 034.
stock.search built-in pack (#217) — search Pexels for stock photos matching a query, download the top 1-4 results into the artifact store, return their artifact keys + per-photo attribution metadata (photographer, photographer_url, source_url, width, height, alt_text). The output uses the same chained-input contract as image.generate so downloaded stock photos slot straight into slides.render (hero), slides.narrate (hero), blog.publish (feature_image_artifact_key), podcast.generate (cover_image_artifact_key), and hyperframes.render (embedded <img src>). Use stock.search for real photography; image.generate for AI-generated art. Filter knobs: orientation (landscape/portrait/square), size (large/medium/small min-size), color (hex or named). Credential: pexels-key (vault) or HELMDECK_PEXELS_API_KEY (env-var fallback). Free tier 200 req/hr at https://www.pexels.com/api/. Engine-pluggable from day 1 — engine: "pexels" only ships v0.13.0; unsplash/pixabay reserved for community PRs. Photos only; media_type: "video" is a follow-up. See docs/reference/packs/stock/search.md. Pack count: 40 → 41.
slides.render contrast guardrails (#202) — three-pronged fix for "LLM picks a custom palette that produces unreadable slides" (the dark-blue-section-with-default-light-tables reproducer). (A) Docs + agent skill: new "Color contrast best practices" section in docs/reference/packs/slides/render.md + an updated slides.render entry in skills/helmdeck/SKILL.md teach the WCAG-AA 4.5:1 rule and the "override every nested element when you change section { background }" checklist. (B) Static contrast lint: the pack now parses the markdown's frontmatter style: block and embedded <style> tags before render, flagging two anti-patterns — section-background-without-nested-overrides (the reproducer pattern) and wcag-aa-text-contrast (any single rule whose hex color/background-color pair contrasts below 4.5:1). Warnings surface in the response's new warnings: [{rule, selector, recommendation}] array — informational, not errors; the render still succeeds. (C) Curated helmdeck themes: two embedded Marp themes ship with the control-plane binary — helmdeck-dark (slate/sky palette, modern technical look) and helmdeck-corporate (white/blue palette, business deck). Both declare WCAG-AA colors for every nested element type explicitly. The agent picks one via theme: helmdeck-dark in the frontmatter; the pack uploads the embedded CSS to the sidecar and passes --theme-set to marp automatically. Response carries curated_theme_used so callers can confirm the theme applied. Source: internal/packs/builtin/themes/.
hyperframes.render built-in pack (#200) — HTML/CSS/JS composition → deterministic MP4 via Chromium BeginFrame + ffmpeg using the upstream hyperframes CLI, running in the new helmdeck-sidecar-hyperframes image (env override HELMDECK_SIDECAR_HYPERFRAMES; Node 22 + ffmpeg on top of the base sidecar). Sizing surface is composable: resolution (1080p / 4k) × aspect_ratio (16:9 YouTube standard, 9:16 Shorts/TikTok/Reels, 1:1 Instagram feed) resolves to one of six upstream CLI presets (landscape/portrait/square ± -4k). Composition must be authored at the matching aspect ratio — upstream's --resolution flag is an integer-multiple upscale knob, not a dimension setter. Audio handling is mode-free: a composition with no <audio> tag produces a silent MP4; an inline <audio src> produces a narrated MP4 — chain podcast.generate → hyperframes.render by embedding the podcast's presigned audio URL in the composition's <audio src> and the audio track flows through automatically. Short-form only (≤12 min, 512 MiB cap; oversize rejects as CodeHandlerFailed pointing at #201 for the v1.x long-form streaming track). Pack is Async: true, 4 GiB session memory, 60-minute timeout. See docs/reference/packs/hyperframes/render.md, docs/SIDECAR-LANGUAGES.md. Pack count: 39 → 40.
provider_calls diagnostic columns (#183) — three new columns on the gateway audit table for diagnosing failed LLM-backed pack calls in a single SQL query instead of timestamp-matching ts against the job's ended_at: job_id (joins back to the pack job that triggered the call; indexed), finish_reason (provider-reported stop/length/tool_calls/content_filter/…), raw_content_len (bytes in choices[0].message.content after trim — instantly distinguishes "model returned no visible text" from "model returned text the pack couldn't parse"). Migration 0005_provider_calls_diagnostics.sql adds columns via SQLite ALTER TABLE ADD COLUMN (O(1) metadata-only, safe on multi-million-row tables). The async-job runner (internal/mcp/jobs.go) stamps the pack job ID on the dispatch context via the new gateway.WithJobID helper so existing per-pack call sites don't need touching. Existing rows keep NULL job_id / NULL finish_reason / 0 raw_content_len — no backfill required.
Subprocess pack manifest format (#173) — operator-supplied command packs ($HELMDECK_COMMAND_PACKS_DIR) can now declare typed input/output schemas + execution overrides via a sibling <basename>.helmdeck-pack.yaml file. The manifest carries name, version, description, author, input_schema/output_schema blocks (BasicSchema-compatible: string, number, boolean, object, array), timeout_s, max_output_bytes, and an env list. Missing manifest falls back to passthrough (the v0.12.x MVP behavior); malformed manifest skips the pack entirely with an error logged. New how-to: docs/howto/build-subprocess-pack.md.

Changed

blog.publish artifact-first refactor (#203) — destination is now optional and defaults to "artifact". When destination="ghost", the pack ALSO saves the post body as an artifact (the safety net) by default; a new also_save_artifact: false input restores the pre-#203 ghost-only behaviour. Ghost failures with the safety net enabled return a partial-success response (status: "artifact_saved_ghost_failed" + ghost_error + artifact_key/artifact_url) instead of a hard error — agents can retry the Ghost step against the saved artifact without paying for prompt expansion again. Strictly additive schema change; existing callers that send destination="ghost" now also see artifact_key/artifact_url/size in the response. See docs/reference/packs/blog/publish.md §Partial success.

[0.12.1] - 2026-05-13

Theme: hot-patch for the v0.12.0 release-image regression + three reliability bugs found within hours of v0.12.0 shipping.

The release-blocker (#180) is the dominant fix: every fresh docker pull ghcr.io/tosin2013/helmdeck:0.12.0 user saw a blank Management UI because the embedded web/dist/index.html referenced asset hashes not present in the image. Root cause was a workflow sequencing bug — the release workflow never ran npm run build before bundling the docker image, so the image baked in whatever stale index.html was last committed. The fix adds a Node + web-build step before docker/build-push-action plus a verify step that fails the release loud if the rebuilt index.html references assets that aren't on disk. Defense in depth: if v0.12.0's release had run this check, the broken image would never have shipped.

The other three are smaller but each pinches at a real operator-visible failure mode introduced (or surfaced) by v0.12.0's content-pack push.

Fixed

Release image's blank Management UI on fresh pulls (#180) — .github/workflows/release.yml now runs cd web && npm ci && npm run build before docker/build-push-action, then verifies that every asset hash referenced from the rebuilt web/dist/index.html exists in web/dist/assets/. Closes #180. Doesn't change web/dist/'s gitignore status — the workflow-step fix is the architecturally correct choice (committing the dist folder would create merge churn on every web/src/ PR).
firecrawl-rabbitmq cold-boot race (#181) — deploy/compose/compose.firecrawl.yml bumps the rabbitmq healthcheck's start_period: 15s → 60s. RabbitMQ's Erlang VM + mnesia init takes 30-60s on alpine; the shorter window exhausted retries before the node was ready → container reported unhealthy → helmdeck-firecrawl (correctly waiting via depends_on: condition: service_healthy) never started → operator had to docker compose up again. 60s aligns with firecrawl-searxng's precedent in the same file. Tutorial note added that firecrawl overlay cold-boot takes ~60-90s. Closes #181.
content.ground truncated-JSON failure mode (#179) — the hard-coded 1024-token completion cap was too tight for the structured claim-plan JSON the extractor returns (~750 tokens for 5 claims left ~270 tokens of headroom; weak models or large posts blew through it). Default bumped to 2048 (~1200 tokens of output budget); new optional max_completion_tokens input on contentGroundInput lets operators raise the cap up to 8192. Over-cap requests now reject with CodeInvalidInput (runaway-cost guard) instead of silently truncating downstream. Closes #179.
content.ground silent degradation when Firecrawl unreachable (#182) — the per-claim grounding loop swallowed callFirecrawlSearch transport errors silently, producing an empty-success "no sources found" output instead of surfacing the underlying reachability issue. Now tracks firecrawlCalls vs firecrawlErrors separately; when 100% of attempted calls hit transport errors, the handler returns CodeHandlerFailed with a message pointing at the firecrawl service URL. Partial-success runs preserved: claims with "search succeeded but no usable source" still land under skipped and the run completes. Mirrors the v0.11 narration contract's fail-loud-on-missing-dependency pattern. Closes #182.

Tests

5 new tests in content_ground_test.go — DefaultMaxTokens, MaxCompletionTokensOverride, MaxCompletionTokensOverCap, FirecrawlAllErrors, FirecrawlPartialErrorsSucceed.

Changed

skills/helmdeck/SKILL.md — refreshed catalog (#184). Now correctly advertises 39 packs (was stamped at pre-v0.10.2 commit 24bd0c3 advertising 36 — missing blog.publish, podcast.generate, image.generate). Frontmatter helmdeckVersion bumped to v0.12.0. Brings SKILL.md in line with docs/integrations/SKILLS.md, which was already current.
website/docusaurus.config.ts — sitemap ignores /blog/tags, /blog/tags/**, /blog/archive, /blog/authors to concentrate Google crawl budget on content pages (137 URLs → 122). Filed as SEO follow-up after Search Console reported 61 URLs in "Discovered – currently not indexed" with crawl timestamp 1969-12-31 (never crawled). Pages still render at their URLs — they're just no longer advertised in the sitemap.

[0.12.0] - 2026-05-12

Theme: content-pack image chaining + v1.0 install-path unblocker + pack-authoring MVP.

A bundled release covering four threads that lined up after v0.11.0: chain image.generate into the three content packs (#146, unblocked by v0.11.0's #71); helmdeck://image-models MCP resource (#158, sibling to #146); unified install paths (#134 step 1, P1 blocker for v1.0.0-rc1); and the originally-planned Pack Authoring MVP (T606a UI + T811 subprocess pack type).

The narrative: covers come for free, the install path becomes Kubernetes-ready, and pack-authoring grows up — operators with no Go toolchain can install via pulled images, and pack authors with no Go can ship in any language via subprocess packs.

Added

Content-pack image chaining (#146) — additive convenience syntax across four packs, all backed by a shared RunImageGen entrypoint extracted from internal/packs/builtin/image_generate.go:
- podcast.generate cover_image: bool — auto-generates podcast cover artwork via image.generate; output gains cover_image_artifact_key + cover_image_model_used. Optional cover_image_model override (default fal-ai/flux/schnell).
- slides.render hero_image_prompt: string — auto-generates hero artwork; base64-inlined as <img data:image/png;base64,…> before slide 1 (after Marp frontmatter when present). Inline bytes avoid Marp needing network access inside the sidecar.
- slides.narrate hero_image_prompt: string — same as slides.render but inlined INTO slide 1 (no --- separator) so the per-slide TTS pipeline still sees a populated narrated slide.
- blog.publish feature_image_artifact_key + hero_image: bool — operator-supplied artifact OR auto-generate from the post title. For Ghost destination, uploads via /ghost/api/admin/images/upload/ (multipart, same JWT) then stamps the returned URL into the post's feature_image field. Artifact-mode writes a sidecar <slug>-cover.png.
helmdeck://image-models MCP resource (#158) — mirrors helmdeck://voices (shipped v0.11.0). Curated in-tree catalog of 7 fal.ai models (flux/schnell, flux/dev, flux-pro/v1.1, fast-sdxl, flux-realism, recraft-v3, ideogram/v2) with cost, p50 latency, supports-seed, supports-image-size, max resolution, capability tags, and one-sentence trade-off notes. Backed by new internal/imagemodels package.
fal-key in vault env-hydrate (#158) — closes the consistency gap image_generate.go:74 has advertised since v0.11.0 ("auto-hydrated to vault as 'fal-key' once #142 lands"). HELMDECK_FAL_KEY now imports into the vault under fal-key on startup, same shape as elevenlabs-key.
deploy/compose/compose.build.yaml overlay (#134 step 1) — operators choose between image-mode (just compose.yaml, pulls ghcr.io/tosin2013/helmdeck:${HELMDECK_VERSION:-latest}) and source-build (base + this overlay, builds locally). Compose's deep-merge picks build: when both are present, so the same image: tag becomes the local build's name.
scripts/install.sh --image-mode flag (#134 step 1) — pulls pre-built images instead of building from source. Implies --no-build. Skips host Go / Node / make preflight checks — the path needs only Docker, openssl, curl. Pin reproducible deploys via HELMDECK_VERSION=0.12.0 in .env.local.
Pack Test Runner UI MVP (T606a) — click any pack row in /packs → modal opens with a JSON textarea + Submit. POSTs to /api/v1/packs/{name} and renders the response (duration, cost hint when present, full JSON). Schema-derived form rendering ships in v0.13.0; this MVP unblocks "no UI today."
Subprocess pack type (T811 MVP) — packs.NewCommandPack(name, version, description, inSchema, outSchema, spec) constructor turns any executable into a pack via the stdin-JSON / stdout-JSON protocol. Operator-supplied packs auto-register from $HELMDECK_COMMAND_PACKS_DIR (one pack per executable, named cmd.<basename>). Pack authors can now ship in any language — Python, Node, Bash, Rust — without a Go toolchain dependency.

Changed

deploy/compose/compose.yaml is now image-mode by default (#134 step 1) — build: blocks stripped from the base file; control-plane and sidecar-warm pin ghcr.io/tosin2013/helmdeck[-sidecar]:${HELMDECK_VERSION:-latest}. Operators wanting source-build layer in compose.build.yaml via docker compose -f compose.yaml -f compose.build.yaml. The Helm chart (v1.0-rc1) will reuse the same versioned-tag convention.
docs/tutorials/install-cli.md — adds "Pick your install mode" section with side-by-side prerequisites for image-mode (Docker only) vs source-build (Docker + Go + Node + make).
docs/howto/upgrade-helmdeck.md §2 splits into Path A (image-mode) + Path B (source-build) — operators on a fresh box can git clone && ./scripts/install.sh --image-mode and skip the Go toolchain entirely.
SlidesRender(v, eg) signature — was SlidesRender(); now takes vault + egress for RunImageGen access. cmd/control-plane/main.go updated to pass vaultStore, egressGuard.
SlidesNarrate(d, vs, eg) signature — gained third eg parameter for the same reason.

Tests

~50 new tests across the bundle. Highlights:

podcast.generate cover-image happy path + dry-run-skips-cover + model override (3 tests)
slides.render hero-image insertion (after frontmatter / no frontmatter / model override / no-fal-credential fails loud), empty-prompt skips, mermaid-coexistence (5 tests)
slides.narrate hero inlined into slide 1 + dry-run skips (2 tests)
blog.publish artifact + ghost feature-image paths, supplied-key + auto-gen, mutual-exclusion validation (4 tests)
helmdeck://image-models resource list/read/unwired + catalog shape + defensive copy (6 tests)
Subprocess pack via test-binary self-exec: happy path, transform, non-zero exit + stderr, non-JSON stdout, empty stdout, timeout, missing path/binary, raw-binary sniff, OutputSchema vs handler boundary, capped-writer truncation (11 tests)
Subprocess pack dir-loader: empty/nonexistent dir, executable discovery, non-executable skip, basename sanitization (6 tests)

Fixed

image_generate.go:74 consistency gap — the doc string promised fal-key auto-hydration "once #142 lands"; #142 shipped v0.11.0 but the WellKnownEnvCredentials entry was missing. Now added.

Out of scope (slipped to v0.13.0 / v1.0-rc1)

#134 step 2 — the Helm chart itself ships with v1.0-rc1.
T606a schema-derived form — JSON Schema → React form rendering; v0.13.0.
T811 manifest format — typed schemas via YAML sidecar (#173); v0.13.0.
T811 egress sandbox — confine subprocess pack network access (#174); v0.13.0.
arm64 sidecar image — still blocked on Marp's amd64-only upstream tarball.

MCP Registry

The auto-publish workflow (.github/workflows/mcp-registry.yml) republishes the listing on v* tag push. After tagging, verify at https://registry.modelcontextprotocol.io/v0/servers/io.github.tosin2013/helmdeck (expect version: 0.12.0, isLatest: true). Watch for the npm-publish race condition documented in release.yml:118-157 — workflow_dispatch the mcp-registry.yml after npm publish completes if the first run fails with "package not found."

[0.11.0] - 2026-05-10

Theme: podcast/slides UX hardening + onboarding fixes + image generation.

A coherent feature release that addresses 9 issues filed during a v0.10.2 OpenClaw integration: the new content packs work, but their first-run UX assumed you already knew the conventions. Silent MP3s when the credential name is wrong, hardcoded /root/openclaw paths, blocking Go preflight on the docker-only path, no voice discovery, no cost preview — all fixed.

The vault env-hydrate fix (#142) is the load-bearing piece: it root-causes the silent-fallback class of bug, not just the ElevenLabs instance. Pairing #138 (the per-pack contract change) with #142 (the platform fix) closes the bug class.

Added

image.generate pack (#71) — text → image via fal.ai's synchronous fal.run endpoint. Default model fal-ai/flux/schnell (~$0.003/image, 1-3s). 1-4 images per call. The engine input field is reserved so a follow-up community PR can add Replicate without a schema change. Vault credential fal-key (with HELMDECK_FAL_KEY env-var fallback, auto-hydrated). 9 unit tests cover happy path + multi-image + missing credential hard-fail + env fallback + bad engine + 401 surfacing.
Vault env-hydrate (#142) — at control-plane startup, WellKnownEnvCredentials registry auto-imports HELMDECK_*_API_KEY env vars into the vault under their canonical names. Operators who set HELMDECK_ELEVENLABS_API_KEY in .env.local per the README now get a working elevenlabs-key vault entry without a manual POST /vault/credentials call. Wildcard ACL granted on first create. Subsequent restarts respect user-managed entries (metadata.source != "env-hydrate" skips re-upsert). One INFO log per hydration (vault env hydrate ok name=elevenlabs-key host=api.elevenlabs.io).
vault.Store.UpsertByName — sibling to Create. Inserts if absent, rotates ciphertext + refreshes patterns/metadata in place if present. Returns (record, created, error).
helmdeck://voices MCP resource (#143) — exposes the operator's ElevenLabs voice catalog via the same resources/list + resources/read surface as helmdeck://packs and helmdeck://sessions. 1h in-memory cache keyed on the credential's plaintext fingerprint (rotating the key invalidates the cache automatically).
internal/voices/ — new package with ListVoices(ctx, apiKey) → []Voice extracted from slides.narrate's inline pickRandomVoice. Voice exposes voice_id, name, labels (accent/gender/use_case), preview_url, source. Tests use overridable ElevenLabsBaseURL package var.
podcast.generate + slides.narrate per-turn duration floor (#141) — new min_turn_duration_s: number input (default 5). Short TTS turns get padded with trailing anullsrc silence so the output respects a per-segment minimum (matches the slides.narrate house style). Pass min_turn_duration_s: 0 explicitly to opt out and preserve raw TTS pacing.
podcast.generate + slides.narrate dry_run / cost preview (#145) — new dry_run: bool (default false) short-circuits before TTS synthesis and returns the script + per-speaker (or per-slide) tts_chars map + estimated_cost_usd + breakdown. Cost block is also included in regular (non-dry-run) responses. New internal/podcast/cost.go with plan rate table (Free/Starter/Creator/Pro/Scale) and HELMDECK_ELEVENLABS_RATE_PER_CHAR_USD override.
podcast.generate + slides.narrate allow_silent_output opt-in — paired with the #138 contract change below; true activates the (now opt-in) silence-padded fallback for CI smoke tests / demo placeholders.

Changed

podcast.generate + slides.narrate require narration by default (#138) — pre-this-change, missing the ElevenLabs credential silently produced a silence-padded artifact with has_narration: false buried in the response. Operators discovered the misconfiguration only by listening to the MP3. Now the packs hard-fail with a typed missing_credential error and an actionable message ("Set HELMDECK_ELEVENLABS_API_KEY in deploy/compose/.env.local..."). Pass allow_silent_output: true to opt back into the silent path. Shared 4-step credential resolver (internal/packs/builtin/elevenlabs_creds.go): explicit credential input → vault elevenlabs-key → vault elevenlabs-api-key (back-compat alias) → os.Getenv("HELMDECK_ELEVENLABS_API_KEY"). Both packs log one INFO line on successful resolve naming the ladder step that matched.
slides.narrate ffmpeg failure surfaces full stderr (#140) — inline error message cap raised from 512 → 4096 bytes. Full stderr (plus the failing command line) persisted to the artifact store as ffmpeg-stderr-segment-NNN.txt / ffmpeg-stderr-concat.txt; the artifact key is referenced from the inline error so operators can fetch the unredacted output via the artifacts API.

Fixed

scripts/install.sh blocked --no-build on hosts with old Go (#136) — check_go_version ran unconditionally even with --no-build, failing on Debian/Ubuntu's apt-default Go 1.22. The control-plane Dockerfile builds inside golang:1.26-alpine, so the docker-only path needs no host Go. Wrapped in if [[ "${DO_BUILD}" -eq 1 ]].
scripts/configure-openclaw.sh hardcoded /root/openclaw + over-strict shell-env auth check (#137) — added OPENCLAW_COMPOSE_FILE env override (default unchanged); replaced 3 hardcoded path references. Auth-list die downgraded to warn when the OpenClaw container has OPENCLAW_LOAD_SHELL_ENV=true and <PROVIDER>_API_KEY is set on it (the auth-list probe is a guaranteed false positive in that documented setup path).

Closed as duplicates

#139 (duplicate of #141) and #144 (duplicate of #145) — closed without separate fixes.

Deferred

#146 (chain image.generate into podcast/slide/blog covers) — defers to a follow-up release. The image.generate pack lands in this release; the integration layer on top of it lands later.

MCP Registry

[0.10.2] - 2026-05-09

A small patch release that ships the MCP Resources surface (closes #44) plus a refined registry-listing description. Functionally additive only; no breaking changes.

Added

MCP Resources (#44) — the MCP server now serves resources/list and resources/read per the 2024-11-05 spec, alongside the existing tools/list / tools/call. Two read-only resources surface today:
- helmdeck://packs — the live pack catalog (every registered pack with its input schema). Equivalent to tools/list as a browsable resource.
- helmdeck://sessions — live session list (id, status, image, created_at). Wired only when the control plane has an active session runtime; safely omitted otherwise.
- The initialize response now declares the resources capability so MCP clients discover the new surface automatically.
- 7 unit tests cover both happy paths, the missing-runtime fallback, the unknown-URI error, lister error propagation, and the capability declaration.

Changed

Registry description now reads "Self-hosted MCP server: sandboxed browser, desktop, vision, code-edit packs for any agent." (was "38 capability packs (browser, desktop, vision, repo, fs, slides, podcast) for MCP agents."). Leads with the value proposition + self-hosted differentiator instead of the feature list.
Registry submission script + workflow corrected to point at the search API URL — the registry has no human-facing web UI today, only the metadata API. Was a pre-1.0 documentation bug from the v0.10.1 cycle.

Operator notes

No action required for existing v0.10.1 installs — MCP Resources is purely additive (new methods don't break existing tools/* clients). Upgrade if you want to expose helmdeck://sessions and helmdeck://packs to your agent for browsing.
Out of scope for #44 (deferred): JWT scope filtering on resources, per-MCP-client integration tests. Tracked as follow-ups; the spec implementation is complete and the 7 unit tests cover the surface.

[0.10.1] - 2026-05-09

A patch release that completes helmdeck's listing on the official MCP Registry. The v0.10.0 attempt failed namespace verification because two pieces of metadata weren't yet declared on the published artifacts — this release adds them. Functionally identical to v0.10.0; no pack/API/binary behavior changes.

Fixed

@helmdeck/mcp-bridge npm package now declares mcpName: "io.github.tosin2013/helmdeck" in its package.json. The MCP Registry's npm validator reads this field to confirm the package belongs to the registered namespace; without it, registry submission failed with NPM package '@helmdeck/mcp-bridge' is missing required 'mcpName' field.
ghcr.io/tosin2013/helmdeck-mcp OCI image now carries the io.modelcontextprotocol.server.name="io.github.tosin2013/helmdeck" label. The OCI validator reads this label to confirm namespace ownership; the v0.10.0 image lacked it.

Operator notes

No action required for existing v0.10.0 installs. The bridge binary, control plane, and all 38 packs are unchanged. Skip this release unless you specifically need the registry-listed install path.
Registry entry goes live on tag push. .github/workflows/mcp-registry.yml auto-fires; verify via the search API at https://registry.modelcontextprotocol.io/v0/servers?search=io.github.tosin2013%2Fhelmdeck (the registry is API-only in preview — there is no human-facing web UI; browse downstream aggregators like mcp.so, Glama, and PulseMCP instead).

[0.10.0] - 2026-05-09

A "content packs" release. Two new packs land — blog.publish for posting to Ghost or stuffing markdown/HTML into the artifact store, and podcast.generate for multi-speaker podcast MP3s via a pluggable TTS engine. The capture pipeline ships in-repo, the upgrade procedure is documented for the first time, and the README now opens with the quantified cost-positioning argument the platform earned by shipping the per-pack reference work. Pack count: 36 → 38.

The originally-planned v0.10.0 theme (Pack Authoring + Test Runner) slips to v0.11.0 — the work didn't happen this cycle, the slot got repurposed because the new packs were ready.

Added

blog.publish pack (#68 via #103) — publish to a Ghost installation (live Admin API) OR render markdown/HTML to the helmdeck artifact store. Two body modes (agent-supplied OR prompt+model the pack expands). Goldmark added to go.mod for the markdown→HTML shim. Ghost JWT minted inline via golang-jwt/jwt/v5 (5-min HS256, audience /admin/).
podcast.generate pack (#106) — produce a 1..N speaker podcast MP3 from a script, a prompt, or long-form content (URL/text → LLM converts). Three input modes (script / prompt+model / source_*+model). Five themed system prompts: interview, debate, news-roundup, deep-dive, solo-essay. Day 1: ElevenLabs behind a podcast.Engine interface so future PRs (PlayHT, Hume.ai, Resemble.ai) slot in by adding a new file under internal/podcast/. Vault credential elevenlabs-key (same as slides.narrate); silent-fallback when missing. Optional cover_image_prompt output for downstream image-gen packs.
38 per-pack reference pages at helmdeck.dev/reference/packs — every shipped pack on the agent-first / developer-second template, with live OpenClaw chat-UI transcripts embedded alongside curl developer references. (PR-A #83 + PR-B #95 + PR-C #101.) Closes #51, #53, #54, #55, #56, #58, #59, #60, #61, #62, #63, #64.
OpenClaw transcript capture pipeline at scripts/oc-capture/ (#97 + #104) — three scripts (capture-oc.sh, extract-oc-transcript.py, inject-transcripts.py), a generic capture-batch.sh driver, and prompt files for the three pack-doc clusters.
Cost-positioning blog + long-form reference (#99) — website/blog/2026-05-08-cheap-models-do-frontier-work.md + docs/explanation/why-helmdeck.md with five per-task comparison tables vs. Anthropic Computer Use, OpenAI Operator, Browser-use, Cursor, Aider, Unstructured.io, LlamaParse, Pictory. Includes a "Run the comparison yourself" reproduction recipe + community-contribution invitation.
Operator upgrade documentation at docs/howto/upgrade-helmdeck.md (#107) — pre-flight checklist, in-place Compose-stack upgrade, schema-migration handling, post-upgrade validation, rollback, Kubernetes/Helm path preview.
SKILLS.md gains a "Freshness contract" section (#98) — teaches agents to re-call stateful packs when state may have changed since the last call. Plus per-client "Load the agent skills" subsections for every integration doc (Claude Code via CLAUDE.md, Claude Desktop via Projects, Gemini CLI via GEMINI.md, Hermes via system_prompt_file).
Per-release-checklist additions in docs/RELEASES.md: step 6 (refresh README + cost numbers per release, #100), step 7 (operator upgrade procedure smoke, #107).

Fixed

vision.click_anywhere mechanical loop bug (#102 via #105) — per-step screenshots now genuinely reflect post-action desktop state. Two changes: Step and StepNative thread prior-turn actions into the next user message as textual history, and a 250 ms post-dispatch wait gives Xvfb time to repaint. Same fix applies to vision.fill_form_by_label. Verified live: per-step PNG artifacts now have distinct file sizes between iterations (vs. PR-B baseline where every step's bytes were identical because Xvfb hadn't repainted before scrot fired). However, the model-side completion-detection limitation remains — the model still rarely emits done on real tasks even when the click visibly landed. Tracked separately at #112 for follow-up research (try gpt-4o vs. haiku-4.5, native computer-use schema, two-shot verification). Treat vision.click_anywhere as experimental for production workflows until #112 lands an answer.
repo.fetch empty-remote infinite hang (#94 via #96) — git ls-remote --heads runs first; pack errors fast with invalid_input: remote has no branches; push at least one commit before cloning.
fs.patch Anthropic-edit-shape rejection (#90 via #93) — both {search, replace} and {edits: [{oldText, newText}]} shapes accepted.
doc.parse formats: "markdown" rejection (#91 via #93) — markdown aliases md; both work.
OpenClaw capture pipeline cross-prompt context bleed (#97) — every capture-oc.sh invocation now mints a fresh --session-id. Side-effect: per-call cost dropped ~140× (no 280-event session bloat shipped on every turn).
Vision pack loops now check ctx.Err() (in #105) — cancelled callers exit cleanly instead of spinning to max_steps.
vision.fill_form_by_label parity fix (#105) — now records per-step PNG artifacts (parity with click_anywhere).

Changed

Pack count: 36 → 38 (blog.publish + podcast.generate)
README.md opens with the quantified cost-positioning argument ($0.07 Phase 5.5 loop on gpt-oss-120b vs $0.30+ on Sonnet via Cursor) plus a 4-row comparison table; "other 99%" framing kept as the follow-on paragraph
Homepage tagline rewritten from "Self-hosted AI agent platform for small open-weight models" to lead with the cost angle
docs/integrations/SKILLS.md picks up the Freshness contract, expanded "How to load" subsection with per-client instructions, "Blog" and "Podcast" catalog entries, and the pack count bump

Operator notes

Upgrade procedure: git fetch && git checkout v0.10.0 && make sidecars && make install. See /howto/upgrade-helmdeck for the full pre-/post-upgrade checklist.
Schema migrations: auto-applied on store.Open. Cross-version smoke is tracked in #108 (P1).
OpenClaw skill refresh: re-run ./scripts/configure-openclaw.sh after pulling so the new SKILL.md (with podcast/blog entries + Freshness contract) lands in the OpenClaw container.
No breaking changes to existing pack input/output schemas. All ### Added work is additive; all ### Fixed items improve observable behavior in agents' favor.
Pre-Kubernetes audit issues filed: #108 (schema-migration cross-version test, P1), #109 (sidecar version pinning, P2), #110 (vault master-key rotation, P2), #111 (cross-version upgrade smoke in CI, P2). All tagged Phase 7; none block v0.10.0.
Known limitation: vision.click_anywhere and vision.fill_form_by_label are experimental — the underlying loop fix in #105 works mechanically (screenshots progress per turn) but the vision model rarely emits done on real tasks. See #112 for the research track. Use at your own risk in production workflows; prefer web.test (Playwright MCP, deterministic) for browser-automation goals where possible.

0.9.0 - 2026-05-07

A "polish + plumbing" release. No new packs and no API changes — the 36 packs from v0.8.0 stay the surface area. What landed: a real install fix that was breaking first-session sessions, a public docs site at helmdeck.dev, two community-contributed AI provider adapters, secret scanning in CI, and the planning-doc cross-references that were documented-but-not-implemented at v0.8.0.

Added

Documentation site at helmdeck.dev — Docusaurus 3, Diataxis-organized (Tutorials / How-to / Reference / Explanation), deployed to Vercel with auto-preview on PRs. Search via @easyops-cn/docusaurus-search-local. SEO-tuned for Google Search Console submission: explicit titles, OG social card, robots.txt, sitemap with per-route priority bumps, schema.org/WebSite + FAQPage JSON-LD.
Install tutorials — docs/tutorials/install-cli.md (10-minute walkthrough from git clone to running stack) and docs/tutorials/install-ui-walkthrough.md (panel-by-panel UI tour).
Troubleshooting how-to — docs/howto/troubleshoot-install.md with FAQPage schema covering 10 known sharp edges (502 on first session, GHCR pull failures, lost admin password, etc.).
Per-pack documentation framework — docs/reference/packs/ with template + fully-written browser family (browser.screenshot_url, browser.interact). 12 family-tracking issues opened for community to pick up the remaining 34 packs.
OSS hygiene files at repo root — CHANGELOG.md, SECURITY.md (90-day disclosure window), CODE_OF_CONDUCT.md (Contributor Covenant 2.1).
GitHub priority taxonomy — priority/P0..P3 labels applied to all 39 open issues. P1 cohort (14 items) is the next-release shortlist.
docs/sitemap.xml — documcp-generated source-side sitemap for link audits and search-engine submission tracking, separate from Docusaurus's runtime sitemap.
Custom logo — helm-wheel + H letterform mark, light/dark variants, SVG favicon. Replaced the scaffolded Docusaurus brand assets.
Provider adapters via community PRs — Groq (PR #45 by @Dev-31) and Mistral (PR #47, resolved from @vijit-vishnoi's PR #46) both ride the HELMDECK_{PROVIDER}_API_KEY[_FILE] / _BASE_URL / _MODELS env-var contract introduced for OpenRouter in v0.8.0.

Changed

Planning docs (RELEASES.md, MILESTONES.md, TASKS.md) are now cross-linked. Every release has a Milestone + Tasks pointer; every milestone has a Ships-in pointer; the v0.8.0 RELEASES section was added (was missing). 19 task IDs that lived in MILESTONES without rows in TASKS got promoted into proper rows.
README's install section links to the new tutorial pages.
Trivy CI scan scope narrowed to scanners: vuln,misconfig. Action pin bumped 0.28.0 → 0.35.0.

Fixed

Install bug — docker compose up -d --build only builds services with a build: clause, so published images (Garage, the GHCR-published sidecar tag) weren't pulled before stack-up. Result: first session calls hung on a 30-second timeout. Fix: new compose_pull step in scripts/install.sh runs docker compose pull --ignore-buildable between sidecar build and compose up, fast-failing on network/proxy issues with an actionable error. The sidecar-warm service no longer swallows pull failures with || true.
CI race — TestBridgeRoundTrip's shared bytes.Buffer between the test goroutine and the bridge writer. Wrapped in a sync.Mutex-guarded safeBuffer. Production code unchanged.
vercel.json — cleanUrls: true added so /PACKS resolves to /PACKS.html (matched to Docusaurus's trailingSlash: false).

Security

Gitleaks secret-scanning CI workflow on every push + PR. Runs via gitleaks/gitleaks-action@v2 with fetch-depth: 0 so the scanner walks full history. Allowlist covers stable dev credentials in deploy/compose/garage.toml (file header already documents these as override-in-production).
serialize-javascript bumped 6.0.2 → 7.0.5 via npm overrides to address GHSA-5c6j-r48x-rmvq (HIGH) and CVE-2026-34043 (MEDIUM). Both shipped as transitive deps in @docusaurus/bundler.

Developer experience

make check target wraps vet + race test + build — exactly what CI's vet + test + build job runs. Plus make install-hooks to wire an opt-in pre-push hook.

0.8.0 - 2026-04-12

Added

36 capability packs total (browser, web, research, slides, GitHub, repo, filesystem, shell, HTTP, document, desktop, vision, language families).
Phase 6.5 validation script (scripts/validate-phase-6-5.sh).
Multi-provider AI gateway adapters: Groq, Mistral.
gitleaks secret-scanning CI workflow with allowlist.

Changed

README leads with the weak-model success story; v0.8.0 + 36-pack catalog refresh.
Trivy CI scan scope narrowed to vuln+misconfig (secrets owned by gitleaks).

0.5.1 - 2026-04-08

Fixed

npm trusted publishing: bump npm + add --provenance so @helmdeck/mcp-bridge releases include attestations.

0.5.0 - 2026-04-08

Added

AES-256-GCM Credential Vault with placeholder-token injection (login, session cookies, API keys, OAuth-with-refresh, SSH/git).
CDP cookie injection at session start.
HTTP gateway intercept-and-substitute for outbound agent traffic.
repo.fetch, repo.push, web.login_and_fetch, web.fill_form, slides.video packs (vault-dependent).
NetworkPolicy egress allowlist + metadata IP / RFC 1918 block.
Sandbox baseline: non-root, drop-all-caps, seccomp.
OpenTelemetry GenAI semantic conventions on every span.
Trivy CRITICAL gate in CI.

0.3.0 - 2026-04-08

Added

MCP registry with stdio/SSE/WebSocket transports.
Built-in MCP server auto-derived from the pack catalog.
helmdeck-mcp bridge binary distributed via Homebrew, Scoop, npm (@helmdeck/mcp-bridge), GHCR OCI image, and signed GitHub Releases.
CI smoke matrix verifying browser.screenshot_url from Claude Code, Claude Desktop, OpenClaw, and Gemini CLI.

Fixed

release.yml: gate binary jobs to push events only.

0.2.0 - 2026-04-08

Added

OpenAI-compatible /v1/chat/completions and /v1/models.
Provider adapters: Anthropic, Gemini, OpenAI, Ollama, Deepseek.
Encrypted key store with rotation API.
Fallback routing rules (rate-limit / error / timeout triggers).
Pack Execution Engine with input/output schema validation.
Typed error code enforcement (closed set per pack).
Pack registry with versioned dispatch.
Three reference packs: browser.screenshot_url, web.scrape_spa, slides.render.
Object store integration with signed-URL artifacts.
A2A Agent Card at /.well-known/agent.json.

Hardware exit gate met

≥90% success rate on browser.screenshot_url and web.scrape_spa against MiniMax-M2.7 and Llama 3.2 7B.

0.1.1 - 2026-04-07

Fixed

sidecar.yml: publish amd64 only until Marp ships an arm64 tarball.

0.1.0 - 2026-04-07

Added

Go control plane binary (Gin + chromedp + Docker SDK).
Browser sidecar image with Chromium, Marp, Tesseract, ffmpeg, xdotool, Xvfb, XFCE4, noVNC.
Ephemeral session lifecycle (POST /api/v1/sessions … DELETE /api/v1/sessions/{id}).
CDP REST endpoints: navigate, extract, screenshot, execute, interact.
JWT bearer auth on every endpoint.
Audit log (write-only).
Single-node Compose deployment (deploy/compose/compose.yaml).
make smoke end-to-end harness in CI.

Unreleased​

[0.29.10] - 2026-06-22​

Fixed​

[0.29.9] - 2026-06-22​

Added​

[0.29.8] - 2026-06-22​

Fixed​

[0.29.7] - 2026-06-22​

Fixed​

[0.29.6] - 2026-06-21​

Fixed​

[0.29.5] - 2026-06-21​

Documentation​

Added​

[0.29.4] - 2026-06-21​

Added​

Fixed​

Added​

Documentation​

Changed​

[0.29.3] - 2026-06-17​

Changed​

Fixed​

[0.29.2] - 2026-06-17​

Added​

Added​

[0.29.1] - 2026-06-17​

Added​

Fixed​

[0.29.0] - 2026-06-16​

Added​

Added​

Added​

Added​

Added​

Added​

[0.28.6] - 2026-06-16​

Changed​

[0.28.5] - 2026-06-15​

Fixed​

[0.28.4] - 2026-06-15​

Fixed​

[0.28.2] - 2026-06-15​

Fixed​

[0.28.1] - 2026-06-14​

Fixed​

[0.28.0] - 2026-06-14​

Added​

Changed​

[0.27.1] - 2026-06-14​

[0.27.0] - 2026-06-10​

Added​

Added​

Changed​

Changed​

Changed​

Added​

Added​

Added​

Added​

Added​

Changed​

Changed​

Fixed​

Added​

Fixed​

Added​

[0.26.0] - 2026-06-05​

Added​

Changed​

Fixed​

[0.25.0] - 2026-06-04​

Added​

[0.23.0] - 2026-06-03​

Changed​

Added​

Fixed​

Added​

[0.22.0] - 2026-06-01​

Added​

Unreleased

[0.29.10] - 2026-06-22

Fixed

[0.29.9] - 2026-06-22

Added

[0.29.8] - 2026-06-22

Fixed

[0.29.7] - 2026-06-22

Fixed

[0.29.6] - 2026-06-21

Fixed

[0.29.5] - 2026-06-21

Documentation

Added

[0.29.4] - 2026-06-21

Added

Fixed

Added

Documentation

Changed

[0.29.3] - 2026-06-17

Changed

Fixed

[0.29.2] - 2026-06-17

Added

Added

[0.29.1] - 2026-06-17

Added

Fixed

[0.29.0] - 2026-06-16

Added

Added

Added

Added

Added

Added

[0.28.6] - 2026-06-16

Changed

[0.28.5] - 2026-06-15

Fixed

[0.28.4] - 2026-06-15

Fixed

[0.28.2] - 2026-06-15

Fixed

[0.28.1] - 2026-06-14

Fixed

[0.28.0] - 2026-06-14

Added

Changed

[0.27.1] - 2026-06-14

[0.27.0] - 2026-06-10

Added

Added

Changed

Changed

Changed

Added

Added

Added

Added

Added

Changed

Changed

Fixed

Added

Fixed

Added

[0.26.0] - 2026-06-05

Added

Changed

Fixed

[0.25.0] - 2026-06-04

Added

[0.23.0] - 2026-06-03

Changed

Added

Fixed

Added

[0.22.0] - 2026-06-01

Added