Changelog
All notable changes to helmdeck are documented here. The format follows Keep a Changelog 1.1.0 and this project adheres to Semantic Versioning starting at v1.0.0; pre-1.0 minor versions may break compatibility (documented per release).
For the forward-looking release plan — what is targeted for upcoming versions
and the hard exit gates for each — see
docs/RELEASES.md.
Unreleased
[0.29.10] - 2026-06-22
Theme: "Error-path findings extraction — empirical loop empirically closes."
Single-PR hot-fix following the v0.29.9 BYO empirical test. v0.29.9 shipped the findings-memory architecture (data layer + projection + compose prompt injection); the first run on v0.29.9 surfaced a subtle but load-bearing bug — when a validation pack errors with output (lint-strict-mode's standard contract), Engine.Execute's post-handler short-circuit dropped the output before findings extraction could see it. Without this fix, the findings-memory loop never closes on Tier C runs because strict validation packs always error. v0.29.10 fixes the engine to capture handler output into a closure-visible variable before the error short-circuit, so audit-row findings get extracted from BOTH success-path AND error-path output. Regression test simulates the exact lint-strict pattern (pre-fix=0 findings, post-fix=2 findings).
Operator upgrade: clean — single engine-internal change. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.10, restart control-plane. Sidecar-hyperframes is unchanged from v0.29.8. No operator-side actions required beyond the upgrade. Empirical follow-through: re-run the BYO pipeline test ONCE — confirm /api/v1/memory/defaults?caller=openclaw-configure now returns common_findings populated from the run; re-run a SECOND time — confirm the compose-prompt findings prefix appears in the next lint sidecar's findings (ideally with fewer/different codes). That's the two-run validation closing today's iteration arc.
Fixed
- Findings extraction on the error path: pack handlers that return BOTH a structured output AND a
PackError(e.g.hyperframes.lintin strict mode — its standard contract returns the findings JSON +CodeArtifactFailed) now correctly land findings in the audit row. Surfaced empirically minutes after v0.29.9 deployed: re-ran the BYO pipeline, lint failed withartifact_failed, the lint sidecar artifact contained findings, but/api/v1/memory/defaultsshowedcommon_findings: 0. Root cause:Engine.Execute's post-handler short-circuitif err != nil { return nil, wrap(err) }returnsnilas the result, and the audit deferred-closure was readingresult.Output— so the output blob that the handler had written got dropped before findings extraction ran. The error path is exactly where we WANT findings recorded (the lint pack's strict-mode contract is "emit findings + error"). Fix: declarevar handlerOutput json.RawMessageat the top ofExecute, assign it right aftersafeInvokereturns (BEFORE the error short-circuit), pass it towritePackAuditfrom the closure. One regression test simulates the lint-strict pattern (return findings JSON +CodeArtifactFailed) and asserts the audit row'sFindingsfield carries the 2 codes — pre-fix this was 0, post-fix it's 2. 1672 tests pass across packs + api + memory + packs/builtin. Closes the empirical gap surfaced by the first post-v0.29.9 BYO run; without this, the findings-memory loop never closes on Tier C runs because validation packs in strict mode always error.
[0.29.9] - 2026-06-22
Theme: "Empirical-reinforcement loop closes + admin observability."
Same-day continuation of the 24-hour BYO empirical-iteration cycle. v0.29.4 shipped the pre-render validation suite; v0.29.6 → v0.29.8 fixed the infrastructure bugs surfaced by running it (operator-uploads visibility, S3 Get URL, memory forget bypass-decrypt, sidecar pin). With infrastructure clean, the first complete BYO run produced real LLM-output findings (missing_local_asset, gsap_studio_edit_blocked, timeline_track_too_dense) — exactly the antipatterns the helmdeck-hyperframes-authoring skill documents. The skill is in-context; the LLM ignored it. v0.29.9 closes that gap with three architectural additions:
- #572 —
PackAudit.Findings+BuildDefaults.CommonFindings. Every pack audit row now carries structured findings; aggregation surfaces them as a per-caller frequency-ranked list via/api/v1/memory/defaultsand the MCPhelmdeck://my-defaultsresource. - #573 —
hyperframes.composeinjects top-N common findings into its system prompt on every run. Empty findings → zero token cost. Empirical "you did X N times" beats abstract rules — biggest lift on Tier C models. - #571 — Routing Memory page gains a caller selector for admins so operators can inspect what their agents have been doing (not just their own admin activity).
Architecture writeup (draft, ships once empirically validated): 2026-06-22-findings-memory-empirical-reinforcement.md covers the three generalizable takeaways — empirical signal beats abstract rules for weak models, loop closes at prompt layer (not fine-tune time), validator rule codes should be load-bearing in both the gate AND the generator's prompt.
Operator upgrade: clean — no schema migrations, no removed packs. Additive across the board. New PackAudit.Findings field is optional (omitempty); existing audit rows with no findings remain valid. New CommonFindings array on defaults is omitempty — clients ignoring it see no change. The compose-prompt findings prefix is conditional on ec.Memory != nil AND non-empty findings — deployments without memory wired (default without HELMDECK_MEMORY_KEY) see zero change. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.9, restart control-plane, re-run ./scripts/configure-openclaw.sh. Sidecar-hyperframes is unchanged from v0.29.8. Empirical validation pending: the BYO pipeline test on v0.29.9 should produce fewer (or different) lint findings than the same test on v0.29.8, because the compose step now incorporates the prior-run findings as constraints. If the same codes recur, slice 4's prompt-template phrasing needs tuning (the static code→guidance map follow-up).
Added
-
Findings-memory: agent USES the data (#570 slice 4 — compose prompt injection). The
hyperframes.composepack now reads its caller'sCommonFindingsviaec.Memory.List(AuditKeyPrefixPack)+ProjectDefaults, and appends a "FINDINGS FROM YOUR PRIOR RUNS" section to the system prompt before dispatching to the LLM. Closes the empirical-reinforcement loop: slices 1+2 (the data plumbing) record what every lint/inspect/validate finding the agent has produced; slice 4 (this PR) feeds those back into the next compose call so the LLM sees concrete antipattern counts ("missing_local_asset seen 2 time(s), severity=error") alongside the abstract authoring-rules system prompt. Empty findings → empty prefix → zero token cost for new callers (auto-tunes per caller). Capped atcomposeFindingsTopN = 10so the prefix tops out at ~300 tokens (negligible against the multi-thousand-token compose prompt). Prefix lives in the SYSTEM message rather than the user message so the LLM gateway / OpenRouter can cache the per-caller system half across requests and only the description varies per call. Tier coverage: helps all tiers — Tier A (claude-sonnet, gpt-4) gets marginal reinforcement of rules they already mostly follow; Tier B (llama-3-70b, etc.) gets meaningful gap-closing on specific recurring failures; Tier C (gpt-oss-120b:free, gemma-9b) gets the highest lift because the empirical "you did X N times" signal cuts through the abstract-rule-ignoring drift these models exhibit. 6 new sub-tests cover the empty-memory + nil-memory + audits-without-findings → no-prefix paths, the empirical-data path (simulates the 2026-06-22 BYO lint findings → confirms missing_local_asset appears with count=2 + gsap_studio_edit_blocked appears with count=1), the hard-constraint closing line, and the topN cap. 1671 tests pass across packs + api + memory + packs/builtin. Empirical validation pending: next BYO test run on this code should show the agent avoiding the prior-run failure modes; if it still hallucinates the same codes, slice 4's prompt-template phrasing needs tuning (possible follow-up: add a static code→human-guidance map for the most common codes). -
Findings-memory layer (#570 slices 1+2 — data plumbing). The engine now records structured rule-violation findings on every pack audit row, and
BuildDefaults(used by/api/v1/memory/defaults+ the MCPhelmdeck://my-defaultsresource) aggregates them intocommon_findingsso the agent can see which validation findings keep recurring across runs. Closes the gap surfaced by the first empirical BYO test (2026-06-22): the lint pack emittedmissing_local_asset,gsap_studio_edit_blocked,timeline_track_too_denseagainst the LLM's authored composition — exactly the antipatterns thehelmdeck-hyperframes-authoringskill documents — and there was no mechanism for the agent to learn from those failures between runs. Three changes: (1) extendPackAuditininternal/packs/audit.gowith a terseFindings []AuditFindingslice (code + severity + file only — verbosemessage/fixHint/snippetstay in the pack's sidecar artifact). Capped atmaxAuditFindings = 50so a single dense run can't monopolize the audit budget. (2)extractFindings(output)heuristically pulls findings from THREE recognized output shapes: top-level{"findings": [...]}(any pack), nested{lint: {findings: [...]}}(hyperframes.lint), and{inspect: {issues: [...]}}/{validate: {errors: [...], warnings: [...]}}(the other two validation packs). Each finding is normalized to{code, severity, file}— entries without acodeare skipped. Bothcodeandseverity/levelfield-name variations are tolerated (lint usesseverity; validate useslevel). (3)BuildDefaultsaddsCommonFindings []CommonFindingto the projection — group-by-code aggregation withOccurrenceCount,LastSeenUnix, and pack-attribution from the most-recent occurrence. Sorted busiest-first, capped atDefaultsFindingsTopN = 20. 14 new sub-tests cover the extraction (all three output shapes; code-missing skip; bad-JSON returns nil; row-cap enforcement) and the projection (3 distinct codes across 3 runs sorted correctly; cross-pack aggregation; empty input; top-N truncation). 1665 tests pass across 4 packages. Slices 3+4 (UI surface + compose-prompt injection — the agent actually READING the common_findings) follow as separate PRs. -
Routing Memory page gains a caller selector for admins (#569). Closes the operator-visible gap surfaced 2026-06-22: an operator logged in as
admincleared Routing Memory then ran BYO pipeline tests via OpenClaw, saw "No history yet" on the page. The data was being recorded correctly — it just landed underopenclaw-configure(the JWT subject minted byconfigure-openclaw.shfor the MCP bridge), not under the operator's ownadminsubject. ADR 047's per-caller isolation is correct multi-tenant design, but the UI had no affordance for admins to inspect what their agents had been doing. Three changes: (1) newMemoryStore.ListNamespacesmethod on both InMemory + SQLite implementations — returns distinct caller namespaces + row counts, sorted busiest-first, never decrypts (raw column read). (2) newGET /api/v1/memory/callersendpoint backed by ListNamespaces — admin-gated so non-admin operators see only their own caller (defense in depth for the per-caller isolation contract). (3)GET /api/v1/memory/defaultsaccepts?caller=<name>query param; admin scope required to override, non-admins see their own scope regardless of the param. UI: dropdown above the existing Refresh/Clear buttons, populated from/api/v1/memory/callers, only renders when more than one caller exists; selecting another caller re-fetches/api/v1/memory/defaults?caller=<selected>and the three sections (Recent activity / Learned pack defaults / Learned pipeline defaults) repopulate. 6 new sub-tests cover ListNamespaces on both backends (busiest-first ordering + empty-after-drain), the callers endpoint (empty store + non-admin filter), and the defaults endpoint's admin-only override gate (regression guard for the per-caller-isolation contract). Empirical use case: when your OpenClaw agent is running BYO pipelines, switch the dropdown toopenclaw-configureto see exactly what it's been doing — the same audit history the agent reads as defaults.
[0.29.8] - 2026-06-22
Theme: "Validation-suite sidecar pin hot-fix — third BYO empirical iteration."
Same-day follow-up to v0.29.7's two fixes (S3 Get URL + memory forget bypass-decrypt). v0.29.7's BYO pipeline test reached the lint step (compose finally resolving the artifact URL post-#564) — then died with handler_failed: hyperframes lint emitted no JSON (exit 127). Exit 127 is bash for "command not found." The v0.29.4 hyperframes.{lint,inspect,validate} packs set NeedsSession: true but forgot to pin SessionSpec.Image, so the session executor spawned them into the default base sidecar (helmdeck-sidecar:latest) which doesn't have the hyperframes CLI on PATH. v0.29.8 ships #567's pin (same convention hyperframes.render has used since v0.13.0, including the HELMDECK_SIDECAR_HYPERFRAMES env override). Operator-visible: builtin.byo-audio-narrated-video now actually completes the lint → inspect → validate gates instead of exit-127-ing at lint. Third hot-fix-from-empirical-testing in 24 hours.
Operator upgrade: clean — single backend code change in three packs. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.8, restart control-plane. Sidecar-hyperframes is unchanged from v0.29.7. No operator-side actions required beyond the upgrade. After it lands, re-run any failed BYO pipeline test — the lint gate should now find the upstream CLI and either pass (clean composition) or produce structured findings (which is the publish-gate working as designed, not a regression).
Fixed
hyperframes.{lint,inspect,validate}packs (v0.29.4 pre-render validation suite) now pinSessionSpec.ImagetohyperframesSidecarImage(), matchinghyperframes.render's pattern. Surfaced empirically the same day v0.29.7 shipped: abuiltin.byo-audio-narrated-videorun failed at the lint step withhandler_failed: hyperframes lint emitted no JSON (exit 127). Exit code 127 is bash for "command not found" — the lint pack was being spawned into the default base sidecar (helmdeck-sidecar:latest) which doesn't have thehyperframesCLI on PATH. The render pack pins the right image viaImage: hyperframesSidecarImage()in itsSessionSpec; my v0.29.4 lint/inspect/validate packs setNeedsSession: truebut forgot the image pin, so the session executor used its default. Fix: add the sameSessionSpec.Image = hyperframesSidecarImage()pin to all three packs, plus sensible MemoryLimit/Timeout/CPUProfile defaults per pack's compute shape (lint: 1g/5min/IO; inspect: 2g/10min/Compute since it loads in headless Chrome with at_transitions sampling; validate: 2g/5min/Compute since it boots Chrome + DevTools console). Operator-visible effect: thebuiltin.byo-audio-narrated-videopipeline now actually completes the lint→inspect→validate gates instead of exit-127-failing at lint. Both theHELMDECK_SIDECAR_HYPERFRAMESenv override + the default pinned image are honored, matching render's behavior.
[0.29.7] - 2026-06-22
Theme: "BYO empirical-test recovery — two same-day-surfaced production blockers fixed."
Same-day follow-up to v0.29.6, surfaced during the first real end-to-end test of builtin.byo-audio-narrated-video. v0.29.4 shipped the BYO pipeline, v0.29.5 shipped the operator upload UI, v0.29.6 fixed the artifact-list visibility. v0.29.7 closes the two operator-visible blockers that ONLY surface against a production S3/Garage backend + multi-restart deployment — neither caught by unit tests because the memory store + Get URL contract differences only manifest at the integration layer.
Bug 1 — S3 Get returned Artifact with empty URL. My v0.29.4 BYO implementation in hyperframes.compose calls ec.Artifacts.Get(ctx, key) and asserts art.URL != "" before threading it into the audio_url codepath. The MemoryArtifactStore.Get filled URL with "memory://" + key (non-empty, contract met). The S3ArtifactStore.Get filled URL on Put but returned Artifact{Key, Size, ContentType, CreatedAt} on Get — no URL, fails the assert. All 6 of the operator's BYO pipeline test runs failed at compose with artifact_failed: audio_artifact_key "..." resolved to empty URL (artifact store does not expose presigned URLs?). PR #564 fixes by calling s.presign(ctx, key) in Get like Put does; same contract both directions.
Bug 2 — Routing Memory's "Clear all history" couldn't recover from rotated keys. Restarting the control plane across releases without a pinned HELMDECK_MEMORY_KEY generates a fresh ephemeral master each time. The SQLite memory table persists ciphertext from old keys; new process can't decrypt. The UI showed build defaults: memory: decrypt: cipher: message authentication failed. The Clear button hit POST /api/v1/memory/forget which listed THEN deleted each entry one by one — the list step decrypt-failed and forget got stuck. Operator's only recovery was a manual sqlite3 ... DELETE. PR #565 fixes by adding MemoryStore.DeletePrefix that operates on raw SQL rows without decrypting, and switching the forget handler to use it. Also documents pinning HELMDECK_MEMORY_KEY (32-byte hex in .env.local) to prevent rotation in the first place.
Operator upgrade: clean — single backend changes only. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.7, restart control-plane. Sidecar-hyperframes is unchanged from v0.29.6. One operator-side action recommended: pin HELMDECK_MEMORY_KEY if you haven't already (echo "HELMDECK_MEMORY_KEY=$(openssl rand -hex 32)" >> deploy/compose/.env.local); see PR #565's CHANGELOG entry for context.
Fixed
- Routing Memory's "Clear all history" button now works even when the encryption key has rotated, unblocking the
memory: decrypt: cipher: message authentication failedrecovery path. Surfaced empirically the same day v0.29.6 shipped: operator restarted the control plane multiple times across v0.29.4/5/6 deployments without a pinnedHELMDECK_MEMORY_KEY, each restart generated a fresh ephemeral master, the SQLite memory table persisted entries encrypted with the OLD keys, the NEW process couldn't decrypt them → AES-256-GCM auth tag verification failed on every list call →build defaults: memory: decrypt: cipher: message authentication failederror in the Routing Memory UI. The UI's Clear button hitPOST /api/v1/memory/forgetwhich listed THEN deleted each entry one by one — so the list step decrypt-failed and forget got stuck (the only path operators had to clear stale entries was a manualsqlite3 ... DELETE). Fix: newMemoryStore.DeletePrefix(ctx, ns, prefix) (int, error)method (internal/memory/memory.go's interface + implementations on bothInMemoryStoreandSQLiteStore). SQLDELETE FROM memory_entries WHERE namespace=? AND key LIKE ? ESCAPE '\'operates on raw rows and never touches ciphertext, so it succeeds even when no key can decrypt the existing rows. LIKE wildcards (%,_) in the caller's prefix are escaped so they match literally — caller's audit-key vocabulary is operator-extensible and the SQL injection / wildcard-leak surface needs to be tight. The/api/v1/memory/forgethandler now usesDeletePrefixinstead ofList+ per-keyDelete. 7 new sub-tests cover the round-trip happy path, idempotency on empty namespaces, cross-namespace isolation, LIKE-wildcard literal-matching for both%and_, and the load-bearing "rotated key" regression: open the same DB with a different master, confirmListfails with auth-tag mismatch ANDDeletePrefixsucceeds + clears the orphans + post-clear List works again. Documentation note: pinHELMDECK_MEMORY_KEY(32-byte hex) in your.env.localto prevent the rotation in the first place — the autogenerate-with-warning fallback is fine for development but loses state on every restart. S3ArtifactStore.Getnow populates theURLfield on returnedArtifactwith a presigned link, matching thePutpath's contract. This unblockshyperframes.compose's BYO audio_artifact_key resolution (and any downstream pack that chains an existing artifact into another via URL). Surfaced empirically the same day v0.29.6 shipped: an operator ranbuiltin.byo-audio-narrated-videoagainst a UI-uploaded MP3 → all 6 pipeline attempts failed at compose withartifact_failed: audio_artifact_key "..." resolved to empty URL (artifact store does not expose presigned URLs?). Root cause: my v0.29.4 BYO implementation inhyperframes.composecallsec.Artifacts.Get(ctx, key)and assertsart.URL != "". The Memory store filled URL with"memory://" + key(non-empty, contract honored). The S3 store filled URL on Put (vias.presign(ctx, key)) but Get returnedArtifact{Key, Size, ContentType, CreatedAt}with no URL — empty string, fails the assert. Fix is two added lines ininternal/packs/s3store.go: calls.presign(ctx, key)in Get, setURL: signedon the returned Artifact. presign errors are non-fatal — the empty URL surfaces back through the existing BYO assert (same contract as before the fix; the assert was correct, the precondition was wrong). One regression test ins3store_test.go(the live-S3 path, skipped without endpoint env vars but exercised in CI) asserts URL is populated on Get. Validation: with the fix in place, the user's BYO test prompt now succeeds at compose; pipeline reaches lint/inspect/validate/render gates without the URL-empty short-circuit.
[0.29.6] - 2026-06-21
Theme: "Operator-uploads list-visibility hot-fix."
Same-day hot-fix to v0.29.5's drag-drop upload card. The upload bytes layer was fine (artifact persisted + downloadable + usable by pipelines), but the Management UI's Artifacts page table didn't surface operator-uploaded files because the default list endpoint iterates the pack registry only. v0.29.6 ships the targeted fix: after the pack-registry loop, also iterate special non-pack namespaces (currently operator-uploads) and append their artifacts to the result. Operators can now see their uploads in the Artifacts table immediately after dropping a file. Back-compat-safe; no API surface change, no pipeline shape change. Single-PR release pattern matching the v0.13.1 same-day-hotfix discipline (see the 2026-05-13 v0.12.1 blog post for the rationale).
Operator upgrade: clean — single backend code change. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.6, restart control-plane. The sidecar-hyperframes image is unchanged from v0.29.5 — no need to pull or restart anything else. The fix is purely in the artifact-list HTTP handler; existing operator-uploads keys from v0.29.5 (which were correctly persisted, just invisibly) immediately become listable in the UI.
Fixed
GET /api/v1/artifacts(the Management UI's Artifacts list endpoint) now surfacesoperator-uploads/*artifacts in the default listing. Bug surfaced empirically the same day v0.29.5 shipped: an operator uploaded an MP3 via the new drag-drop card (PR #556), the upload succeeded (theoperator-uploads/<hash>-<filename>key was returned + the bytes were correctly stored — verified viaGET /api/v1/artifacts/download/<key>returning 200 + 2.65 MB), but the artifact didn't appear in the Artifacts page table. Root cause: the default list (no?pack=filter) iterates the pack registry and queriesstore.ListForPack(packName)for each registered pack.operator-uploadsisn't a registered pack — it's a special namespace introduced by the upload endpoint. So the iteration skipped it entirely. Fix: after the pack-registry loop, also iterate a hardcoded list of special non-pack namespaces (currently justoperator-uploads) and append their artifacts to the result. The artifacts were always in the store + listable via?pack=operator-uploadsfilter; this just makes them visible in the default view. One regression test covers the no-registry-wired path (which would have caught the bug in CI if we'd added it on the original PR #556).
[0.29.5] - 2026-06-21
Theme: "Operator artifact upload + BYO-audio worked example."
Same-day follow-up to v0.29.4 that closes the chat-side-file-ingestion gap and refines the gpt-oss-120b reference recipes with a BYO-audio variant. v0.29.4 shipped builtin.byo-audio-narrated-video but operators couldn't easily get an MP3 INTO the artifact store — artifact.put is a pack that takes bytes via the agent's tool input, and a 2.5 MiB MP3 means ~3.3 MiB of base64 in chat which is impractical. v0.29.5 ships the drag-drop upload card on the Management UI's Artifacts page plus the new POST /api/v1/artifacts/upload REST endpoint behind it, AND refines the gpt-oss-120b-concept-animator howto with a BYO variant that collapses the 5-call from-scratch narrated-video chain to a single pipeline call when the operator supplies the audio. Together these close the workflow surfaced by the v0.29.3 retest: operator drops MP3 → copies artifact key → asks Tier C agent → narrated MP4 with pre-render validation gates inlined. Back-compat-safe; operators on v0.29.4 can upgrade directly with no input or schema changes.
Operator upgrade: clean — no schema migrations, no removed packs, no pipeline-shape changes. The upload endpoint is additive; the existing artifact.put pack continues to work unchanged for agent-driven uploads. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.5 + ghcr.io/tosin2013/helmdeck-sidecar-hyperframes:0.29.5, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion. Smoke-test the upload UX: open the Management UI's Artifacts page, drag any file onto the new upload card, confirm the resulting artifact_key appears with a copy button, paste into a chat invocation of builtin.byo-audio-narrated-video (audio files only; the pipeline rejects non-audio at compose's content-type validation).
Documentation
gpt-oss-120b-concept-animatorhowto gains a bring-your-own-audio variant section (#496 refinement). Recipe now covers the operator workflow where Maya (the sanitized worked-example persona) already has an audio file — recorded interview, stitched podcast clip, ElevenLabs render — and wants a narrated video against it without re-generating audio from a prompt. Operator-side: open Management UI → Artifacts → drag-drop the audio file (the newPOST /api/v1/artifacts/uploadendpoint shipped above) → copy the returnedaudio_artifact_key→ paste into chat. Agent-side: a CONSTRAINTS-section override on the base AGENTS.md template that locks the model to ONE pack call (helmdeck__pipeline-runwithbuiltin.byo-audio-narrated-video) and explicitly invalidatespodcast.generateregeneration when anaudio_artifact_keyis supplied. Why this matters for Tier C models: from-scratch is a 5-call chain (podcast → compose → render → av.validate → verify_manifest) where each call is a drift opportunity; the BYO variant collapses to 1 call because the pipeline inlines the chain including the pre-render validation gates (lint/inspect/validate). Includes a sample test prompt + the expected JSON pack-call shape so an operator can verify their agent fires correctly. Companiongpt-oss-120b-slide-narratorrecipe's Related section updated to point at the BYO variant.
Added
- Operator-facing artifact upload — new
POST /api/v1/artifacts/uploadREST endpoint plus a drag-drop card on the Management UI's Artifacts page (web/src/pages/artifacts.tsx). Closes the UX gap surfaced during v0.29.4 testing: the operator has an MP3 (or any media file) on their laptop and wants to use it via the BYO-audio narrated-video pipeline, but there was no clean path to get the file INTO the artifact store.artifact.putis a pack — it takes bytes in the agent's tool input, but for a 2.5 MiB MP3 that means ~3.3 MiB of base64 in the chat message, which is impractical. The new endpoint acceptsmultipart/form-datawith afilefield, persists under theoperator-uploads/namespace, and returns{artifact_key, url, size, content_type, filename}. Content-type detection: prefer the browser-setContent-Typeon the upload part, fall back tomime.TypeByExtension(filename), then tohttp.DetectContentTypeon the first 512 bytes. 100 MiB cap (50 MiB abovehyperframes.attach_audio's audio cap so long-form audio + large video both fit). Filename sanitization strips path prefixes + control characters + truncates at 200 chars. UI surface: drag-drop zone with a fallback file input, success state shows the resultingartifact_keywith a copy-to-clipboard button + a hint about pasting it into pipeline inputs. 8 new sub-tests cover input validation (happy path, content-type inference from extension, missing file field, plain-JSON instead of multipart, no-artifact-store-wired), filename sanitization (normal/spaces/path-prefixes/control-chars/empty/truncation), and the operator-uploads namespace contract. Workflow now: operator opens Management UI → Artifacts → drags MP3 → copies returnedaudio_artifact_key→ asks agent to runbuiltin.byo-audio-narrated-videowith that key + a topic description +duration_seconds. No chat-side file ingestion, no SSH, no curl scripts.
[0.29.4] - 2026-06-21
Theme: "Pre-render validation suite + bring-your-own-audio pipeline + render-deterministic authoring docs."
Four days of follow-up work to the v0.29.3 retest investigation. The v0.29.3 render produced 2 distinct frames over 90 seconds despite PR #546's slot-lifetime fix landing correctly — diagnosis showed it wasn't a slot-lifetime bug at all, it was upstream's "render ≠ preview" bug class manifesting in the decision-tree example we'd chosen as the default. v0.29.4 ships the architectural response: three new pre-render validation packs that wrap upstream's own diagnostic tools (hyperframes lint, hyperframes inspect, hyperframes validate), a bring-your-own-audio pipeline so operators can compose visuals against an MP3 they've already uploaded, an explanation page + agent skill that codify the empirically-derived render-deterministic composition rules, and the operator-visible <audio id=...> fix in attach_audio that upstream's lint was flagging as silent-in-renders all along. Back-compat-safe; operators on v0.29.3 can upgrade directly with no input or schema changes.
Operator upgrade: clean — no schema migrations, no removed packs, no pipeline-shape changes. Existing pipelines/agents continue unchanged. The three validation packs are additive; the BYO pipeline is additive (starter pipeline count went 22 → 23); the audio_artifact_key input on hyperframes.compose is additive and mutually exclusive with the existing audio_url. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.4 + ghcr.io/tosin2013/helmdeck-sidecar-hyperframes:0.29.4, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion into the deployed SKILL.md AND install the new helmdeck-hyperframes-authoring skill (auto-discovered from skills/<name>/SKILL.md). For a quick smoke-test of the BYO path: artifact.put an MP3 → pipeline-run builtin.byo-audio-narrated-video with the returned key + a topic description + the audio's duration (max 720s).
Added
- Three new packs form a pre-render validation suite for hyperframes scaffold projects:
hyperframes.lint(static source),hyperframes.inspect(runtime layout),hyperframes.validate(runtime errors + WCAG contrast). All three share the same input shape ashyperframes.render(project_artifact_key OR composition_html, mutually exclusive), the same setup helper (setupHyperframesProjectDir), the same JSON-parse + strict-mode contract, the same soft-surface default (findings ARE the output; the pack returns success even with errors). The trio targets the three failure-detection windows: lint catches STATIC issues from source files (~1s, file-system only), inspect catches RUNTIME LAYOUT issues by loading in headless Chrome and sampling the DOM at N timestamps, validate catches RUNTIME CONSOLE ERRORS during the headless load plus a WCAG AA contrast audit across timeline samples.hyperframes.lintwrapshyperframes lint --json— catchesmedia_missing_id(audio silent in renders),google_fonts_import(external font fetches fail in sandboxed renders),gsap_studio_edit_blocked(manual__timelinesregistration conflicting with runtime auto-discovery),composition_self_attribute_selector(CSS that leaks across embedded instances),missing_gsap_script, etc.hyperframes.inspectwrapshyperframes inspect --json— catchestext_box_overflow(text extends past its container at a specific timestamp),transition_overlap(sibling clips overlap at a transition seam),static_collapse(element width or height goes to 0);at_transitions:truesamples every tween start/end boundary to catch transient overlaps.hyperframes.validatewrapshyperframes validate --json— catches CORS-blocked external assets (which produce silent blank media in renders),net::ERR_FAILEDfor any external resource, JS runtime errors during composition load (which lead to blank-canvas renders), plus WCAG AA contrast failures across sampled timestamps; strict mode targets console errors only (contrast failures are a separate audit dimension). All three passstrict:trueto surface error-severity findings as typedCodeArtifactFailed, gating downstream packs on a clean result. Reference docs atdocs/reference/packs/hyperframes/{lint,inspect,validate}.md. 14 + 11 + 8 = 33 new sub-tests cover input validation, happy paths, strict-mode behavior, CLI argv shape (verbose, at-transitions, no-contrast flags thread correctly), the JSON-prefix stripper (CLI emits a telemetry notice before the JSON payload on first session invocation), and contrast-vs-error severity separation in validate strict mode. Architectural twin ofav.validateend-to-end; the four packs together (av.validatepost-render + the new three pre-render) give pipelines symmetric validation on both sides of the render boundary.
Fixed
hyperframes.attach_audioinjected<audio>element now carriesid="aroll-audio-<sha256-prefix>"matching the content-addressed filename stem. Upstream's ownhyperframes lintflags media withoutidas a hard error (media_missing_id): "The renderer requires id to discover media elements — this audio will be SILENT in renders." The content-hash id mirrors the filename's hash component so the same audio bytes always produce the same id (stable across re-runs of the same narration). Surfaced during the v0.29.3 retest — even with PR #546's slot-lifetime fix, the audio element our pack injected was technically render-silent per upstream's contract. Existing 15 attach_audio sub-tests updated to assert the id contract; one new sub-test verifies content-addressed id stability across calls with identical audio bytes. Reference: field report, upstream issue heygen-com/hyperframes#1437 (render ≠ preview bug class).
Added
hyperframes.composegains anaudio_artifact_keyinput as an alternative toaudio_url— the handler resolves the artifact key to a presigned URL viaec.Artifacts.Get(...).Artifact.URLand threads it into the existing audio_url codepath. Mutually exclusive withaudio_url. Enables bring-your-own-audio pipelines that compose visuals against pre-existing audio (operator uploads viaartifact.put, or output from a prior pack call) without an intermediateartifact.getstep that would base64-encode the full audio bytes just to extract the URL. Five new sub-tests cover BYO happy path (key resolves, composition embeds URL), mutual-exclusion guard, key-not-found error, no-artifact-store-wired error, and a back-compat regression test confirming the existing audio_url path is unchanged. Back-compat: pre-existing callers passing audio_url see ZERO behavior change.- New pipeline
builtin.byo-audio-narrated-video— bring-your-own-audio counterpart tobuiltin.prompt-narrated-video. Inputs:audio_artifact_key+description+duration_seconds(required) plus optionalaspect_ratio(16:9 default / 9:16 / 1:1) andresolution(1080p / 4k). Pipeline shape:hyperframes.compose(withaudio_artifact_key) →hyperframes.lint→hyperframes.inspect(withat_transitions:true) →hyperframes.validate→hyperframes.render. All three validation gates passstrict:true— any error-severity finding aborts the pipeline BEFORE render burns wall-clock. 12-minute cap enforced byhyperframes.compose's existinghyperframesComposeMaxDuration(720s) → CodeInvalidInput on duration_seconds > 720. Use case: a user uploads an MP3 (interview, lecture, podcast clip) and wants a topic-relevant narrated MP4. Skipspodcast.generatevsprompt-narrated-videobecause the audio already exists. Tier-A authoring (LLM writes the composition from scratch); for Tier-C scaffold-based workflows where the user wants visuals borrowed from upstream's curated examples instead of LLM-authored, usebuiltin.scaffolded-narrated-video— but it regenerates audio via podcast.generate, so it won't preserve a user-uploaded MP3. Pipeline count update: starter pipelines went 22 → 23.
Documentation
- New explanation page
docs/explanation/authoring-render-deterministic-compositions.mdcodifies the empirically-derived rules an LLM or human author must follow so a hyperframes composition renders correctly (not just previews correctly). Covers the structural contract (single GSAP timeline per composition, key matchesdata-composition-id, declarative sub-composition sequencing viadata-start), the authoring-style contract (layout-before-animation, synchronous construction, nosetTimeout/setInterval/requestAnimationFrame/repeat:-1/post-paint DOM mutation), the asset contract (media needsid, no external CDN URLs except GSAP itself, no CSStransformon GSAP-animated elements), and the pre-render validation gate (lint → inspect → validate, all strict, before render). Sourced from the v0.29.3 retest investigation; references upstream's "render ≠ preview" tracking issue (heygen-com/hyperframes#1437). - New skill
skills/helmdeck-hyperframes-authoring/SKILL.mdpackages the same rules in agent-context-injection format. Auto-discovered byscripts/configure-openclaw.sh(no script changes needed). Use when an agent is authoring composition HTML forhyperframes.compose,hyperframes.render, or any pipeline that produces a programmatic MP4 — includingbuiltin.scaffolded-narrated-video,builtin.prompt-video,builtin.prompt-narrated-video. Includes a worked example of the smallest render-deterministic composition skeleton (title + subtitle, 8s, withgsap.from()entrances) so the LLM has a reference shape to extend rather than authoring from scratch. skills/helmdeck/SKILL.mdupdated: adds bullets forhyperframes.lint/hyperframes.inspect/hyperframes.validatenext to the existinghyperframes.composeandhyperframes.renderentries, references the new authoring skill, and emphasizes the "always run lint → inspect → validate BEFORE render" publish-gate pattern with token-economics rationale (lint <1s, inspect+validate ~10-30s vs render's ~1-5 min — gates catch failures cheaply before render budget burns).
Changed
scripts/hyperframes-bare-baseline.shnow defaults--example=kinetic-type(empirically render-deterministic: 10 distinct frames over 10 samples) instead ofdecision-tree(render-hostile: 2 distinct frames over 15s even when rendered bare from upstream's registry). Adds--lint=true|false(default true) and--no-lintshorthand to run upstream'shyperframes lint --jsonupfront and surface findings in the diagnostic.json + final summary. Help text documents the render-deterministic example set (kinetic-type,swiss-grid,warm-grain) and notesdecision-treeexists for reproducing the v0.29.2/v0.29.3 slot-lifetime regression test bed.
[0.29.3] - 2026-06-17
Theme: "Decision-tree blank-canvas fix + upstream pin hygiene."
Two follow-ups to v0.29.2's hyperframes.attach_audio pack. The first (#546) closes the blank-canvas symptom operators saw when narrated videos extended past the scaffold's 15-second child composition: attach_audio now stretches the child's data-duration to match the root's when they started equal, eliminating the upstream slot-lifetime trigger. The second (#548) bumps the sidecar's pinned hyperframes from 0.6.97 to 0.6.110 for general hygiene. Both back-compat-safe; operators on v0.29.2 can upgrade directly with no input or schema changes.
Honest framing: the pin bump does NOT fix the slot-lifetime bug. Upstream #911 (closed 2026-05-17, shipped in 0.6.110) addresses an adjacent code path; helmdeck's actual bug is filed at heygen-com/hyperframes#1540 and tracked in #547. PR #546's child-composition rewrite is the only thing closing the operator-visible bug today. See the blog post for the empirical trail.
Operator upgrade: clean — no schema migrations, no removed packs. Existing pipelines/agents continue unchanged. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.3 + ghcr.io/tosin2013/helmdeck-sidecar-hyperframes:0.29.3, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion into the deployed SKILL.md. Re-run any narrated-video pipeline that produced a 15-second-then-blank canvas on v0.29.2 to confirm the fix.
Changed
helmdeck-sidecar-hyperframesDockerfile bumps the pinned upstream fromhyperframes@0.6.97tohyperframes@0.6.110— version hygiene only; does NOT fix the child-composition slot-lifetime bug. Background: while drafting PR #546 (the helmdeck-side child-composition stretch fix), I noticed upstream #911 had been closed 2026-05-17 with a fix that sounded like our exact symptom ("Sub-composition slot goes black after GSAP timeline ends, regardless of host data-duration"). Empirically verified the closure by building the sidecar withhyperframes@0.6.110and re-rendering the same barenpx hyperframes initdecision-tree scaffold with root extended to 331s and child left at 15s: frames at t=20s/100s/200s/300s are byte-identical to the 0.6.97 result (md59c95fca0…, 8.6 KB blank canvas), continuing to blank the instant the child'sdata-durationelapses. Inspecting the shipped runtime bundle confirmed the #911 fix IS present (d.hasAttribute("data-composition-src")||d.hasAttribute("data-composition-file")) but addresses an adjacent code path — the producer'shtmlCompilerstrippingdata-composition-srcduring inlining — not the duration-mismatch case helmdeck hits. Filed heygen-com/hyperframes#1540 with the reproducer; helmdeck-side watch issue #547 tracks when the shim can come back out. PR #546's helmdeck-side rewrite remains the only fix in play. The pin bump is still worth landing: 13 patch releases of unrelated upstream improvements, ADR-037 exact-version + CLI-surface-sentinel discipline intact, Dependabot tracking the live latest, and the regression-check render confirmed 0.6.110 does NOT break the working scenario (root and child durations matched). Operator-visible reproducer + the wider "trust-but-verify-an-issue-close" story in the2026-06-17-child-composition-slot-lifetimeblog post.
Fixed
hyperframes.attach_audionow stretches child compositions whosedata-durationmatched the root's original, closing the v0.29.2 follow-up bug where the decision-tree scaffold rendered as a blank canvas for 83 of 98 seconds. Empirical repro fromrun_6f6cb0ea40a94dd1: a ~98-second narrated video, audio attached correctly, but visuals went white at 15s and stayed white through the rest. Root cause: the decision-tree scaffold'sindex.htmlhas both a root composition (data-composition-id="main",data-duration="15") AND a child composition (<div data-composition-id="decision-tree" data-composition-src="compositions/decision_tree.html" data-duration="15">). The v0.29.2 attach_audio rewrote root's duration to 97.9 but left the child at 15 — so the renderer played 0-15s of decision-tree animation followed by 83 seconds of inactive (blank) canvas. The fix extendsupdateRootDataDurationto ALSO rewrite any<div>with adata-composition-idattribute whosedata-durationequals the root's original. Conservative heuristic: only stretch children that were span-aligned with the root. Operator-deliberate divergences (e.g. a 5-second intro composition under a 30-second root) are preserved — when a child's data-duration differs from root's original, it's left alone.class="clip"data-durations are still untouched (nodata-composition-idanchor on clip elements). Four new sub-tests cover the empirical decision-tree shape, the stretches-matching-children behavior, the leaves-divergent-children-alone behavior, and the regression guard for class="clip" semantics. Existing 15 attach_audio tests pass unchanged. 1124 builtin / 1770 across consumers pass with race detector clean.
[0.29.2] - 2026-06-17
Theme: "Silent-video fix + tunable Firecrawl/LLM concurrency."
Two follow-ups landed within hours of v0.29.1: the hyperframes.attach_audio pack that closes the v0.28.x silent-video bug for unreliable upstream examples (decision-tree empirically), and an operator env var to tune content.ground's per-call Firecrawl + verify concurrency. Both back-compat-safe; operators on v0.29.1 can upgrade directly with no input or schema changes.
Operator upgrade: clean — no schema migrations, no removed packs. Existing pipelines/agents continue unchanged. The new pack is additive; the concurrency env var defaults to today's hardcoded 4 when unset. After tag push, pull ghcr.io/tosin2013/helmdeck:0.29.2, restart control-plane, and re-run ./scripts/configure-openclaw.sh to stamp the new helmdeckVersion into the deployed SKILL.md.
Added
content.groundPhase 2 concurrency is now operator-tunable viaHELMDECK_CONTENT_GROUND_CONCURRENCY(#524). Default stays4(the historical hardcoded value), range[1, 32]. Out-of-range or non-numeric values silently fall back to the default — an operator typo can't break grounding. Use case: operators running self-hosted Firecrawl with relaxed rate limits + a dedicated LLM gateway can raise the limit for faster wall-clock on long posts (12+ claims); operators on free-tier shared infrastructure can lower it to avoid bumping into rate caps. The constantcontentGroundConcurrencybecomes a functioncontentGroundConcurrency()re-evaluated on every handler entry — restart not required after updating the env. Eighteen new sub-tests cover the default, valid overrides at the boundaries (1, 4, 8, 16, 32), out-of-range fallback (0, -1, 33, 1000, 100000), non-numeric fallback (typos like "fourrr", "4.5"), and whitespace trimming (" 4 " → 4). 1120 builtin tests pass with race detector clean; existing 45 content.ground tests pass unchanged (back-compat). Reference doc gains a "Tunable Phase 2 concurrency" section explaining when to raise vs lower the limit.
Added
- New pack
hyperframes.attach_audiocloses the silent-video failure mode inbuiltin.scaffolded-narrated-video(#521). Background: upstreamhyperframes init --audio=<path>is silently ignored by at least thedecision-treeexample (and possibly others — empirical per-example reliability), so threadingaudio_urlthroughhyperframes.scaffoldis unreliable. Despite v0.28.4 and v0.28.5 fixing every other step in the chain (audio threading, concat-vs-validate timing), runs againstdecision-treecontinued to produce 15-second silent MP4s. This pack is the deterministic alternative: pure-Go in-process tarball transform that downloads the audio bytes, embeds them underassets/aroll-audio-<sha256-prefix>.<ext>(content-addressed for dedup), and injects an<audio>element as the first child of the root composition div (matched bydata-composition-id="main"— the canonical hyperframes scaffold convention; tolerant of arbitrary attribute order). The element carriesdata-start="0",data-duration=<seconds>,data-volume=<volume>,data-track-index=<idx>per upstream's contract (volume defaults to1.0, track index to9— the documented audio-track row). By default also rewrites the root composition div'sdata-durationto the audio length (update_root_duration: true) so the rendered video plays the full narration; set false whenhyperframes.interpolatehas already established the duration. Required inputs:project_artifact_key,audio_artifact_key,duration_seconds. Outputs: newproject_artifact_keyplusaudio_filename/audio_size/duration_seconds_used/root_duration_updated/track_index_used/volume_usedtelemetry. Supported audio content types:audio/{mpeg,mp3,mp4,aac,wav,x-wav}covering ElevenLabs' defaultmp3_44100_192and the common alternatives. 50 MiB cap matcheshyperframes.attach_asset's. Same shape asattach_assetend-to-end — no dispatcher, no session executor, justec.Artifacts. 15 new sub-tests cover input validation (missing keys, negative duration, missing audio/project, empty bytes, oversize, unsupported content-type), the happy path (MP3 with all defaults — confirms audio element injected, data-duration rewritten, audio file written into tarball, content-addressed filename stable across calls),update_root_duration:falsesemantics (root duration preserved), custom volume/track_index, no-root-div rejection, missing-index.html rejection, and the regex helpers at unit level (spliceAudioIntoRootfindsdata-composition-id="main"regardless of attribute order;updateRootDataDurationonly rewrites the root, not child clip durations; handlesdata-durationbefore/afterdata-composition-idin the attribute list).builtin.scaffolded-narrated-videopipeline rewired:hyperframes.scaffoldno longer receivesaudio_url(upstream's unreliable path), andhyperframes.attach_audiois inserted betweeninterpolateandrenderchainingpodcast.generate.audio_artifact_key+duration_s. Existing direct-pack callers ofhyperframes.scaffoldthat still passaudio_urlsee no behavior change — the scaffold input is preserved for back-compat; this PR just stops the built-in pipeline from relying on it. Reference docdocs/reference/packs/hyperframes/attach_audio.mdcovers the splice algorithm, supported content types, and the issue #521 history. 1102 builtin / 1748 across consumers pass with race detector clean.
[0.29.1] - 2026-06-17
Theme: "v0.29.0 follow-ups — close the audit gap and unblock operator upgrades."
Two fixes surfaced during v0.29.0 release prep + first-hour operator testing. Both are back-compat-safe; operators can upgrade directly from v0.29.0 with no input or schema changes.
Operator upgrade: clean — no schema migrations, no removed packs. Existing pipelines/agents continue unchanged. configure-openclaw.sh now actually completes on a correctly-configured deployment (was blocked by a false-positive preflight on every prior version). After tag push, run git pull && ./scripts/configure-openclaw.sh to pick up both fixes; the script refreshes the deployed SKILL.md with the new helmdeckVersion stamp.
Added
content.groundgains a handler-internal per-claim cache that survives unrelated edits to the input markdown (#523). The engine-levelMemoryConfigcache (ADR 047, added in PR #522) keys on sha256(caller + input bytes), so a typo fix anywhere in the markdown invalidates every claim's cached source. The new per-claim cache keys on sha256(claim_text + "\0" + search_query) — claims whose text + query is unchanged across edits hit, skipping BOTH the Firecrawl/v1/searchcall AND the per-source verify LLM call. The two caches stack: the engine cache catches idempotent re-runs (~millisecond replay); the per-claim cache catches the "fix a typo, re-cite" workflow where the engine cache misses but the claim set is mostly unchanged. TTL is 7 days (vs the engine cache's 24h) because the per-claim key is content-derived rather than time-derived. The cache is goroutine-safe (Phase 2's bounded errgroup populates it concurrently); cache hits skip the errgroup slot entirely so a cached re-run completes in ~claim-extractor wall-clock with no Phase 2 work. Failed Firecrawl searches are NOT cached (transient outages shouldn't poison the cache for 7 days); empty picks (no source found) ARE cached so a re-run doesn't re-burn the verify LLM call on the same null result. Two new output fields:claims_cached(per-claim cache hits) andfirecrawl_calls(real Firecrawl calls, excludes cache hits) — operators see "0 of 5 claims hit Firecrawl after the typo fix" telemetry. ExistingMemory: &MemoryConfig{...}declaration on the pack stays (engine cache continues to work); the per-claim layer is additive. Four new sub-tests: typo-fix workflow with all-cache-hits (zero Firecrawl, zero verify), mutate-one-claim with 2 hits + 1 miss, nil-Memory safety (engine withoutWithMemoryStoreworks unchanged), key stability. Existing 41 content.ground tests pass unchanged (back-compat); race detector clean. Reference doc has a two-layer cache table explaining when each layer hits.
Fixed
scripts/configure-openclaw.sh— auth probe completes correctly on authenticated deployments (#539). Two bugs stacked. (1) SIGPIPE under pipefail: the probe pipedopenclaw models auth list 2>/dev/nulldirectly intogrep -q.grep -qexits immediately on first match, which closes its stdin and SIGPIPEs the upstream auth-list call (rc=141 = 128 + signal 13). Withset -o pipefail(set at script line 34), the 141 propagates as the pipeline exit; theif !inverts it; the script dies with "missing openrouter auth" on a correctly-authenticated deployment. (2) Redundant ephemeral container: the original implementation useddocker compose -f $OPENCLAW_COMPOSE_FILE run --rm -T openclaw-cli, which spawns a fresh container that exits non-zero under2>/dev/nullfor unrelated reasons on top of the SIGPIPE issue. Fix: capture-then-grep — capture the auth list into a variable first, then grep the variable. The capture lets the upstream finish cleanly (no SIGPIPE); the in-memory grep against$auth_listnever closes its stdin early. Switched todocker execagainst the running$OPENCLAW_CONTAINER(the pattern used elsewhere in this script) so the auth state is the one OpenClaw actually uses. Empirically blocked the v0.29.0 SKILL.md refresh that exposed the bug — every priorconfigure-openclaw.shrun on this deployment hit the false-positive die. The 4 othergrep -qprobes in this script (against tinydocker ps/docker network inspectoutputs that fit in the kernel pipe buffer) are unaffected and unchanged; if they ever bite, the same capture-then-grep pattern applies.
[0.29.0] - 2026-06-16
Theme: "Packs measure their own input."
Closes the cross-pack JIT length-sizing convention adoption arc. All six length-variable packs (blog.rewrite_for_audience, podcast.generate, hyperframes.compose, slides.narrate, research.deep, content.ground) now accept length_intent (summary / thorough / exhaustive) + inspect:true + explicit numeric overrides, and report length_intent_applied + truncated on every generate response. Calling agents can declare intent uniformly and stop precomputing per-pack length surfaces in their AGENTS.md. Originally motivated by an undersized blog rewrite empirically observed 2026-06-16: a ~7,000-word source compressed to 1,161 words because the agent's static "1300-2000 words for technical-deep-dive" target couldn't scale with source size. All six adoptions strictly back-compat — existing callers passing the explicit numeric input (max_tokens, duration_target_min, duration_seconds, max_claims, limit) see ZERO behavior change. The umbrella tracking issue (#525) closed with this release.
Operator upgrade: clean — no schema migrations, no removed packs, no breaking input changes. All input/output additions are optional. Existing pipelines and agents continue to work unchanged; the new convention is opt-in via the new fields. Pack signatures podcast.GenerateScript, slides.narrate's generateEngagement, and content.ground's extractClaims gained finish-reason returns — internal API changes only, no caller impact.
Added
content.groundadopts the JIT length-sizing convention (#525 umbrella, #531 follow-up). Sixth and final pack in the cross-pack adoption sequence afterblog.rewrite_for_audience(#527),podcast.generate(#533),hyperframes.compose(#534),slides.narrate(#535), andresearch.deep(#536). content.ground is cost-cap shaped like research.deep — each claim costs a Firecrawl/v1/search+ per-source LLM verify call. Intent maps directly to the existingmax_claimsinput:summary→ 3 claims,thorough→ 5 (matches legacy default),exhaustive→ 8 (matches legacy ceiling). The issue's original "intentional back-compat break" framing was based on a wrong premise — the current code is already capped at 8 with a default of 5, not unlimited. The exhaustive row labels today's hard cap rather than relaxing it. New optional inputs:length_intentandinspect:true. Precedence:inspect:trueshort-circuit → explicitmax_claims("explicit", clamped to[1, 8]) →length_intent("intent:*") → legacy default 5 ("default"). Strict back-compat: existing callers passingmax_claimssee ZERO behavior change. New outputs on every generate response:max_claims_applied(what was actually used after clamping),length_intent_applied(where the value came from),truncated(fires when EITHER the claim extractor LLM hitfinish_reason=lengthOR the rewrite step truncated and fell back to citation-only).extractClaims's signature gains a finish-reason return:(claims, raw, finishReason, error). The rewrite step's pre-existingerrRewriteTruncatedsignal is now also surfaced viatruncated:truerather than being a silent log-only event.inspect:trueshort-circuits before the dispatcher /HELMDECK_FIRECRAWL_ENABLEDchecks — gateway-less, Firecrawl-less environments can plan a grounding pass.OutputSchema.Requirednarrowed from[claims_considered, claims_grounded, sha256]to[]so inspect responses (no extraction) satisfy the validator. Six new sub-tests cover inspect short-circuit (no Firecrawl, no dispatcher) + resolver precedence at unit level + each intent row mapping to the rightmax_claims_applied+ explicit-max_claims-wins back-compat + no-input default → 5 / "default" + extractor finish_reason=length → truncated + unknown-intent fallback. 1073 builtin tests pass with race detector clean. Reference doc updated with the heuristic table, precedence rules, and the truncated-signal semantics. Closes the cross-pack adoption sequence — all six length-variable packs now sharelength_intent/inspect/truncatedso agents can declare intent uniformly and stop precomputing per-pack length surfaces in their AGENTS.md.
Added
research.deepadopts the JIT length-sizing convention (#525 umbrella, #532 follow-up). Fifth in the cross-pack adoption sequence afterblog.rewrite_for_audience(#527),podcast.generate(#533),hyperframes.compose(#534), andslides.narrate(#535). research.deep is cost-cap shaped: the "length" being controlled isn't output words or duration but the number of source URLs scraped per call (each costs a Firecrawl SERP page hit + a per-source markdown scrape + a slice of the synthesis LLM's context window). Intent maps directly to the existinglimitinput:summary→ 3 sources,thorough→ 5 (matches the legacy default),exhaustive→ 10 (matches the hard cap). New optional inputs:length_intentandinspect:true. Precedence:inspect:trueshort-circuit → explicitlimit("explicit", clamped to[1, 10]) →length_intent("intent:*") → legacy default 5 ("default"). Strict back-compat: existing callers passinglimitsee ZERO behavior change. New outputs on every generate response:limit_applied(what Firecrawl actually saw),sources_used(count after empty-markdown filtering — operators see how lossy the scrape was),length_intent_applied,truncated(fires when the synthesis LLM hitfinish_reason=length; re-run with smaller intent or largermax_tokens).inspect:trueshort-circuits before the dispatcher /HELMDECK_FIRECRAWL_ENABLED/ model-required checks — gateway-less, Firecrawl-less environments can plan a research call.InputSchema.Requirednarrowed from[query, model]to[query]so inspect-mode payloads omittingmodelaren't rejected by the engine validator (runtime model-required check is preserved for the generate path).OutputSchema.Requirednarrowed from[query, sources, synthesis, model]to[query]so inspect responses satisfy the validator. Eleven new sub-tests cover inspect short-circuit (no Firecrawl, no dispatcher) + each intent row mapping to the right Firecrawllimit+ explicit-limit-wins back-compat + no-input default → 5 / "default" + finish-reason truncation + JIT-metric presence + resolver precedence at unit level + unknown-intent fallback. 1062 builtin tests pass with race detector clean. Reference doc updated with the heuristic table, precedence rules, and an inspect-mode response section.
Added
slides.narrateadopts the JIT length-sizing convention (#525 umbrella, #530 follow-up). slides.narrate's relationship to the convention is unusual: the pack does NOT generate narration (notes come from the input markdown — typically prepared by slides.outline) and per-slide duration is dictated by the natural length of the TTS audio. Solength_intentis observational + reporting rather than active sizing: the agent declares the density they expected, the pack measures what they actually got, and reports the gap so the agent can iterate on slides.outline if needed. New optional inputs:length_intent(summary/thorough/exhaustive),inspect:true,words_per_slide_min/max. Heuristic table:summary→ 40-60 words per narrated slide (~16-24 sec at 150 wpm),thorough→ 80-120 (~32-48 sec),exhaustive→ 150-220 (~60-88 sec). Precedence:inspect:trueshort-circuit → explicitwords_per_slide_min+max("explicit") →length_intent("intent:*") → no input →"default:reporting-only"(thorough's range used as the stats baseline so within/outside counts stay meaningful). New outputs on every generate response:source_words_per_slide_avg/min/max,narrated_slide_count,slides_within_intent_range,slides_outside_intent_range,length_intent_applied,truncated(fires when the engagement-metadata LLM hitfinish_reason=length— the only gateway-dispatch call in the pack; TTS is HTTP-direct).inspect:trueshort-circuits before the session executor / vault checks — gateway-less and session-less environments can run a deck quality check without renderable resources. Silent slides (empty notes) are excluded from the average so intro/outro placeholders don't drag the density signal down.generateEngagement's signature gains a finish-reason return value:(map, string, error).OutputSchema.Requirednarrowed from[video_artifact_key, video_size, slide_count, total_duration_s, has_narration]to[slide_count]so inspect-mode responses (parse-only, no rendering) satisfy the engine validator. Eight new sub-tests; 1056 builtin / 1702 across consumers pass with race detector clean. Reference doc updated with the density table, precedence rules, and an explanation of why this adoption is observational rather than active.
Added
hyperframes.composeadopts the JIT length-sizing convention (#525 umbrella, #529 follow-up). New optional inputs:length_intent(summary/thorough/exhaustive) andinspect:true. Unlikeblog.rewrite_for_audience(#527) andpodcast.generate(#533) — both of which scale by source word count — the compose pack picks a fixed duration from the intent table because the description is a planning instruction, not source material:summary→ 60s (floor 30s, ceiling 120s),thorough(default for intent path) → 180s (120-360s),exhaustive→ 600s (360-720s, matcheshyperframes.render's 12-min cap). Precedence:inspect:trueshort-circuit →audio_url+duration_seconds("explicit:audio-locked") →duration_seconds > 0("explicit") →length_intentset → legacy 8-sec default ("default:legacy-8sec", preserves back-compat — existing silent-micro-animation callers see ZERO behavior change). New outputs on every generate response:description_words,target_duration_sec_chosen,length_intent_applied,truncated(fires when the composition-HTML LLM hitfinish_reason=length, signaling the assembled HTML may be incomplete — re-run with a richer description or smaller intent / larger max_tokens).inspect:trueshort-circuits before the dispatcher / model-required / audio-requires-duration checks — gateway-less and dispatcher-less environments can plan a composition without spending anything; an agent can also inspect withaudio_urlset even before measuring the audio duration. Themodelfield is no longer in InputSchema.Required (was[description, model], now[description]) so inspect-mode payloads omittingmodelaren't rejected by the engine schema validator; runtime check still enforces model for the generate path. Thirteen new sub-tests cover inspect short-circuit + back-compat default + each intent row + numeric-overrides-intent + audio-locked precedence + finish-reason truncation + stop-finish-no-truncation + JIT-metric presence + resolver precedence + unknown-intent fallback. 1036 builtin tests pass with race detector clean. Reference doc updated with the heuristic table, precedence rules, and an inspect-mode worked example.
Added
podcast.generateadopts the JIT length-sizing convention (umbrella #525, pilot landed in v0.28.7'sblog.rewrite_for_audience, podcast follow-up #528). New optional inputs:length_intent(summary/thorough/exhaustive) andinspect:true. The pack measures the source it actually sees (script text in mode A,source_textin mode C-2, scraped text in mode C-1 after Firecrawl) and picks aduration_target_minfrom a heuristic table:summary→ reading time × 0.20, floor 1 min, ceiling 3 min;thorough→ × 0.50, 3 min, 8 min;exhaustive→ × 0.90, 6 min, 12 min. Reading time uses 150 wpm (matchesslides.narrate's caption pacing constant). Back-compat is strict: when neitherlength_intentnorduration_target_minis set, the pack falls back to today's 8-min default (length_intent_applied: "default:legacy-8min"); existing callers see ZERO behavior change.duration_target_minstill wins over intent when set (length_intent_applied: "explicit"); script mode reports"n/a:script"because the script's length is intrinsic. New outputs on every generate response:source_words,target_duration_min_chosen,actual_duration_min,length_intent_applied,truncated(fires onfinish_reason=lengthfrom the script-generation LLM).inspect:trueshort-circuits before the dispatcher / session / vault checks — gateway-less and session-less environments can plan podcast duration without spending anything; the model-required check and Firecrawl-enabled check are both skipped when inspecting.inspectdoes NOT scrapesource_url(the reason field tells the caller to call again without inspect to get a measured suggestion).podcast.GenerateScript's signature gains a finish-reason return value (([]Turn, string, error)); only 2 in-tree callers, both updated. Reference doc updated with the heuristic table, precedence rules, an inspect-mode worked example, and clarified mode-validation footnotes. Twelve new sub-tests cover inspect short-circuit + script-mode inspect + source_url-no-scrape + back-compat default + explicit-numeric-wins + JIT-metric-presence + the three intent rows + floor-clamp + unknown-intent-fallback + resolver precedence; 1078 builtin + podcast pkg tests pass with race detector clean.
Added
blog.rewrite_for_audiencepack pilots the JIT length-sizing convention (#525 umbrella, #526 pilot). Calling agents no longer have to precompute a static word target in their AGENTS.md — the pack measuressource_content, picks an output range from a declaredlength_intent(summary/thorough/exhaustive), and reportstarget_words_chosen/output_words/compression_ratioso the agent can see what scale it actually got. Heuristic table:summary→ ratio 0.10, floor 300, ceiling 1200;thorough(default) → 0.30, 800, 2500;exhaustive→ 0.55, 1500, 6000. The chosen range is injected into the system prompt as an explicit override of the persona's word-count guidance — without this the persona's "800-1200 words" silently out-voted a chosenexhaustivetarget of 3300-4400 and the JIT sizing had no visible effect. Two escape hatches alongside the intent path: explicittarget_words_min+target_words_max(both must be set; partial falls through), andinspect:truewhich short-circuits before any dispatcher use to return a suggestion without spending a model call (gateway-less deployments can use this path). Newtruncatedboolean on every generate-mode output: strong signal isfinish_reason=lengthfrom the gateway; fallback heuristic fires when output is within 95% of the upper target bound AND ends without sentence-terminating punctuation, so providers that don't exposefinish_reason(Ollama doesn't always) still surface silent truncation. The motivating failure mode: long-form source documents getting compressed below the agent's static target's lower bound — a generic shape, not a one-off. Eleven new sub-tests cover inspect short-circuit + dispatcher-less inspect + intent scaling across each row + floor/ceiling clamps + numeric-override precedence + partial-numeric fall-through + prompt-target injection + finish-reason truncation + mid-sentence-heuristic truncation + back-compat metric presence; race detector clean. Reference doc updated with the heuristic table, precedence rules, and an inspect-mode worked example. The same convention is opt-in forpodcast.generate,hyperframes.compose,slides.narrate,content.ground, andresearch.deepas they're touched for unrelated work (no big-bang migration); per-pack adoption is tracked from the umbrella issue.
[0.28.6] - 2026-06-16
Changed
content.groundpack closes four gaps from the 2026-06-15 audit. (A) Memory cache (ADR 047 compliance): pack now declaresMemory: &packs.MemoryConfig{Cache: true, TTL: 24h, Category: "cache"}. Idempotent re-runs (same caller, same input bytes) get cached results instead of spending Firecrawl + LLM verify calls each time — matches the pattern github.go has used since v0.7. The TTL is 24 hours (vs github.go's 5-minute pattern) because source authority changes on a slow cadence. NOTE: the cache key is engine-derived from input bytes, so a typo-fix edit is still a miss — per-claim caching across edits would be a handler-internal layer, captured as audit follow-up. (B) Concurrent claim processing: per-claim Firecrawl search + LLM verify ran sequentially before, so a 12-claim post took 60-120s wall-clock even on a healthy stack. Refactored into three phases — Phase 1 fuzzy-locates findable claims (synchronous, fast), Phase 2 runs Firecrawl + verify under a bounded errgroup (SetLimit(4)), Phase 3 applies results to the document in original claim order. Patching stays sequential because each substitution can shift byte offsets of later claims; re-finding the span per-iteration handles that. Wall-clock drops to ~ceil(N/4)×(search+verify). (C) Fuzzy claim matching closes the silent-drop failure mode for Tier C extractors: the strictstrings.Containscheck dropped any claim whose text the LLM had normalized (double-space → single-space, soft-wrap newline → space, etc.) — even when the claim was real and the source was valid. NewfindClaimSpanhelper: exact substring first (fast path preserves existing behavior for the 95% case), whitespace-tolerant scan on miss. Splices the citation after the doc's ORIGINAL bytes, not the LLM's normalized variant, so the patched file matches the original prose. Smart-quote / em-dash / Levenshtein folding intentionally deferred — whitespace is by far the most common normalization the extractor LLM applies, and broader fuzziness widens the false-positive surface. (D) ADR 051 verifier migration: the per-source verifier was the last content.ground caller still using the legacyextractFirstJSONObjectfallback (the claim extractor had migrated earlier). Replaced withDecodeStructuredResponse, preserving the existing soft-degrade (parse failure → skip the claim, same as before). 7 new sub-tests cover the new helpers + memory cache declaration + a fuzzy-match end-to-end happy path; 2 existing tests updated to query-route their Firecrawl stubs (concurrent Phase 2 means non-deterministic call order); race detector clean.
[0.28.5] - 2026-06-15
Fixed
podcast.generate's validation pass now actually finds the MP3 it's supposed to validate. TheConcathelper used torm -rf /tmp/helmdeck-podcastimmediately after reading the finalfinal.mp3bytes back to the control-plane process — and thenpodcast.generate's validation step (PR #515) tried toav-validate.sh --audio /tmp/helmdeck-podcast/final.mp3a fraction of a second later, gotexit 2 (file not found), soft-degraded into silent fallback (allow_silent_output:true's contract), and propagatedaudio_url=""through the entire scaffolded-narrated-video chain. Net effect was a 15-second silent MP4 with passed-validationconsistency:audio_video_durationflagging "could not probe (arate= aframes=)" — the validator was correctly reporting the bug; the bug was upstream. Empirically found 2026-06-15 chasing the v0.28.4 retest's122922a5661bcb63-video.mp4artifact, after av-validate.sh confirmed the file hadvcodec=h264 acodec=<empty>and no audio stream at all. ElevenLabs credentials, TTS API call, and voice IDs were all healthy — the MP3 was generated, briefly written, and then deleted before validation could see it. Fix: drop the post-readback cleanup in Concat (line 175-178). The session container's tmpfs is reclaimed when the session ends; the next Concat call already rm -rfs the tempdir at its step 1 (line 84). Net: no leak across sessions, no leak between Concat calls within a session, AND the file stays available for the in-call validation pass. NewTestConcat_DoesNotPostCleanupTempDirregression test pins the fix — counts post-readback rm -rf calls; trips loudly if the cleanup ever sneaks back in. 1042 tests pass across pipelines + packs/builtin (up from 1041 with the new regression test).
[0.28.4] - 2026-06-15
Fixed
builtin.scaffolded-narrated-videopipeline now produces a narrated video that's actually narrated at the operator-controllable target length. Two related misses landed in v0.28.0's pipeline (#512) and surfaced empirically on the 2026-06-15 eBPF retest run, when the chain reachedrendersuccessfully but produced a 9-second silent MP4 against an 11-minute generated narration. The pipeline (a) didn't threadpodcast.generate'saudio_urloutput tohyperframes.scaffold, so the scaffold used the upstream example's intrinsic 10-seconddata-durationand the rendered video had no audio track; and (b) didn't passduration_target_mintopodcast.generateat all, so the narration silently ran atpodcast.generate's 8-minute internal default instead of the operator's expected 60-second social-first target. Both gaps are closed:hyperframes.scaffoldgains anaudio_urlinput that fetches + stages the bytes in-sidecar and passes--audio=<path>tohyperframes init(upstream then embeds the<audio>element and alignsdata-durationto the audio length); the pipeline threadsaudio_urlfrompodcast.generate.output.audio_urlto scaffold, AND threads its own newduration_target_mininput through topodcast.generate(default unset → 8-minute fallback; pass1for 60-second social-first per the old AGENTS.md convention; max12for long-form). 5 new sub-tests on scaffold cover audio_url empty / happy-path / 404 / 200-with-empty-body / oversize. Discovered architecturally because the prior 4 patch releases were all about Tier C model output variance; this one is the pipeline's first real "design miss" — composition gaps in the input plumbing the original PR didn't fully wire.hyperframes.interpolate's content classifier now recognizes the decision-tree scaffold shape —<div class="node ...">,<div class="connector-label">,<div class="text-highlight">, and<span id="*-text">. Empirically found 2026-06-15 on the third eBPF retest after the v0.28.2 podcast-parser fix landed:hyperframes.scaffoldsucceeded, the agent's pipeline reachedhyperframes.interpolate, but the pack rejected the scaffold with "no files in the scaffold matched a recognized content shape" becausedecision-tree'scompositions/decision_tree.htmluses sticky-note "node" boxes for its branching diagram (different element/class shape fromswiss-grid's<h1>/<div class="stat-value">patterns the classifier was originally calibrated against). The new patterns are word-boundary-anchored so existingswiss-grid/nyt-graphshapes still match. 8 new sub-tests cover the decision-tree node + connector-label + text-highlight + span-id-suffix-text shapes and the multi-class attribute preservation under splice. The known false-positive risk (\bnode\bmatches inside compound class names liketree-nodebecause-is non-word) is pinned by a test so a future tightening trips loudly.
[0.28.2] - 2026-06-15
Fixed
podcast.generate's script parser now also accepts multiple bare JSON objects in sequence (JSONL, whitespace-separated, or comma-separated{...}{...}{...}without[...]array brackets). Empirically found 2026-06-14 on the SECOND eBPF retest after the bare-single-object fix (v0.28.1) landed —gpt-oss-120b:freeemitted ~10 sequential{"speaker":"Host","text":"..."}turns as JSONL instead of one array. The new Fallback C normalizes the}<whitespace and optional comma>{boundary between sibling objects to},{via regex and wraps the result in[ ... ]so the strict array parser succeeds. The earlier single-object fallback still fires on actual one-turn responses; well-formed array responses still take the fast path. Four new sub-tests cover JSONL / comma-separated / fenced multi-object / multi-object-with-preamble variants. Error message refined to "no JSON array, single object, or sequence of objects found in response" so the final failure mode is unambiguous.
[0.28.1] - 2026-06-14
Fixed
podcast.generate's script parser now accepts a bare single JSON object ({"speaker":"...","text":"..."}) as a valid one-turn script, not just a[...]array. Closes a Tier C failure mode found empirically 2026-06-14 runningopenai/gpt-oss-120b:freethrough the newbuiltin.scaffolded-narrated-videopipeline — the model emitted one object instead of an array (semantically a valid one-turn script, just missing the array wrapping), and the parser returnedno JSON array found in response. Three new sub-tests cover the bare-object fallback (raw / with prose preamble / fenced in```json). No behavior change for the array path — existing tests pass unchanged.
[0.28.0] - 2026-06-14
The scaffold-mode video release. Closes the architectural arc surfaced empirically by the morning's 🎬 concept-animator retest against openai/gpt-oss-120b:free (rendered MP4 was structurally correct but visually flat — text on a black background, because asking a Tier C model to invent HTML/CSS/GSAP from scratch asks it to do the one thing Tier C reliably can't). Same evening, the architecture is rebuilt to borrow visual creativity from upstream's 140+ example catalog: the LLM's job becomes content interpolation, not design invention. Two original assumptions (#503 Path A: stitched HTML; first cut of #503 Path B: scaffold-mode in hyperframes.compose) were both surfaced + discarded mid-implementation when empirical scaffold inspection revealed multi-file structure (sub-compositions referenced by data-composition-src paths, JS TRANSCRIPT arrays in captions.html, A-roll slot in index.html) — the right shape was a 4-pack family matching helmdeck's existing decomposition pattern. Seven PRs over a single 2026-06-14 evening session: #506 ships scripts/hyperframes-init.sh inside helmdeck-sidecar-hyperframes plus a new CONTRIBUTING.md principle "prefer the upstream CLI over custom Go" (saved as [[feedback-upstream-cli-takes-precedence]] for future pack design); #507 pivots the script's output contract from "stitched HTML" to "gzipped project tarball" before any caller depends on it; #508 gives hyperframes.render a new project_artifact_key input alongside the existing composition_html (mutually exclusive, fully backward-compatible) so it consumes the project-shape upstream natively expects; #509 ships the new hyperframes.scaffold pack (picks an upstream --example, returns project_artifact_key + editable_slots manifest); #510 ships hyperframes.interpolate (pure-Go in-process tarball manipulation, per-file LLM rewriting for HTML text slots + JS TRANSCRIPT, tier-aware prompts, soft-degrade on per-file failure); #511 ships hyperframes.attach_asset (content-addressed asset embedding for A-roll image/video, videos emit muted per upstream convention, URL fetch deferred); #512 ships builtin.scaffolded-narrated-video — the sibling pipeline to builtin.prompt-narrated-video that wires podcast.generate → hyperframes.scaffold → hyperframes.interpolate → hyperframes.render. The 2026-06-14 blog post When agent-instruction docs drift from upstream spec (upstream-spec-drift) released yesterday tells the docs-layer prologue to this story; today is the implementation-layer chapter.
Operator upgrade: clean — no schema migrations, no removed packs, no breaking input changes. The additions:
- Three new packs (
hyperframes.scaffoldv1,hyperframes.interpolatev1,hyperframes.attach_assetv1) — additive; existinghyperframes.composefreeform mode untouched and continues to work for callers who want raw HTML control. - One new pipeline (
builtin.scaffolded-narrated-video) — additive; the existingbuiltin.prompt-narrated-videocontinues to work unchanged. Pipeline count is now 22. hyperframes.rendergains aproject_artifact_keyinput alongside the existingcomposition_html(mutually exclusive; pass exactly one). Existing pipelines / callers passingcomposition_htmlsee no behavior change.helmdeck-sidecar-hyperframesbumped toHYPERFRAMES_VERSION=0.6.97(was 0.6.7) — auto-pulled on first use of the new packs. Sidecar image is auto-rebuilt on main pushes and already shipped to GHCR.- New CONTRIBUTING.md principle "Prefer the upstream CLI over custom Go" (item 7 of "What makes a good pack") documents the architectural lesson for future pack contributions.
For Tier-C-targeted agents (gpt-oss-120b:free, gemma, smaller open-weight): update your AGENTS.md to call builtin.scaffolded-narrated-video (provide description + example) instead of builtin.prompt-narrated-video. The scaffolded pipeline borrows upstream's polished visuals so the model only does content interpolation — visually-rich output reliably, where the freeform compose path collapses to text-on-black. Common example picks by intent: swiss-grid (general explainer), decision-tree (flow diagrams + traces), code-snippet-dark-modern (technical content), kinetic-type (typography focus), nyt-graph (data viz). For Tier-A agents (Claude Sonnet/Opus, GPT-4-class): the freeform builtin.prompt-narrated-video path is still the right tool — the model authors HTML from scratch with full creative control.
Added
scripts/hyperframes-init.shand a CONTRIBUTING.md "prefer the upstream CLI over custom Go" principle, executing the first half of the architectural refinement on #503. The script wrapshyperframes init --example=<x>insidehelmdeck-sidecar-hyperframesand emits a gzipped tarball of the scaffolded project directory; it's the session-exec target the upcominghyperframes.composescaffold-mode change will call viaec.Exec, matching theav-validate.sh/hyperframes_render.go:276pattern. Empirically grounded: the 140+ example catalog enumerated viahyperframes init's registry is the upstream-authoritative source of visual creativity — Tier C models (gpt-oss-120b:free, gemma) will only need to do content interpolation, not design invention. No caller changes yet; the script is dormant until subsequent PRs wire the compose handler to invoke it.helmdeck-sidecar-hyperframespinsHYPERFRAMES_VERSION=0.6.97(was 0.6.7) — the upstream renamed--templateto--exampleand added--non-interactive"for CI/agents" between those versions, both of which the new script depends on. Image smoke test now also assertshyperframes init --helpsucceeds and/usr/local/bin/hyperframes-init.shis executable.hyperframes.rendergains aproject_artifact_keyinput field alongside the existingcomposition_html(mutually exclusive — pass exactly one). When provided, render downloads the gzipped tarball from the artifact store, extracts it under/tmp/helmdeck-hf/, and runshyperframes render <project-dir>against the multi-file scaffold the framework natively expects (index.html+compositions/*.html+assets/+hyperframes.json). This is the consumer side of #503's Path B refactor — paired with the newhyperframes.scaffoldpack (below) and upcominginterpolate/attach_assetpacks, which produce the tarball this pack consumes. Backward-compatible: existing callers passingcomposition_htmlcontinue to work unchanged. Schema, error-mapping, and the 17 existing tests are untouched; 7 new tests cover both inputs missing/both set/happy-path/store-miss/tar-extract-fail/missing-index/empty-artifact.- New
builtin.scaffolded-narrated-videopipeline — ties the four scaffold-mode packs together:podcast.generate(narration) →hyperframes.scaffold(picks an upstream example likeswiss-grid/decision-tree/code-snippet-dark-modern/kinetic-type/nyt-graph/tiktok-follow— 140+ in the catalog) →hyperframes.interpolate(LLM rewrites visible text + caption transcript to fit the topic) →hyperframes.render(project tarball → MP4). Sibling tobuiltin.prompt-narrated-video— same narration + render halves, different compose strategy.prompt-narrated-videoasks the LLM to author HTML from scratch (great on Tier A, visually-flat on Tier C);scaffolded-narrated-videoborrows upstream's polished examples so Tier C produces visually-rich output reliably. Inputs:description+example(both required),resolution+aspect_ratio(optional, threaded to scaffold + render). For an A-roll image, chainimage.generate+hyperframes.attach_assetbetween interpolate and render manually — the pipeline doesn't automate this (no conditional-step support in v1). Closes the four-pack #503 Option C refactor — see issue #503 for the full architectural arc from Path A (stitched HTML) → Path B (project artifact) → Option C (4-pack split + pipeline). - New
hyperframes.attach_assetpack — third (optional) link in the scaffold-based video pipeline. Takes aproject_artifact_key(from scaffold or interpolate) + anasset_artifact_key(fromimage.generate,stock.search, or any pack that uploaded an image/video to the artifact store), embeds the asset bytes atassets/aroll-<sha256-prefix>.<ext>in the project tarball, and modifiesindex.htmlto reference the asset from the target div (default#short_mag_cut_frame, matching upstream's canonical A-roll slot id). Returns a newproject_artifact_keyready forhyperframes.render. Supportsimage/{png,jpeg,gif,webp,svg+xml}andvideo/{mp4,webm,quicktime}content types (50 MiB cap). Videos are emitted withmutedper upstream'sAGENTS.mdconvention. Asset filenames are content-addressed so identical asset bytes produce the same path — convenient for dedup across chained pipelines. Pure-Go in-process (likeinterpolate): no SessionSpec, no dispatcher, just the artifact store. URL fetching is intentionally not supported in v1 — chainhttp.fetchupstream if your asset is URL-only; keeps the pack focused. 18 tests cover input validation, store-miss / empty / oversize / unsupported-type rejection, missing-index, target-not-found, image happy-path, video happy-path withmutedassertion, customtarget_id, leading-#canonicalization, content-addressed filename dedup, andspliceAssetIntoTargetunit (image / video / no-match / preserves-div-attrs). - New
hyperframes.interpolatepack — second link in the scaffold-based video pipeline. Takes aproject_artifact_key(fromhyperframes.scaffold) plus a userdescription+model, runs LLM passes percompositions/*.htmlfile to rewrite the visible text content so it fits the topic, re-uploads the modified project as a newproject_artifact_key. Auto-detects two content shapes per file: HTML text slots (<h1>,<h2>,<h3>,<div class="stat-value">,<div class="stat-label">) get on-topic text substituted via a numbered-slots LLM format, and the JSTRANSCRIPTword array (incaptions.html) gets regenerated with timing aligned toduration_secondsat a 150 wpm cadence. Other files pass through unchanged. Pure-Go in-process tarball manipulation (archive/tar+compress/gzip) — no SessionSpec, noec.Exec, just dispatcher + artifact store. Soft-degrades on per-file LLM failure (skipped files are surfaced infiles_skipped; the whole call only fails when ZERO files got rewritten). Tier-aware prompts viallmcontext.BudgetFor. 23 tests cover input validation, content classification (transcript vs text-slots vs unknown), text-slot extract/splice round-trip, numbered-slot parsing (strict / out-of-order / extras), transcript parsing (strict JSON / lenient JS-keys / empty rejection), tarball roundtrip, the happy-path multi-file rewrite end-to-end, and the no-recognized-shape rejection path. - New
hyperframes.scaffoldpack — first link in the scaffold-based video pipeline. Picks one of upstream HyperFrames' 140+ pre-built examples (swiss-grid,decision-tree,code-snippet-dark-modern,kinetic-type,vignelli,tiktok-follow, etc.), runshyperframes init --example=<name>insidehelmdeck-sidecar-hyperframes, uploads the resulting project tarball to the artifact store, and returns aproject_artifact_keyplus aneditable_slotsmanifest naming whichcompositions/*.htmlfiles the upcominghyperframes.interpolatepack will rewrite. This is the first concrete pack from #503's Option C architectural decision: instead of folding scaffold-mode intohyperframes.compose(creating a multi-headed pack with split output schemas), the scaffold-based path becomes its own family of small composable packs —scaffold→interpolate→attach_asset→render— matching helmdeck's existing pattern (slides.outline + slides.render + slides.narrate, podcast.generate + image.generate + stock.search).hyperframes.composefreeform mode stays untouched for callers who want full HTML control. 15 tests cover input validation, resolution/aspect-ratio matrix, script exit-code mapping (caller-fix vs handler-failed), tarball-shape edge cases (empty / cat failure / leading-.// directory entries), and artifact upload round-trip.
Changed
scripts/hyperframes-init.shswitched its output contract from "emit stitched composition HTML" to "emit a gzipped tarball of the scaffolded project directory" before any caller depended on it. The empiricalhyperframes initscaffold is a multi-file project —index.html,compositions/*.html(with the caption-transcript word-timing array),assets/,hyperframes.json,package.json— and the sub-compositions are referenced bydata-composition-srcpaths, not inlinable into a single HTML blob. This is the Path B branch of issue #503's plan:hyperframes.renderwill gain aproject_artifact_keyinput that consumes the tarball natively (next PR in the chain). Bonus: the LLM content-interpolation step (PR 4) now operates on the structured TRANSCRIPT array incompositions/captions.html, not on regex-extracted HTML slots — a richer surface for word-level timing-preserving rewrites.
[0.27.1] - 2026-06-14
The video-pack hardening release. Closes the concept-animator empirical arc that began with PR #497's gpt-oss-120b:free recipes and ran through six follow-on PRs (#499, #500, #501, #502, #504) over a single 2026-06-14 session driven by an empirical first-run session against openai/gpt-oss-120b:free. Four pack-level changes on hyperframes.compose close foot-guns the session surfaced: (1) audio_url now requires explicit duration_seconds (#498 → #499, silent-truncation bug closed); (2) duration-band-aware engagement metadata generation mirrors podcast.generate's metadata_model pattern (#500, short_form / mid_form / long_form payload shapes); (3) blank-screen guard via timeline-coverage validation and tier-aware system prompts (#502); (4) track-index collision check + upstream-sourced rule rewrites + comprehensive integration guide derived from the actual upstream HyperFrames AGENTS.md / SKILL.md / hyperframes-student-kit (#504, after the #502 best-practices doc turned out to be synthesis-without-citation — the lesson is captured in the new 2026-06-14 blog post When agent-instruction docs drift from upstream spec). Companion: gpt-oss-120b:free concept-animator + slide-narrator recipes (#497, updated in #501 to drive the new pack capabilities) demonstrate end-to-end free-tier video chains. Strategic-direction issue #503 proposes a template.fetch pack to surface upstream reference repositories as composition seeds for future releases. SEO Tier 1 + Tier 2 discoverability pass (#494, #495) addresses the GSC "Discovered – currently not indexed" bucket at the docs-site layer. HuggingFace platform epic Phase 4 expanded with the consume/publish Track A/B split (#490, companion blog post promoted from draft).
Operator upgrade: clean — no schema migrations, no breaking input changes, no removed packs. Three opt-in additions:
metadata_modelfield onhyperframes.compose(string-ptr; defaultopenrouter/auto; pass""to disable; pin a free model for end-to-end free-tier discipline)- Composition timeline-coverage + track-index collision pre-checks (reject at compose-time with the upstream rule cited; existing recipes that already used the upstream patterns are unaffected)
- Tier-aware system prompt selection via
llmcontext.BudgetFor(model)— Tier C gets the verbose verbatim-rule prompt; Tier A/B gets a lean prompt referencing the best-practices guide
Operators driving the concept-animator recipe should refresh their AGENTS.md from the updated docs/howto/per-model-agents/gpt-oss-120b-concept-animator.md to pick up the speakers map default, free-tier metadata_model pinning on both podcast.generate and hyperframes.compose, and the engagement payload surfacing in OUTPUT FORMAT.
- HuggingFace epic (#490) Phase 4 expanded: Track A (consume via
hf-space-invoke) + Track B (publish viahf-space-create/update/deletetrio). Track B framed around operator self-service — any helmdeck workflow becomes a hosted UI under the operator's HF account — with scoped tokens, default-private semantics, per-deployment consent flow, quota caps, and mandatory delete pairing as the security envelope. - Companion blog post HuggingFace isn't just another LLM router — it's a platform helmdeck barely uses promoted from
draft: trueafter the Phase 4 expansion synced the post's framing with the epic. - SEO discoverability pass addressing the GSC "Discovered – currently not indexed" bucket (validation started 2026-05-13, ~61 pages stuck). Adds
description:frontmatter to 73 docs (50 ADRs via bulk script + 23 hand-crafted non-ADR pages), de-orphans previously zero-inbound ADRs via## Related ADRssections in 5 hub docs (PACKS.md,integrations/SKILLS.md,integrations/openclaw.md,howto/multi-model-recovery.md,RELEASES.md), adds OpenGraph site-wide defaults (og:type,og:site_name,og:locale) indocusaurus.config.ts, adds per-page Article + BreadcrumbList JSON-LD via theme swizzles atsrc/theme/BlogPostPage/andsrc/theme/DocBreadcrumbs/, and bumps sitemap default priority 0.5 → 0.6 for/adrs/*,/reference/*,/howto/*. Tier 3 (manual GSC "Validate Fix" re-click + per-URL inspection submits) remains an operator action. - Tier 2 SEO follow-on: per-blog-post OpenGraph cards + homepage "Recently shipped" section. New
website/scripts/generate-og-cards.mjsrenders 1200×630 PNG cards via@resvg/resvg-js(manual SVG template — no JSX/satori dep) with helmdeck branding + wrapped title + tags + date; 25 non-draft posts get unique cards understatic/img/og/<slug>.pngand their frontmatterimage:field updated. Newwebsite/scripts/generate-recent-data.mjsruns at build time, writes top 8 recent posts tosrc/data/recent.json; homepage renders them in a new card grid below the Diátaxis quadrants — gives newest content the highest-PageRank inbound link the site has to offer. Generator documented asnpm run og:generate(ad-hoc; not on CI to avoid the heavier toolchain). - Two new per-model agent recipes for
openai/gpt-oss-120b:freecovering video workflows (#496):docs/howto/per-model-agents/gpt-oss-120b-concept-animator.mddrives a 5-callpodcast.generate→hyperframes.compose→hyperframes.render→av.validate→artifact.verify_manifestchain, with the AGENTS.md template hardened against several empirical foot-guns observed in a 2026-06-13 first-run session: the requiredspeakers: {Narrator: "21m00Tcm4TlvDq8ikWAM"}map onpodcast.generate(omitting it triggers an infinite retry loop),model+metadata_modelpinning toopenrouter/openai/gpt-oss-120b:freeonpodcast.generateplusmodelonhyperframes.compose(end-to-end free tier — the defaultmetadata_model: "openrouter/auto"would route engagement metadata to PAID), and theduration_secondsdata-flow constraint matchingpodcast.generate'sduration_s(without this the compose pack used to silently truncate the rendered MP4 to 8s — fix in #498 makes the pack rejectaudio_urlwithout an explicitduration_secondsgoing forward). The companiondocs/howto/per-model-agents/gpt-oss-120b-slide-narrator.mddrives a singlehelmdeck__pipeline-runcall selecting the rightbuiltin.research-narrate/builtin.grounded-narrate/builtin.repo-presentationpipeline by input type. Both recipes use the sanitized Maya security-researcher persona, embed AGENTS.md templates in the Objectives + Constraints + Success-Criteria-as-Invalidation-Rules style the gpt-oss profile prefers, and document thehelmdeck-trace extractcommand for capturing empiricalcommunity_traces[]entries in a follow-on PR. - Bug fix:
hyperframes.composenow rejects calls that provideaudio_urlwithout an explicit positiveduration_seconds(#498). The previous behavior defaultedduration_secondsto 8s — which is correct for silent micro-animations but silently truncated narration tracks longer than 8 seconds in chainedpodcast.generate→hyperframes.compose→hyperframes.renderworkflows. The 8s default still applies to genuinely-silent compositions; the new validation only fires whenaudio_urlis non-empty. Reference doc (docs/reference/packs/hyperframes/compose.md) updated to markduration_secondsas conditional-required whenaudio_urlis set. Empirical repro from a 2026-06-13 session driving the concept-animator recipe (PR #497) againstopenai/gpt-oss-120b:free: an 88.58s podcast became an 8s video, withav.validate'sconsistency:audio_video_durationcheck passing trivially (both clipped together). - New:
hyperframes.composegains opt-in engagement-metadata generation mirroringpodcast.generate'smetadata_modelpattern. A string-ptr-shapedmetadata_modelinput (defaultopenrouter/auto;""opts out; any model id pins to that model) triggers a second gateway LLM call after composition success that produces a duration-band-aware engagement payload:short_formshape (<60s; title / hook / hashtags / caption / thumbnail_prompt for TikTok / Shorts / Reels),mid_form(60–179s; addssocial_blurbfor Twitter / LinkedIn-native), orlong_form(≥180s; adds YouTube-shapeddescription/chapters/tags/hook_30s/category). The payload lands as the newengagementoutput object andengagement_artifact_key(stable key to a JSON sidecar athyperframes.compose/engagement.json). Generation failures soft-degrade: the composition still succeeds, the engagement field is just absent. Reference doc (docs/reference/packs/hyperframes/compose.md) gains an Engagement metadata section with the per-band shape table. Empirical motivation: the 2026-06-13 concept-animator session produced an 88-second narrated MP4 that had no accompanying title / hashtags / thumbnail prompt — operators had to hand-author all of them. Mirrors the existingpodcast.generateandslides.narrateengagement patterns rather than introducing a new shape. - Concept-animator howto (
docs/howto/per-model-agents/gpt-oss-120b-concept-animator.md) updated to drive the new PR #500 capability: AGENTS.md template now passesmetadata_model: "openrouter/openai/gpt-oss-120b:free"tohyperframes.compose(keeping engagement gen on the free tier) and the invalidation rules require it. OUTPUT FORMAT section adds theengagement.format/title/hashtags/thumbnail_prompt+engagement_artifact_keysurfacing requirements (with the YouTube-shaped extras whenlong_form). "What to capture" metrics table addsengagement_payload_surfaced+engagement_format_correct, andcost_discipline_observednow checks four model fields instead of three. - Blank-screen guard + tier-aware system prompt on
hyperframes.compose. Closes a quality gap surfaced by the 2026-06-13 concept-animator session, where the rendered 8-second MP4 hit a 2+ second black run thatav.validatewarned on but the chain didn't surface as a failure. Two changes: (1) the pack now inspects the composition'sclass="clip"element intervals at compose-time and rejects (CodeInvalidInput) when their union leaves a gap longer thanmin(2.0s, duration * 0.05)— the gap range and suggested fix (add a permanent background element) are cited in the error message. (2) The system prompt is now tier-aware via the existingllmcontext.BudgetFor(model)registry: Tier C (free / weak open models) gets a constraint-heavy compact prompt with the timeline-coverage rule inlined verbatim; Tier A/B gets a leaner prompt that trusts the model and references the new HyperFrames composition best practices guide. The best-practices doc covers visual hierarchy (one focal element per ~3s), type-on-screen rules (≥60px, ≥1.5s read time), pacing, color choices, GSAP transition patterns that play well, audio-aware composition for narrated chains, and a common-failure-modes table. Reference doc (docs/reference/packs/hyperframes/compose.md) gains "Timeline coverage" and "Tier-aware system prompt" sections. - Upstream-spec alignment on
hyperframes.compose(PR #504) after the PR #502 best-practices guide turned out to be largely synthesis-without-citation. Three coupled changes: (1) new pack-side validationcomposeTrackCollisionthat rejects compositions where twoclass="clip"elements share an integerdata-track-indexAND temporally overlap — this is an upstream HyperFrames hard rule per the actualAGENTS.md(track-index is a non-linear-editor row index, NOT a CSS z-index; spatial layering happens via CSS z-index entirely separately). (2) Both Tier C and Tier A/B system prompts rewritten with upstream-sourced rules verbatim — layout-first pattern (write the static hero frame in flex/gap/padding before any GSAP),gsap.from()/tl.to()entrance-exit convention, track-index temporal-exclusion rule, audiodata-volumeis immutable (volume tweens silently ignored), DETERMINISTIC ONLY with PRNG seeding option. (3) Best-practices doc (docs/reference/packs/hyperframes/best-practices.md) completely rewritten as a helmdeck integration guide for the upstream HyperFrames project — cites the upstreamAGENTS.md/SKILL.md/hyperframes-student-kitthroughout; covers the seven-step pipeline (Capture → Design → Script → Storyboard → VO+Timing → Build → Validate), the full attribute vocabulary (data-media-start,data-composition-src,data-variable-values,data-layout-allow-overflow,data-layout-ignore), the upstream reference template catalog (warm-grain,swiss-grid,play-mode,vignelli,product-promo,nyt-graph,decision-tree,kinetic-type), WebGL shader transitions with optimal duration ranges, audio-reactive pre-extracted FFT pattern, ARM64 deployment escape hatch (PRODUCER_FORCE_SCREENSHOT=true), and React migration constraints — and explicitly marks helmdeck-specific guidance separately. Companion blog postWhen agent-instruction docs drift from upstream spec(2026-06-14, draft) captures the epistemic lesson. Companion issue #503 proposes atemplate.fetchpack to surface upstream reference repositories as composition seeds.
[0.27.0] - 2026-06-10
The per-model profiles + audit-callback release. Closes a major arc: empirically validated that per-use-case AGENTS.md hardening is the load-bearing layer for reliable agentic behavior on Tier C models (PR #481 → PR #484 Nemotron baseline-vs-hardened A/B: 24 calls / 0 deposit → 7 calls / deposit + verify with all_present: true). The 5-profile prompting library ships (#464 Phase 1: gpt-oss-120b, gemma-4-26b-a4b-it, llama-3.3-70b, nemotron-3-super-120b-a12b, qwen3-coder), with the multi-provider YAML schema accepting non-OpenRouter routes (huggingface / together / groq / cerebras / sambanova / custom) and the first HF Inference Providers template as a community-contribution starting point per #482. The audit-callback pattern (#461) gets its anchor pack with artifact.verify_manifest; the typed artifact store (artifact.put / .get / .list) replaces prose-instruction deposit guidance that Tier C models silently ignore. Companion infrastructure: helmdeck-trace CLI for community_traces[] extraction, configure-openclaw.sh canonical 4-file workspace seeds, personalize-an-openclaw-agent howto, canonical file roles section in integrations/openclaw.md §5d, audit + persona-leak fix of skills/helmdeck/SKILL.md. Catalog grows from 53 → 57: four new artifact packs. Strategic direction: HuggingFace integration epic (#490) frames 6 phases beyond routing layer (Datasets, Embeddings, Spaces, Tokenizers, Self-hosted runtime patterns).
New packs: artifact.put / artifact.get / artifact.list (PR #450) + artifact.verify_manifest (PR #462, audit-callback anchor) — all available as helmdeck__artifact-* MCP tools, no AI gateway required.
Operator upgrade: clean — no schema migrations, no breaking input changes, no removed packs. New provider: union for models/*.yaml is backwards-compatible (existing provider: openrouter files need no changes; convention for new files going forward is models/<provider>-<model>.yaml). New --seed-canonical-layout and --force-overwrite flags on configure-openclaw.sh; existing --seed-identity flag preserved as alias. Re-run ./scripts/configure-openclaw.sh after upgrade to refresh the v0.27.0-stamped skill (catalog grew from 53 → 57 packs).
Added
- HuggingFace integration epic (#490) + companion strategic-direction blog draft. PR #489 added HF Inference Providers as alternative LLM routing — multi-provider YAML schema, first HF template profile, routing setup walkthrough, CI validation gate. This filing reframes the broader opportunity: HuggingFace isn't just another LLM router; it's a platform spanning 100K+ datasets, embeddings APIs, Spaces hosting demos, tokenizers, fine-tuning hooks. Helmdeck currently uses zero of those beyond the routing-only integration. Epic #490 enumerates six phases — (1) Inference Providers (foundation, mostly shipped via #489; acceptance pending community-contributed
community_traces[]entry per #482); (2) Datasets integration (two new packshelmdeck__hf-dataset-search+helmdeck__hf-dataset-streamfor domain-corpus grounding incontent.ground/research.deepbeyond Firecrawl scraping); (3) Embeddings + similarity (helmdeck__hf-embeddingspack for sentence-transformers / cross-encoder embeddings +helmdeck.memory_storeintegration for semantic recall beyond key/value-only lookups); (4) Spaces (helmdeck__hf-space-invokefor remote demo-endpoint composition with explicit security review for arbitrary remote code invocation); (5) Tokenizers (helmdeck__hf-tokenizepack for accurate per-model token counting + per-model profile YAML schema gains optionaltokenizer:field for context-engine budgeting); (6) Self-hosted runtime patterns (expanded vLLM / TGI / SGLang walkthroughs atdocs/howto/self-host-with-*.md+ per-enginetool_parser:field guidance with Nemotron'sqwen3_coderparser as the canonical example +deploy/docker/sidecar-vllm.Dockerfilepatterns). Each phase ships acceptance criteria; ordering is community-driven; phases 1-4 are independent (any order); phases 5-6 build on earlier work. Companion blog draft atwebsite/blog/2026-06-10-huggingface-as-a-first-class-platform.md— 600-word strategic-direction framing per CLAUDE.md draft-on-finding norm. Uses sanitized Maya security-research persona for worked examples per the standing memory rule.draft: trueuntil at least Phase 2 ships so the post has concrete deliverables to reference beyond strategic framing. Empirical motivation anchored in PR #481 → PR #484 Nemotron baseline-vs-hardened A/B (24 calls / 0 deposit → 7 calls / deposit + verify withall_present: true) — per-use-case AGENTS.md hardening is the lever regardless of platform; HuggingFace gives helmdeck more substrate to harden against. What this filing does NOT do: doesn't predict per-phase timelines (acceptance criteria are listed; pacing is community-driven); doesn't gatekeep contributions on maintainer review (external PRs welcomed via existing patterns); doesn't restructure existing #464 / #482 issues (they remain specific tracks for their respective scopes); doesn't create a GitHub milestone (single-tracking-issue pattern matches existing #461 / #464). Cross-link comments posted on #464 and #482 clarifying the broader context.
Added
- Multi-provider schema upgrade for the per-model profile library + first non-OpenRouter template (advances #464 Phase 1 + #482 HF community track). Today's empirical findings (3 of 5 Phase 1 models hit upstream rate limits on the OpenRouter
:freepool — Google AI Studio 429 on gemma-4, "Venice"-attributed 429s on llama-3.3 and qwen3-coder) motivated this scope. Four deliverables in one cohesive change unblocking external contributions for routing layers beyond OpenRouter: (1) Schema reference doc atdocs/reference/model-profiles-schema.mddocumenting the YAML schema explicitly for the first time — required + optional fields, acceptedprovider:union (openrouter/huggingface/together/groq/cerebras/sambanova/custom), per-provider extension fields (hf_routing_policy,hf_partner,endpoint_base_url,tool_parser), required empirical sections (validated_against/community_traces/comparison_traces— present even if empty[]), file size soft cap, anonymization rules per the standing memory rule, full schema for each empirical section's entry shape with the existingmodels/openai-gpt-oss-120b-free.yamlas the most-populated reference. (2) First HF Inference Providers template atmodels/huggingface-openai-gpt-oss-120b.yaml— reuses the gpt-oss prompting guidance from the OpenRouter sibling unchanged (model behavior is provider-agnostic; only routing differs), adds HF-specifichf_routing_policy: ":preferred"default +context_window_notesexplaining the HF routing layer (OpenAI-compatible atrouter.huggingface.co/v1, provider-selection policies:fastest/:cheapest/:preferred, free-tier credit ~$0.10/month writeup-quoted, BYOK alternative). Empirical sections[]— community contribution invited per #482. The cross-provider relationship (openai/gpt-oss-120bon OpenRouter vs HF) gives external contributors a clean A/B template: same model + same prompt across routing layers, measure whether reliability differs. (3) Routing setup howto atdocs/howto/configure-non-openrouter-providers.md— primary section walks HF Inference Providers end-to-end (get HF API key, configure OpenClaw with base URL + key, provider-selection policies explained, free-tier credit ceiling notes, worked example of switching trace-test agent to HF for cross-provider trace contribution); secondary section briefs Together AI / Groq / Cerebras / SambaNova direct (all OpenAI-compatible with their own free tiers, base URLs + auth doc links per provider); tertiary section briefs self-hosted vLLM / SGLang / TGI withtool_parser:field reference for Nemotron-3 Super'sqwen3_coderparser (per the Nvidia developer-forum thread's "Native fixed it" resolution captured in #475 research). Submission methodology section cross-links to existing helmdeck-trace CLI workflow. (4) CI YAML validation gate atscripts/validate-model-profiles.py+.github/workflows/model-profiles-validate.yml— Python stdlib + PyYAML, single file, validates: required top-level keys present,provider:in accepted union,tier:inA/B/C, file size under 30 KB soft cap (sanity check — bumped from 20 KB after nemotron landed at 22.5 KB post-PR #487 due to rich legitimate empirical content), empirical sections present even if empty arrays, provider-specific required fields when relevant (endpoint_base_urlforcustom). Workflow runs only whenmodels/*.yamlor the validator changes — cheap to run, fast to fail. Negative-test validated against deliberately-broken fixture; positive-test passes all 6 existing profiles (5 OpenRouter + 1 new HF). Cross-references:docs/reference/models.mdgains a "Non-OpenRouter profiles" subsection listing the new HF template + a "See also" pointer to the schema reference;docs/howto/add-free-models.mdgains an "Adding a non-OpenRouter profile" section pointing at the schema + routing howto;CONTRIBUTING.md"Profile contribution" bullet updates the schema reference link to the new dedicated doc + notes non-OpenRouter providers are supported via the routing howto; sidebar registers both new docs under their respective categories (howto: "Per-model agent adaptation" alongside the gemma-4 recipe and personalize howto; reference: alongsidereference/models). Backwards-compatibility: existing 5 OpenRouter YAMLs need no changes — theprovider: openrouterline inside each is the explicit identifier; the convention for NEW files going forward ismodels/<provider>-<model-slug>.yaml(HF gpt-oss is the first to follow it). What this PR deliberately does NOT do: doesn't ship empirical HF traces (the HF gpt-oss profile starts empty; community contribution is the whole point of #482); doesn't rename the existing OpenRouter YAMLs; doesn't migrate to multi-provider-per-YAML object schema (sticking with simple union per-file: easier for external contributors to understand); doesn't ship HF templates for the other 4 Phase 1 models (community-contribution opportunities now that the schema + template + routing howto unblock them); doesn't add Together / Groq / Cerebras / SambaNova templates (community contributors can add specific templates as they validate models there).
Changed
skills/helmdeck/SKILL.mdaudit + small persona-leak refactor (Fixes #455). Audit categorization across all 14 top-level sections (lines 27–524): pack catalog + MCP resources + async wrappers + pipelines + repo discovery pattern = mechanism (clean — describes packs by capability, decision tables, contracts); error handling rules + default model selection + session chaining + when-to-create-a-github-issue = operating rules served as baseline defaults (acceptable — operators can override in AGENTS.md per the PR #483 layered pattern); developer guidance section at end = audience-targeted (intentional, clearly delimited for helmdeck developers, doesn't pollute agent prompt). Only one real persona-leak found: "Pack composition — you are a creative agent" (section header at line 305) used persona-shaped framing ("YOU are a creative agent...YOU generate creative content"). Refactored to "Pack composition pattern" with mechanism-shaped framing ("agent generates content, packs handle production") — same operational guidance, no persona prescription. Added explicit "Operator override" note at the end of the section pointing operators atdocs/integrations/openclaw.md§5d for the layered-customization pattern when they want to pin a different composition style. Audit conclusion: the skill IS well-layered overall; the operating-rules sections serve as defaults that AGENTS.md overrides (per the empirical lesson from PR #481 → PR #484: docs-only profile is necessary but not sufficient; per-use-case AGENTS.md hardening is the load-bearing layer). No major refactor warranted. Companion audit forskills/helmdeck-debug/SKILL.mdstill tracked in #456 with the same methodology.
Changed
models/nvidia-nemotron-3-super-120b-a12b-free.yamlfinal empirical refinements (closes #475). Three structural updates synthesizing what the v1→v2 baseline-vs-hardened A/B taught us about per-model profile sufficiency: (1)validated_against[]populated with a structured maintainer-curated finding capturing the full A/B comparison table (24 calls / 0 deposit → 7 calls / deposit + verify withall_present: true), the three hardenings that empirically closed both Nvidia-documented failure modes (explicit tool whitelist + async pattern bounds for content.ground + plain-text-tool-call invalidation), the resilience observation (content.ground job actually failed upstream in v2; agent honored "don't retry" rule, recovered via operator's deposit reply), and the strategic lesson that the YAML profile is necessary but NOT sufficient for reliable Tier C Nemotron behavior — per-use-case AGENTS.md hardening is the load-bearing layer. (2)best_practices[]extended with three empirical entries prefixedEMPIRICAL 2026-06-10capturing the load-bearing hardening pattern, the bounded-polling pattern for async packs, and the upstream-failure-resilience observation. (3)anti_patterns[]extended with two empirical entries prefixedEMPIRICAL 2026-06-10capturing the "deploy with profile but no hardened AGENTS.md" anti-pattern and the "parallel async pack jobs" anti-pattern (both reproduced verbatim in the v1 baseline). (4)chain_call_reliability.notesextended with the empirical refinement that chain-call reliability is workflow-shape-dependent, not a pure model property — the same model on the same prompt produced 24 calls / 0 deposit (v1) vs 7 calls / deposit + verify (v2) purely on AGENTS.md hardening differences. The short/medium/long buckets describe the model's CAPACITY; the actual call counts depend on whether operator AGENTS.md constrains the workflow. Memory-rule compliance:validated_against[]entry uses sanitized labels ("Tier C agent onnvidia/nemotron-3-super-120b-a12b:free, three-turn iterative workflow") rather than naming Press-Nemotron explicitly. Closes #475 — first Phase 1 follow-up issue to land all four empirical sections (validated_against+community_traces× 2 + refinedbest_practices/anti_patterns/chain_call_reliabilitynotes). Mirrors howmodels/openai-gpt-oss-120b-free.yamlhas all four populated; sets the canonical bar for the remaining three empirical follow-up issues (#473 gemma-4, #474 llama-3.3, #476 qwen3-coder) when their traces eventually land.
Changed
scripts/configure-openclaw.shnow seeds the canonical four-file workspace layout instead of dumping concerns into IDENTITY.md (closes #454). Previous behavior:--seed-identitywrote three files (IDENTITY, USER, SOUL) with leaky concerns —SOUL.mdmixed voice posture with operating instructions ("Follow the SKILLS.md decision tables..."),USER.mdmixed operator description with technical assumptions about pack vocabulary, andAGENTS.mdwas never seeded. New behavior: four files (SOUL.md,IDENTITY.md,USER.md,AGENTS.md) with cleanly separated concerns per OpenClaw's canonical model (SOUL=voice, IDENTITY=name, USER=operator, AGENTS=operating rules) — each capped well under the 12,000-char bootstrap injection limit (current sizes: SOUL 980c, IDENTITY 198c, USER 855c, AGENTS 1988c) with operator-tunable<!-- TODO: -->comments at each section so operators know what to customize.SOUL.mdcovers voice posture / editorial discipline / banned phrases ONLY — no operating instructions;IDENTITY.mdis intentionally minimal (name / emoji / one-line theme);USER.mdis the operator profile (who you are, where you publish, current focus, editorial preferences) with placeholders for customization;AGENTS.mdcarries operating rules (tool whitelist, workflow shape, hard constraints, etiquette) and explicitly references the per-model-agents recipes indocs/howto/per-model-agents/. New--seed-canonical-layoutflag is the documented name for the new behavior;--seed-identityis preserved as an alias for backwards compatibility (no script breaking for any caller passing the old flag). New--force-overwriteflag for idempotency control: by default existing files are preserved (skip with informative log message pointing operators atdocs/howto/personalize-an-openclaw-agent.md); pass--force-overwriteto.bak.YYYYMMDD-HHMMSSexisting files and write fresh seeds. Why this matters: today's PR #485 (personalize-an-openclaw-agenthowto) + PR #483 (canonical file roles section inintegrations/openclaw.md) document the layered SOUL/IDENTITY/USER/AGENTS pattern as the maintainability story; this PR makes the script that bootstraps new agents actually produce that layout instead of overloading IDENTITY.md (the observed pre-fix behavior from the 2026-06-09 tech-blog-publisher debugging arc that motivated #454 in the first place). Layered seeding empirically matters: PR #481 → PR #484 Nemotron A/B (24 calls / 0 deposit → 7 calls / deposit + verify withall_present: true) demonstrates that per-use-case AGENTS.md hardening is the lever — and the script now seeds an AGENTS.md skeleton that explicitly invites that hardening.
Added
docs/howto/personalize-an-openclaw-agent.md— generic operator-personalization howto using the layered SOUL / IDENTITY / USER / AGENTS pattern (closes #458). Walkthrough for operators who want to use helmdeck shipped skills (and skills under~/.openclaw/skills/) with their own persona, platforms, and goals — not the defaults baked into the upstream skill. Covers the five-layer mental model (SOUL=voice / IDENTITY=name / USER=operator / AGENTS=operating rules / SKILL=mechanism), what goes where with concrete examples per layer, walkthrough templates for populating USER.md (the most-customized file), tuning IDENTITY.md (override when defaults don't match), and SOUL.md (generally don't, but here's the dial). Tradeoffs table for when to fork the skill vs customize via identity files. Full worked example using sanitized Maya security-research persona (consistent across helmdeck docs — same personadocs/integrations/openclaw.md§5d canonical file roles section anddocs/howto/per-model-agents/gemma-4-iterative-workflow.mdrecipe use) — Maya's SOUL.md, IDENTITY.md, USER.md files shown verbatim, demonstrating the persona-reuse pattern (same SOUL/IDENTITY/USER copied across multiple model variants; only AGENTS.md changes per model). Shows the multi-variantopenclaw.jsonregistration pattern with two example agents (maya-gemma-4+maya-llama) sharing persona but using different per-model AGENTS.md. Verification section points operators atscripts/helmdeck-tracefor empirically validating their personalized agent's behavior post-bootstrap —verify_manifest_called: True+all_present: True+ tool tally matching AGENTS.md prescription. Bootstrap helper section points atconfigure-openclaw.shand the #454 layered-seed work that will eventually automate the workspace scaffolding. Empirical motivation anchored in PR #481 → PR #484's Nemotron baseline-vs-hardened A/B (24 calls / 0 deposit → 7 calls / deposit + verify withall_present: true) showing that per-use-case AGENTS.md hardening is more impactful than persona dumping — the layered pattern documented here is what makes that hardening maintainable across many agents. Memory-rule compliance: worked example uses standing Maya persona throughout; no operator-personal agent names (Hat, Press-*, etc.) leak into the public doc. Sidebar registration: new sidebar entry under existing "Per-model agent adaptation" category, alongside the gemma-4 iterative workflow recipe — operators landing on the conceptual howto can hop to the per-model worked example and back.
Added
models/nvidia-nemotron-3-super-120b-a12b-free.yaml— secondcommunity_traces[]entry, hardened-v2 success (#475 Phase 1 follow-up closure). Direct A/B against PR #481's v1 baseline — SAME model, SAME prompt (eBPF kernel rootkit detection), hardened AGENTS.md (operator-local per memory rule). 7 total pack calls vs v1's 24 (71% reduction).artifact-put+verify_manifestboth fired withall_present: trueon 1 of 1 artifact. Three hardenings empirically closed the Nvidia-documented gap: (1) Explicit tool whitelist ("You MAY call ONLY these tools") forbidding filesystem write/read packs — empirically 0 filesystem calls (vs 5 in v1); (2) async pattern bounds for content.ground ("Call ONCE, poll pack-status max 5x, then pack-result OR honest timeout. NEVER start a parallel job") — empirically 1 content.ground call (vs 6 in v1) + 4 pack-status polls (within 5-budget); (3) plain-text tool call invalidation — explicit rule that tool calls generated as plain text invalidate the response, empirically 0 plain-text tool calls (vs the documented anti-pattern that fired in v1's final turn). Resilience observation: the content.ground job ACTUALLY failed upstream in v2 (state transitioned working → failed by poll #4). The agent honored the "don't retry" rule and reported the failure honestly in the Turn 2 response, ending with the literal handoff line. Operator replied "deposit"; Turn 3 fired artifact.put + verify_manifest correctly with the un-grounded draft, returning all_present:true. The hardened workflow is resilient to upstream pack failure, not just clean-path. Decision:profile-works(vs v1'sprofile-not-enough) — per-use-case AGENTS.md hardening on top of the docs-only profile closes the Nvidia-documented failure modes. Strategic lesson for future Nemotron operators: the YAML profile gives the prompting shape, sampling, and reasoning controls Nvidia recommends; the AGENTS.md gives the workflow constraints that turn those mechanics into reliable agentic behavior — you need both layers. Submitted via thehelmdeck-traceCLI (PR #478 / #479) — third YAML to receive a community_traces[] entry via the canonical Phase 1 contribution tool.
Added
models/nvidia-nemotron-3-super-120b-a12b-free.yaml— firstcommunity_traces[]entry capturing both Nvidia-documented Tier C failure modes empirically (#475 Phase 1 follow-up advanced). Press-Nemotron agent (session41863f17-43bc-447a-9828-87c812534615, 2026-06-10) ran the standard three-turn iterative blog-drafter workflow onnvidia/nemotron-3-super-120b-a12b:free. 15-minute session, 24 total pack calls, zeroartifact-putorverify_manifestcalls — workflow never reached the deposit step. Reproduces both anti-patterns the Nvidia agentic-coding cookbook documents: (1) Goal Drift: agent drifted from "blog draft + deposit" to "spam content.ground with multiple concurrent jobs and write random files" — used filesystemwrite/readpacks (NOT prescribed by AGENTS.md) to save outline.md, draft.md, temp_draft.md, test.md to the workspace dir; six simultaneous content.ground jobs started, most hung at "progress: 10%", only ONE completed and only on a tiny 46-byte test file. (2) Tool-Call Failures: final assistant turn started generating<tool_call><function=helmdeck__pack-status><parameter=job_id>...as PLAIN TEXT instead of using the OpenAI toolCall format — literal "malformed function call" anti-pattern Nvidia documents. Decision:profile-not-enough— the docs-sourced profile guidance (ChatML format, sampling,enable_thinking,force_nonempty_content) was insufficient to prevent the failures; per-use-case AGENTS.md hardening is the apparent next step. Useful side observations: (a)content.groundis async (returnsjob_id+state:"working"); AGENTS.md says "Call content.ground ONCE" but doesn't mention the polling pattern — operators iterating on Nemotron recipes should add explicit "call once, poll pack-status until state:completed, then call pack-result" guidance. (b) The agent has access to filesystem packs that AGENTS.md never authorized (probably from a separate Claude Code MCP integration in OpenClaw); per-model AGENTS.md should explicitly enumerate allowed packs to prevent goal-drift escapes. Iterating Press-Nemotron AGENTS.md (operator-local per memory rule) sets up a v2-vs-v1 A/B for the nextcommunity_traces[]entry — and validates the helmdeck-trace CLI's role as the canonical evidence-capture tool for #464 Phase 1 follow-ups.
Added
models/openai-gpt-oss-120b-free.yaml— secondcommunity_traces[]entry capturing first end-to-end CLI dogfood run (PR #478 consumer, captured via PR #479's fixed helmdeck-trace). Tracebabfee13-9d81-4f88-a3c8-3cab900c562efrom the newtrace-testagent onopenrouter/openai/gpt-oss-120b:free: three-turn iterative workflow on an MCP tool catalog deep-dive prompt;artifact.put+verify_manifestfired end-to-end withall_present:trueon 1 of 1 deposited artifact. Captures three findings worth pinning beyond the metric_summary: (1) workflow shape EXPANDED rather than simplified — AGENTS.md prescribed "exactly two tool calls" for Turn 3 but the agent fired 5 (1 content-ground + 1 artifact-put + 1 verify-manifest + 2 exploratory probes via pack.status + pack.result before the deposit). The publishing-strategist trace from 2026-06-09 simplified 9 platforms to 2 (workflow contracted); this trace shows the opposite drift (workflow expanded with exploratory probes). Two different Tier C deviation patterns, both away from the AGENTS.md prescription; both shapes of customization pressure operators should expect when designing per-use-case agents. (2) non-deliverable-terminal-turn retry-recovery is a real Tier C resilience pattern — operator's firstdepositreply triggered the trajectory error; retry with same input succeeded. The free gpt-oss-120b route can fail one turn and recover on the next attempt, which is useful operator-facing patience guidance. (3) Three-turn iterative shape held under iterative pressure — both handoff lines fired literally, Turn 3 ended with the prescribedDone. Artifact deposited and verified.line, zero citation URLs fabricated (model honored the content.ground rule and didn't author URLs). Namespace deviation: artifact landed atartifact.put/...mdnotblog.publish/...mdper AGENTS.md — pack default kicked in where the model should have honored the explicitnamespacearg; worth tightening in future revisions. Decision:profile-works(the audit-callback pattern fired correctly and produced a verified artifact; the workflow deviations are operational observations, not workflow-breaking failures). Submitted via the samehelmdeck-trace extractshape Phase 1 community contributors are expected to use — proving the CLI's primary use case works end-to-end on a successful session for the first time.
Added
scripts/helmdeck-traceCLI — extracts structuredcommunity_traces[]blocks from OpenClaw session jsonl files (issue #464 Phase 1 contribution tooling). Single-file Python CLI (stdlib only, no PyYAML or requests) with three subcommands:extract(one session → one YAML block matching the canonicalcommunity_traces[]schema inmodels/openai-gpt-oss-120b-free.yaml),compare(baseline vs profile-aware A/B markdown table for the methodology described in each empirical-baseline issue), andsummary(quick stdout key:value dump for eyeballing). Walks the OpenClaw session jsonl forward, pairstoolCallparts with the nexttoolResultturn FIFO (matching the existingscripts/oc-capture/extract-oc-transcript.pyparser pattern), and computes:real_pack_calls(count of actualtoolCallparts, NOT text claims like "I deposited 6 artifacts"),tool_calls_by_name(per-tool tally),verify_manifest_called+all_present(from parsing the verify_manifest tool result JSON),artifact_put_called,content_ground_called+claims_considered/claims_grounded/skipped(from parsing the content.ground tool result),pipeline_run_called,citation_urls_in_text/citation_urls_from_grounding/citation_urls_fabricated(parses[N](url)and[source](url)patterns from assistant final text; cross-checks each URL againstcontent.groundresponsegrounding[]array — flags any inline URL that did NOT come from content.ground as fabricated; this is the Tier C citation-confabulation failure mode documented in 2026-06-10 traces),hallucination_count(heuristic: assistant text claims a deposit / verify outcome but the corresponding tool call never fired), andterminal_errors(captured from trajectorymodel.completed.data.terminalError— exercises the 429 /non_deliverable_terminal_turnpath proven against the 2026-06-10 gemma-4 rate-limit trace).simplification_observedis intentionally NOT auto-detected — heuristic is too fragile; CLI emitsnullso the YAML schema is satisfied and operator sets it manually after review. Anonymization: default behavior strips operator-personal data per the standing memory rule that workspace files + agent names stay private (agent_id: press-gemma-4→ comment# trace agent (anonymized): Tier C agent on <model>; workspace path omitted entirely).--no-anonymizeflag is available for local testing but the default is safe for community PRs. Validation pattern documented: rather than running the CLI against personal agents (Hat / Press-Gemma / etc.), the README recommends spinning up a dedicatedtrace-testagent on a known-good model (e.g.,openrouter/openai/gpt-oss-120b:free) with a generic AGENTS.md that runs the same three-turn iterative workflow shape Hat/Press-Gemma use. The agent stays on the operator's machine (NOT in helmdeck) but the pattern is community-useful — surfaced inscripts/helmdeck-trace/README.mdas the recommended validation approach. What this CLI does NOT do (explicit scope boundaries): doesn't fire sessions (OpenClaw's internal IPC protocol isn't documented for external automation; filed as a research follow-up if upstream OpenClaw ships a documented session-fire API; meanwhile, operator manually pastes the test prompt into the OpenClaw UI then points the CLI at the resulting jsonl); doesn't computesimplification_observed(manual after review); doesn't compare against expected behavior (the output is the trace; the operator picks thedecision:value). Consumers: the four empirical-baseline issues filed alongside PR #477 — #473 gemma-4, #474 llama-3.3, #475 nemotron-3-super, #476 qwen3-coder — each invites community contribution of trace excerpts to populate the respective profile YAML'scommunity_traces[]array; this CLI is the canonical tool for producing those excerpts. Docs atscripts/helmdeck-trace/README.md.
Changed
- Empirical refinement: deposit-step skipping is workflow-shape-dependent, not tier-invariant (issue #466 follow-up). PR #469 / the 4th blog post (
/blog/tier-a-empirical-baseline) framed the deposit-step skipping as tier-invariant based on three single-response traces (Tier C baseline, Tier C with profile, Tier A baseline). A fourth trace on the sameopenai/gpt-oss-120b:freeroute, run with a three-turn iterative workflow (outline → draft → operator-triggered deposit+verify), successfully called BOTHhelmdeck__artifact-putANDhelmdeck__artifact-verify_manifest, returningall_present: true, 1 of 1 verified. Real 10,438-byte artifact landed at the expectedblog.publish/namespace key. Latency was significant (~5 minutes total for the deposit-and-verify turn on the free route), but the mandatory tool calls executed correctly. Corrected conclusion: single-response workflows asking the agent to do classify-outline-draft-deposit-verify-checklist in one go fail on every tier; multi-turn iterative workflows with explicit operator handoffs (each turn small enough that 1-2 pack calls suffices perchain_call_reliability: highin the profile) drive the mandatory calls reliably even on cheap Tier C. Engine-level enforcement (#461 Phase 3) remains the durable architectural answer because it removes the workflow-shape dependency entirely — but well-shaped iterative skill prose CAN drive the mandatory call on every tier tested so far. What changed in the docs:docs/reference/models.mdTier C row updated with the iterative-workflow recipe; "Empirical findings from 2026-06-09" section gains the refined finding paragraph + a new "Iterative workflow pattern" subsection documenting the recommended Turn 1 / Turn 2 / Turn 3 structure with operator-triggered handoffs. The handoff line at the end of each turn is itself load-bearing — if the skill prose says "produce a handoff line" but doesn't list missing-handoff as an invalidation condition, the model will drop it. The doc explicitly recommends pinning handoff lines as success-criteria invalidation conditions.models/openai-gpt-oss-120b-free.yamlschema extended with a newcomparison_traces[]entry capturing the iterative-workflow trace alongside the original Tier A entry; the original entry's "tier-invariant" notes are revised in-place to point at the new entry as the corrected finding. Methodological lesson: empirical claims based on a single workflow shape are premature. The architectural answer (Phase 3 engine hook) still holds, but the per-tier customization recommendations gain a new dimension — workflow shape, not just model tier, drives reliability of mandatory tool calls.
Changed
docs/reference/models.mdtier-level recommendation table rewritten with empirical Tier A baseline data (issue #466). The original table (shipped in PR #465) claimed Tier A "works out of the box" as an assumption. The 2026-06-09 Tier A baseline test onanthropic/claude-sonnet-4.6empirically revealed the assumption is only partially supported: Tier A handles every structural aspect of skill compliance better than either Tier C variant (parallel tool use at startup, full N-platform fanout, InfoQ 6-criterion fit check with per-criterion grades, multi-step plan acknowledged upfront, "one clarifying question" rule honored exactly) — but Tier A also skips the mandatoryartifact.put+verify_manifestdeposit step, same as both Tier C variants. The agent's text says "Now appending CTAs and depositing to artifacts — all in parallel" but its parallel tool calls were 8×blog.append_cta— conflating "append CTA" with "deposit to artifacts." The mandatory deposit step was never executed. Strategic finding: the deposit-step skipping is tier-invariant, not Tier-C-specific. Skill prose marked "MANDATORY, NOT ADVISORY" is treated as advisory regardless of model capability. What changed in the docs: the recommendation table now has two columns ("Structural compliance" vs "Mandatory deposit-step compliance"); a new "Empirical findings from 2026-06-09" section presents the three-trace comparison (Tier C baseline, Tier C with profile, Tier A baseline) across 10 metrics. Architectural implication: Phase 3 of #461 (engine-level post-call hook) was originally deferred pending Phase 1 + 2 evidence — today's trace strengthens its justification. The pattern is necessary because skill prose can't carry the mandatory-call weight on any tier, not just Tier C. The architectural shape that closes the loop: producer pack registers a paired auditor; engine intercepts the producer's completion and auto-invokes the auditor; auditor result attaches to the producer's response envelope so the LLM sees both in its next-turn context; no skill-prose dependency. Field report captured in2026-06-09-tier-a-empirical-baseline.md— fourth post in the 2026-06-09 series, frames the tier-invariant deposit-step failure mode honestly and points at #461 Phase 3 as the architectural answer.models/openai-gpt-oss-120b-free.yamlschema extended with acomparison_traces[]array (distinct fromcommunity_traces[]) so cross-tier maintainer-captured comparison runs have a structured place to live. Today's Tier A run is the first entry; future Tier B comparison runs will follow the same shape.
Fixed
blog.append_ctanow uses thedefaultPackModel()resolver — closes the last hold-out from PR #453. When PR #453 added the default-model resolver tocontent.groundandblog.rewrite_for_audience, it deliberately excludedblog.append_ctabecause the conditional shape ("model is required when source_url / project_url / github_url is set") was thought to be a different failure surface. The 2026-06-09 Tier A baseline trace (issue #466) empirically proved otherwise:anthropic/claude-sonnet-4.6runningtech-blog-publisheron the mcp-adr-analysis-server prompt calledblog.append_cta8 times in parallel withproject_urlset but nomodelarg, and the pack rejected ALL 8 withinvalid_input: model is required when one of source_url / project_url / github_url is set. That's the same upstream-failure failure mode #453 closed for the other content packs — the pack rejects before the LLM dispatcher fires. Fix: the handler now callsdefaultPackModel(in.Model)exactly likecontent.groundandblog.rewrite_for_audiencedo, resolving the same precedence chain (caller input →HELMDECK_DEFAULT_PACK_MODELenv → firstHELMDECK_OPENROUTER_MODELS→openrouter/autohard fallback). The "model is required when..." error path is removed; the dispatcher gets a non-emptyModelvalue on every call. Behavior change: callers omittingmodelwhile supplying any of the URL link inputs no longer hitCodeInvalidInput. Operators who want a specific model still pass it; the default fires only when omitted. Test surface: the existingTestBlogAppendCTA_RequiresModelWhenLinkSetwas removed (the behavior it pinned no longer applies) and replaced withTestBlogAppendCTA_DefaultsModelWhenOmitted(asserts the dispatcher receivesopenrouter/autowhen caller omitsmodel) andTestBlogAppendCTA_DefaultsModelHonorsOperatorOverride(asserts theHELMDECK_DEFAULT_PACK_MODELenv wins over the hard fallback). Inline comment on the removed test inblog_append_cta_test.godocuments the empirical-trace lineage so a future maintainer can audit the relaxation. Empirical impact: the same Tier A retry (next session post-merge) should now produce 8 successfulblog.append_ctacalls and the chain that broke today flows through cleanly. Architectural finding captured separately in issue #466: even with this fix, today's Tier A trace skippedartifact.putANDartifact.verify_manifestcalls entirely — the deposit-step skipping appears to be tier-invariant, not Tier-C-specific. That observation reframes the "Tier A works out of the box" assumption indocs/reference/models.mdand strongly supports the engine-level post-call hook (Phase 3 of #461) as the architectural answer regardless of tier.
Added
-
models/google-gemma-4-26b-a4b-it-free.yaml— second per-model prompting profile, stub (issue #464 Phase 1.2). 26B-total / 3.8B-active MoE Gemma 4 IT variant on Tier C (256K context window, multimodal — text + image + video up to 60s). Profile sourced from OFFICIAL Google Gemma 4 docs only: Hugging Face model card, Google AI model card, DeepMind product page (τ2-bench numbers), and Google's announcement blog. Schema captures Gemma's role-turn conversational format (replaces Gemma 3's<start_of_turn>syntax with standardsystem/user/assistantroles via the chat template), binary thinking-mode control via the<|think|>token (NOT a gradedlow/medium/highknob like gpt-oss; toggled viaenable_thinking=True/Falsethrough the chat template), Google's universal sampling defaults (temperature=1.0, top_p=0.95, top_k=64across all tasks),harmony_format: false(Gemma uses its own channel-tag thinking format<|channel>/<channel|>— important: per Google's docs, "Thoughts from previous model turns must not be added" back into history), and multimodal ordering rules (image content BEFORE text, audio content AFTER text).chain_call_reliability: high for short chains (1-2 calls), medium for medium (3-4), low for long (5+) — based on DeepMind's published τ2-bench 85.5% (retail agentic tool-use, 26B-A4B variant) plus the 3.8B active-parameter budget (small-active MoEs typically degrade on long horizons; binary-only thinking control leaves no escalation knob).best_practices[]quotes from Google's official sources;anti_patterns[]captures Gemma-specific gotchas (replaying prior-turn thoughts, hand-rolling Gemma 3 turn markers, expecting nuance/sarcasm reliability — model card explicitly cautions on each).validated_against,community_traces, andcomparison_tracesship empty — baseline empirical trace deferred to a follow-up issue because the Google AI Studio shared:freepool on OpenRouter rate-limited the trace prompt at zero token cost on 2026-06-10 (429 Provider returned error: google/gemma-4-26b-a4b-it:free is temporarily rate-limited upstream/provider_name: "Google AI Studio"). The 429 finding itself is captured in the YAML's header comment as a Tier C infrastructure observation — Google AI Studio gates at the upstream-provider level, NOT at the model level, affecting allgoogle/*:freeroutes simultaneously. BYOK (https://openrouter.ai/settings/integrations) is required for sustained empirical work on Gemma 4 via OpenRouter. -
Per-model agent recipe: Gemma 4 iterative workflow (issue #464 Phase 4 down-payment). New how-to doc at
docs/howto/per-model-agents/gemma-4-iterative-workflow.mdwalks through setting up an OpenClaw blog-drafter agent ongoogle/gemma-4-26b-a4b-it:freewith a Gemma-4-tuned AGENTS.md template — restructured for role-turn-conversational style instead of gpt-oss's Objectives + Source priority + Constraints + Output format + Success criteria sections. Same three-turn iterative workflow shape as PR #470's gpt-oss validation (outline → draft + ground → deposit + verify) for clean cross-modelcomparison_traces[]isolation. Sanitized worked example uses Maya persona (a hypothetical security researcher) per the standing memory rule that operator-personal workspace files stay anonymized in helmdeck-facing docs. Recipe covers pre-flight (OpenRouter key + Firecrawl overlay), per-agent model config (Google's universaltemperature=1.0, top_p=0.95, top_k=64sampling defaults +enable_thinking: true), the full AGENTS.md template, a test prompt that mirrors PR #470's validation arc, the metric-capture shape forcomparison_traces[]submissions, and an honest "why three turns" rationale. Partial Phase 4 acceptance: issue #464 Phase 4 originally proposed shipping per-model templates underskills/tech-blog-publisher/templates/agents/<variant>/— but thetech-blog-publisherskill itself isn't helmdeck-shipped (operators set it up locally perdocs/howto/add-free-models.md). This recipe-doc shape closes the same intent without requiring helmdeck to ship the upstream skill: it gives operators a model-specific AGENTS.md template + worked example they can copy into their personal OpenClaw workspace. New sidebar category "Per-model agent adaptation" surfaces the recipe in the howto sidebar. -
Profile stubs for three more #464 Phase 1 entries (issue #464 Phase 1.2). Schema scaffolds with docs-sourced metadata and prompting guidance ship for
meta-llama/llama-3.3-70b-instruct:free(models/meta-llama-llama-3.3-70b-instruct-free.yaml, 70B dense Llama 3.3 on Tier C free route,role_header_chatmlformat with Meta's own<|start_header_id|>tokens, two function-calling paths documented (bracket-list vs JSON-after-<|python_tag|>), Meta's Llama prompting guide best-practices captured, the family-level "conversation alongside tool calling" anti-pattern noted),nvidia/nemotron-3-super-120b-a12b:free(models/nvidia-nemotron-3-super-120b-a12b-free.yaml, 120B-total / 12B-active hybrid Mamba-Transformer MoE with 1M context window, ChatML format with<|im_start|>/<|im_end|>, reasoning control viaenable_thinking+low_effortsub-mode throughchat_template_kwargs, Nvidia'sforce_nonempty_content: Truerecommendation for coding agents to prevent reasoning-only-empty turns — corroborates ADR 053, goal-drift and tool-call-failure documented as residual failure modes despite the 1M window, Nvidia's own Super+Nano deployment recommendation for long chains noted), andqwen/qwen3-coder:free(models/qwen-qwen3-coder-free.yaml, 480B-total / 35B-active MoE coder-specialized Qwen 3 variant with 256K native context extendable to 1M via YaRN, ChatML format with<|im_start|>/<|im_end|>plus FIM tokens for inline-completion contexts, NON-thinking-mode only — Qwen3-Coder explicitly does NOT generate<think></think>blocks per the HF card, Qwen-specific tool parser recommended in SGLang/vLLM, post-trained with long-horizon Agent RL for multi-turn tool trajectories, SWE-Bench Pro 38.7 / Terminalbench 2 23.9 documented; sourced from HF model card + GitHub README + Qwen announcement blog). All three stubs ship empirical sections empty (validated_against: [],community_traces: [],comparison_traces: []) with comments pointing at follow-up empirical-baseline issues; this lowers the bar for community contribution (per Phase 1 §7) — operators running these models on real workloads can submit trace excerpts to populatecommunity_traces[]without rebuilding the schema scaffold first. Phase 1 substitution rationale: originally #464 Phase 1 listedz-ai/glm-4.5-air:freeas the fifth entry, but live OpenRouter API enumeration on 2026-06-10 confirmed the:freevariant has been deprecated (only the paidz-ai/glm-4.5-airremains; live/api/v1/models, the collections page, and third-party enumeration all agree).qwen/qwen3-coder:freeis substituted in — it's an actively maintained coder-specialized model with strong agentic positioning (Agent RL post-training, SoTA among open models on Agentic Coding per the Qwen blog), and the Qwen upstream pool is independent of the Google AI Studio pool that gemma-4 hit today. Docs update:docs/reference/models.md"Per-model profiles available today" list promotes all four new YAMLs out of "Planned" into "Available today" (with four explicitly labeled as stubs); a new section above the Tier C routing table notes that per-model profiles override the row-level Notes column with prompting guidance sourced from official model docs;google/gemma-2-9b-it:freeandz-ai/glm-4.5-air:freeremoved from the planned list (gemma-2 substituted with gemma-4-26b-a4b-it:free, glm-4.5-air substituted with qwen3-coder:free). Tier C table gets new rows foropenrouter/google/gemma-4-,openrouter/meta-llama/llama-3.3-70b-instruct:free, andopenrouter/qwen/qwen3-coder; the existingopenrouter/z-ai/glm-prefix row gains a note about glm-4.5-air's deprecation; existing nemotron prefix row gets a "Profile: [...]" link. Follow-up empirical-baseline issues filed alongside this PR for gemma-4, llama-3.3, nemotron-3-super, and qwen3-coder — each follows the methodology shape of issue #466 (which validated gpt-oss-120b) and invites community contribution. Why ship stubs instead of one PR per model: it closes Phase 1 acceptance from "1 of 5" to "5 of 5 with at least one fully empirically validated" in a single push, declares the schema scaffold for community contributors to PR against, and surfaces the per-model prompting differences immediately in docs without waiting for empirical trace runs on every model. The gpt-oss profile started populated because a prior empirical session was available; the other four Phase 1 entries don't have prior traces, but the docs-sourced scaffold provides immediate value while empirical data accumulates. -
models/openai-gpt-oss-120b-free.yaml— first entry in the per-model prompting-profile library (issue #464 Phase 1). Sourced from OFFICIAL model documentation only: OpenAI Harmony response format, Together AI GPT-OSS guide, IBM watsonx GPT-OSS behavior guidelines, and OpenRouter free-route. Schema captures:prompting_style(objectives + source priority + constraints + output format + success criteria — NOT step-by-step),reasoning_effort_controlwith per-task defaults (low/medium/high),source_priority_directive(gpt-oss can prefer internal knowledge unless told otherwise — skills must include an explicit source-priority section),harmony_format(gpt-oss uses harmony response format with internal chain-of-thought),chain_call_reliabilityper chain length (high for 1-2 calls, medium for 3-4, low for 5+ — Tier C reliably makes 1-2 real pack calls per turn then hallucinates the rest as text, per the 2026-06-09 trace in PR #462),best_practices[](10 items derived from official docs),anti_patterns[](5 items including the plausibility-shaped-output failure mode), and aprompt_templateshowing the canonical shape. Schema extended withcommunity_traces[]array so external operators contributing their own use-case traces have a structured place to submit them (contributor / use_case / session_date / metric_summary / decision / notes / pr_or_issue_url). First entry is the 2026-06-09 empirical run: profile-aware agent on openai/gpt-oss-120b:free vs baseline, both calling the same publishing-strategist skill. Empirical finding (full results invalidated_against.finding): profile-aware agent produced 2 real blog artifacts, calledartifact.verify_manifestonce withall_present: true, 2 of 2 verified, hallucinated 0 manifest entries — vs baseline which produced 0 deposits, 0 verify_manifest calls, and (in earlier sessions) 6 hallucinated entries. SAME agent simplified the skill's 9-platform table to 2 variations by choosingpipeline-run(auto-deposit) over per-platformblog.rewrite_for_audiencecalls — the strategic insight is that the profile raises the floor of structural compliance but does NOT eliminate per-use-case simplification on Tier C. Per-use-case AGENTS.md customization remains the architectural truth for non-frontier models. Documentation ships in three new operator-facing surfaces:docs/howto/add-free-models.md(strict recommendation: must customize per (model × use-case), with §7 community contribution paths),docs/howto/experiment-with-tier-b-models.md(Tier B is an open research question — A/B methodology and mandatory share-your-findings ask),docs/reference/models.mdgains a tier-level recommendation table at the top (Tier A out-of-box / Tier B experiment / Tier C must-customize).CONTRIBUTING.mdadds a "Reporting model behavior" section pointing at the two howtos and thecommunity_traces[]schema. Blog field-report:2026-06-09-empirical-validation-per-model-profile.md— third post in the 2026-06-09 series (companion to "plausibility-shaped output" and "the audit-callback pattern" in PR #463), explicitly frames the library as a starting point that operators must finish via per-use-case customization. Privacy: blog post + howto worked examples use a SANITIZED hypothetical persona (Maya, security researcher with generic platforms) — the operator's personalpress-gpt-ossworkspace files are NOT reproduced publicly. Phase 2 follow-ups (tracked in #464): same profile shape formeta-llama/llama-3.3-70b-instruct:free,nvidia/nemotron-3-super-120b-a12b:free,google/gemma-2-9b-it:free,z-ai/glm-4.5-air:free— each requires its own empirical validation trace before shipping. Tier B unknown, tracked as community research per the experiment-with-tier-b-models howto. -
artifact.verify_manifestpack — anti-hallucination audit for the artifact deposit step (#461 Phase 1). Empirical motivation: live trace on 2026-06-09,tech-blog-publisheragent onopenai/gpt-oss-120b:freewith all morning fixes merged (PR #450 artifact triad, PR #452 declarative bridge, PR #453 default pack model, layered SOUL/IDENTITY/USER/AGENTS workspace split). Agent made one realblog.rewrite_for_audiencecall, then produced a confidently-formatted six-entry "Artifact Deposit Manifest" table with realistic byte sizes (7.4 KB, 2.1 KB, 3.8 KB, 4.0 KB, 3.5 KB, 3.2 KB) and the disclaimer "Artifact deposit was performed viahelmdeck__artifact_putfor each variation (mandatory per SKILL.md)." Empirical ground truth fromGET /api/v1/artifacts: zero artifacts in theblog.publishnamespace, one artifact total in the entire store (an unrelated test). Every line of the manifest was fabricated. The architectural fixes from this morning close the prose-instruction-skipped failure mode (PR #450's typed deposit) and the required-arg-missing failure mode (PR #453's default-model resolver) — they do not close the lying-about-tool-calls failure mode where a Tier C model produces a plausibility-shaped manifest for artifacts it never deposited. This pack closes that gap. Input:{expected: [{artifact_key: "..."}]}(also accepts a flat string array[...]for Tier C friendliness — both shapes decode). Output:{verified[], missing[], all_present, summary}. Handler: per-keyArtifactStore.Getaccumulating found vs not-found, dedup before lookup, whitespace-only / empty-string entries dropped silently during decode,summaryis one-line"M of N claimed artifacts verified; K missing". Architectural shape mirrors ADR 052 at the chat-response layer: turn an implicit trust ("the agent said it deposited") into a typed pack call that reads ground truth and surfaces the gap in O(200) tokens instead of the multi-thousand-token REST-poking dance an operator would otherwise do to verify. Skill integration documented indocs/reference/packs/artifact/verify-manifest.md: every skill that produces multiple artifacts should chainhelmdeck__artifact-verify-manifestas§ 4bafter the deposit step, with explicit instructions to surface theverified/missingresult honestly in the response.tech-blog-publisher/SKILL.mdupdates to add the § 4b rule ship in the same release as a worked example. Test surface: 15 new tests acrossartifact_verify_manifest_test.gocovering all-verified (object shape), all-verified (flat-string Tier C shape), partial-missing (the today-trace reproduction — 1 of 6 verified), all-missing, dedup of duplicate keys, empty/whitespace entries dropped silently, verified-entry shape carrying filename + namespace + size + content_type + key, 6 error-path cases (missing field, empty array, all-empty entries, wrong type, malformed JSON, no store wired), and a round-trip test that puts two artifacts viaartifact.putthen verifies both — proof the producer/consumer pair works as matched. 100% per-function coverage on the new file;internal/packs/builtinpackage total: 93 artifact-related tests pass. Phase 2 follow-ups tracked in #461: same audit shape forrepo.verify-clone(claimed clone_path exists, commit SHA matches),blog.verify-published(claimed URL is reachable, content matches),pack.verify-completed(job_id iscompletednotworking),slides.verify-rendered(MP4 artifact exists + passesav.validate),content.verify-grounded(claims_grounded_count matches grounded[] length),pipeline.verify-completion(claimed step outputs match run record). Phase 3 (deferred): engine-level post-call hook that auto-invokes the registered auditor without skill-prose dependency — likely its own ADR if Phase 1 + 2 prove the pattern is generally useful. Field-report blog drafts scheduled (per CLAUDE.md draft-on-finding norm): (a) "Plausibility-shaped output: Tier C models hallucinate multi-step pack-call chains as text, including manifests of fictitious deposits" quantified from the 2026-06-09 trace; (b) "The audit-callback pattern: verify-against-ground-truth as anti-hallucination middleware" — architectural framing, applicable beyond helmdeck.
Fixed
-
content.groundandblog.rewrite_for_audienceno longer hard-fail when the caller omitsmodel. Live-trace evidence from today'stech-blog-publisherretest againstopenai/gpt-oss-120b:free(the Tier C validation case): agent loaded the skill, ranrepo.fetch+web.scrape+fs.readsuccessfully, then looped re-callingcontent.groundwithout themodelargument and bouncing offValidation failed: model: must have required properties modelindefinitely. Same architectural pattern PR #450 fixed for artifact deposit and the validation arc (ADR 052) fixed for AV post-processing: skill prose tells the agent to call the pack, pack contract requires a parameter the prose doesn't mention, Tier C model has no anchor to fill it in, call rejects, loop. Fix: new shared helperdefaultPackModel(callerInput string) stringininternal/packs/builtin/model_defaults.goresolves a sensible default when the caller omitsmodel. Precedence: explicit caller input wins →HELMDECK_DEFAULT_PACK_MODELenv (operator override, new) → first entry ofHELMDECK_OPENROUTER_MODELSenv (reuses the existing gateway-side model registry pin frominternal/gateway/hydrate_openrouter.goso a stack pinningHELMDECK_OPENROUTER_MODELS=minimax/minimax-m2.7gets the same model resolved consistently on both the gateway-registration path AND the pack-default path) →openrouter/autohard fallback. The hard fallback isopenrouter/autorather than a specific free model because (a) it routes through OpenRouter's per-call provider selection which is generally available on every deployment that hasHELMDECK_OPENROUTER_API_KEYset, and (b) it preserves the existing project posture that the gateway prefersautofor orchestration work (ADR 053PromptVariantFullStepsruns on auto by default). The first non-empty source wins; trimming guards against whitespace-only env values being treated as set. Wired into:content.ground(input schemaRequireddropsmodel; handler resolves via the helper before any LLM dispatch);blog.rewrite_for_audience(same — both packs are in thetech-blog-publisherskill's chain).hyperframes.composealso requiresmodelbut is hyperframes-specific and deferred until a Tier C trace shows it as a real blocker. Why a hard fallback rather than returningCodeInvalidInput: the typical zero-config dev experience is a fresh helmdeck stack withHELMDECK_OPENROUTER_API_KEYset (the only way the gateway works) and no model override. The Tier C silent-skip mode means an agent calling a pack on that stack would hitmodel is requiredwith no hint of what value to pass. Defaulting toopenrouter/automakes the pack succeed at the cost of using more tokens than a hand-tuned model choice would. Operators who want a different default setHELMDECK_DEFAULT_PACK_MODELonce at the stack level. Test surface: 10 new tests acrossmodel_defaults_test.go(caller-wins, whitespace-trim, operator-override, stack-pin-with-and-without-prefix, hard-fallback, empty/whitespace env handling, leading-empty-skip in comma list, prefix-preservation); plusTestBlogRewrite_DefaultsModelWhenOmitted+TestBlogRewrite_DefaultsModelHonorsOperatorOverride+TestContentGround_DefaultsModelWhenOmittedconfirm the helper fires end-to-end through each pack's dispatcher path. ExistingTestBlogRewrite_RequiredFields/no_modelandTestContentGround_MissingRequiredFields/no_modelwere removed (the behavior they pinned no longer applies) with inline comments pointing to the replacement tests so a future maintainer can audit the relaxation. What this PR explicitly does NOT do (scope boundaries): (a) apply the same fix to every pack that takes amodelparameter —blog.append_ctahas a conditional "model is required when..." shape that's a different semantics, andhyperframes.composehasn't surfaced as a real blocker yet; both are tracked for follow-up. (b) Touch the gateway-side LLM provider chain — the default model is resolved purely at the pack boundary; the dispatcher's existing provider routing handles whatever string lands inin.Model. (c) Change the OUTPUT schema or pack metadata — callers checkingoutput.modelstill see the resolved value (now the default when the caller omitted one), preserving the wire shape. -
OpenClaw ↔ helmdeck network bridge now declarative — survives
docker compose up --buildinstead of evaporating on every rebuild. Recurring 24-hour debugging loop: thebundle-mcpprocess inopenclaw-gatewayneeds DNS resolution forhelmdeck-control-plane, which requires both containers to share thebaas-netDocker network. The previous mechanism was a runtimedocker network connect baas-net openclaw-openclaw-gateway-1call inscripts/configure-openclaw.shstep 1 — runtime attachments are erased every time the container is recreated. Symptoms: bundle-mcp probes failing withgetaddrinfo EAI_AGAIN(network gone) or401(a stale token survived but the network didn't), the agent stopping mid-conversation with "I don't have access to MCP tools," and the operator getting pulled back into manual recovery viaconfigure-openclaw.sh --rotate-jwt. Fix: newdeploy/openclaw-baas-net.compose.ymldeclares the attachment as a Docker Compose override on the OpenClaw service.configure-openclaw.shstep 1 now installs this override into the OpenClaw compose directory (typically/root/openclaw/docker-compose.override.yml) before the runtimenetwork connect— the runtime call remains a best-effort safety net for the CURRENT container instance so the rest of the script can probe + verify without requiring a restart, but the override is what makes the bridge survive the NEXT compose-recreate. New--skip-compose-overrideflag opts out for operators who manage the override themselves. The script preserves any pre-existing differing override atdocker-compose.override.yml.bak.YYYYMMDD-HHMMSSbefore replacing, so a hand-edited override isn't silently clobbered. Why this lives in helmdeck's tree rather than OpenClaw's: helmdeck and OpenClaw are independent projects with separate compose lifecycles — the override file is generated by helmdeck (so the integration runbook ships with it) but installed into OpenClaw's compose dir (so OpenClaw's container lifecycle remains the source of truth for OpenClaw's networking). Same pattern as Phase 1 of PR #450: turn an advisory step (re-runconfigure-openclaw.shafter every rebuild) into a declarative artifact (the override file is applied automatically by compose). Updateddocs/integrations/openclaw.md§5b with the "Network bridge survival across rebuilds" troubleshooting section so operators hitting the symptom find the fix instead of re-running the script.
Added
artifact.put/artifact.get/artifact.listpacks — typed surface for the artifact store, replacing prose-instruction "save to / read from artifacts" guidance that Tier C free models silently ignore. Motivating observation: thetech-blog-publisherOpenClaw skill was generating blog content correctly onopenai/gpt-oss-120b:freebut returning the markdown inline in the chat response instead of depositing it under the artifact store the way the SKILL.md prose instructed. Same failure mode the validation arc solved at the pack-output layer per ADR 052: turn an advisory step into a typed pack call so model tier doesn't matter.artifact.putaccepts{content, kind, filename?, content_type?, encoding?, namespace?}and returns{artifact_key, url, size, content_type, filename, namespace}. Thekindhint (one ofblog,markdown,transcript,summary,json,text,html,csv,binary) drives defaultfilename+content_typeso skills don't have to think about MIME types —kind:"blog"→content.md+text/markdown.encoding:"base64"opt-in for binary content the JSON envelope can't carry literally; unsupported encodings reject fast rather than silently passing base64 text through as if it were UTF-8. Filename safety: leading slashes stripped,..segments resolved, path.Clean applied, empty/./..fall back to the kind default.artifact.getis the symmetric reader: input{artifact_key, encoding?}, output{content, encoding, content_type, size, artifact_key, filename, namespace}. Encoding policy: text-shaped content types (text/*,application/json,application/yaml,application/xml,*+json,*+xml,*+yamlper RFC 6839) return as UTF-8 strings by default; everything else returns base64 so a non-UTF-8 byte sequence doesn't blow up the JSON envelope. Callers can force either withencoding:"utf-8"/encoding:"base64".artifact.listis the introspection capability: input{namespace?, filename?, limit?}(filename is a case-insensitive substring match, not a glob), output{artifacts:[...], count, truncated}. Default limit 100 entries, newest-first sort bycreated_at. Pairartifact.list(find the key) withartifact.get(read the bytes) when an operator may have uploaded a file the agent needs to discover, or to enumerate what a multi-pack skill produced. No external deps, no NeedsSession — pure passthroughs to the existingArtifactStoreinterface (already onExecutionContext.Artifactssince T205). Each pack registers incmd/control-plane/main.goin the always-available section. Test surface: 78 new test cases across three pack files.artifact.putcovers happy path, all 9 kind defaults + unknown-kind fallback + case-insensitivity, explicit filename/content_type override of kind defaults, custom namespace, base64 round-trip, filename sanitization (absolute path,..traversal, internal..cleanup,./../empty defaults), and 7 error paths (missing content, empty content, no store wired, bad base64, unsupported encoding, malformed JSON, store-backend failure).artifact.getcovers UTF-8 vs base64 routing across 9 text content types and 7 binary content types, forced-encoding overrides in both directions, key-split parsing for filename/namespace extraction, 6 error paths (missing/empty/whitespace key, no store, not-found, malformed JSON), and a round-trip test that chainsartifact.put→artifact.getto confirm the value comes back unchanged.artifact.listcovers empty store, listAll, namespace filter, filename substring filter (case-insensitive + suffix matching), namespace+filename combined filter, limit + truncation, and 4 error paths. All three packs hit 97-100% per-function coverage;internal/packs/builtinstays above the 80% floor. Skill pattern documented indocs/reference/packs/artifact/put.md: every skill in~/.openclaw/skills/that produces audience-facing content should end its procedure with a mandatoryhelmdeck__artifact-putcall. The pattern was introduced specifically because Tier C free models on OpenRouter (openai/gpt-oss-120b:free,meta-llama/llama-3.3-70b-instruct:free, etc. — see ADR 051 and the models reference) ignored the prose deposit step. What's NOT in this PR (explicit scope boundary): thePOST /api/v1/artifactsupload endpoint that would let the management UI write operator-uploaded files into the store. That's a separate small PR with its own design questions (per-caller namespacing, MIME allowlists, size limits, auth posture). Once it lands, the round trip closes: operator uploads via REST → agent finds viaartifact.list→ agent reads viaartifact.get→ agent processes and deposits viaartifact.put. Until then,artifact.list/getare still useful for inspecting pack-produced sidecars (validation.json, engagement.json, captions.srt) and for skills that chain artifacts between stages.
[0.26.0] - 2026-06-05
The validation-arc release. Closes the four-phase AV-validation arc (PRs #428 / #430 / #431 / #432 / #433) — script → pack → upstream fix → default-on integration → ADR record. Token cost of "the video has issues" diagnostics drops from ~3,000 tokens per incident (manual ffprobe loop) to ~200 tokens (read validation.checks[] from the run record) per ADR 052. Sibling work: tier-aware Budget.PromptVariant for helmdeck.plan (ADR 053, #437) routes Tier C models to a single_pick one-step-at-a-time plan path, addressing the 50% multi-step plan failure rate observed against openrouter/nvidia/nemotron-3-super-120b-a12b:free during validation-arc testing. New documentation surface: operator-facing models tier reference (#439), intent-first cookbook with 17 recipes (#435 + #441), two field-report blog posts capturing the arc + cookbook-pattern thesis. Pack catalog grows from 52 → 53: new av.validate pack ships no-gateway-required.
New packs: av.validate (helmdeck__av-validate via MCP).
Operator upgrade: clean — no schema migrations, no breaking input changes, no removed packs. Validate *bool and CaptionsSidecar *bool are pointer-bool default-on with backward-compatible nil-→-on semantics. Production deployments using the published ghcr.io/tosin2013/helmdeck-sidecar:latest get the new validation script automatically on next pull (the script COPY is in the sidecar Dockerfile). Local builders using compose.build.yaml get the local override automatically per #434.
Added
-
helmdeck.plangets tier-aware prompt templates viaBudget.PromptVariant— ADR 053. Motivated by empirical data captured during the validation-arc testing window on 2026-06-05: sixhelmdeck.plancalls againstopenrouter/nvidia/nemotron-3-super-120b-a12b:freefor the same multi-step intent class showed 3/6 success (50%), with 33% length-truncation at the 600-token output cap and 17% near-empty responses with the canonical reasoning-token leak pattern (423-token completion, 71 chars of user-visible JSON — TokenMix measures the analogous behavior at ~40% on DeepSeek R1 withmax_tokens=200). Same intent class onopenrouter/autoin the same window: 2/2 clean stops at 15–34s latency. The architectural finding — captured in the new blog draft (PR #436) and the new ADR 053 — is that output shape, not model size, is the right primitive: small models reliably make ONE pack-pick decision in 50–200 tokens but fail at emitting a 1,500-token multi-step plan in one shot. BFCL data confirms the multi-turn cliff at the model-family level (xLAM-2-1B at 53.97% overall but 8.38% multi-turn; Qwen3-1.7B at 55.49% overall but 16.88% multi-turn — TinyLLM, arXiv 2511.22138). Two prompt variants now ship:PromptVariantFullSteps(Tier A/B default) emits the complete pipeline JSON in one shot (today's behavior — sameplanSystemPrompttemplate);PromptVariantSinglePick(Tier C default) emits the SINGLE NEXT step + amore_steps_likelyflag, agent re-callshelmdeck.planwith updated context to plan the next step. The output schema is the same across both variants ({steps:[], complexity, more_steps_likely, reasoning}) so the handler doesn't need to parse two response shapes — only the model's TASK changes. Selection viaBudget.ResolvePromptVariant(): explicitPromptVariantfield on a budget entry wins; otherwise tier defaults apply (Tier A/B → FullSteps, Tier C and unknown → SinglePick — fallback for unknown tiers is the conservative path, matching the ADR 051 "we don't know, route to the safer path" posture). Operators override per-entry when their per-model knowledge contradicts the tier default — e.g. a Tier B model trained specifically for tool calling that handles multi-step plans reliably should getFullStepsdespite the tier default suggesting otherwise. Output additions:planOutputgainsprompt_variant_used(which template fired) andmore_steps_likely(set by SinglePick on the first step of a chain; always false on FullSteps). Bothomitempty— wire-shape stable for callers that haven't migrated. Backward compatibility: Tier A/B behavior is identical to pre-ADR-053 — same prompt, same output, same model interactions. Only Tier C and unknown-tier models see a behavioral change, and that change is they now produce reliably parseable output where 50% of the time they previously did not. Agent loop pattern: thesingle_pickvariant composes naturally with the MCP agent loop — agent callshelmdeck.plan→ runs the step → callshelmdeck.planagain with updated context → repeats untilmore_steps_likely:false. Each call is a self-contained Tier-C-sized decision; the catalog projection is already cached on prefix-cache-enabled providers per ADR 051 PR #4, so the per-step cost is dominated by output tokens. Regression guards at two layers:TestResolvePromptVariant_TierDefaults+TestResolvePromptVariant_ExplicitOverrideininternal/llmcontext/budgets_test.goassert the variant resolution rules;TestSelectPlanSystemPromptininternal/packs/builtin/plan_test.goasserts template-marker presence per tier+variant (Tier A → "ORDERED sequence of tool/pipeline calls", Tier C → "Emit EXACTLY ONE step in the steps array", and the override paths in both directions). Same rule-with-test posture PR #404 introduced for the no--c copyaudio-concat guard. Architectural framing in ADR 053: routes by output shape, not parameter count; references the literature converging on the same point (Portkey "Smart Fallback with Model-Optimized Prompts", DSPy compile-per-LM Signatures, PLAN-TUNING arXiv 2507.07495, Pre-Act arXiv 2505.09970, Anthropic's "Building Effective Agents" essay). Future-deferred:PromptVariantHybridvalue for Tier B models that handle short multi-step plans but not full pipelines — deferred until we have empirical data on a specific Tier B model failing the current FullSteps posture; speculative variants without motivating evidence are how the variant enum bloats into a footgun. -
av.validatedefault-on integration inslides.narrate+podcast.generate— Phase 3 of 4 in the validation arc. Phase 1 (#428) shipped the standalone script. Phase 2 (#430) wrapped it as theav.validatepack. Phase 3 is the token-savings payoff the entire arc was built for: every successfulslides.narrateandpodcast.generaterun now embeds the structuredvalidationreport directly in the run output. The next "the video has issues" diagnostic costs ~200 tokens (readvalidation.checks[]from the run record) instead of the ~3,000-token manual ffprobe loop we ran before the validator existed. Refactor first: the core validation logic ininternal/packs/builtin/av_validate.gowas extracted into a reusablerunAVValidation(ctx, ec, opts) (scriptReport, string, error)function. Theav.validatepack handler now calls it after resolving artifact-keys to paths; the newslides.narrateandpodcast.generatepost-concat steps call it directly with paths already in the session tmpfs (no double-fetch overhead — the whole point of acceptingvideo_path/audio_pathdirect inputs back in Phase 2). The function applies the known-issue demotion map, persists the validation.json sidecar under the caller's namespace, and returns the typed report.slides.narrateintegration: newValidate *boolinput field on the pointer-bool default-on pattern (mirrorsCaptionsSidecarfrom PR #425 andMermaidfrom PR #379 — nil → on,&false→ off). Validation runs at the new step 9b between video upload (step 9) and engagement metadata (step 10):runAVValidationis called withVideoPath: "/tmp/final.mp4"(still on disk in the sidecar) andCaptionsPath: captionsValidatePath(the SRT bytes written to/tmp/captions-validate.srtwhen sidecar is enabled but burn-in is not — a ~10 KB tmpfs write whose result is consumed by the script'ssrt:*+consistency:captions_coveragechecks). The artifact namespace is set to"slides.narrate"so the validation.json sidecar lives next to the engagement.json + captions.srt artifacts the pack already persists.podcast.generateintegration: same pattern. NewValidate *boolinput; validation runs after audio artifact upload at progress ~97%.runAVValidationis called withAudioPath: "/tmp/helmdeck-podcast/final.mp3"(the pathinternal/podcast/concat.gouses for its concat output). Audio-only invocation meansmp4:*andconsistency:audio_video_durationchecks skip automatically per the script's argv dispatch — onlyaudio:packet_contiguity,audio:rms_sweep,audio:loudness_lufs, andaudio:silence_runsrun, which is the correct check set for an MP3 output. The artifact namespace is"podcast.generate". Output additions: both packs gainvalidation(the structured report — shape mirrorsav.validate's output:{checks[], passed, failed, warnings, all_passed}) andvalidation_artifact_key(the persisted sidecar) in theirOutputSchema.Properties.validation_artifact_keyis always emitted (empty string when validate is off or the script failed) so consumers can branch on its presence.validationis conditionally added — present when validate ran successfully and produced ≥1 check, absent when validate is off OR the script invocation failed. Soft-surface contract preserved (the load-bearing reason validation runs default-on): validation script-exec failures, JSON-parse failures, and validation findings (checks at any severity) all log and continue rather than failing the pack. The artifact is the value; validation is a description of the artifact. Operators who want fail-fast behavior callav.validatestandalone withstrict:true(Phase 2's escape hatch); the default-on integration intentionally never blocks artifact ship.scripts/pipelines-smoke.shrefactor: themp4:avandmp3:avassertion specs gain a newvalidation_asserthelper as the primary signal. When the run record contains avalidationfield, the smoke script readsvalidation.all_passedand short-circuits to green-success. When the field is absent (validate explicitly disabled, OR runs from before Phase 3 shipped during avbench's cutover window), the script falls back to the legacy inline ffprobe checks (mp4_faststart_ok,audio_packets_contiguous,audio_rms_above,audio_codec_params_ok, plus the engagement/captions structural assertions). Net effect: an avbench run on a post-Phase-3 artifact now does most of its work by reading one JSON field; the inline checks become a backwards-compatibility safety net rather than the primary verification path. Regression guards at three layers:TestSlidesNarrate_ValidationDefaultOnandTestPodcastGenerate_ValidationDefaultOnconfirm the handler invokesav-validate.shwhen the pointer-bool is nil (default-on) AND the output schema validates even when the script fails (the soft-surface contract).TestSlidesNarrate_ValidationExplicitlyDisabledandTestPodcastGenerate_ValidationExplicitlyDisabledconfirmvalidate:falsesuppresses the script call entirely. The existingTestAVValidate_NoDemotionsInForcefrom the #429 fix continues to assert no checks are demoted and the demotion mechanism still works. Coverage: fullgo test ./internal/... -race -count=1passes 2,006+ tests across 32 packages. Coverage gate PASS at every floor (internal/packs/builtin80.6%). Phase 4 of the validation arc remaining: ADR audit of ADR 008, ADR 015, ADR 045, ADR 051 for the implications of default-on validation + a new ADR-052 capturing the architecture (severity policy, known-issue demotion lifecycle, soft-surface contract, script-delivery via sidecar Dockerfile COPY). -
av.validatepack — Phase 2 of 4 in the validation arc (Phase 1 shipped the standalone script in PR #428). The pack wrapsscripts/av-validate.shso any pipeline or agent can call validation as a typed surface and read structured findings rather than re-deriving the diagnostic flow from scratch every time. Token-savings rationale (the load-bearing motivation for the whole arc): every manual "the video has issues" diagnostic burns ~3,000 tokens of bash output + analysis. This pack collapses that to ~200 tokens once Phase 3 wires it as a default-on post-step. Pack inputs:video_artifact_key/audio_artifact_key/captions_artifact_key(fetched from the artifact store and written to/tmp/av-validate-{video,audio,captions}.{mp4,mp3,srt}in the session before invoking the script), ORvideo_path/audio_path/captions_path(direct paths — useful for chained-pack scenarios where the file is already in the session/tmp, eliminating double-fetch overhead Phase 3 will rely on). Plusebur128_target(default -14 LUFS, YouTube spec; -23 for broadcast),skip_checks(comma-separated;video:freeze_runsis default-skipped because slide-deck videos hold a static image per slide and that check false-positives 100%), andstrict(boolean, defaultfalse). Pack outputs:validation(object withchecks[],passed,failed,warnings,all_passedmirroring the script's--jsonshape) +validation_artifact_key(the persistedvalidation.jsonsidecar — same pattern asengagement.json/captions.srtfrom #424 / #425). Severity policy is honest: the script reports each check at its natural severity (failfor matches-shipped-bug-fixes,warnfor soft heuristics). The pack then overrides the script's severity for checks listed in an internalknownIssueDemotionsmap. When afail-severity check is in the map, the pack demotes it towarnand appends the tracking-issue reference to the detail string. Current demotions (will shrink as fixes land):consistency:audio_video_duration→ demoted towarnper issue #429. The 888de7b23142ba81 artifact diagnostic during Phase 1 development surfaced thatPadAudioToMinproduces duration-stretched AAC packets at slide boundaries — exactly 13 packets summing 26.246s on the symptom artifact, matching the 25.9s timeline-vs-content discrepancy. The audio PLAYS correctly (665s narration + 26s inter-slide pauses = 691s timeline); the container metadata over-claims because each silence-pad becomes a single AAC frame with stretched duration metadata. The demotion is coupled to the tracking issue, not to a release calendar — when the fix lands (replaceGenerateSilence + ConcatAudiopad with an-af apadfilter inrunSegmentEncode; ~30 LOC), the same PR removes the entry fromknownIssueDemotions, bumping severity back tofailtogether with the underlying fix. Same-PR coupling makes the regression guard impossible to silently leave behind. Default behavior is soft-surface (strict:false): the pack returns success even when checks fail; the findings ARE the output. The orchestrating LLM agent readsvalidation.all_passed, sees the specific check names + details, and decides whether to retry / escalate / report — matching the project norm "honest output > convenient lie" and the typed-error model from ADR 008 where errors are for "couldn't proceed," not for quality findings. Strict mode (strict:true): anyfail-severity check failure after demotion surfaces as a typedCodeArtifactFailederror with the failing check names in the message. Use this for CI publish gates and downstream consumers that can't tolerate processing a structurally-invalid artifact. Runtime-error vs check-finding distinction is intentional (closes a class of confusing-error bugs): exit code 2 from the script (missing dependency / usage error — validation DIDN'T RUN) returnsCodeHandlerFailed; failed checks (validation RAN AND REPORTED FINDINGS) return success with the findings in the output unlessstrict:true. Script-delivery mechanism:deploy/docker/sidecar.Dockerfilegains aCOPY scripts/av-validate.sh /usr/local/bin/av-validate.sh && chmod +xstep (alongside the existinghelmdeck-entrypointcopy). The pack handler invokes the script via session exec at the stable/usr/local/bin/av-validate.shpath — no Go//go:embedcomplexity, no file duplication. ffprobe / ffmpeg / python3 (the script's only dependencies) are already in the sidecar from earlier installs; PR #425'slibass smoke checkconfirms ffmpeg has the filters the script'ssilencedetect/volumedetect/blackdetect/ebur128calls rely on. Test surface: 7 new unit tests ininternal/packs/builtin/av_validate_test.go: input validation (no v/a inputs →CodeInvalidInput); happy-path JSON parse + sidecar artifact persist; known-issue demotion (the #429-class JSON in →all_passed:true,warnings:1, detail string contains#429); strict-mode surface (strict:true+ fail-severity check →CodeArtifactFailednaming the failing check); soft-surface default (same inputs without strict → success with findings); script-exit-2 distinction (CodeHandlerFailednot check failure); argv wiring (paths, ebur128_target, skip_checks all flow through;--jsonis always passed). Pack registers incmd/control-plane/main.goin the always-available section (no LLM, vault, or egress-guard deps). Not yet integrated as a post-step onslides.narrate/podcast.generate— that's Phase 3, explicitly deferred until this Phase 2 pack has been called against 5-10 real artifacts (avbench monthlies + ad-hoc operator invocations) to confirm the false-positive rate after demotion is acceptable. The validation arc remaining: Phase 3 (default-on integration), Phase 4 (ADR audit + new ADR-052 capturing the architecture). -
scripts/av-validate.sh— standalone validator for slides.narrate / podcast.generate AV artifacts (Phase 1 of 4 in the validation arc). Every time an operator reports "the video has issues" we run the same manual ffprobe sweep: auth → fetch artifact → check faststart → sample RMS at intervals → verify packet contiguity → eyeball duration parity. The 888de7b23142ba81-video.mp4 diagnostic we just ran burned ~3,000 tokens of bash output + analysis to discover an audio/video duration mismatch (27.930s of trailing video-without-audio past the audio stream's end) — a finding that's trivially expressible as a single JSON field. This script is the executable spec for that diagnostic: a 350-LOC bash + python3 + ffprobe/ffmpeg validator that takes a video/audio/captions path and emits either a colored human report or a structured JSON document. Phase 2 will wrap it as anav.validatepack; Phase 3 will integrate that pack as a default-on post-encode step on slides.narrate and podcast.generate so the validation result lands in the run record'svalidationfield — collapsing the next "video has issues" diagnostic from ~3,000 tokens to ~200. Phase 4 audits the relevant ADRs (008/015/045/051) and lands a new ADR-052 capturing the architectural decisions. Check set, calibrated to the bugs we've actually shipped fixes for (each labeledfail-severity, exits the script non-zero on regression):mp4:faststart(PR #422 — moov-before-mdat via pure-Python byte scan, no ffprobe dep),mp4:codec_pin(PR #421 — h264 + aac LC + 44.1 kHz pinned via ffprobe-show_entries stream),mp4:bitstream_decode(research §"Deep Bitstream Decoding" —ffmpeg -v error -xerror -err_detect crccheck+bitstream+buffer -f null -null muxer pass, catches macroblock corruption that survives the muxer but fails decoders),audio:packet_contiguity(PR #423-class — packet pts gap > 0.5s indicates the ElevenLabs partial-response cascade),audio:rms_sweep(5-point sweep, -45 dB floor, catches silent-fallback regressions),consistency:audio_video_duration(the bug we just found —audio_content_duration = aframes × 1024 / sample_ratevs containerformat=duration, 1s tolerance),srt:first_cue_anchor(PR #425 — must be exactly00:00:00,000for YouTube CC import),srt:comma_separator(period decimal silently parses as hours in some libass builds — 7-hour offset captions),consistency:captions_coverage(last SRT cue end within 2s of audio_content_duration). Plus threewarn-severity heuristics that surface for review but don't fail the run:audio:loudness_lufs(EBU R128 integrated loudness, YouTube target -14 ± 2 LUFS viaebur128filter — drift surfaces operators shipping out-of-spec audio that platforms then normalize aggressively),audio:silence_runs(silencedetect=noise=-50dB:d=2, ≥2s runs flagged — could be legitimate between-slide pauses),video:black_runs(blackdetect=d=2.0:pix_th=0.10— catches marp render failures inserting accidental long black frames). Thevideo:freeze_runscheck (freezedetect=n=-60dB:d=2) is implemented but default-skipped via theSKIP_CHECKSenv var because slides.narrate output is static-image-per-slide by design — every slide IS technically a freeze, so the check false-positives 100% of the time on our dominant use case. Talking-head pipelines (none exist yet) should--no-skipit. Operator interface:make av-validate VIDEO=/path/to.mp4 CAPTIONS=/path/to.srt JSON=1or call the script directly with--video / --audio / --captions / --json / --ebur128-target / --skip-checks. Exit code 0 = no fail-severity check failed (warns may be present); exit 1 = at least one fail-severity check failed; exit 2 = usage error or missing dependency. Acceptance test against the artifact that motivated this work (slides.narrate/888de7b23142ba81-video.mp4): the script correctly firesconsistency:audio_video_durationwithcontainer=693.344s audio_content=665.414s delta=27.930s exceeds 1s toleranceand exits 1, while every other applicable check passes — confirming the script catches what the manual diagnostic found AND doesn't false-positive on the surrounding healthy parts of the artifact. What this PR explicitly does NOT do (per the plan's "what this deliberately doesn't" section): no MP4Box/GPAC integration (CVE risk per CVE-2026-9572 / CVE-2026-7135 / CVE-2025-70116; functionally redundant with ffprobe for our use case where we control encoding); no Bento4 mp4dump deep atom inspection (overkill); no mp3val / mp3check (we control encoding so garbage MP3 frames aren't a realistic failure mode); no QCTools / qcli analog-tape forensics (we don't have analog tape); no MediaConch policy compliance (no operator has asked for institutional-archive policy schemas); no untrunc repair tooling (fix root causes upstream in the encoder, not patch corrupted output); no pack wrapping yet — that's Phase 2, deliberately deferred until the standalone script has been run against 5-10 real artifacts to confirm the false-positive rate is acceptable. Reusable patterns leaned on: thegreen/red/yellowcolor helpers + ffprobe wrapper functions fromscripts/pipelines-smoke.sh(audio_packets_contiguous, audio_rms_above, audio_codec_params_ok, mp4_faststart_ok, captions_assert) are lifted withpath-argwrappers, so Phase 3 will refactorpipelines-smoke.shmp4:av/mp3:avchecks to read thevalidationfield from the run record instead of re-implementing the same probes inline — net LOC reduction across the codebase once Phase 3 lands. -
Multi-model recovery matrix workflow +
openrouter/auto-as-default decision rule (v0.26.0 candidate). PR H of the v0.25.0 arc proved the cheap-model bet against ONE pinned model (openai/gpt-oss-120b:freerecovers correctly on all 5 typed-error scenarios at ≥7/10). This PR takes the proof and turns it into a discovery mechanism: which other free models on OpenRouter handle helmdeck's typed-error contract reliably, and shouldopenrouter/autobe surfaced as a recommended default for users without a configured API key? NEW.github/workflows/model-discovery.yml— weekly Wednesday 06:00 UTC (different day frommodel-recovery.yml's Sunday so the two don't compete for the runner pool). 4-row matrix withfail-fast: false:openai/gpt-oss-120b:free(requiredtier, MUST pass) +google/gemma-4-31b-it:free(observational, threshold modifier -1 for the size gap) +nvidia/nemotron-3-ultra-550b-a55b:free(observational, modifier 0) +openrouter/auto(observational, modifier -1 for per-call routing variance).continue-on-error: ${{ matrix.tier != 'required' }}— only the pinned model can fail the workflow; observational rows publish their per-scenario scores but don't block. An aggregator job downloads every per-model report and posts a combined summary table to the run page with a 3-state status legend: ✓ (all scenarios passed), ⚠ (at least one below threshold but received responses), ✗ (at least one scenario fully dark — provider may be deprecated). NEWinternal/reliability/recovery_test.gogains theHELMDECK_RECOVERY_THRESHOLD_MODIFIERenv var. Default 0 preserves v0.25.0 single-pinned-model behavior; the matrix workflow sets per-row modifiers so weaker observational models can have honest lower thresholds without globally weakening the v0.25.0 reliability bet. Gemma-4-31B at threshold-1 is the same reliability story as gpt-oss-120B at threshold+0 — "reliably correct in 60% of cases for the smaller model" vs "70% for the larger" — documented per-model so the comparison stays honest. Floor-clamped at 1 — even the most accommodating row demands "model emitted a usable recovery at least once." NEW.github/workflows/model-discovery-alert.yml— separate workflow withissues: writepermission scoped HERE only. Chains offmodel-discovery.ymlviaworkflow_run. Opens (or comments on) a GitHub issue when an observational row scores 0/N on at least one scenario — "fully dark" signals provider deprecation, model-id rotation, or upstream unreachability. A row at 4/10 is normal variance and does NOT trigger an alert; the narrow 0/N threshold avoids weekly issue spam. Duplicate-issue avoidance: searches for an open issue with labelmodel-discovery-alertand the exact model in the title; if found, comments on it instead of opening a duplicate. Splitting the alert into a separate workflow confines the elevatedissues: writepermission to ~50 lines of YAML and keepsmodel-discovery.ymlatcontents: read— smaller blast radius if either workflow is ever compromised. NEWdocs/howto/multi-model-recovery.md— operator-facing guide: per-row purpose, threshold-modifier rationale, summary-table reading guide, and the load-bearing decision rule foropenrouter/auto-as-helmdeck-default: ≥7/10 across all 5 scenarios for 6 consecutive weekly runs → surface as recommended default; <5/10 on any scenario → never offer; between those lines → document the gaps in the howto and leave routing to the operator. The rule is in the howto (not in code) because it's a product decision informed by the matrix data. What's NOT in v0.26.0 (deliberate scope decisions): the actual UI change to recommendopenrouter/autoas a default — that lands AFTER the 6-week observation window produces the evidence, in a separate small PR that cites the matrix run window. A long-term trend dashboard rendering per-scenario scores across weekly runs — deferred until maintainers find themselves diffing artifacts often. Auto-swap of therequiredrow when an observational row outperforms — deliberately manual so a maintainer confirms the new pin. Combined cadence after this PR: model-recovery.yml (pinned model, weekly Sunday) + model-recovery + mutation.yml (decision-dense code mutation, daily 04:00 UTC) + model-discovery.yml (4-model matrix, weekly Wednesday) = three workflows producing reliability signal across different time horizons. Total cost: ~260 runner-min/month (free tier easily covers, GitHub Pro comfortable). -
Captions/SRT support on
slides.narrate: sidecar default-on, burn-in opt-in (v0.26.0 candidate). PR #424 declaredengagement.captions_recommended: truebut didn't actually produce captions — operators saw the recommendation but had to write SRT files themselves. This PR makes the recommendation actionable. Sidecar SRT (default-on): acaptions.srtartifact persisted alongside the MP4. YouTube/Vimeo auto-import as the CC track via Studio "Subtitles → Upload file → With timing" — the path that backs the research-cited ~12-13% YouTube view boost (sidecar CC, NOT burn-in). Essentially free: a few KB of bytes per run, zero encode cost. Burn-in (opt-in viacaptions_burn_in:true): renders captions into every frame via ffmpeg's libasssubtitles=filter. Required on platforms that don't surface CC tracks (Twitter/X embedded videos, LinkedIn embeds, raw MP4 downloads viewed in players without CC support). The two outputs feed from the same SRT byte stream — generating once and persisting as sidecar is the cheap default; burn-in adds the encode cost + OOM risk when explicitly requested. NEWinternal/packs/builtin/slides_captions.go— pure-functionbuildSRT(slides, durations) []byte+formatSRTTimestamp(seconds) string. Kept out ofslides_narrate.go(already ~1,270 LOC) to match theslides_notes.goseparation pattern.formatSRTTimestampis intentionally DISTINCT from the existingformatTimestamp(M:SS with period, used by YouTube chapter markers in the engagement object) — SRT spec mandates the widerHH:MM:SS,mmmfield AND a COMMA decimal separator; using a period would parse as hours in some libass builds and produce 7-hour-offset captions. Co-located rationale comments cite the spec source so a future refactor can't conflate them. Text normalization insidebuildSRT: CRLF/CR → LF (paste-from-Word safety), per-cue whitespace strip, empty-notes → single literal space (preserves cue numbering so cue N corresponds to slide N+1 — operators reviewing the .srt by eye need this alignment for sane debugging). Pack inputs:captions_sidecar *bool(mirrors theMermaidpointer-bool default-on shape —nil⇒ on, explicitfalse⇒ off) andcaptions_burn_in bool(default false). OutputSchema additions:captions_artifact_key: "string"(empty when sidecar suppressed or artifact-store Put failed) +captions_burned_in: "boolean"(ALWAYS emitted so consumers can branch on its presence — even when false). Handler integration sits between the close of the audio-generation loop (durations finalized) and the start of the per-segment encode loop — the only point where bothslides[]anddurations[]are simultaneously known. Burn-in wiring appends,subtitles=/tmp/captions.srtto the existingvfchain right after the fade-filter block — same comma-separated filter-chain shape, no escaping needed because/tmppaths have no spaces or quotes. Failure semantics are intentionally soft: sidecar artifact-store Put failures log + continue (captions are auxiliary; failing a 3-minute encode over an artifact-store hiccup is worse than degraded output); burn-in write failures degrade to no-burn rather than failing the segment encode. Pipeline wiring:builtin.repo-presentationalready inherits the default-on sidecar via the pack default;Producesupdated to includesrt_captions;Limitationsgains a captions-honesty entry alongside the existing engagement-honesty entry — distinguishes the cheap sidecar path from the costly burn-in. Sidecar Dockerfile smoke (deploy/docker/sidecar.Dockerfile): the existingffmpeg -versioncheck is extended withffmpeg -filters | grep -q ' subtitles 'so an image build fails LOUDLY if a future apt change drops libass support from theffmpegpackage — prevents the confusingUnrecognized option 'subtitles'exit at run-time that would otherwise surface only on the firstcaptions_burn_in:truerun. Burn-in OOM honesty per[[feedback-pipeline-description-honesty]](and explicitly user-confirmed during planning): document the risk, don't engineer around it. The packDescriptionwarns that burn-in adds 5-50% encode wall-clock + 20-50 MB per encoder thread; on memory-tight hosts with large decks the existing OOM-retry path may fire AND (if libass-with-threads=1 also OOMs) fail the run. No auto-fallback retry that silently drops captions (would violate honest-output preference); no preflight memory check (would add brittle estimation that's hard to keep accurate). Theengagement.format_ceiling_notealready establishes the precedent that helmdeck describes real constraints rather than papering over them. Test surface: 5 new pure-function tests inslides_captions_test.go(cue numbering, timestamp format with HH:MM:SS,mmm + comma separator + cumulative arithmetic, empty-notes cue preservation, multiline normalization, timestamp edge cases) + 4 new handler-level tests inslides_narrate_test.go(sidecar default-on emits non-emptycaptions_artifact_key; explicitcaptions_sidecar:falsesuppresses;captions_burn_in:truewires,subtitles=/tmp/captions.srtinto the per-segment ffmpeg argv AND setscaptions_burned_in:true; the new output keys round-trip throughOutputSchema.Validate). The existingTestSlidesNarrate_RealOutputMatchesSchemaschema-contract test continues to gateEngine.Executevalidation for the wider output shape. Pipeline-smoke / avbench (#423) deep asserts:pipelines-smoke.shmp4:avspec gains acaptions_assertblock that extractscaptions_artifact_key, fetches the SRT via the existingfetch_artifacthelper, asserts size > 30 bytes (sane floor), and greps for both-->(cue separator) AND00:00:00,000(YouTube-acceptance signature — comma decimal, NOT period). Graceful CAPTIONS_ABSENT skip when the operator disabled the sidecar viacaptions_sidecar:false. Out of scope (deferred per the plan, each with one-line justification): per-word/karaoke captions (TTS gives per-cue timing only — word-level alignment needs Whisper or a forced-aligner, separate pack); caption styling (font/color/position would need ASS/SSA format and a styling schema — libass renders SRT in a serviceable default for v0); podcast.generate captions (audio-only output, no canvas to burn into); WebVTT (.vtt) sidecar (YouTube/Vimeo accept both — defer until an operator hits a platform that doesn't accept SRT); multi-language captions (operator can translate the SRT externally; bundled translation needs an LLM step + per-language artifact persistence); thumbnail with caption preview (orthogonal feature); auto-language-detect filename (video.en.srtmatching — YouTube auto-imports via Studio manual upload regardless of filename; future PR can detect language fromengagement.languageand rename). All affected packages passgo test ./internal/packs/builtin/ -race -count=1(767 tests; +9 vs the engagement PR baseline of 858 in this package alone — adjusted up after the 5+4 new tests landed). -
Engagement-metadata best practices baked into
slides.narrate+podcast.generate(v0.26.0 candidate). Operator question: "should there be built-in best practices for video / podcast generation in the packs? what makes a YouTube video or podcast actually get views?" External research surfaced concrete, research-validated rules for each platform (YouTube official chapter spec, retention-curve data on the first 30 seconds, Apple Podcasts chapters guidance, Buzzsprout 2025 listen-duration data, Podcasting 2.0 namespace) AND the honest reality that slide-deck-with-voiceover videos sit in the lower retention bracket vs talking-head regardless of metadata polish (5-12pp structural gap that no prompt closes). This PR bakes the rules in as hard prompt constraints, ships them as a typedengagementoutput object on both packs, and surfaces the format-ceiling reality in three machine-readable places so the system can't silently drift optimistic.slides.narrate.engagement(YouTube-shaped):{title, title_char_count, description, chapters:[{timestamp,title,seconds}], hashtags, tags, hook_30s, captions_recommended, category, language, format_ceiling_note}. Structural rules enforced by the prompt — title 45-55 chars target (hard cap 60), first chapter MUST be at0:00, ≥3 chapters when video > 7min, ≥10s between chapter starts, 3-5 hashtags, hook follows the pattern-interrupt → payoff-promise → commitment-hook structure that retention research validates.podcast.generate.engagement(Apple Podcasts + Podcasting 2.0):{title, subtitle, summary, show_notes_md, chapters:[{startTime,title}], hook_30s, cta:{placement,copy}, language, format_ceiling_note}. Structural rules: title 60-80 chars takeaway-first,chapters[0].startTimealways0, ≥3 chapters when episode > 10min and ≥120s each,cta.placementis force-overridden to"mid-roll"server-side regardless of what the LLM emitted — a defensive layer that means a future prompt drift can't silently flip the research-validated placement. Operator-overridable inputs (per the plan §2):metadata_model(podcast: default-on atopenrouter/auto— pass""to disable; slides: stays opt-in for back-compat),cta_style(podcast:natural/direct/none),hashtag_count(slides: clamped to 3-5),category+language(slides + podcast, server-authoritative override of LLM-emitted values). Everything else (chapter floors, char caps, hook structure, 0:00 anchor) is non-overridable — the research is unambiguous and an override would just let drift back to the patterns the research warns against. Sidecar artifact: both packs persistengagement.jsonalongside the binary artifact (mirrors the existingmetadata.jsonpattern slides.narrate already had). New OutputSchema fields:engagement(object),engagement_artifact_key(string) on both packs. BREAKING change on slides.narrate: the v0.25.xmetadata+metadata_artifact_keyfields are renamed toengagement+engagement_artifact_key. helmdeck is pre-1.0 (CHANGELOG header authorizes breaking changes per minor release); the renamed path was already opt-in viametadata_modelso consumers are power-users who can adapt. Engagement is a strict superset of the old metadata shape (gainschaptersas a structured array,hashtags,hook_30s,captions_recommended,title_char_count,format_ceiling_note). Three-layer format-ceiling honesty per the user's stored preference ("pipeline descriptions must match the mechanism"): (1)engagement.format_ceiling_noteconstant string baked into both pack output objects — slides.narrate carries the explicit talking-head retention-gap note; podcast.generate carries the solo-vs-cohost honest caveat. (2)PipelineMetadata.Limitationsentries added tobuiltin.repo-presentation,builtin.repo-readme-podcast(newly.withMeta()-promoted), andbuiltin.prompt-narrated-video(also newly.withMeta()-promoted) — each pipeline now declares the engagement-metadata reality alongside its existing constraints. (3) PackDescriptionsuffix onslides.narrate— keeps the catalog (which agents read first viahelmdeck://packs) consistent with the run-time output. Pipeline wiring:builtin.repo-presentationthreadsmetadata_model:"openrouter/auto"into itsnarratestep so pipeline runs get engagement metadata default-on (the bare pack stays opt-in).builtin.repo-readme-podcastinherits the podcast's default-on behavior automatically. Test surface: 6 new unit tests on slides side (engagement shape, disabled path, operator-override-LLM precedence, hashtag-count clamp, existing happy-path retrofitted to assert constant enrichment) + 3 new on podcast side (default-on engagement, disabled-via-empty-string, custom cta_style/language prompt-shape verification). The existing schema-contract tests (TestSlidesNarrate_RealOutputMatchesSchema,TestPodcastGenerate_RealOutputMatchesSchema) continue to gateEngine.Execute→OutputSchema.Validateso a future field rename can't silently violate the declared schema — closes the[[feedback-pack-tests-bypass-execute-validation]]gap for this PR's surface. Pipeline-smoke / avbench (#423) deep asserts:pipelines-smoke.shmp4:avspec now assertsengagement.title_char_count <= 60,engagement.chapters[0].timestamp == "0:00",len(engagement.chapters) >= 3when video > 7min;mp3:avassertsengagement.chapters[0].startTime == 0, ≥3 chapters when episode > 10min,engagement.cta.placement == "mid-roll". Engagement helpers degrade gracefully when the field is absent (yellowENGAGEMENT_ABSENTline) so pipelines withoutmetadata_modelset don't false-fail. Out of scope (deferred per the plan §"Out of scope"): transcript/SRT caption artifact (research-validated ~13% YouTube view boost — distinct artifact + handler work; user explicitly chose to defer), thumbnail generation (better as a dedicatedimage.thumbnailpack with aspect-ratio + face-prominence rules), end-screen / cards JSON (publish-time concern → futureyoutube.publishpack), auto-publish to YouTube/Spotify (needs OAuth + credential contract), A/B title variants (requires a downstream picker that doesn't exist), RSS feed XML generation (engagement object carries the data; serialization is a publish concern), B-roll/motion-graphics insertion (the actual lever against the format ceiling — touches the codec path, needs its own pack), and auto-validation of LLM-emitted chapters against the structural rules at LLM-call time (the prompt enforces; if the LLM violates, we accept what it produced and let avbench catch drift over time rather than pad with stub chapters, which would be dishonest output). All affected packages passgo test ./internal/packs/builtin/ ./internal/pipelines/... -race -count=1(856 tests across 2 packages). -
Monthly
avbenchworkflow +mp4:av/mp3:avdeep asserts inpipelines-smoke.sh— catches the bug class unit tests structurally can't see. Motivating bug: PR #422 fixed an MP4+faststartregression that had silently shipped with every helmdeck-produced video since #379 (six months — the entire lifetime ofslides.narrate). Every unit test ininternal/avenc/audio_test.gopinned the ffmpeg argv shape; the broken file had a valid-looking command line, so coverage stayed green at 99.3% while every operator's MP4 was streaming-broken. The only test that would have caught it is one that runs the real pipeline end-to-end and ffprobes the output. This is that test. NEW.github/workflows/avbench.yml— runs the first Sunday of every month at 04:00 UTC (different time frommodel-recovery.yml's 06:00 Sun +model-discovery.yml's 06:00 Wed so the three workflows don't compete for runner slots) plusworkflow_dispatchfor ad-hoc runs. Brings up the full helmdeck stack from current source viascripts/install.sh --no-smoke --no-embeddings, runsbuiltin.repo-presentation(slides.narrate path → MP4) +builtin.repo-readme-podcast(podcast.generate path → MP3) against a configurable public repo (defaulttosin2013/helmdeck), deep-verifies each artifact, uploads on failure with 30-day retention, tears down. NEW deep-assert specs inscripts/pipelines-smoke.sh— extending the existingmp4andmp3assertions which previously only checked file magic + minimum size (the bugs they could see were "file is empty" / "wrong format" — both rare). The newmp4:avspec adds: (1) faststart layout check via pure-Pythonmoov-before-mdatscan, always runs (no external dep); the regression-impossibility guard for #422; (2) audio packet contiguity viaffprobe -show_packets, fails on any consecutive packet gap > 0.5s — catches mid-segment dropouts where packets simply stop, the class of bug an ElevenLabs200 OKwith truncated body would cascade into; (3) RMS sanity sampled at 5 evenly-spaced 2-second windows across the file, fails if any window's mean is below -45 dB — catches the "TTS silent-fallback fired for slide N" failure mode; (4) codec/sample-rate verification —aac+ 44100 Hz for MP4,mp3+ 44100 Hz for MP3 — catches encoder drift (codec swap, sample-rate not pinned). Themp3:avspec applies (2)-(4) but skips (1) since MP3 has nomoovatom. Both specs degrade gracefully when ffprobe isn't installed (yellow "audio NOT verified" line), same posturepdf_pagestakes forpdfinfo— keeps the script useful on hosts without the optional dep. NEW gate:elevenlabs— checksHELMDECK_ELEVENLABS_API_KEYis reachable to the control-plane (env-var fallback path) before running an av case. Likefirecrawl/docling, when the gate isn't satisfied the case is SKIPPED (not failed) so PR contributors without a TTS key get clean green local runs. Cost analysis: ElevenLabs Creator-tier per-character rate × ~3,000-4,500 chars per run × 12 runs/year = ~$1/year in TTS credits at current rates. GitHub Actions: 2-3 minutes per run × 12/year = ~30 runner-minutes/year. Artifact storage: ~10 MB × 30-day retention = ~300 MB-days/year. All three are trivially small. What this catches that unit tests can't (the load-bearing claim): regressions in the FINAL artifact that look fine at every other layer. Container muxing flag drift (the #422 shape). ElevenLabs API-shape change. Native AAC encoder regression. Sample-rate-not-pinned reintroduction. Silent-fallback firing on a slide that should have audio. These all surface as "audio sounds wrong" operator reports days/weeks after they ship; the workflow turns each into a maintainer-visible monthly red signal. Repository secrets required to enable:ELEVENLABS_API_KEY(TTS) +OPENROUTER_API_KEY(LLM for slides.outline / podcast scripting). Without either secret the preflight emits a::warning::and skips the run cleanly — same gating shape the model-recovery workflow uses. What's deliberately NOT in this PR: (1) per-PR fire oninternal/avenc/**changes — the monthly cadence is enough to catch upstream drift; per-PR adds noise for code-level changes the unit tests already cover. Could be added later if the existing unit-test pins prove insufficient. (2) Thebash-side ffprobe asserts don't currently capture downloaded artifacts to the upload path (the script's mktemp dir is cleaned on exit). A follow-up could persist failures into/tmp/avbench-artifact-*.binso the workflow'supload-artifactstep has something to attach — useful when a future operator wants to ffprobe the broken file rather than re-run locally. (3) Long-term trend dashboard rendering pass/fail across months — deferred until a maintainer finds themselves diffing artifacts often, same posture as the model-discovery trend.
Changed
-
Cookbook expansion (+7 recipes) + new blog draft about the cookbook pattern. The cookbook shipped in PR #435 had 10 recipes across 5 sections; user feedback on it surfaced demand for more entries (community + contributor angle). This PR adds 7 more recipes — every one validated against the actual shipped pack surface in
docs/PACKS.md(no recipes for hypothetical capabilities). New recipes by section: Repos → code work gains "Audit a repo's code for a security pattern" (repo.fetch+cmd.rungrep + LLM analysis, with the session-chaining contract note) and "Generate developer documentation from a codebase" (repo.fetch+repo.map+blog.rewrite_for_audience— flagged as a candidate for abuiltin.repo-onboarding-docpipeline). Web → structured output gains "Extract structured data from a single-page web app" (web.scrape_spawith CSS-selector schema; distinguished fromweb.scrapeandweb.test) and "Compare two competitor products' marketing pages" (web.scrape× N +blog.rewrite_for_audiencewith thepersonaknob for honest-vs-weighted comparison). Validation + reliability gains "Strict-mode validate before publishing" (av.validatestrict:trueas the CI publish gate — bridges the soft-surface default to the typed-error path per ADR 052). NEW section "Media & creativity" — three recipes targeting weekend-builder / hobbyist intents: "Generate AI artwork from a text prompt" (image.generatevia fal.ai with the FLUX schnell-vs-pro cost trade-off documented), "Find stock photos for a topic" (stock.searchPexels-backed, with the photos-vs-illustration decision note for chaining), "Build a quick demo video from a HyperFrames description" (hyperframes.compose+hyperframes.renderwith the HyperFrames-vs-slides.narrateselection guidance). Plus "Generate marketing copy for an upcoming release" (repo.fetch+blog.rewrite_for_audience+image.generatechain — flagged as another candidate composition for a futurebuiltin.repo-release-marketingpipeline). Total cookbook coverage: now 17 recipes across 6 sections covering repos-as-content, web extraction, code work, validation, media generation, and memory. NEW blog draftwebsite/blog/2026-06-05-cookbook-pattern.md— "Recipe-style docs are dramatically underused. Here's the case for them." Frames the cookbook pattern as a generalizable docs technique that survives outside this codebase: the "I don't know what to type" gap is bigger than most docs systems account for; recipe-style docs reward composition because each entry stands alone; recipes are honest about what your system can do (the Tip block has space for non-obvious behavior that tutorials sell-past and reference can't fit). The post documents the four-field recipe shape (OpenClaw prompt + direct invocation + outputs + Tip) and includes a 5-step "how to contribute a recipe" walkthrough so the post functions as a contributor on-ramp. Cited time estimates: ~3 hours for a tutorial vs ~15 minutes for a recipe; per-recipe ROI is high; partial coverage (unlike a tutorial series where missing entries break later ones) is still valuable.draft: trueper the template workflow; flips todraft: falsein a follow-up after maintainer review. Targets ~1,100 words; tagscontributor-experience+field-report+agent-architecture. Build verified locally viacd website && npm run build— both the cookbook page and the blog draft build clean with no broken links. What this PR doesn't do: doesn't create the pipeline candidates the recipes flag (builtin.repo-onboarding-doc,builtin.repo-release-marketing) — those are tracked as cookbook tips rather than filed issues today; revisit when concrete demand emerges. Doesn't extend the cookbook into integrations not currently shipped (Notion / Slack / Linear / Jira recipes — those packs are in the pack-candidate backlog #73–#80; they get cookbook entries when the packs ship). Doesn't refresh the cost table sync (README.md+docs/explanation/why-helmdeck.md+ 2026-05-08 blog) — still deferred to the next release cut perRELEASES.md§"Agent sync checklist" step 6. -
NEW
docs/reference/models.md— operator-facing tier table. Closes the last deferred item from PR #435 ("docs/reference/models.mdoperator-facing tier table — depends on the tier-awarePromptVariantwork landing first") and the natural-next-page sequencing from PR #437 (the tier-awarePromptVariantwork that landed first). Surfaces the tier system to operators with a single information-oriented lookup page. Page structure: (1) "How tier affects behavior" — a 6-row matrix showing what changes per tier across catalog projection, output budget,helmdeck.planprompt variant, strict-JSON mode, prefix-cache routing, and LLM filter pass. Each row links to the source ADR. (2) "When you'll see Tier C behavior" — the three situations that trigger the conservative path (explicit Tier C entry, prefix match, unknown model fallback) with a note explaining why parameter count is the wrong proxy (openrouter/nvidia/nemotron-3-super-120b-a12b:freeis 120B parameters but Tier C because free-tier inference quality doesn't match what parameter count alone suggests — cites the validation-arc blog post's 50% multi-step success measurement). (3) Tier A table — 13 entries (anthropic/claude-opus-,claude-sonnet-,claude-3.7-sonnet,claude-haiku-,openai/gpt-4o,gpt-5,o3-mini,google/gemini-2.5-pro,gemini-2.5-flash, plus the OpenRouter relays) with input ceiling, output budget, strict-JSON / prefix-cache / hybrid-reasoning flags, and the calibration source frombudgets.go. (4) Tier B table — 8 entries covering Llama 3.1/3.3 70B, Gemma 2, Mistral, DeepSeek V4 Pro / V3.2 / chat, Grok. (5) Tier C table — 7 entries (openrouter/openrouter/free,nvidia/nemotron-,z-ai/glm-,qwen/qwen-2.5-,moonshotai/kimi-k2,moonshotai/kimi-,tencent/) with the Tier C-specific notes about what happens on thesingle_pickpath. (6) "Picking a model for your goal" — 6 scenarios (most reliable, lowest cost, exercise the agent loop, best Tier B price, max context, reasoning models) → recommended model + rationale. (7) "Overriding the tier" — when and how to setBudget.PromptVariantexplicitly on an entry to defy the tier default. (8) Cross-links to ADRs 050/051/053, the calibrate-model-tiers HOWTO, the free-models-and-context HOWTO, and the validation arc blog post that motivated thesingle_pickdesign. Cross-links from existing docs:docs/howto/calibrate-model-tiers.mdRelated section gains a top-of-list pointer at/reference/models(the calibration methodology produces entries that surface here);docs/howto/free-models-and-context.mdRelated section gains the same plus an explicit pointer at ADR 053.docs/reference/index.mdgains a "Models reference" bullet alongside the cookbook and prompt-templates entries in the pack-catalog section. Sidebar registration:website/sidebars.tsreference section gains'reference/models'betweenagent-memoryand the Prompt-templates category — same posture as PR #438 caught (orphan markdown pages slip through review when authors forget the sidebar registration; this PR registers proactively). Posture: information-oriented lookup, not a tutorial — the file reads like a contract, not prose. Operators looking up "what's my tier" or "why is my plan emitting one step at a time" land here and get a 5-second answer. The architectural narrative continues to live in the ADRs; this page is the index over the data. Verified locally viacd website && npm run buildbefore push (catches the doc-id-vs-sidebar-id class of bug PR #438 hit). Sequencing impact: with this PR, the only loose threads from the validation-arc session are (a) the cost-table sync betweenREADME.md+docs/explanation/why-helmdeck.md+ the 2026-05-08 blog post perRELEASES.md§"Agent sync checklist" step 6 (next release cut), (b) the orphan-page CI check that would have caught both the av-validate frontmatter mismatch in #438 and the cookbook+av sidebar omissions before merge ([good first issue]follow-up), and (c) the actual slides.narrate test run through OpenClaw end-to-end now that the stack is fully fixed. -
Doc refresh post-validation-arc + new
docs/cookbook/intent-to-prompt.md. Five existing docs were stale after the validation arc landed (PRs #428 / #430 / #431 / #432 / #433) — they referenced the pre-validation-arc world by not mentioningav.validateat all.README.mdupdated: pack count52 → 53,slides.narratehighlight row gains the validation + captions + engagement parenthetical, newav.validaterow added to the "Document & vision" section.docs/PACKS.mdupdated:slides.narraterow's Input column gainscaptions_sidecar?/captions_burn_in?/validate?(the inputs that shipped in PRs #425 / #432), Output column gainsengagement(renamed frommetadatain PR #424) /engagement_artifact_key/captions_artifact_key/captions_burned_in/validation/validation_artifact_key. Description updated to call out the pointer-bool default-on pattern for captions + validation.podcast.generaterow receives the analogous updates plusmetadata_model?/cta_style?/language?(engagement defaults shipped default-on in PR #424). Newav.validaterow added under a new "AV utilities" section with the full 13-check set documented + severity model + strict-mode behavior + link to ADR 052. Gateway-gated count updated to10 of 53(43 without a gateway —av.validatehas no LLM dependency). Source files section getsav_validate.go→av.validate.docs/explanation/why-helmdeck.mdupdated: header pack count bumped, new sixth per-task comparison entry — "Diagnosing 'the video has issues' — reliability as a token tax" — using the validation arc as a concrete worked example of the broader thesis (~3,000 LLM tokens / incident manual ffprobe loop vs ~200 tokens readingvalidation.checks[]from the run record). The 27.9-second audio/video duration mismatch on888de7b23142ba81-video.mp4(issue #429) is cited as the motivating example. The added paragraph closes by drawing the parallel: "moving the diagnostic mechanism from 'frontier model derives it' to 'deterministic pack computes it' is exactly the same lever as moving navigation from 'vision model interprets screenshots' to 'browser pack executes deterministic actions.'"docs/reference/prompt-templates/packs.mdupdated:podcast.generatetemplate's Notes line gains a one-sentence pointer about engagement + validation defaults; new "AV utilities" section +av.validatetemplate added between Podcast and Image sections. The template shape mirrors the rest of the file (Template / Variables / Notes blocks). NEWdocs/cookbook/intent-to-prompt.md— the index this docs system has been missing. Ten worked recipes organized by intent class (repos→content, web→structured output, repos→code work, validation+reliability, memory), each showing three things: the OpenClaw natural-language prompt that resolves cleanly, the direct REST/MCP invocation underneath, and the structured output fields that land in the run record. Each recipe also has a Tip block calling out the non-obvious behavior (engagement defaults, soft-surface validation, citation handling, when to prefer pipelines over bare packs). The cookbook addresses what the Nemotron-3-super-120b-a12b:free testing surfaced as the highest-leverage onboarding gap: users not knowing what to type. The architectural alternative (a separate prompt-generator tool or website) was deliberately rejected — fragmentation cost > value when the helmdeck catalog already publishes intent metadata via/api/v1/packsand the cookbook can leverage the existing Docusaurus build. Cross-linked from the prompt-templates index (the cookbook is the intent-first index over the pack-first templates), the calibrate-model-tiers HOWTO, and the when-a-pipeline-fails HOWTO. What's deliberately deferred (sequenced after this PR):docs/reference/models.mdoperator-facing tier table (depends on the tier-awarePromptVariantwork landing first), the HOWTO amendments calling outvalidationas a diagnostic step (small follow-up touching existing files), and refreshing the cost-table sync betweenREADME.md+docs/explanation/why-helmdeck.md+ the 2026-05-08 blog post perRELEASES.md§"Agent sync checklist" step 6 (next release cut). Test plan: docs-only PR; no code touched.go vet ./...clean. CHANGELOG mirror byte-identical to website. Post-merge verification: Docusaurus build picks up the new cookbook page, internal links resolve, sitemap regenerates. -
ADR 052 lands; ADRs 008, 015, 045, 051 get focused amendments — Phase 4/4 of the validation arc. Closes out the four-phase arc that started with the standalone script in PR #428. Phase 4 is the architecture record. NEW ADR 052 — "AV Output Validation as a Default-On Post-Encode Step" captures the five sub-decisions the arc encoded in code: (1) Tool selection: ffprobe + libavfilter (
silencedetect,blackdetect,freezedetect,ebur128) + null-muxer decode pass + pure-Pythonmoov-vs-mdatbyte scan — explicit, per-tool rejection of MP4Box/GPAC (CVE risk + functional redundancy), Bento4 mp4dump (atom-level surgery not where our bugs live), mp3val/mp3check (over-scoped for a single-codec single-bitrate pipeline), QCTools/qcli (built for analog-tape forensics), MediaConch (policy-driven archival compliance — YAGNI), untrunc (we fix the encoder, not patch the output — same lesson as #431's apad swap). (2) Severity model:pass/warn/fail, withfailreserved for checks that match a shipped bug fix (faststart per #422, codec pin per #421, packet contiguity per #404, RMS floor, audio/video duration parity per #429→#431, SRT first-cue anchor + comma separator, captions coverage) — soft heuristics like loudness LUFS, silence runs, black-frame runs stay atwarnso pipelines don't break on advisory findings. (3) Known-issue demotion lifecycle: three rules to keep the mechanism honest — file the issue first; same-PR coupling on removal; no demotions for already-warnchecks. The lifecycle is enforced by theTestAVValidate_NoDemotionsInForcetest (asserts the map is empty post-#431) + lifecycle documentation in the ADR. (4) Soft-surface contract: the pack's output IS the report; failing the pack over asilence_runsadvisory would defeat the surface. Strict-mode (strict:true) is the opt-in escape hatch for CI publish gates. (5) Scope boundary: helmdeck-generated artifacts only. Operator-uploaded artifacts (future) have a different threat model (untrusted bitstreams need adversarial parsing + sandboxing posture + GPAC CVE mitigations) and get a sibling pack rather than extendingav.validate's check set. ADR 008 amendment explains the severity-vs-error-code axis distinction: a failed check returns success at the runtime layer because the operation proceeded; a typed error code (CodeHandlerFailed) returns when the operation didn't proceed.strict:trueis the bridge — translates fail-severity findings intoCodeArtifactFailed, keeping the closed-set error vocabulary closed while letting quality findings flow as data. ADR 015 amendment documents the validation post-step as part ofslides.narrate's contract: afterConcatVideoMP4s+ video upload,runAVValidationruns against/tmp/final.mp4+ optional SRT path; report lands asvalidationfield;validation.jsonsidecar persists alongsideengagement.json+captions.srt. New input:validate *boolpointer-bool default-on. ADR 045 amendment captures the ~600 MB null-muxer decode pass memory peak on 1080p × 11-minute video: sits on top of the existing encoder peak; operators on memory-tight Compose hosts should setSessionSpec.MemoryLimit: 1g. The pass is CPU-bound short-burst, not parallel-heavy, soProfileCompute'sclamp(host_cores - 1, 1, 6)cap is unchanged. ADR 051 amendment clarifies that validation findings are NOT routed throughFailureClass— the two systems target different concerns.FailureClassdisambiguates empty-completion symptoms across hybrid models (safety filter vs. length truncation vs. constrained-decoding deadlock vs. timeout); validation findings are quality observations on a successfully-produced artifact. Routing retries on asilence_runsadvisory would re-encode the entire video to chase a heuristic finding — burning encode time the validation step was built to save.strict:trueis again the bridge for operators who explicitly opt into fail-fast. Architectural posture preserved across the arc: same-PR coupling between the fix and its regression guard (#431 demonstrated this — the apad swap landed with the severity-promotion test); soft-surface as the default; strict mode as the explicit opt-in for CI gates; tool-selection rationale documented per-tool so future maintainers don't need to re-derive why we said no to GPAC. What's now closed: the validation arc — Phase 1 (#428), Phase 2 (#430), Phase 3 (#432), Phase 4 (this PR). The token cost of "the video has issues" diagnostics is now ~200 tokens (readvalidation.checks[]) vs the previous ~3,000 tokens (manual ffprobe loop). The mechanism for catching future encoder regressions is in place at three layers (script, pack, default-on integration) with the architecture documented and the severity policy ossified in the script. -
Audio quality lift for
slides.narrate+podcast.generate: 128k → 192k MP3, 44.1 kHz pin throughout the avenc pipeline. Operators reported audio sounded "fine" but still noticeably compressed despite PRs #379–#408 closing every functional dropout/OOM/silent-failure bug. Audit traced the cause to four compounding lossiness events stacked on top of each other: (1) both packs hardcoded ElevenLabsoutput_format=mp3_44100_128— the cheapest tier — bounding everything downstream by the source quality; (2)internal/avenc/video.go:runSegmentEncodedidn't set-ar, so ffmpeg defaulted to 48 kHz AAC for MP4 output, forcing every per-segment encode to resample the 44.1 kHz TTS source through the worst-case 44.1→48 kHz non-integer libswresample path (audible high-frequency aliasing); (3)internal/avenc/audio.go:ConcatVideoMP4sre-encoded audio at concat (PR #404, load-bearing for the dropout fix) but inherited the 48 kHz mismatch and compounded AAC's psychoacoustic loss over already-AAC input; (4)internal/podcast/concat.gopassedBitrateKbps: 128explicitly, leaving no headroom for the silence-segment splice. Fix is one change applied through the whole pipeline: bump the ElevenLabs default tomp3_44100_192(Creator-tier), pin-ar 44100on every avenc ffmpeg command that re-encodes audio (runSegmentEncode,ConcatAudio,ConcatVideoMP4s), bumpConcatAudio's default bitrate 128→192 to match the new source, and drop the explicit 128k pin inpodcast.Concatso it picks up the new default. NEWHELMDECK_ELEVENLABS_FORMATenv var is the escape hatch for operators on the ElevenLabs Starter tier (capped atmp3_44100_128): set the env var on the helmdeck process and bothslides.narrate's andpodcast.generate's TTS calls downgrade. Same env-var ladder shape asresolveElevenLabsKeyininternal/packs/builtin/elevenlabs_creds.go; resolved per-call so a config reload doesn't require a restart. Kept package-local tointernal/podcastandinternal/packs/builtinto avoid aninternal/podcast → internal/packs/builtinimport cycle — minor duplication, much cleaner dependency graph. Why an env var, not a pack input: most operators set TTS tier once per deployment (their ElevenLabs subscription is fixed). A pack input would clutter every pipeline definition with a quality choice the operator doesn't change call-to-call. Env var is the right cardinality. Documented indocs/reference/packs/slides/narrate.mdanddocs/reference/packs/podcast/generate.md. Test surface: every flag change ships with a positive regression guard.internal/avenc/audio_test.gopins-b:a 192k+-ar 44100onConcatAudio's default and-c:v copy -c:a aac -b:a 192k -ar 44100onConcatVideoMP4s;internal/avenc/video_test.gopins-ar 44100on the per-segment encode;internal/packs/builtin/slides_narrate_test.gopins-ar 44100on the slides.narrate concat shape alongside the existing PR #404 dropout-regression guard.internal/podcast/elevenlabs_test.gopins the new 192k default in the API query AND adds two env-var tests (HELMDECK_ELEVENLABS_FORMAT=mp3_44100_128returns the override, unset env returns the new 192k default) — same "rule-with-test" shape PR D's property tests follow so a future revert is loud. Cost impact: ~50% more per ElevenLabs character (Creator tier vs Starter tier per-character rate). For a typical 5-minute narrated slide deck (~750 chars/min × 5 = 3,750 chars), the absolute delta is sub-cent at current ElevenLabs rates. Operators sensitive to the cost set the env var. Why this is the right scope: reverting PR #404's concat re-encode (option to "eliminate" the lossiness chain by stream-copying audio at concat) would bring back the mid-segment AAC frame-boundary dropouts; the right move is to make the load-bearing re-encode less lossy by matching its input quality, not to eliminate it. PCM source format (pcm_44100, eliminates source-side MP3 loss entirely) is available on higher ElevenLabs tiers and works through the same env var — deferred until operators ask. All affected packages passgo test ./internal/avenc/... ./internal/podcast/... ./internal/packs/builtin/... -race -count=1(863 tests across 3 packages).
Fixed
-
Compose build overlay: session sidecars now use the locally-built image, not the GHCR-published
:latest. Surfaced as a missing-script bug on the firstslides.narraterun after the validation arc shipped (PR #430): theCOPY scripts/av-validate.sh /usr/local/bin/av-validate.shdirective was in the Dockerfile,make sidecar-buildproduced a freshhelmdeck-sidecar:devimage with the script present, but every session container spawned byslides.narrate's validation post-step failed withOCI runtime exec failed: ... stat /usr/local/bin/av-validate.sh: no such file or directory. Phase 3's soft-surface contract worked exactly as designed (ADR 052): the failure was logged as a warn, the artifact still shipped, the pack returned success — but the underlying mismatch had been silently masking every Dockerfile change made under the build overlay since the overlay shipped in PR #134. Root cause:compose.build.yamlpreviously only declared abuild:directive forcontrol-plane. Thesidecar-warmservice in the basecompose.yamlrunsdocker pull ghcr.io/tosin2013/helmdeck-sidecar:${HELMDECK_VERSION:-latest}at everycompose up, populating the local Docker cache with the GHCR-published image (built from the last release, not the current source). The session runtime (internal/session/docker/runtime.go:47) then defaults to that same:latesttag whenHELMDECK_SIDECAR_IMAGEis unset. Net effect: developers running with the build overlay would see theircontrol-planechanges land instantly, but theirsidecar.Dockerfilechanges would only take effect after a release to GHCR — defeating the whole point of building from source for local development. Fix:compose.build.yamlgains two complementary overrides. First,HELMDECK_SIDECAR_IMAGE: helmdeck-sidecar:localon the control-plane'senvironmentblock — pointing the runtime'simageOverrideresolution path at a tag the build overlay can populate. Second, thesidecar-warmservice gets repurposed: instead ofimage: docker:cli+command: ["docker pull ghcr.io/..."], it now declaresimage: helmdeck-sidecar:local+build: { context: ../.., dockerfile: deploy/docker/sidecar.Dockerfile }+entrypoint: ["true"]+command: []. Same end goal (the sidecar tag is in the local Docker cache before the control-plane starts launching sessions), inverted mechanism (BUILD from source instead of PULL from GHCR). Compose'sbuild:+image:semantics tag the freshly-built image with theimage:reference, the no-op entrypoint exits cleanly, and the freshly-built image is now what the runtime resolves to when launching session containers. What stays the same:compose.yamlwithout the overlay layered still pulls from GHCR as before — production deployments are untouched.scripts/install.shalready layers the overlay by default, so dev installs get the fix automatically. TheHELMDECK_SIDECAR_IMAGEenv var hooks into the existing override mechanism documented atruntime.go:40-47andruntime.go:90-95— no new code, just compose-level wiring of an already-existing knob. Verified viadocker compose configshowing both overrides land cleanly (HELMDECK_SIDECAR_IMAGEon control-plane env,helmdeck-sidecar:localtag + sidecar.Dockerfile build context onsidecar-warm), and via re-running the slides.narrate validation post-step which now finds the script at/usr/local/bin/av-validate.shand lands avalidationfield in the pack output withconsistency:audio_video_duration: pass: true, severity: fail. -
slides.narrate audio/video duration mismatch (#429): replace
PadAudioToMinpre-encode silence pad with an-af apad=whole_dur=Xfilter insideencodeSegment. Issue surfaced during the Phase 1 (#428) av-validate.sh acceptance test againstslides.narrate/888de7b23142ba81-video.mp4: ffprobe revealed exactly 13 abnormally-long audio packets at inter-slide boundaries summing to 26.246 seconds — matching the 25.9s timeline-vs-content discrepancy the validator's newconsistency:audio_video_durationcheck detected. The audio PLAYED correctly (the decoder emitted the right samples); the bug was that each silence-pad got compressed into ONE duration-stretched AAC packet carrying metadata claiming ~2s of duration vs the natural 23.22ms-per-1024-sample-frame, pushing the audio stream's container-claimed duration ~26s past the actual content on a typical 14-slide deck (13 inter-slide pads × ~2s each). The discrepancy propagated to the video container, the SRT alignment, and the engagement chapter timestamps. Root cause flow (the previous, buggy path): TTS → narration MP3 (~3s);dur < minTurnSec→PadAudioToMin(internal/avenc/audio.go:159) →GenerateSilence(2s)viaanullsrc → libmp3lame→ConcatAudio(libmp3lame re-encode) merges narration + silence MP3 → 5s padded MP3 →runSegmentEncodewith-loop 1 -i image -i audio.mp3 -shortest -c:a aac -b:a 192k -ar 44100: the silent tail re-encodes to AAC and the encoder emits one duration-stretched packet covering the silent region in metadata while only containing a single 1024-sample frame of actual PCM silence. Fix (the new path):ffmpegEncodeOpts.AudioPadDur float64field on the localencodeSegmentininternal/packs/builtin/slides_narrate.go. When non-zero, the encode command gains-af 'apad=whole_dur=X.XXX'and the legacy-shortestis replaced with-t X.XXXfor deterministic per-segment duration.apadgenerates the silence inline as PCM samples during the encode pass; the AAC encoder then emits normal-density 1024-sample frames covering the silent region — no more duration-stretched metadata. The handler call site (slides_narrate.go:657area) drops thePadAudioToMininvocation and instead computesdurations[i] = max(tts_dur, minTurnSec); the encode loop passesAudioPadDur: durations[i]unconditionally — apad'swhole_duris a no-op when the input audio is already at or above the target, so this is safe regardless of whether the per-slide TTS naturally exceeds the floor.PadAudioToMinitself stays ininternal/avenc/audio.gobecauseinternal/podcast/concat.gostill calls it for the podcast turn-padding flow. Podcast outputs are MP3 (libmp3lame end-to-end), not AAC — MP3 frames are time-uniform (1152 samples / 44100 Hz = 26.12ms each) with no per-frame duration field to stretch — so the bug doesn't manifest there and the existing podcast pad path stays correct. Same-PR severity promotion (per the av.validate demotion lifecycle): theconsistency:audio_video_durationentry is removed fromknownIssueDemotionsininternal/packs/builtin/av_validate.goin this same PR. The check returns to its naturalfailseverity. Same-PR coupling means the fix and the regression guard travel together — if a future revert breaks the apad change, the validation will start failing at fail-severity again immediately. Regression guard (TestSlidesNarrate_AudioPadDur_WiresApadFilterininternal/packs/builtin/slides_narrate_test.go): asserts the per-segment ffmpeg argv contains-af 'apad=whole_dur=, contains-t(not-shortest), and explicitly does NOT contain-shortestwhen AudioPadDur is set. Same posture PR #404 introduced for the no--c copyaudio-concat guard.TestAVValidate_NoDemotionsInForce(renamed fromTestAVValidate_KnownIssue_DemotedToWarn) asserts the demotion map is empty and the check now lands atfailseverity with no(known issue, …)suffix on the detail string — protects against accidentally re-adding a demotion entry without the corresponding tracking issue. Test suite: 2,006 tests pass across 32 packages under-race. Coverage gate PASS at every floor (internal/packs/builtin80.6%). Phase 3 (default-on integration ofav.validateas a post-step onslides.narrate/podcast.generate) is now unblocked — the validation check will surface real regressions at fail-severity going forward rather than producing a stream of pre-known warnings on every output. -
slides.narrate MP4 playback dropout:
-movflags +faststarton the final concat so streaming players can begin playback before download completes. Operator reported a deterministic audio dropout — "audio plays for ~45 seconds then goes silent; restart and it does the same thing at the same point" — that initially read like an audio-encoding bug. Live ffprobe of the affected artifact (slides.narrate/ee1d32882b4d9962-video.mp4, 5.8 MB, 229.7s duration) ruled out every audio-side cause: packets contiguous from 0 to 227.7s with no gaps, RMS uniform at -22 to -24 dB sampled every 30s across the full file, no DTS discontinuities, audio duration matches video duration. The audio track was fine. The actual bug was in the MP4 container layout:moovatom at byte 5,919,942 (97% into the file), placed aftermdat— the mp4 muxer's default behavior when no-movflags +faststartis passed. Streaming consumers (HTML5<video>, the OpenClaw chat-UI inline preview, mobile MP4 frameworks, most browser-based viewers) cannot begin playback until the entire file streams in because the seek index lives at the tail. In practice the player plays what it has buffered, then stalls — looking exactly like an audio dropout. Bug was present in every helmdeck-produced MP4 sinceslides.narrateshipped (#379 era); no+faststartflag has ever existed in the codebase pergrep -r "faststart" internal/. Fix is one flag oninternal/avenc/audio.go:ConcatVideoMP4s—-movflags +faststart— which triggers ffmpeg's second-pass moov-relocation. Confirmed via re-encode on the affected artifact: the second-pass log line[mp4 @ ...] Starting second pass: moving the moov atom to the beginning of the fileappears, and the resulting file plays correctly in every player tested. Diagnostic methodology that surfaced this (worth preserving as a future debugging pattern): when an operator reports "audio dropout at deterministic timestamp" on a helmdeck artifact, the first move is not to read the code — it's to download the artifact and ffprobe it. The on-disk audio either has gaps in the packet stream (real audio truncation) or doesn't (container/playback issue). The two failure classes have completely different root causes and the audit reads completely different. PR #421 was scoped against the assumption it was the former; the actual issue was the latter. Test surface: positive regression guards at two levels.internal/avenc/audio_test.go:TestConcatVideoMP4s_VideoStreamCopyAudioReencodepins+faststartin the ffmpeg argv alongside the existing-c:v copy -c:a aac -b:a 192k -ar 44100shape assertions;internal/packs/builtin/slides_narrate_test.go:TestSlidesNarrate_ConcatReencodesAudiopins+faststartat the pack-level concat shape so a future revert is caught even if someone refactorsinternal/avencindependently. Same regression-impossibility pattern PR #404 introduced for the no--c copyguard. External research surfaced two architectural follow-ons (deliberately out of scope for this tonight-ship fix, candidates for v0.26.0): (1) ElevenLabs has a documented "200 OK with no audio data" failure mode (status.elevenlabs.io incident 2025-11-27); the current code path reads response body viaio.ReadAll(io.LimitReader(resp.Body, 32<<20))which silently passes a truncated body through to ffmpeg. Hardening would add an ffprobe-based duration sanity check at the TTS-fetch boundary so the failure is loud instead of cascading as silent audio later. (2) The "per-segment encode → concat" architecture has known fragility around AAC priming-sample drift and DTS discontinuities (FFmpeg Trac #10379, #5448); production NLE tools (Descript, Adobe Premiere) use a "timeline → single-render" pattern where audio is one continuous track rendered in one encode pass. Migrating helmdeck to this pattern would eliminate an entire bug class but requires reshaping the segment loop inslides_narrate.go. Doesn't fix podcast.generate (.mp3) dropouts — MP3 has nomoovatom, so faststart doesn't apply. Operator reported the same symptom in podcast output; that case has a different root cause and needs its own diagnostic pass.
[0.25.0] - 2026-06-04
Theme: The cheap-model reliability bet, empirically proved. Eight PRs (A–H) shipped the v0.24.0 + v0.25.0 reliability arcs as a single release. The architectural claim — that weak, cheap models can drive complex workflows iff the surrounding environment is perfectly reliable — has moved from "we have typed errors and contract tests" to "every layer (handlers, schemas, engine, MCP, S3, model recovery) has a regression-impossible backstop AND we have empirical evidence that a free 120B-class model recovers correctly from helmdeck's typed errors at ≥7/10 across all 5 reliability scenarios." Concretely: 7 packages newly tracked in the coverage gate (avenc/llmcontext/gateway/packs/builtin/api/packs/pipelines/mcp at 80–90% floors), 5 new contract / property / mutation / wire / recovery test surfaces, 2 nightly/weekly workflows surfacing reliability signal that didn't exist a week ago, ~2,000 internal tests (was ~1,650), and the first piece of empirical evidence in helmdeck's history that the cheap-model bet actually holds — openai/gpt-oss-120b:free passes all 5 typed-error recovery scenarios on the weekly model-recovery workflow.
Added
-
Model-recovery loop test against
moonshotai/kimi-k2.6:free(PR H of the v0.25.0 reliability arc — final). PRs A-G proved helmdeck's environment is correct: coverage gates, contract tests at the schema seam, property tests on validators, mutation testing on decision-dense code, engine audit/memory machinery, S3 wire surface, MCP transport. The reliability bet under all of that is the headline claim: weak, cheap models can drive complex workflows iff the surrounding environment is reliable. A 100%-covered codebase still doesn't prove the LLM understands the typed-error vocabulary helmdeck advertises. PR H closes that gap with the first piece of empirical evidence in helmdeck's history that the cheap-model bet actually holds (or doesn't — both are useful results). NEW.github/workflows/model-recovery.yml— nightly schedule (06:00 UTC, after the 04:00 mutation workflow) +workflow_dispatchfor ad-hoc runs. Pinsmoonshotai/kimi-k2.6:freevia three workflow-level env vars:RECOVERY_MODEL,MODEL_LAST_VERIFIED=2026-06-04,MODEL_NEXT_REVIEW_DUE=2026-09-04. The dates live next to the model id so updating the pin prompts updating the review date — no separate calendar to drift. Preflight step: (a) calls OpenRouter's/api/v1/modelsand assertsRECOVERY_MODELis in the catalog; if absent, fails loudly with "model deprecated; check https://openrouter.ai/models?supported_parameters=free and update RECOVERY_MODEL + bump LAST_VERIFIED + push NEXT_REVIEW_DUE forward." (b) Compares today's date againstMODEL_NEXT_REVIEW_DUEand emits a GitHub::warning::annotation past the deadline — visible on the run summary, present on every subsequent nightly until the maintainer updates the dates. Same "loud-but-not-blocking" cadence the coverage gate uses. NEWinternal/reliability/package — build-taggedrecoveryso a defaultgo test ./...compiles onlydoc.goand the package is a no-op for ordinary CI. Three gates protect the live API: the build tag (-tags=recovery),HELMDECK_RECOVERY_TESTS=1env var, andOPENROUTER_API_KEY. All three must be set; the test skips cleanly otherwise so forks and PR contributors without the secret get clean green runs.scenarios.godeclares 5 recovery scenarios + the closed-set action vocabulary the model returns (retry_corrected,retry_as_is,escalate_to_user,report_bug) + the system prompt that explains helmdeck's typed-error vocabulary to the model (mirroring what an MCP client surfaces).client.gois a 180-LOC OpenRouter chat-completions caller — deliberately not the productioninternal/gatewaystack so the harness can't be confused with what's being measured. Forcesresponse_format: json_objectand temperature=0.2; strips ```json fences from providers that wrap output; treats malformed JSON as a recovery failure (a model that can't emit parseable output for a typed envelope is failing the contract just as much as one that picks the wrong action).recovery_test.goruns each scenario N=10 attempts (configurable viaHELMDECK_RECOVERY_ATTEMPTSfor ad-hoc shorter runs), tallies actions againstExpectedActions, assertssuccesses ≥ threshold. Default threshold 7/10; message-only ambiguity scenario uses 6/10 because multiple recoveries are inherently acceptable. Persists a per-scenario JSON report to/tmp/recovery-report.json; the workflow uploads it as a 30-day-retention artifact and posts a per-scenario table to the run summary via$GITHUB_STEP_SUMMARY. The 5 scenarios (each pinning a specific reliability claim): (1)CodeInvalidInputwith named field — caller-fixable, model must emit a corrected input (the headline claim). (2)CodeSchemaMismatchon output — pack bug, model must report (NOT retry, NOT escalate to user) — the v0.17.1-class regression. (3)CodeHandlerFailedtransient — model must retry with same inputs. (4)CodeCredentialInvalid— model must escalate to user (auto-retry could lock the account). (5) Message-only ambiguity — vague message, only the code carries actionable signal — tests whether the typed code is doing the work or the model is pattern-matching message text. What this proves: each PASS is direct empirical evidence that the typed-error vocabulary works for the weak-model regime the bet is making a claim about. A FAIL is also useful — it surfaces "the message isn't clear enough" or "the bet is weaker than we claimed for this code" honestly. The wrong move is hiding the result. Why a free model, not Haiku 4.5: real-token cost on nightly CI compounds; free tier keeps the budget at zero. More importantly, recovery from a weak model is a stronger reliability claim than recovery from a smart one — Kimi-K2.6 doesn't have the general intelligence to "figure out" the right action from prose, so a PASS is evidence the typed-error contract is doing the work. GitHub repository secret to add before first nightly fires:OPENROUTER_API_KEY(Settings → Secrets and variables → Actions). Without the secret the workflow's preflight emits a clear "secret not set" warning and skips the test. Closes the v0.25.0 reliability arc. Eight PRs (A-H) shipped: coverage gate (A), contract tests (B), handler coverage (C), property + mutation (D), S3 (E), engine audit/memory (F), MCP ratchet (G), model-recovery proof (H). The architectural reliability claim has moved from "coverage % says we're 80%-tested" to "every layer (handlers, schemas, engine, MCP, S3, model recovery) has a regression-impossible backstop + the cheap-model recovery loop is empirically measured." -
internal/mcpratcheted 69.5% → 81.5%, added to the coverage gate at floor=81 (PR G of the v0.25.0 reliability arc). PR D's reshape closed the v0.24.0 arc withinternal/mcpdeferred (69.5% < 80% infra floor; "would fail the gate"). PR G fixes that — the MCP package is the wire surface every connected agent (OpenClaw, Gemini CLI, Claude Code) talks to, and an untested branch in the tool dispatcher silently breaks every agent's pipeline-execution workflow at once. NEWpipelines_test.go(20 tests) — the highest-leverage file ininternal/mcpwas at 8.4% before this PR. CoversWithPipelines's nil-service gating (so deployments without pipelines still serve a working pack catalog) + thetools/listwire shape (tool names are BARE —pipeline-run, nothelmdeck__pipeline-run— because namespacing MCP clients would double-prefix tohelmdeck__helmdeck__pipeline-run; the docstring's load-bearing contract is pinned) + every action ofdispatchPipelineTool: list/get/create/run/run-status/rerun/cancel, each with happy-path service forwarding, missing-required-field validation, and service-error translation. Pins specifically thatcoalesced: truefrom the single-flight guard is NOT an error in the tool-result envelope — the LLM's recovery code branches on this, and a regression that promoted it to isError would silently break every agent re-firing a pipeline. Also pins the distinct error codes per tool (pipeline_run_failedvspipeline_cancel_failedvs the genericpipeline_error) — the LLM's recovery branches on the specific code, not the message. NEWmy_resources_test.go(12 tests) —buildMyDefaults,buildMyMemory,buildRoutingGuide,formatPipelineAuditChunk. Thehelmdeck://my-defaults/helmdeck://my-memory/helmdeck://routing-guideMCP resources are what the chat agent reads at the top of every session to understand who it's talking to and what's been learned. Pin three distinct note states (memory-not-configured vs no-store vs empty-history) so the UI can distinguish "memory off" from "memory on but new caller"; pin the wire shape rewrite frompacks.DefaultstoMyDefaults(a future JSON-tag rename on the underlying type would silently corrupt the agent's defaults reading); pin the audit-category filter inbuildMyMemory(pack_history / pipeline_history rows MUST be excluded — they're surfaced via my-defaults, not my-memory, and a regression that leaked them would clutter the agent's user-facts view with engine-written rows).formatPipelineAuditChunk(QMD MCP corpus bridge) — every field renders in a stable header → key/value layout; optional fields (empty Run ID, zero DurationMs, empty LearnInputs) MUST NOT produce dangling labels. NEWhelpers_test.go(7 tests) —isInlineableImageclosed-set MIME check (PNG/JPEG/GIF/WebP inlineable; SVG/AVIF/BMP fall back to text-URL — a regression here would silently break inline screenshot rendering);base64Encoderound-trip;rpcError.Error()format stability (log-parsing scripts depend on the exact shape);extractWebhookFieldssecurity boundary (the pack handler MUST NEVER see webhook_url or webhook_secret — they're MCP-server-level metadata; the test pins that the cleaned input does not leak these fields, and that the no-url branch returns the input unchanged so awebhook_secretwithout a URL doesn't get silently stripped). NEWregistry_factory_test.go(4 tests) —defaultAdapterFactoryfor all three transports (stdio / SSE / WebSocket): valid-config happy path + malformed-config typed error per transport. Unknown-transport branch surfaces a typed error naming the bad value (operator typos like"stio"in the DB row don't silently route to nil). Coverage:internal/mcp69.5% → 81.5% (+12pp).pipelines.go8.4% → 92.2% (the biggest single-file jump in the v0.25.0 arc).routing_guide.go47% → 88%.my_defaults.go35% → ~95%.my_memory.go20% → ~85%. Floor:internal/mcpnewly tracked at 81. Tests: 113 passing ininternal/mcp/(was 70). All./internal/...package suites passgo test -race -count=1 -timeout=240s(1,994 total tests across 34 packages). Coverage gate reports PASS at the new floor. What's deliberately left:jobs.go.sweep(the SEP-1686 async-job janitor — exercised end-to-end by the integration suite when it next runs; standalone test would need timer-clock injection that adds more weight than it removes),stdio.goreader-side adapter (sub-process spawn semantics; integration territory), the trivialWithArtifacts/WithInlineImageThresholdone-line setters. v0.25.0 arc remaining: PR H — model-recovery loop test against Haiku 4.5 (the actual cheap-model-reliability proof; real-token cost, opt-in env var, budget plan). -
Engine audit + memory machinery covered (PR F of the v0.25.0 reliability arc). PR E closed the S3 store gap; PR F closes the LLM-context machinery — the ADR 048 surface that builds the model's per-caller defaults projection on every run. Before this PR:
WritePlanAudit,WritePipelineAudit,MemoryStore()accessor,StoreFact,ProjectDefaults,CallerFromContext,WithProgress,ProgressFromContext,FactStoreError.Error()were all at 0%. The reliability story rests on these being right —CallerFromContextreturning the wrong subject means every audit row lands under the wrong namespace and the per-caller learned defaults silently swap between users;WritePlanAuditlosing the IntentSHA from its key shape means the planning-history projection breaks;MemoryStore()returning nil from a wired engine means the QMD MCP bridge mounts a 503 stub when it should serve real corpus. NEWaudit_engine_test.go(9 tests):WritePlanAudithappy-path pinning theplan_history/<intent_sha>/<nano>key shape + category=plan_history (ADR 049's reservation); preserves caller-set non-zero AtUnix (the branch existed but was untested); nil-store no-op guard (without this, every plan run on a no-memory deployment would nil-deref); unknown-caller default namespace (callerFromContextfalls back to "unknown" so memory writes always have a well-defined namespace).WritePipelineAudithappy-path with learnable-input filtering (theme+modelextracted,markdownbody dropped — same closed-set aswritePackAudit); empty-pipelineID no-op (so the projection isn't polluted with empty-ID rows the my-defaults UI can't group); nil-store guard.MemoryStore()accessor: returns the configured store, returns nil when unwired — the QMD bridge's mount-vs-stub gate. NEWfacts_engine_test.go(5 tests):FactStoreError.Error()round-trip througherrors.As(the REST handler atinternal/api/memory.gouses this seam to extract the typed code for status mapping);StoreFacthappy path through the REST entry point; nil-store path synthesizes the entry so memory-disabled deployments get a stable response shape; validation errors pass through unchanged (so missing-key doesn't coerce to backend-error 500); backend errors wrap asFactErrBackend(so a SQLite write failure surfaces as 500 instead of 400 invalid_input). NEWcontext_test.go(4 tests):WithCaller/CallerFromContextround-trip + nested-child inheritance + empty-subject-fallback-to-"unknown" (the namespace MUST be non-empty);WithProgress/ProgressFromContextround-trip + the always-non-nil contract (no-op callback returned for bare context so handlers don't need a nil-check); nil-clears branch. NEWproject_defaults_test.go(5 tests):ProjectDefaults(the slice-input variant used byhelmdeck.route, distinct from the store-backedBuildDefaults) — empty-inputs returns non-nil empty slices (JSON marshals as[]notnull, same shape contract as PR C's null-fixes); ranks by call count; excludes failed runs from learned defaults (a caller-fixable failure with persona="executive" must NOT pin executive as default — that's reinforcing the wrong intent, the regression class the LLM's recovery story depends on most); pipeline-audit accepts both "succeeded" and "ok" outcomes (pack-level vs pipeline-level vocabulary); top-N cap applied so a heavy caller doesn't blow up the routing prompt. Coverage:internal/packs82.0% → 87.6% (+5.6pp). Per function:WritePlanAudit0→76.9%,WritePipelineAudit0→75%,StoreFact0→full happy/error paths,ProjectDefaults0→full,CallerFromContext/ProgressFromContext0→full. Floor bumped to 87. Tests: 140 passing ininternal/packs/(was 117). All./internal/...package suites passgo test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floor. What's still untested ininternal/packs(deliberate scope decisions):memoryAdapter.Namespace/List/Delete(exercised indirectly viaEngine.Execute; pinning the adapter directly is a follow-up if signals warrant),ExecutionContext.Report(no-op when no progress sink wired; integration-tested via the pipeline runner's progress capture), and the trivialWithCDPFactory/WithSessionExecutor/WithArtifactStoreoption setters (one-line assignments — testing them is theater). v0.25.0 arc remaining: PR G —internal/mcpratchet (69.5 → 80; the deferred-from-PR-D infrastructure floor). PR H — model-recovery loop test against Haiku 4.5. -
internal/packs/s3store.gowire-tested against a stub S3 endpoint (PR E of the v0.25.0 reliability arc). Post-v0.24.0 coverage audit surfaced the bigger latent risk: the artifact store EVERY operator's production deployment depends on —internal/packs/s3store.go— was 0% covered in CI. The existings3store_test.gohad a compile-time interface check + an opt-inTestS3ArtifactStoreLivethat runs against real MinIO whenHELMDECK_S3_TEST_ENDPOINTis set, but the CI surface was zero. Any operator deploying with MinIO/R2/B2/AWS S3 was running unreviewed code on every artifact upload. PR E closes that gap with a stub S3 server that speaks just enough of the AWS S3 wire protocol for the minio-go SDK to round-trip. Scope (NEWinternal/packs/s3store_wire_test.go, 11 tests): full Put → Get round-trip with content-type + size preservation + presigned-URL shape (X-Amz-Signature query param verified, not just non-empty); BucketExists failure surfaced at construction time (operators get a clear error at startup, not on first Put); upstream-error translation to*PackError{Code: CodeArtifactFailed}on Put / Get / Delete (the engine's typed-error contract held end-to-end); ListForPack reads the in-process index (cross-handler-within-run lookup, not bucket scan); ListAll walks the bucket and parses Pack from the key prefix (the only entry point the TTL janitor uses — if the prefix parse breaks, janitor either deletes the wrong artifacts or stops working); Delete removes the object AND drops the index entry (without the index update a follow-up ListForPack would return a stale handle); PublicEndpoint rewrites the presigned-URL host (the docker-internal-vs-public-DNS seam compose deployments rely on); PresignTTL=0 defaults to 15min; Region defaults to us-east-1 for MinIO sign-path compatibility. Why a stub, not testcontainers: a real MinIO container in CI adds a docker-in-docker dependency and ~5s of startup per run. A stub server that emulates the path-style S3 endpoints (HEAD bucket, PUT/GET/DELETE object, GET ?list-type=2) is enough for unit-test coverage of the helmdeck-side translation logic — the wire-shape we care about. The two non-trivial pieces the stub had to model: (1) AWS chunked-signed PUT payloads (minio-go usesX-Amz-Content-Sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOADfor streaming uploads, so each chunk arrives as<hex-size>;chunk-signature=<sig>\r\n<data>\r\n; the stub decodes this so Get round-trips the raw bytes the test wrote), and (2) persistent error injection (minio-go retries failed requests internally — a one-shot error stub fires on the first attempt and the retry succeeds against the stub's normal flow, so the error field has to stay set for the duration of the test). Why this matters for the reliability bet: PRs A–D proved the handlers are correct. The engine's artifact store is the substrate every artifact-producing pack writes to (slides.narrate, podcast.generate, image.generate, screenshot_url, hyperframes.render, swe.solve's trajectory dumps). A bug here breaks all of them at once with the worst possible failure mode: silent. The presigned URLs are how agents reach back to fetch what they produced — a regression in the URL shape would silently break every agent's fetch loop. Coverage:internal/packsjumped 72.6% → 82.0% (+9.4pp).internal/packs/s3store.gofrom 0% (across every function) to 75-100% per function. Added to the coverage gate at floor=80 — the engine layer now has the same regression-impossible backstop the v0.24.0 packages got. Tests: 117 passing ininternal/packs/(was 106). All./internal/...package suites passgo test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floor. Why this kicks off v0.25.0: PR D closed the v0.24.0 arc with a self-audit promising "engine internals + S3 are next." PR E starts the v0.25.0 arc against the actually-untested infrastructure: S3 store (this PR), then engine audit/memory machinery (WritePlanAudit/WritePipelineAudit/StoreFact/ProjectDefaults/CallerFromContext— all 0%), theninternal/mcpratchet (69.5 → 80), then the model-recovery loop test (the actual cheap-model-reliability proof). -
Property-based tests on seam validators + nightly mutation-testing workflow (PR D of 4, v0.24.0 reliability arc — final). PRs A-C ratcheted the quantity floor (coverage gate, contract tests at the schema seam, closing zero-coverage handlers). PR D adds the quality gates that coverage can't see — the things that actually prove the cheap-model reliability bet. Property tests (
pgregory.net/rapid v1.3.0added; test-only dep, no production import):internal/pipelines/validate_property_test.go(6 tests) pinspipelines.Validate's invariants — every well-formed pipeline must validate, every duplicate-step-ID / empty-pack / forward-step-ref / packExists-rejects pipeline must reject with a message naming the offending element (the LLM's recovery key).internal/packs/schema_property_test.go(4 tests) pinsBasicSchema.Validate— every conforming output validates, every missing-required / type-mismatch / non-object input rejects with a clear message.internal/gateway/splitmodel_property_test.go(4 tests) pinsgateway.SplitModel's round-trip identity AND the docstring's load-bearing claim that the split is on the FIRST/only (so"ollama/library/llama3"routes correctly to provider=ollama, model=library/llama3— a naivestrings.Splitwould corrupt it). And caught a real bug while writing them:BasicSchema.Validateaccepted top-levelnulleven though the docstring promises rejection of non-objects.json.Unmarshal([]byte("null"), &map[string]json.RawMessage{})succeeds with the map left as nil — Go's decoder treats null as "no value" for map types. Without an explicit nil-check, any pack returningnullinstead of{}would silently pass validation. Same regression class as PR C'sbrowser.interactnull-slice screenshots: an empty value JSON-encoded asnullinstead of[]/{}slips past validation that "looks" right by line coverage. Fixed ininternal/packs/schema.gowithif obj == nil { return ... "got null" }. Why property tests, not more example tests: example tests at 95% line coverage pin the cases the test author thought of. Property tests pin the INVARIANT — across thousands of generated inputs per check. If a future refactor ofextractStepRefsaccepts${{ steps. }}with a trailing dot, the well-formed property doesn't notice (it doesn't generate that shape), but the forward-ref property does — every well-formed run includes refs that the validator now misparses. The reliability bet rests on these validators being right for the inputs they haven't seen yet; properties are how we test that. Nightly mutation-testing workflow (NEW.github/workflows/mutation.yml,go-mutesting v1.2.0): scheduled at 04:00 UTC daily, scoped narrowly to three places where a flipped condition has the largest blast radius on the reliability story —internal/packs/classify.go(typed-error closed-set mapping; a mutation swappingCodeInvalidInputforCodeInternalwould silently break every LLM's failure-recovery channel),internal/gateway/fallback.go(Chain.Dispatch retry/fallback ladder; flipped predicates surface as routing dead-letters no example test catches),internal/avenc/(the codec byte-floor checks from PRs #400/#404/#405;size < floorvssize <= flooris a 1-byte difference coverage % can't detect). Runs as a matrix (3 parallel jobs), 25-minute timeout per target, uploads survivor lists as artifacts retained 14 days, posts a per-target summary to the run page. Not a per-PR gate because go-mutesting runs the test suite once per mutation and is slow (~5-15 min per file, longer foravencwhich has more functions); per-PR would burn CI. Nightly + on-demandworkflow_dispatchis the right cadence for a "this drift caught us before we noticed" signal. Final per-package floors locked: avenc=90 (99.3 actual), llmcontext=90 (92.1), gateway=88 (88.1; bumped from 85), packs/builtin=80 (80.5), api=80 (80.1), pipelines=80 (84.0; new tracked package).internal/mcpdeliberately not yet tracked — currently 69.5%, below the 80% infrastructure floor; adding it to the gate now would fail the run. Ratchetingmcpis a focused v0.25.0 task.cmd/*excluded as documented (entry-point os.Exit/signal handling has a realistic ~60% ceiling). Deferred to v0.25.0: the model-recovery loop test — drive Haiku 4.5 through deliberately-broken pack outputs and assert the LLM's recovery behavior — is the test that would truly prove the cheap-model reliability bet end-to-end. Real token cost per CI run requires a dedicated budget plan + opt-in env var so dev machines don't burn credits, so it's its own arc, not bundled into PR D. Tests: 14 new property tests across 3 packages, all pass underrapid.Check(each runs ~100 generated cases per invocation by default). 855 total tests ininternal/packs/...(was 750). All./internal/...package suites passgo test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new locked floors. -
Close the zero-coverage handler set in
internal/packs/builtin(PR C of 4, v0.24.0 reliability arc). PR A landed the regression gate, PR B added contract tests at the schema seam. PR C closes the actually-untested handlers that PRs A and B left at 0% — the LLM-facing surfaces where coverage genuinely was theater.browser.interact(NEWbrowser_interact_test.go, 14 tests): full happy-path walkthrough exercising every action shape (click, type, focus, screenshot, extract, assert_text, wait, execute) againstcdpfake.Client, plus per-action input validation (selector/value/text required), navigate-error propagation,assert_text→CodeSchemaMismatchmapping, the no-CDP defense-in-depth path. And caught a real bug while writing the tests: the handler initializedscreenshotsas a nil slice (var screenshots []string), which JSON-marshals asnulland violates thearraytype declared in the OutputSchema for action sequences that don't include a screenshot. Production runs without the screenshot action would have failedEngine.Execute's output-schema validation withinvalid_output: expected array, got null. Initialized as[]string{}so empty marshals as[]— exactly the class of bug PR B's schema-contract tests are designed to catch in the future, and a useful proof point that the contract-test pattern works.github.*handler set (NEWgithub_handlers_test.go, 8 tests): the existinggithub_cache_test.goexercises the engine's cache seam by stubbing the handler — never hitsgithubAPI. PR C closes that gap by overriding the package-globalgithubAPIBase(newlyvarinstead ofconstso tests can point it athttptest.NewServer, same pattern asvoices.ElevenLabsBaseURL) and running the real handlers through the real HTTP call. Tests pin: request-shape headers (Authorization Bearer, Accept, X-GitHub-Api-Version, User-Agent), the no-token branch (no Authorization header — public reads still work), upstream-error surface (4xx/5xx →CodeHandlerFailedwith status + message), per-pack body/path shape for create_issue / list_prs / post_comment / create_release / search. A header regression ingithubAPIis exactly the bug that would silently break every github pack at once — pinning it once protects the whole family. ElevenLabs credential ladder (NEWelevenlabs_creds_test.go, 7 tests):resolveElevenLabsKeyis called at handler entry by bothpodcast.generateandslides.narrate. The 4-step resolve ladder (explicit credential → canonical vault nameelevenlabs-key→ back-compat aliaselevenlabs-api-key→HELMDECK_ELEVENLABS_API_KEYenv var) had 76.5% coverage with the alias + explicit + env paths untested. Tests pin each step's precedence + the no-source empty return + the explicit-missing-falls-through behavior + the nil-vault defensive path. A ladder reorder would now fail loudly pointing at the source step. Floors:internal/packs/builtin77 → 80 (+3pp; actual coverage now 80.4%). The larger 76 → 88 sweep the original plan called for proved aspirational — PR C's 4pp ratchet (76 → 80 across PRs B+C) is the real path the available test surface supports. Tests: 750 passing ininternal/packs/builtin/(was 721). All./internal/...package suites passgo test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floors. PR D (reshape per the v0.24.0 plan) addsrapidproperty tests + nightlygo-mutestingworkflow — quality gates on top of the quantity floor. -
Schema-contract + typed-error contract tests across the builtin pack catalog (PR B of 4, v0.24.0 reliability arc). PR A landed a coverage gate; PR B addresses the dimension coverage can't see — quality. You can hit 90% coverage with tests that only assert HTTP 200 and never verify the typed-error code the LLM's recovery depends on. The bug we actually pay for is "pack returned
CodeInvalidInputbut the test only checked status 400" — coverage stays green while the model receives a generic error, gets confused, and burns tokens retrying. PR B closes two specific drift surfaces at the seam where the reliability bet lives. Output-schema contract (internal/packs/builtin/output_schema_contract_test.go) extended from 2 packs (slides.narrate, podcast.generate) to 7 — adds helmdeck.plan, helmdeck.route, content.ground, research.deep, swe.solve. The class of bug closed: a pack's unit tests callpack.Handlerdirectly, bypassingEngine.Executewhich is the only placeOutputSchema.Validateruns. So a handler can emittts_chars: {by_voice: {...}}while the schema declarestts_chars: number, every unit test passes, and pipeline runs fail in production withinvalid_output: field "tts_chars": expected number, got object(this exact regression shipped in v0.17.1). Each contract test invokes the real handler with valid input and assertspack.OutputSchema.Validate(output)returns nil. Typed-error all-pack contract (NEWinternal/packs/builtin/typed_error_contract_test.go) — table-driven test enumerating 47 builtin packs. For each: invoke handler with deliberately-invalid input ({}in most cases), asserterrors.As(err, &perr)succeeds ANDpacks.IsValidCode(perr.Code)is true. The architectural promise per ADR 008 is that NO error escapingEngine.Executecarries a code outside the closed set ininternal/packs/classify.go. Pack handlers returning&PackError{Code: "weird"}get coerced toCodeInternal, which the pipeline-level FailureClass router maps topack_bug— wrong bucket, wrong recovery, the LLM tries an issue-filing flow when the real fix iscaller_fixable. The table makes future drift visible: a new pack returning&PackError{Code: "something_not_in_the_set"}fails the contract loudly. Why this matters for cheap-model reliability: helmdeck's bet is that weak models drive complex workflows iff the surrounding environment is perfectly reliable. The closed-set typed errors are the channel through which the LLM learns "this is your fault, fix your input" vs. "this is infrastructure, retry with backoff" vs. "this is a bug, escalate." Without these contract tests pinning the channel, schema-vs-handler drift breaks the channel silently and the model's recovery becomes a stab in the dark. Floors:internal/packs/builtin76 → 77 (modest ratchet — the larger jump to 82 lands in PR C alongside browser_interact / github_handlers / elevenlabs_creds test coverage). Tests: 721 passing ininternal/packs/builtin/(was 712). All./internal/...package suites passgo test -race -count=1 -timeout=240s. Coverage gate reports PASS at the new floors. Plan reshape: PR D was originally "close gateway 88→91, lock floors, write a docs page" — a symbolic close. After landing PR A and observing that coverage % alone doesn't prove the reliability bet, PR D was reshaped to addrapidproperty-based tests on the seam validators (pipelines.Validate, OutputSchema.Validate) and a nightlygo-mutestingworkflow against the decision-dense LLM-facing code (classify.go, gateway/fallback.go, avenc codec floors). Floor-locking stays but is no longer the headline. The model-recovery loop test — drive Haiku 4.5 through deliberately-broken pack outputs and assert the LLM's recovery behavior — is a v0.25.0 candidate (real token cost per CI run requires a dedicated budget plan). -
Per-package coverage gate + golangci-lint job in CI (PR A of 4, v0.24.0 reliability arc). The architectural bet behind helmdeck is that weak, cheap models can drive complex workflows iff the environment around them is perfectly reliable — typed errors, strict schemas, context compaction. Coverage is the measurable proof that bet holds; an untested branch that returns a raw Go error breaks LLM context and burns tokens reasoning out. This PR makes regression impossible from this point forward, then closes the biggest gap (
internal/api/) so the floor can lift. Coverage gate (scripts/coverage-gate.sh) parsescoverage.txtdirectly, computing statement-weighted percentages per package via awk — same metricgo tool cover -funcprints for thetotal:row — so a single big untested function can't hide behind many small tested ones the way function-averaging viacover -func | tailwould. Initial floors land at current values rounded down (avenc:90, llmcontext:90, gateway:85, packs/builtin:75, api:60); PRs B–D ratchet these up. The script reports every tracked package on every CI run, success or fail, so maintainers see drift before it crosses a floor. Slack of 0.05% absorbs rounding noise so a 89.95% reading doesn't flake a 90% gate. golangci-lint job in.github/workflows/ci.ymluses the v2 config schema in.golangci.ymland enableserrcheck,govet,staticcheck,unused,ineffassign— the bug-class linters that coverage doesn't replace (shadowed variables, unchecked errors, dead code).errcheckexcludes the common Close/Flush patterns ((*sql.DB).Close,(*io.PipeWriter).Close,(*tabwriter.Writer).Flush, …) explicitly — errcheck matches on the static receiver type, not interface satisfaction, so(io.Closer).Closealone wouldn't catch them.staticcheckis scoped to theSA*prefix (genuine bug checks); theST*(style),S*(simplification), andQF*(quick-fix) categories are off in PR A because the codebase had never run staticcheck and would surface ~33 pre-existing findings whose cleanup belongs in a focused style PR. Subsequent reliability-arc PRs can opt categories back in as the backlog drains.only-new-issues: trueratchets the same way the coverage gate does — the lint blocks NEW issues a PR introduces, doesn't force the cleanup of pre-existing drift in code paths a PR doesn't touch. Action pinned tov8withgolangci-lint v2.12.2— the major version of the action, the linter binary, and.golangci.yml's schema must stay in sync on bumps (action v6 only supports linter v1; v8 maps to linter v2).internal/api/coverage lift: 62.8% → 80.1% (+17.3pp). Tests added across handler-shape branches that previously lived in the 0% column. Mostly small, high-fanout additions: REST handlers returned 503/404/405 paths that weren't exercised; pack/pipeline/MCP adapters that the in-process MCP surface routes through; SSE handshake paths for the QMD memory-corpus bridge; vault and key-store error mappings; webhook dispatch + post-issue-comment fallback. New tests honor existing seam patterns —fake.Runtimefor session ops,cdpfake.Clientfor CDP,httptestrecorders for HTTP shape,memory.NewInMemoryStorefor memory paths — so future contributors find familiar scaffolding instead of one-off mocks. CI scope tightened:go vetandgo test -racerestricted to./internal/... ./cmd/...(was./...) so the test job doesn't try to compile the Docusaurus-bundledwebsite/build/assets/*.gotree, which referencesinternal/apisymbols but isn't part of the go.mod build module — this drift was masking real failures behind an unrelated compile error. Why per-package, not aggregate. An org-wide average lets a strong package subsidize a weak one —avencat 99.3% andgatewayat 88% would absorb a hypotheticalapidrop from 80% to 30%, and operators would notice only when the cheap-model reliability story collapsed in production. Per-package floors with explicit exclusions (cmd/* entry points, generated code, integration-only paths) make backsliding loud at the PR that introduces it. The 4-PR cadence (A–D, this is A) lets each step add tests and raise its package's floor without locking in two more PRs against a flawed baseline — if the gate logic has a counting quirk, we discover it in this PR, not in PR D. Final floors at PR D: critical reliability packages (avenc, llmcontext, gateway, packs/builtin) at 90%; infrastructure packages (api, pipelines, mcp) at 80%; cmd/* excluded. Test count: 291 passing tests ininternal/api/(was 173). All./internal/...package suites passgo test -race -count=1 -timeout=240s.
[0.23.0] - 2026-06-03
Theme: Reliable narrated decks + shared audio/video helper. The slides.narrate failure surface that produced "ffmpeg segment N failed (exit 0)" — for weeks the most-reported single error — is closed end-to-end: silent-failure detection, honest error messages, OOM retry, post-encode validation, Mermaid pre-rendering, audio re-encode at concat boundaries, and most importantly a shared internal/avenc/ package that captures every lesson PRs #379–#405 paid for so the next pack that needs ffmpeg starts from a solid base. Plus a long string of supporting fixes: pipeline-run single-flight coalescing, session-timeout extension on pinned reuse, paid-API credential precheck, and the ADR 051 routing-reliability work (reasoning-token stripping, parser parity, calibration tooling, cause-typed errors, strict JSON, prefix cache).
Changed
-
internal/podcast.Concatadoptsinternal/avenc/shared helpers (PR C of 3 — final consolidation step). Completes the 3-PR avenc consolidation arc.internal/podcast/concat.gowas the second of two production callers shelling out to ffmpeg directly; this migration removes its duplicatedgenerateSilence/probeAudioDuration/padTurnToMin/concat-command helpers and replaces them withavenc.GenerateSilence/avenc.ProbeAudioDuration/avenc.PadAudioToMin/avenc.ConcatAudiocalls. TheConcatfunction steps now read like the documentation: write the per-turn files (writeTurnFilestays local — streaming Stdin isn't ffmpeg-shaped), callavenc.GenerateSilencefor the between-turn segment, build the concat list, callavenc.ConcatAudiofor the audio-only concat with mandatory re-encode, thenavenc.ProbeAudioDurationfor the final duration.padTurnToMincollapsed from a 45-line inline 4-step ffmpeg pipeline to a 6-line composition ofavenc.ProbeAudioDuration+avenc.PadAudioToMin.SilenceTurnnow wrapsavenc.GenerateSilence+ a cat-readback (the wrap closes the 0-byte-output hole the original SilenceTurn had — avenc post-validates the produced file). Bridging shape:session.Executoris the runtime-side interface with anExec(ctx, sessionID, req)signature, whileavenc.Executoris a closure already bound to a sessionID. A 3-lineavencBind(ex, sessionID) avenc.Executoradapter lives at the top ofconcat.goand gets passed to every avenc call. Considered putting the sessionID parameter on the avenc API directly — rejected because it would propagate the session-ID-dispatch indirection into every avenc caller including slides.narrate, which already passesec.Execdirectly without indirection; the closure-adapter pattern is a one-line cost paid only by callers that need it. Net change: 166 LOC → ~140 LOC inconcat.go(the 45-linepadTurnToMinshrank to 6 lines + 9-line doc comment, and the inline concat block lost ~12 lines of error-handling boilerplate that avenc owns now). Concat reads as a much shorter pipeline of named-helper calls. Behaviour deltas worth noting: (1) ffprobe invocations now prefixLC_ALL=C(avenc's locale-stability guard from the external research in PR #406). A sidecar withLC_NUMERIC=de_DEpreviously had ffprobe emit "5,123" whichfmt.Sscanfsilently parsed to 0 — now the parse simply succeeds since LC_ALL=C forces period decimal separators. (2) Final-duration probe failures (corrupt MP3 etc.) now ignore the avenc-returned error and fall back toduration = 0— preserves the historical behaviour wherefmt.Sscanfsilently returned 0 on garbage so the caller's cost-accounting code didn't blow up on a probe failure. (3)padTurnToMinintermediate file names changed from/tmp/helmdeck-podcast/turn-NNN-pad.mp3to/tmp/helmdeck-podcast/avenc-pad-turn-NNN.mp3— the prefix is nowavenc-pad-instead of the bare turn-NNN-pad-* shape. Files are still inconcatTempDir; temporary; no semantic impact. (4)SilenceTurnpreviously had a known 0-byte-output hole (mid-write SIGPIPE produced an empty file with exit 0). The avenc-wrapped version closes it — silent-fallback runs now surface "silence-gen output: produced only N bytes" instead of generating a video over an empty audio track. Tests: 3 existingconcat_test.gotests pass (one needed the fake-executor's response map updated to include"wc -c < "so avenc's post-encode size validation gets a healthy response — same fixture pattern as PR B used for slides.narrate); 2 existingpodcast_generate_test.gotests needed the samewc -cmock added and the ffprobeHasPrefixswitched toContainsso the newLC_ALL=C ffprobe ...shape still matches. All 1702 internal tests pass (count unchanged from PR B — no leaf-helper tests existed at the podcast level to delete, since the duplicated helpers were exercised through the higher-levelpadTurnToMinandConcatorchestration). What this completes: every audio/video pack that shells out to ffmpeg directly (slides.narrate + podcast.generate via internal/podcast.Concat) now importsinternal/avenc/. Future packs (tiktok.shorts, audiobook.generate, …) inherit every battle-tested pattern PRs #390–#405 paid for, with concentrated 99.3%-coverage tests rather than partial coverage across N call sites. The consolidation arc described in/root/.claude/plans/i-would-like-to-elegant-kahan.mdis closed. -
slides.narrateadoptsinternal/avenc/shared helpers (PR B of 3 — slides.narrate migration). Validates PR #406's abstraction by deleting the duplicatedrequireNonEmptyOutput/generateSilence/probeAudioDuration/padSlideAudioToMin/looksLikeMP3/validateElevenLabsBodyhelpers frominternal/packs/builtin/slides_narrate.goand replacing them withavenc.RequireNonEmptyOutput/avenc.GenerateSilence/avenc.ProbeAudioDuration/avenc.PadAudioToMin/avenc.ValidateMP3Bodycalls. The manual concat command + error-handling block (10+ lines of OOM lifts, transport-error detection, stderr capture) collapses to a singleavenc.ConcatVideoMP4scall that owns PR #404's byte-stable-c:v copy -c:a aac -b:a 192kshape. Net change: 298 LOC removed fromslides_narrate.go(-200 net once you account for the avenc call sites added), 312 LOC removed fromslides_narrate_test.go(the leaf-helper tests are now covered by avenc's 99.3%-coverage test surface — keeping them in slides.narrate would be redundant). Total: 567 lines deleted, 43 lines added across both files. Local helpers KEPT: (a)encodeSegment+ffmpegEncodeOpts+ the per-segment OOM-retry loop — preservespersistFfmpegStderrartifact-store dump on the most-common production failure path (avenc.EncodeVideoSegment surfaces stderr inline only, truncated at 4 KB; the per-failure artifact dump is genuinely useful for production debug); (b)validateMarpPngs+pngMagicHex+minRenderedSlidePngBytes— Marp-specific PNG validation, not shared with any other pack; (c)slidesNarrateFfmpegThreads+ the env-var → constants — slides.narrate-specific operator tuning knob; (d)persistFfmpegStderr+truncStr+artifactSuffix+extractFirstJSONObject— used by the kept encodeSegment path. Local helpers DELETED: every audio/duration/validation helper from PR #400 + PR #404 + PR #405 — they live in avenc now. Behaviour changes worth noting: (1) Concat-step failure messages no longer reference anffmpeg-stderr-concat.txtartifact key (concat failures are rare — single occurrence vs. per-segment, which still gets the artifact dump). The inline 4 KB stderr is still in the surfaced error message, sufficient for diagnosis. (2)PadAudioToMin's intermediate-file names changed from/tmp/audio-NNN-pad.mp3to/tmp/avenc-pad-slide-NNN.mp3(and similar for the merged + list files). Operators inspecting the sidecar mid-run see the new names; the intermediate files are temporary so the rename has no semantic impact. (3) The newavencfloor constants apply:MinSilenceMP3Bytes = 256(was the same),MinTTSResponseBytes = 512(was the same),MinEncodedSegmentBytes = 1024(was the same) — byte-identical, just relocated. Tests: existing slides.narrate orchestration tests pass byte-identically (37 SlidesNarrate-prefixed tests), proving the abstraction is right. The 13 leaf-helper test functions (~30 individual sub-cases) that directly called the deleted local helpers were removed since avenc's 99.3%-coverage test surface covers the same behaviours, and keeping them would be testing avenc twice. The test count drops from 1732 → 1702 (-30 redundant sub-cases). All 1702 internal tests pass. What this PR validates: PR A's abstraction is correct — every removed helper had a 1:1 avenc equivalent with the same signature shape and the same error-message conventions. No edge case surfaced that required reshaping avenc. PR C (internal/podcast.Concatmigration) can proceed against the same stable surface.
Added
internal/avenc/— shared ffmpeg/ffprobe/TTS-validation helper package (PR A of a 3-PR consolidation; no callers change yet). Operator framing: "the slides.narrate audio code keeps breaking; there's ffmpeg code all over the place across packs; could we have ONE helper everyone uses with 90% test coverage that covers things it can run into?" The intuition is correct. PRs #390, #399, #400, #401, #404, #405 each closed a real audio/video failure mode, but each landed in only one caller (slides.narrate orinternal/podcast.Concat). Without consolidation, the next pack that shells out to ffmpeg starts from zero on lessons already paid for. This PR extracts the canonical patterns intointernal/avenc/so future packs (tiktok.shorts, audiobook.generate, …) inherit every battle-tested behaviour automatically AND so we can cover the patterns with a single concentrated test surface instead of partial coverage across N packs. Scope verified: the two packs that shell out to ffmpeg directly areslides.narrate(5+ call sites) andpodcast.generate/internal/podcast.Concat(4 call sites).hyperframes.renderuses ahyperframesCLI wrapper that opaquely runs ffmpeg — out of scope.slides.renderandhyperframes.composehave no ffmpeg surface. External research (Mux, WaveSpeed, vidcutter, ffprobe docs — citations in the plan file) confirmed the canonical patterns the bug history already encoded AND surfaced two gaps closed in this PR: (1)LC_ALL=Cprefix on every ffprobe invocation so a sidecar withLC_NUMERIC=de_DEdoesn't emit "3,14" instead of "3.14" andstrconv.ParseFloatsilently returns 0; (2) ffprobe-based MP4 stream presence validation (-show_entries stream=codec_type) so an ffmpeg-exit-0 output with no moov atom / no audio stream surfaces honestly instead of slipping past the byte-floor check. The 10 exported surface area:Executortype alias matchingpacks.ExecutionContext.Exec;MinEncodedSegmentBytes/MinSilenceMP3Bytes/MinTTSResponseBytes/LocalePrefixsize + locale constants;IsOOMExitCodeshared classifier;RequireNonEmptyOutputpost-success size stat;LooksLikeMP3byte-level MP3 sniffer (MPEG-1/2 Layer III sync words + ID3v2);ValidateMP3Body(size floor + MP3 sniff for HTTP-200-wraps-error case);ValidateMP4Streams(LC_ALL=C ffprobe stream-presence check, optional video/audio);ProbeAudioDuration(LC_ALL=C ffprobe + NaN/±Inf/non-positive rejection);GenerateSilence(anullsrc → libmp3lame + post-validate);ConcatAudio(audio-only concat with mandatory re-encode, configurable codec + bitrate);ConcatVideoMP4s(video stream-copy + audio re-encode — PR #404's asymmetric pattern locked in);PadAudioToMin(4-step composition of silence-gen + concat that no-ops within 1ms epsilon);EncodeVideoSegment(still-image-plus-audio → .mp4 with PR #390's OOM-retry pattern: primary-threads 4/ medium preset → on exit 137 retry ONCE with-threads 1 -preset veryfast, surface CodeResourceExhausted on double-OOM). Tests: 80 unit tests ininternal/avenc/*_test.gocovering every failure mode per function (happy path, transport error, OOM exit, generic non-zero exit, output validation: missing/0-byte/below-floor, edge cases: NaN/±Inf/0/negative/garbage stdout, MP3 sync variants, comma-decimal locale guard, OOM retry fires + double-OOM escalation + non-OOM-no-retry, byte-stable codec/bitrate/flag-shape regression guards:-c:v copy/-c:a aac/-b:a 192k/-tune stillimage/-shortest/-pix_fmt yuv420p).go test -race -coverprofile=avenc-cover.out ./internal/avenc/...reports 99.3% line coverage across 80 tests — well above the 90% target. The mock executor scaffolding (mockExecinvalidate_test.go) is a substring-keyed scripted-response Executor modelled onslides_narrate_test.go'snarrateExecScript, generalised so every avenc test reuses the same goroutine-safe mock instead of each helper rolling its own. Three convenience builders chain:.stdout(needle, out)for happy-exit,.fail(needle, code, stderr)for non-zero exits,.transport(needle, errMsg)for the err != nil case. Reusable in test files of downstream packages without dragging in the engine. What this PR does NOT do: migrate any callers. The plan's PR B will delete slides.narrate's duplicatedgenerateSilence/probeAudioDuration/padSlideAudioToMin/encodeSegment/looksLikeMP3/validateElevenLabsBody/requireNonEmptyOutputhelpers and call avenc instead (~200 LOC net deletion inslides_narrate.go). PR C will do the same forinternal/podcast/concat.go(~80 LOC net deletion). Each pack migration is a delete-and-replace refactor with byte-identical behaviour tests — if PR A's abstraction is wrong we'll discover it without locking in two pack rewrites. All 1732 internal tests pass (1652 from main + 80 new in avenc/).
Fixed
-
slides.narrateno longer speaks<!-- image_prompt: ... -->comments aloud — the narrator was literally saying "image prompt colon: a chart of revenue by year" because the speaker-notes extractor matched every<!-- ... -->block indiscriminately. Operator-reported on the run that produced the first video after PR #404 closed the Mermaid + audio-dropout gaps: the narrator on each slide was reading both the freeform speaker notes AND the image_prompt comment thatslides.outlineembeds next to them. The bug was structural:slides.outlineinstructs the LLM to emit a<!-- speaker notes -->comment (freeform text the narrator should say) AND a<!-- image_prompt: ... -->comment (structured metadata consumed by slides.outline's ownextractImagePromptsto produce a typedimage_prompts[]output array).slides.narrate'sextractNotesatinternal/packs/builtin/slides_notes.go:106used the genericnotePatternregex (<!--\s*(.*?)\s*-->) and concatenated EVERY match into the spoken-notes string. Result: the image_prompt's content (a description of the visual the slide should show) ended up in the TTS payload and the narrator spoke it as if it were dialog. Fix: a smallisStructuredMetadataCommenthelper checks each comment's inner-text prefix; comments whose trimmed lowercase body starts withimage_prompt:are skipped when building the narrator's TTS input but still get stripped from the visible slide content (the existing catch-allReplaceAllStringkeeps that behavior). The filter is an explicit allowlist of prefixes — currently justimage_prompt:— chosen over a generic "anything-with-a-colon" filter so legitimate freeform notes that happen to contain a colon ("Note: discuss this further") still get spoken. Future structured-comment prefixes get added to the same allowlist as they ship. Tests: 6 new sub-cases inTestExtractNotestable —speaker notes plus image_prompt — only notes spokenpins the production-shape behavior,image_prompt only — empty notesconfirms a slide with only a prompt and no narration produces empty notes (the narrator path then falls back to silence — correct),image_prompt interleaved with speaker notes — image_prompt droppedconfirms ordering doesn't matter,IMAGE_PROMPT uppercase — still filteredpins case-insensitivity so a model that produces uppercase or mixed-case doesn't slip through,image_prompt with weird whitespace — filteredpins whitespace tolerance, and the critical false-positive guardfreeform note containing image_prompt as substring — preservedconfirms that a legitimate narration that mentions the words "image_prompt" mid-sentence ("The image_prompt feature is documented in the README.") is NOT filtered — theHasPrefixcheck on trimmed inner text only matches when the metadata prefix is at the very start. All 1652 internal tests pass (+6 from main). What this means for the operator's video: the next narrated deck will speak only the actual speaker notes for each slide, not the image_prompt descriptions. The image_prompts themselves remain available onslides.outline's output as the typedimage_prompts[]array (slides.outline's behavior is unchanged); downstream packs that consume that array (e.g. for hero-image generation) continue to work. -
slides.narratenow pre-rendersmermaidfenced blocks via mmdc (parity with slides.render) AND re-encodes audio at concat to eliminate mid-segment dropouts. Operator-reported on the first successful run ofbuiltin.repo-presentationafter PR #401 unblocked the engine: the video completed end-to-end (no moresession: not found), but two surface-level rendering bugs surfaced. (1) Mermaid not rendered.slides.renderhas shipped apreprocessMermaidFenceshelper for a while (slides_render.go:399) — it findsmermaidfences, runsmmdc(mermaid-cli) inside the sidecar, converts each diagram to SVG, and substitutes the fence with an inline<img src="data:image/svg+xml;base64,..." />. The sidecar image already shipsmmdcwith/etc/mmdc/puppeteer-config.json.slides.narratesimply was not calling that helper; rawmermaidblocks landed in/tmp/helmdeck-deck.mdand Marp's headless Chromium (which has no built-in Mermaid renderer) left them blank in the per-slide PNGs. Fix: wirepreprocessMermaidFencesinto the slides.narrate handler right after hero-image inlining and beforeinjectFitStyle/ write-to-sidecar. The helper lives in the samebuiltinpackage, so it's a one-call addition. Newmermaid *boolinput field on the pack mirrors slides.render's same field — default on (nil ⇒ on), explicitfalseopts out for decks without diagrams (saves ~500ms of mmdc startup per diagram). (2) Audio dropouts mid-slide. Per-segment AAC frames (1024 samples each) rarely divide cleanly into a TTS-driven segment duration, so the per-segment .mp4s contain partial AAC frames at their tail boundaries. The existing concat command wasffmpeg -y -f concat -safe 0 -i /tmp/concat.txt -c copy /tmp/final.mp4—-c copystream-copies BOTH streams, splicing at the wrong-boundary AAC frames and producing audible mid-segment dropouts whenever the audio crossed a segment edge mid-word. Fix: split the concat codec flags — video stays stream-copy (-c:v copy, fast and lossless), audio is re-encoded (-c:a aac -b:a 192k, matches the per-segment bitrate). The re-encode pass re-aligns AAC frames at concat time, eliminating dropouts. Cost is a single AAC pass over the total audio (typically 5-15 min of audio, encoded by libavcodec in seconds — negligible vs. the per-segment h264 encode that already spent 5-15 min on video). Video stream-copy is preserved because per-segment h264 IS identical across segments (same libx264 invocation, same params) and GOP structure aligns to keyframes at each segment start. Tests: 4 new inslides_narrate_test.go. Commit A (Mermaid):TestSlidesNarrate_MermaidFencePreprocessedasserts mmdc ran AND the markdown handed to Marp contains the inline-SVG data-URI AND NOT the rawmermaidfence.TestSlidesNarrate_MermaidOptOutassertsmermaid:falseskips mmdc even on a deck with fences.TestSlidesNarrate_NoMermaidFenceSkipsMmdcasserts a fence-free deck pays zero mmdc cost. ThenarrateExecScripttest harness gained an mmdc case (ordered BEFORE thecat >case because the mmdc wrapper shell script contains both substrings — a subtle ordering bug we caught with the first failing test run). Commit B (audio):TestSlidesNarrate_ConcatReencodesAudiopins the new flag shape — must contain-c:v copy,-c:a aac,-b:a 192k, and MUST NOT contain the legacy-c copy(which would stream-copy both streams and re-introduce dropouts). The explicit "must not contain" assertion is the bug-shape guard against a future "make concat faster" refactor that quietly reverts. All 1646 internal tests pass (+4 across both commits). What this means for the originally-reported video: re-running the samebuiltin.repo-presentationagainst the same deck now produces (a) per-slide PNGs with Mermaid diagrams visible (the slide that triggered PR #399's earlier failure shape will actually render correctly this time), and (b) continuous audio across segment boundaries with no mid-sentence dropouts. -
Pinned-session reuse honors the longest-needed
Spec.Timeoutacross packs in a pipeline — closes the shared-session-watchdog bug whereslides.narrate's 30-minute timeout was silently overridden byrepo.fetch's 5-minute default and the watchdog killed multi-segment encodes at ~5 minutes withsession: not found. Operator-observed onrun_71be278e92d7bb5bafter PR #400 made the failure honest:slides.narratefailed at segment 7 (~4 minutes into the encode loop) with the message PR #400 introduced —ffmpeg segment 7: docker-exec transport error (ffmpeg did NOT return a real exit code): session: not found. The honest error was the win — operators now know the session was killed, not that ffmpeg failed. The underlying bug was that the watchdog (internal/session/watchdog.go:57) computes the kill deadline ass.CreatedAt + s.Spec.Timeout, wheres.Spec.Timeoutis frozen at session-create time by whichever pack first calledRuntime.Create. Inbuiltin.repo-presentation's flow —repo.fetch(creates session, preserves via_session_id) →repo.map→slides.outline→slides.narrate— every follow-on pack reuses the session created byrepo.fetch, inheriting itsSpec.Timeout. Even thoughslides.narrate's pack declaration setsSessionSpec.Timeout = 30 * time.Minute, the reused session retainedrepo.fetch's (default) 5-minute timeout. Control-plane logs from the operator's run confirmed it:13:41:30 reusing pinned session pack=slides.narrate session_id=f9a98cec…, then13:45:45 watchdog terminating expired session age=5m7s— exactly the 5-minute pre-extension deadline, well inside slides.narrate's needed window. The fix adds a new methodRuntime.ExtendTimeout(ctx, id, newTimeout)to thesession.Runtimeinterface — when called withnewTimeout > current Spec.Timeout, it updates the session's in-memorySpec.Timeoutso the watchdog uses the longer deadline; when called with an equal or shorter value, it is a no-op (the deadline never shrinks under a pinned reuse, so a fast follow-on pack cannot accidentally pull the deadline down). Implemented in bothinternal/session/docker.Runtime(production) andinternal/session/fake.Runtime(tests).internal/packs/packs.goaround line 605 — the existing pinned-session-reuse branch — now callsExtendTimeoutwhenpack.SessionSpec.Timeout > sess.Spec.Timeoutand logsextended pinned session timeoutwith the old and new values. The call is best-effort: on failure the engine logs at WARN and proceeds, and the worst case is the pre-fix behavior (watchdog kills at the old deadline) — that's a fallback, not a regression. What this does NOT do (deliberate scope decisions): MemoryLimit, CPULimit, SHMSize, and the container's actual runtime resources are NOT mutated on reuse — those are frozen by Docker at container creation and cannot be changed on a live container without restart. A pipeline that needs more memory for slides.narrate than repo.fetch allocated would still need a "pipeline-level max-Spec aggregation" pass (separate follow-up, larger surgery). Timeout is uniquely runtime-mutable because it only affects the in-memory deadline the watchdog reads, not container resource caps. Tests: 7 new tests in 3 files.internal/session/fake/fake_test.go(NEW) —TestFakeRuntime_ExtendTimeout_GrowsDeadline(basic extend contract),_NeverShrinks(table-driven across equal/shorter/zero values — critical so the deadline never goes backward),_UnknownSession(returnsErrSessionNotFoundso callers can distinguish missing session from no-op).internal/session/watchdog_test.go—TestWatchdogRespectsExtendedTimeout(regression guard for the actual production failure: session injected withCreatedAt6 minutes ago andTimeout=5mwould normally die immediately on watchdog tick; afterExtendTimeoutto 30m the watchdog must skip it).internal/packs/packs_test.go—TestEngine_PinnedSessionReuse_ExtendsTimeoutWhenLonger(slides.narrate-shaped: packSpec.Timeout=30mreusing a session with currentTimeout=5mtriggers exactly oneExtendTimeoutcall with the right session id andnewTimeout=30m),_NoExtendWhenShorter(repo.map-shaped: packTimeout=5mreusing a session withTimeout=30mtriggers NO call — critical for the no-shrink invariant at the engine layer),_NoExtendWhenEqual(boundary: equal timeouts skip the extend so there is no spurious log line or registry mutation),_ExtendErrorDoesNotFailHandler(whenExtendTimeouterrors, the handler still runs — best-effort posture). Three existingRuntimestub implementations across the test tree updated to satisfy the new interface method:internal/packs/builtin/screenshot_url_test.go,internal/api/desktop_vnc_test.go, and the existing engine-levelfakeRuntimeinpacks_test.go(extended withextendCallscapture +getTimeoutknob). All 1642 internal tests pass (+8 from main). What this means for the originally-reported failure (run_71be278e92d7bb5b): a re-run of the samebuiltin.repo-presentationagainst the same Mermaid-bearing deck now extends the shared session's timeout to slides.narrate's 30 minutes at the moment slides.narrate starts, so the watchdog will not kill the session mid-encode. If the deck still fails at segment 7, the failure must be something OTHER than the watchdog — either ffmpeg producing 0-byte output (caught by PR #400's post-encode check), a Mermaid render issue (caught by PR #399's PNG validation), or a genuinely transient docker disconnect (now visible as the honest transport error). The three remaining failure modes are distinguishable from each other and from real pack bugs. -
slides.narratesilent-failure surface closed across PNG, ffmpeg encode/concat, ffprobe, ElevenLabs TTS, and silence/pad paths — eliminates the recurring "ffmpeg segment N failed (exit 0)" misclassification and the audit-identified gap class behind it. Operator reported the samehandler_failed: ffmpeg segment 4 failed (exit 0)shape PR #399 was meant to eliminate, this time on a Mermaid sequence diagram in slide 4. Three parallel Explore-agent audits surfaced the structural cause:slides.narratehad twelve silent-failure modes, and PR #399'svalidateMarpPngssize-only check only covered two of them. Among the rest: the per-segment ffmpeg error template atslides_narrate.go:587(and the parallel concat path at 616) printedres.ExitCodeunconditionally even when the failure waserr != nil(docker-exec transport error from a session disconnect / container kill mid-call) andres.ExitCodewas the zero value — operators were reading "exit 0" when ffmpeg never actually returned anything, thenclassifyShellExitCodecouldn't match 0 and the classification fell through toCodeHandlerFailed → FailurePackBug, minting a misleading "file a helmdeck issue" URL for what was really an infrastructural failure. The taxonomy also included no post-encode file existence check (ffmpeg can exit 0 yet produce a 0-byte mp4 on malformed input), no PNG magic-byte check (a >=1024-byte file can still be corrupt placeholder content),probeAudioDurationsilently acceptedNaN/Inf/0 (locale-affected ffprobe or upstream LLM garbage),generateSilencehad no post-write stat, ElevenLabs returned{"error":"..."}wrapped in HTTP 200 was treated as valid audio bytes, andpadSlideAudioToMinhad zero test coverage on its 4-step pipeline. The fix is structural — two batched commits: (A) honest error messages on transport errors (lines 587/616 split intoerr != nilbranch with "docker-exec transport error (ffmpeg did NOT return a real exit code):" and Cause:wrapping so callers canerrors.As), a newrequireNonEmptyOutput(ctx, ec, path, minBytes, label)helper that stats produced files via the samewc -c < FILEpatternvalidateMarpPngsalready uses and is called after every per-segment encode AND after the concat output, and a PNG-magic-byte extension tovalidateMarpPngsthat reads the first 8 bytes viahead -c 8 | od -An -tx1and compares againstpngMagicHex(89504e470d0a1a0a) so corrupt-but-larger-than-floor placeholder content surfaces with the same Mermaid hint. (B)probeAudioDurationrejects NaN, ±Inf, anddur <= 0withmath.IsNaN/math.IsInfafterParseFloat,generateSilencecallsrequireNonEmptyOutputafter exit 0 with a 256-byte floor for libmp3lame's ID3v2-plus-frame overhead,elevenLabsTTSvalidates the HTTP-200 body via a newvalidateElevenLabsBodyhelper (extracted so the logic is unit-testable without an HTTP stub —elevenLabsBaseURLis a const) that rejects bodies underminTTSResponseBytes(512) or that fail thelooksLikeMP3sniff (accepts MPEG-1/2 Layer III sync words0xFF 0xFB/0xFA/0xF3/0xF2and theID3v2 tag header). What about the cross-pack pattern? The audit foundhyperframes.renderalready doesif len(videoBytes) == 0(line 330) andslides.renderdoesif len(res.Stdout) == 0(line 264) — both correct for their shape (already-loaded byte slices vs. files on disk). The newrequireNonEmptyOutputis specifically for stat-after-write scenarios; it didn't make sense to retrofit packs whose checks are already correct in mechanism. Tests: 24 new tests + 1 stub expansion across both commits. Commit A:TestValidateMarpPngs_BadPngMagic_ReturnsInvalidInputNamingSlide(slide-2 corrupt magic → CodeInvalidInput naming slide and explaining the signature mismatch),TestSlidesNarrate_SegmentTransportError_HonestMessage(asserts the message must NOT contain "exit 0" on transport error, MUST contain "transport error" and "did NOT return a real exit code"),TestSlidesNarrate_SegmentExitZeroEmptyOutput_PostCheckFires(ffmpeg exit 0 + empty .mp4 surfaces at SEGMENT step, not later at concat),TestSlidesNarrate_ConcatTransportError_HonestMessage(mirror of the segment-path test for the concat step),TestRequireNonEmptyOutput_*(3 direct unit tests on the helper — healthy/missing/below-floor). Commit B:TestProbeAudioDuration_RejectsNaN/_RejectsInfinity/_RejectsNonPositive(table-driven),TestProbeAudioDuration_AcceptsPositiveFloat(happy baseline),TestGenerateSilence_PostCheckCatches0Byte(ffmpeg exit 0 + empty silence file surfaces an error),TestLooksLikeMP3_Identifies(9 sub-cases: MP3 sync variants, ID3v2, JSON envelope, empty, garbage, wrong second-byte mask),TestValidateElevenLabsBody(5 sub-cases: healthy/JSON-error/empty/under-floor/HTML-error-page),TestPadSlideAudioToMin_HappyPath/_NoOpWhenDeficitNegligible/_StopsOnMidStepFailure(closes the audit-flagged zero coverage on the 4-step pipeline).fakeMP3expanded from 6 bytes to 1026 bytes of valid MP3 prefix + zero padding so existing TTS tests still passminTTSResponseBytes. Two existing tests (TestSlidesNarrate_FfmpegConcatFailure,TestSlidesNarrate_FfmpegSegmentFailure_FullStderrSurfaced) updated to also script a healthyhead -c 8magic response so they reach their targeted ffmpeg failure paths past the new validation gates. All 1634 internal tests pass. What this means for the originally-reported failure (run_b0aacfabb479f5f3, segment 4 with Mermaid sequence diagram): if the failure was a transport error, the message is now honest about it ("ffmpeg segment 4: docker-exec transport error (ffmpeg did NOT return a real exit code):") instead of misleading "exit 0". If the failure was ffmpeg-exit-0-but-no-output, the new post-encode check surfaces "ffmpeg segment 4: produced only 0 bytes (below the 1024-byte floor)" — operators see the encode produced nothing and the error names the actual cause. If the failure was Mermaid producing a corrupt but >=1024-byte PNG, the magic-byte check intercepts it at validation with the same caller-fixable "edit slide N" hint. The three paths are now distinguishable from each other and from real pack bugs. -
slides.narratevalidates each marp-rendered PNG BEFORE handing it to ffmpeg — silent marp render failures (Mermaid blocks, custom HTML, broken fenced YAML) now surface ascaller_fixable: slide N produced no rendered PNGinstead of the misleadingpack_bug: ffmpeg segment N failed (exit 0). Operator-reported: a livebuiltin.repo-presentationrun failed atslides.narratestep withhandler_failed: ffmpeg segment 3 failed (exit 0), which the gateway classifier routed tofailure_class: pack_bugand minted an auto-generated "file a helmdeck issue" URL. The smoking gun: ffmpeg exited 0 (success) yet the handler returned a failure — because theExecwrapper observed a transport-layer error on what was nominally a successful segment, OR (more commonly) ffmpeg "succeeded" on a malformed input PNG and produced a zero-byte segment file. Either path pointed operators at a non-existent pack bug instead of the actual problem: the slide markdown contained an embedded block — in the reported case aflowchart LRMermaid diagram — that marp's headless Chromium silently failed to render, leaving an empty or near-empty PNG for that slide. The bug class is structural: marp returns exit 0 from--images pngeven when individual slides render to nothing, so the existing exit-code check at the marp call site (line 402) cannot catch per-slide render failures. The fix is a pre-flightvalidateMarpPngspass after marp succeeds and before the per-segment ffmpeg loop. For each expected slide PNG (/tmp/slides/deck.NNN.png, 1-based per marp's convention),wc -c < FILEis statted via the same shell-exec patternfs.readuses (fs_packs.go:140-151). Two failure cases surface asCodeInvalidInput(whichclassify.gomaps toFailureCallerFixable): (1)wcexits non-zero → file missing entirely → "slide N produced no rendered PNG (marp exited 0 but the expected output file is missing). Most common cause: an embedded block marp's headless Chromium can't render — a Mermaid diagram (flowchart,sequenceDiagram), custom HTML with broken CSS, or a fenced YAML that confuses the parser. Edit slide N's markdown to remove or simplify the offending block, then re-run."; (2)wcreturns underminRenderedSlidePngBytes(1024) → file is the marp-blank signature → "slide N's rendered PNG is only X bytes (below the 1024-byte floor), which is the signature of a silent marp render failure …". A transport-layer error on the stat call surfaces asCodeHandlerFailed(NOTCodeInvalidInput) because the caller's input is fine and the failure is infrastructural — same defense-in-depth posture as the rest of the handler. Threshold reasoning: 1024 bytes is well below any real rendered slide (the smallest sensible solid-color 1920×1080 PNG is several KB after deflate overhead even with maximal compression) and well above the few hundred bytes marp's blank-output mode produces, so the floor is safe in both directions — no false positives on legitimately sparse slides, no false negatives on tiny garbled output. The existing per-segment ffmpeg error path is unchanged: if a PNG passes validation but ffmpeg still fails downstream, the operator gets the original ffmpeg-segment-failed message (the pre-flight check is additive, not a replacement). Tests: 5 new tests inslides_narrate_test.go—TestValidateMarpPngs_AllHealthy_NoError(3 slides all >=1024 bytes pass, 3 wc-c calls observed),TestValidateMarpPngs_MissingFile_ReturnsInvalidInputNamingSlide(slide 3 missing →CodeInvalidInputwith "slide 3" in message + "Mermaid" hint; loop stops at first failure, so slide 4 is not statted),TestValidateMarpPngs_TinyFile_ReturnsInvalidInputWithSize(slide 2 at 256 bytes →CodeInvalidInputwith "slide 2" and "256 bytes" both surfaced so operators can sanity-check),TestValidateMarpPngs_AtFloor_Passes(boundary test: exactly 1024 bytes passes, catches< vs <=off-by-one regressions),TestValidateMarpPngs_TransportError_ReturnsHandlerFailed(anExecerror returnsCodeHandlerFailed, notCodeInvalidInput— input may be fine, failure is infrastructural). Two existing tests (TestSlidesNarrate_FfmpegConcatFailure,TestSlidesNarrate_FfmpegSegmentFailure_FullStderrSurfaced) updated to also script a healthywc -cresponse so they reach their targeted ffmpeg failure paths instead of stopping at the new pre-flight gate. All 1600 internal tests pass. What this means for the originally-reported failure: the slide-3 Mermaidflowchart LRblock would now stop the run withfailure_class: caller_fixableand message "slide 3 produced no rendered PNG (marp exited 0 but the expected output file is missing). Most common cause: an embedded block marp's headless Chromium can't render — a Mermaid diagram…" — the operator gets the exact slide to edit, without burning ElevenLabs TTS credits and ~30s of ffmpeg work for the misleading-bug-report outcome. Out of scope: rendering Mermaid blocks ahead of marp (a marp-cli--engineplugin), or marp-side per-slide error reporting (would require an upstream marp change). Both are valid follow-ups but orthogonal to surfacing the failure honestly. -
Control-plane image builds the web bundle inside a Node Docker stage — eliminates the recurring "blank page after rebuild" failure mode where
web/dist/index.htmlreferences bundle hashes that aren't in the image. Operator-observed pattern (visible twice in the local stash history aslocal web/dist rebuildentries): after a docker rebuild, the Management UI loads/but renders blank, because the embeddedindex.htmlreferences/assets/index-XXX.jspaths that don't exist in the image. The 801-byte HTML returned for every URL was the SPA fallback servingindex.htmlfor any unknown path — so the browser tried to execute HTML as JavaScript and silently failed. Root cause: onlyweb/dist/index.htmlwas tracked in git (the placeholder mentioned inweb/embed.go); the matchingweb/dist/assets/*.js,*.csswere always untracked. The Dockerfile did not runnpm run build— it justCOPY web ./webfrom the host. So the image's embedded bundle = whatever happened to be on the developer's host at build time, and EVERY drift between the committed HTML and the local assets (a stale checkout, a pulled main, agit stashof a local rebuild, agit checkoutrevertingindex.htmlwhile leaving localassets/*untouched) produced a broken image. The bug class is structural — not a one-off — which is why the fix is structural too. Fix shape: add a Nodeweb-buildstage todeploy/docker/control-plane.Dockerfilethat runsnpm ci && npm run buildinside the image, producing a self-consistentweb/dist/{index.html,assets/*}. The Go stage thenCOPY --from=web-build /web/dist ./web/distso the embedded HTML and the embedded assets are co-generated from the SAME source tree in the SAME build — byte-for-byte consistent by construction..dockerignoreaddsweb/dist/so the host's local dist is never copied into the build context, removing the only path by which host-side drift could leak in.web/dist/index.htmlis now a tiny stable stub (no asset references, no hashed filenames) that exists solely to satisfy//go:embed all:distduring host-sidego build/ IDE compilation /go test. The stub is byte-stable so it can't drift; a localnpm run buildoverwrites it for browser-facing dev. Verified end-to-end: builtdockerfile-fix-testimage from this Dockerfile, ran it on port 3099 —GET /returnedindex.htmlreferencingindex-4AsUCFtK.js(fresh hash from the in-image Node build, distinct from whatever was on the host), andGET /assets/index-4AsUCFtK.jsreturned 215097 bytes oftext/javascript— proving the image's HTML and assets are bound together, not subject to any host-side dependency. What this fixes long-term: no developer needs to remember tocd web && npm run buildbeforedocker build; no contributor needs to know which files to commit and which are gitignored; CI builds and local builds produce equivalent images; the recurring "rebuild + blank page" footgun is removed from the dev loop.web/embed.go's doc comment updated to describe the two-source flow (production = web-build stage; host = stub or local npm build). Out of scope: trimmingweb/dist/index.htmlfrom the git index (would require all contributors with stale local dist to do a clean checkout — separate housekeeping PR). -
Pipeline-run single-flight coalescing — duplicate concurrent
pipeline-runrequests with the same (caller, pipeline id, inputs) no longer spawn a second identical execution. Operator-observed: some LLM clients time out on a long-runningpipeline-runcall (multi-minute pipelines likeslides.narrate,*-video,research.deep) before the underlying pipeline finishes, then RETRY the same call thinking the original failed. The original run was still in-flight; the retry happily started a SECOND identical run. With pipelines likeslides.narrate— which we JUST fixed (PR #390) to encode within an 8g memory cap by capping ffmpeg threads to 4 and adaptive-retrying OOM segments — two concurrent runs against the same memory budget reliably OOM both, defeating the single-run fix. The shape was wrong:internal/mcp/pipelines.go(line 146 casepipeline-run) calleds.pipelines.StartRun(ctx, a.ID, a.Inputs)and returned the newrun_idunconditionally — no fingerprint check, no in-flight-duplicate detection. The fix is single-flight coalescing at theStartRunboundary, not rejection: when an identical in-flight run already exists, the new caller gets back the ORIGINAL run'srun_idpluscoalesced: true. The caller's nextpipeline-run-statuspoll works against the real run instead of spawning a duplicate execution. Fingerprint =sha256(caller || pipeline_id || canonical_json(inputs)). The canonicalization is deterministic across JSON whitespace and object-key ordering — two callers POSTing the same logical inputs with different formatting (one minified, one pretty-printed; one with keys declared, one alphabetized) coalesce together. Empty inputs normalize tonullso empty-body POSTs coalesce with each other. Migration0008_pipeline_run_fingerprint.sqladdscaller TEXT NOT NULL DEFAULT ''+fingerprint TEXT NOT NULL DEFAULT ''columns topipeline_runsplus a partial unique indexWHERE fingerprint <> '' AND status IN ('pending','running'). Both columns are additive with safe defaults so a downgrade-to-prev-binary still reads old rows (the empty-fingerprint legacy rows are excluded from uniqueness). Concurrency guard: a newstartMu sync.MutexonrunRegistryserializes the (fingerprint-lookup, INSERT) critical section so two goroutines racing with identical fingerprints can't both miss the lookup and insert duplicates. The partial unique index is the belt (DB-level guarantee against multi-process races, e.g. two control-plane replicas); the mutex is the suspenders (turns the constraint violation into a clean coalesced=true return). When the INSERT does collide despite the mutex (only possible across replicas),StartRunre-resolves the fingerprint and returns the winner instead of surfacing a UNIQUE error. What does NOT coalesce: different caller, different pipeline id, different inputs (all 3 are in the fingerprint); a terminal run (the lookup filtersstatus IN ('pending','running'), so a finished run never coalesces a fresh request onto a stale result).Rerungets the dedup for free since it delegates toStartRun— an operator who spam-clicks "Rerun" gets the existing in-flight back, not 5 duplicate runs. Surface change:Runner.StartRunandRunner.Rerunsignatures gain acoalesced boolreturn;internal/mcp.PipelineService.StartRun/Rerunsame; MCPpipeline-run/pipeline-rerunresponses gain acoalescedfield; REST POST/api/v1/pipelines/{id}/runand/runs/{runId}/rerunresponses gain the same field. Existing callers that don't readcoalescedsee byte-identical run-status semantics — they still poll, the run-status flow is unchanged. MCP tool descriptions updated so LLM clients learn thatcoalesced: trueis NOT an error and should be polled like any other run_id. Tests: 4 new tests ininternal/pipelines/runner_test.go—TestComputeRunFingerprint_StableAndDistinct(8 sub-cases: identical / reordered-keys / whitespace / nested-reorder all coalesce; different-caller / different-pipeline / different-input / empty-vs-empty all distinguish correctly),TestRunner_StartRun_CoalescesIdenticalInFlight(4 sub-assertions: first call non-coalesced + 3 duplicates with whitespace-normalized inputs coalesce, different-caller spawns fresh, different-inputs spawns fresh),TestRunner_StartRun_DoesNotCoalesceOntoTerminalRun(the regression guard — a finished run must NOT coalesce a fresh request onto stale results),TestRunner_StartRun_ConcurrentIdenticalCalls(the race-window guard — N=8 goroutines fire simultaneously, exactly 1 run exists in the store, N-1 callers seecoalesced=true). All 1595 internal tests pass. Architecture note: this is "single-flight at the API boundary, not the runner" — the runner still has exactly one execution per run; we just dedupe the creation of new runs when an identical one is already running. Friendlier than rejection (the retrying client gets a useful run_id to poll) and friendlier than back-pressure (no wait, no holding the connection). -
slides.narrateffmpeg thread cap (4) + adaptive retry on OOM-killed segments (degraded encoder settings). Operator reported that different LLM models produced decks of variable visual complexity, and the dense ones OOMed even atHELMDECK_SLIDES_NARRATE_MEMORY_LIMIT=8g— but the sparse ones didn't. Same memory budget, same slide count, same resolution; only the per-frame encoder working set varied. The root cause: the per-segment ffmpeg command had no-threadsflag, so libx264 grabbed every host core (12 on a typical workstation), and each thread holds ~50-80 MB of frame buffers at 1080p. That's ~800 MB of encoder state before reference frames, lookahead, and Chromium's resident set — and the marginal slide that pushes encoder peak over budget OOMs every time. Two fixes: (1) add an explicit-threads Nto the per-segment ffmpeg command. Default N=4 — cuts peak by ~3× at the cost of ~20% wall-clock per segment, which is negligible against the wins. New env varHELMDECK_SLIDES_NARRATE_FFMPEG_THREADSlets operators with abundant RAM bump higher, or hosts with tight RAM drop to 1-2. Same operator-tunable idiom asHELMDECK_SLIDES_NARRATE_MEMORY_LIMIT. ADR 045 stays in place —CPUProfile=ProfileComputestill scales the container's CPU quota with host cores; this cap is narrowly about the encoder thread count, not CPU allocation. (2) Adaptive retry onCodeResourceExhausted: if a per-segment encode returns exit 137 (OOM-classified byclassifyShellExitCode), the handler retries that ONE segment with-threads 1 -preset veryfast— combination cuts encoder memory roughly in half versus the primary attempt at the cost of a small bitrate-efficiency hit (CRF 23 still looks fine; the difference isn't visual artifacts). Retry is bounded to one attempt per segment; if both OOM, the handler surfacesCodeResourceExhaustedso the operator can bumpMemoryLimitand rerun. Retry logs at WARN level so post-mortems show when degraded encoding fired and on which segment. Architecture note: this is "smart resource management without going to Kubernetes" — Docker compose stays, the operator gets two tunable knobs (memory cap + thread cap) plus automatic per-segment degradation. A future PR may add GPU/NVENC swap when the sidecar exposes/dev/nvidia*(filed as a follow-up issue), which would eliminate the memory wall entirely on GPU-equipped hosts. Tests: 3 new helper tests (TestSlidesNarrateFfmpegThreads_DefaultWhenEnvUnset/_OverrideHonored/_GarbageFallsThroughToDefault), 1 new retry-success test (TestSlidesNarrate_AdaptiveRetryOnOOM— primary returns exit 137, retry returns 0, asserts the retry carries the degraded flags AND the primary does NOT), 1 new retry-fails-too test (TestSlidesNarrate_DoubleOOMSurfacesCodeResourceExhausted— both attempts OOM, asserts exactly 2 attempts, no third escalation, returns CodeResourceExhausted). All 1583 internal tests pass. -
Closed-set classifier coerced PR #379 and PR #381 typed codes to
internal(silent regression in both prior PRs); ElevenLabs precheck now uses/v1/voices(scope-matched). Two surgical fixes caught while a livebuiltin.repo-presentationrun failed atslides.narratewithstep "narrate": internal: credential_invalid: ElevenLabs rejected the stored API key (401): "The API key you used is missing the permission user_read"—failure_class: pack_bug. Two bugs in one error message: (A)internal/packs/classify.go:14-23definesvalidCodes, the closed-set the engine's middleware uses to gate handler return codes. PR #379 addedCodeResourceExhaustedand PR #381 addedCodeCredentialInvalidtointernal/packs/errors.go, but neither added the new codes tovalidCodes. Result:Classify()lines 60-65 walked the chain, saw the handler returned a*PackErrorwith a code NOT in the set, and silently coerced toCodeInternal. The pipeline-level classifier (internal/pipelines/classify.go) then mappedCodeInternal → FailurePackBug, minting a bogus "file an issue" URL for what was actually a resource/credential issue. Both prior PRs shipped non-functional in the wire envelope — the inner message was right, the outer code wasinternal,failure_classwaspack_bug. (B)vault.ValidateElevenLabs(PR #381) calledGET /v1/userwhich requires theuser_readElevenLabs scope. But ElevenLabs scopes are independent —text_to_speech,voices_read,user_read,history_readare granted separately. A production-shaped key minted with justtext_to_speech+voices_readcan do every TTS operationslides.narrateneeds but 401s against/v1/user. The precheck therefore blocked working keys with a scope-mismatch false-positive. Fix A: addCodeResourceExhaustedandCodeCredentialInvalidto thevalidCodesmap. 2 lines + a doc comment naming the regression so future code additions don't repeat the omission. After this, OOM-killed ffmpeg surfaces asfailure_class: transient(PR #379's intent), and a rejected ElevenLabs credential surfaces asfailure_class: caller_fixablewith the "update the vault" reason (PR #381's intent). Fix B: switch the precheck endpoint fromGET /v1/usertoGET /v1/voices. Thevoices_readscope is whatslides.narrate's ownpickRandomVoicepath already calls — keys that pass the precheck are guaranteed to work through the rest of the handler. Updated doc comment explains the scope reasoning so the choice survives the next refactor. Tests: 2 new cases ininternal/packs/classify_test.goassertingClassify(&PackError{Code: CodeResourceExhausted})andClassify(&PackError{Code: CodeCredentialInvalid})both round-trip unchanged — would have caught Bug A on its own.internal/vault/validate_test.go's path expectation flipped from/v1/userto/v1/voiceswith an explanatory error message naming the scope reasoning. All 1578 internal tests pass. -
slides.narrateresolution normalization + video pipelines no longer hardcode aspect_ratio/resolution. Two bugs thehelmdeck-debugskill caught in the same sweep: (1)slides.narrateacceptedresolution: "1080p"per its declared input schema, but passed the value verbatim to ffmpeg'sscale=filter — which rejected it withInvalid size '1080p'. The schema and the handler disagreed on the vocabulary:hyperframes.rendertakes named presets (720p/1080p/4k),slides.narrateonly tookWIDTHxHEIGHT. (2)builtin.html-video,builtin.prompt-video, andbuiltin.prompt-narrated-videoall hardcoded"resolution":"1080p","aspect_ratio":"16:9"in theirhyperframes.render/hyperframes.composestep inputs. A caller passing an HTML composition whose intrinsic dimensions were vertical (1080×1920 for Shorts/TikTok) got back"outputResolution landscape does not match the composition"with no surface area to fix it — the pipelines didn't exposeaspect_ratioas an input at all. Fix 1: newnormalizeSlidesNarrateResolution()helper inslides_narrate.gotranslates named presets toWIDTHxHEIGHTbefore ffmpeg sees them —720p→1280x720,1080p→1920x1080,1440p→2560x1440,2160p/4k→3840x2160. Pre-formatted strings pass through; empty stays empty (caller's downstream default applies); unknown values pass through so ffmpeg surfaces its own "Invalid size" message (silent normalization would mask typos). Case-insensitive, whitespace-tolerant. Fix 2: the 3 video pipelines now thread"resolution":"${{ inputs.resolution }}"and"aspect_ratio":"${{ inputs.aspect_ratio }}"instead of literals. PR #380's resolver drops the fields when the caller omits them, sohyperframes.render/hyperframes.composeuse their own1080p+16:9defaults — zero observable change for current callers. Callers who want vertical (9:16for Shorts/TikTok) or square (1:1) compositions can now pass the value through the pipeline input. Tests:TestNormalizeSlidesNarrateResolutiontable-driven across 12 cases (presets / pre-formatted / empty / unknown / case-insensitive / whitespace);TestVideoPipelines_DoNotHardcodeAspectRatioregression guard on the 3 production pipelines. All 1566 internal tests pass. -
Paid-API credential precheck + honest
has_narration(slides.narrate); production narrate pipelines fail-fast on missing/rejected ElevenLabs key. Operator-reported: ran a*-narratepipeline, got back a silent video, output saidhas_narration: true. Three architectural bugs in one shape: (1)slides.narratesethasNarrationfromapiKey != ""BEFORE any provider call — the field was decided on key presence, not call outcome; (2) the per-slide TTS loop fell back to silence on ANY error (including 401/403/quota), so a dead key produced a video that looked narrated according to the output schema but was actually silent throughout; (3) the production pipelinesbuiltin.grounded-narrate,builtin.research-narrate,builtin.repo-presentationall literally hardcodedallow_silent_output: true, which masked the missing credential entirely — a caller asking for "narrate this" got silence with no signal that the credential was the cause. The fix introduces a new typed error codepacks.CodeCredentialInvalid(internal/packs/errors.go) for credentials rejected by an upstream paid API — distinct fromCodeInvalidInput(caller passed bad input — they can fix without touching the vault) andCodeHandlerFailed(pack code misbehaved): the pack ran correctly, the caller's input was structurally fine, the stored credential is dead.classify.gomaps it toFailureCallerFixablewith the actionable reason"The vault-stored API credential this pack needed was rejected by the upstream provider (401/403/quota). The pack itself ran correctly; the credential is dead. Update it via /api/v1/vault/credentials/{id} (PUT) or re-hydrate from your .env.local, then re-run. Retrying with the same key would burn more provider quota for no benefit.".isRetryable=false. New helpervault.ValidateElevenLabs(ctx, hc, apiKey)(internal/vault/validate.go): single GET/v1/useragainst ElevenLabs to confirm the key is accepted before doing expensive work. Returnsnilon 200,*packs.PackError{CodeCredentialInvalid}on 401/403/402-quota-exhausted, transient errors on 429/5xx (caller proceeds — per-slide TTS calls have their own fallback path). Signature template reusable for sibling providers (fal.ai, Firecrawl, HeyGen, Runway — see follow-up list).slides.narratewiring: after key resolution, before voice listing + Marp render + LLM YouTube-metadata call, callValidateElevenLabs. OnCodeCredentialInvalidreturn immediately — saves ~$0.01-0.05 in wasted LLM tokens + ~30s of CPU per failed run. On transient error, log warning + proceed. Honesthas_narrationcomputed at return asnarrationRequested && voiceID != "" && narratableSlideCount > 0 && ttsFailureCount == 0. New output fieldtts_failure_countfor diagnostics so an operator can see "I asked for narration, but 3 of 25 slides fell back to silence." Pipeline cleanup:builtin.grounded-narrate,builtin.research-narrate,builtin.repo-presentationchange"allow_silent_output": true→"allow_silent_output": "${{ inputs.allow_silent_output }}". PR #380's resolver drops the field when the caller doesn't pass it, soslides.narrate'sAllowSilentOutputzero-values tofalseand the fail-fast credential check kicks in. Callers who explicitly want silence passallow_silent_output: trueon the run input — the opt-in path still works. Tests: 8 new tests ininternal/vault/validate_test.go(200/401/403/402/429/5xx/empty-key/whitespace-key); 1 new case inclassify_test.goassertingCodeCredentialInvalid → FailureCallerFixable;TestIsRetryableextended; 1 new pipeline-shape testTestNarratePipelines_DoNotHardcodeAllowSilentOutputthat prevents regression on the 3 production pipelines. All 1562 internal tests pass. Out of scope (follow-up PRs that adopt the same pattern):podcast.generate(4 pipeline sites still carry hardcodedallow_silent_output: true; needs its own provider-specific precheck),image_generate(fal.ai),research.deep(Firecrawl),heygen_video,runway_video,slides.outlineLLM metadata calls. Each adoption is ~50 LOC once the validator helper for that provider lands. -
Pipeline template resolver: drop the JSON field when a whole-value
inputs.*reference misses (typed-field fix). Surfaced by thehelmdeck-debugskill's sweep —builtin.repo-presentationfailed at theoutlinestep withfield "export_outline": expected boolean, got string, and the skill correctly identified that 6 sibling pipelines reference the same optional booleans the same way (builtin.grounded-deck,builtin.grounded-narrate,builtin.research-deck,builtin.research-narrate,builtin.scrape-deck,builtin.research-ground-deck). PR #377 made the resolver substitute""for missing top-levelinputs.*references — correct for string-typed targets, broken for bool/number/array targets where 7 pipelines pass the input as a whole-value template (e.g."export_outline": "${{ inputs.export_outline }}"). The receiving pack's JSON decoder rightly rejects an empty string in a bool field. The fix distinguishes whole-value misses from embedded misses: when a${{ inputs.X }}reference is the ENTIRE value of a JSON field and X isn't supplied, the resolver now drops the field from the output JSON entirely. The receiving pack then sees an absent field and uses its declared zero-value default (falsefor bool,0for number,[]for array,""for string). Embedded references ("prefix-${{ inputs.x }}-suffix") keep current behavior — they substitute""because dropping would unhelpfully truncate the surrounding string. Array elements with a missing ref substitutenull(preserves indices). Implementation uses a package-privatemissingRefsentinel thatlookupExprreturns andwalk()recognizes at the map/array boundary — never leaks to JSON. Steps. references stay loud always*: a missingsteps.X.output.Yindicates a real inter-step wiring bug and the safety net is unchanged. Tests: 4 new tests intemplate_test.go—TestResolve_MissingInput_WholeValueDropsField(the contract: drop on whole-value miss),TestResolve_MissingInput_EmbeddedKeepsField(substitute""in embedded position),TestResolve_MissingInput_DropAcrossTypes(the motivating case —export_outline/include_image_prompts/fade_ms/voice_idsall dropped when omitted),TestResolve_MissingInput_ArrayBecomesNull(array indices preserved). The PR #377 testTestResolve_MissingInputDefaultsToEmptywas split into the two new whole-value/embedded tests since its prior conflated assertion no longer holds.TestResolve_MissingStepStillFailsstill passes —steps.X.output.Yreferences stay loud. All 1552 internal tests pass. -
slides.narrateOOM-killed ffmpeg now classifies astransient, notpack_bug. Surfaced by thehelmdeck-debugskill's diagnostic sweep:builtin.repo-presentationfailed atslides.narratewithffmpeg segment 9 failed (exit 137)and the gateway classifier emittedfailure_class: pack_bugplus an auto-generated "file an issue" URL — but exit 137 is SIGKILL, which in our sandboxed sessions overwhelmingly means the kernel OOM killer reaped ffmpeg because the per-segment 1080p h264+AAC encode exceeded the cgroup memory limit. That's a resource/environment issue, not a bug in the pack. The fix introduces a new typed error codepacks.CodeResourceExhausted(internal/packs/errors.go) — distinct fromCodeTimeout(deadline expired) andCodeSessionUnavailable(couldn't acquire a session): the session ran fine, the workload was too heavy for the memory/CPU budget.classify.gomaps it toFailureTransientwith the actionable reason"The OS killed a child process for resource reasons (typically OOM — exit 137 / SIGKILL). The pack itself isn't buggy; the workload was too heavy for the session's memory/CPU budget. Bump SessionSpec.MemoryLimit, reduce the job size (fewer slides/segments/pages), or re-run on a host with more memory.".isRetryable()now returns true forCodeResourceExhaustedso the ADR 044 auto-retry loop gives it a shot before surfacing. New shared helperclassifyShellExitCode(exitCode int) (packs.ErrorCode, bool)ininternal/packs/builtin/shell_exit.gois the single source of truth for "what does this exit code from a shelled-out tool mean in a typed way?" — today it lifts exit 137 toCodeResourceExhaustedand returnsok=falsefor everything else (caller falls through toCodeHandlerFailed). Future packs (hyperframes.render, slides.render, scrape_spa, doc_ocr, etc. — every shell-out path) can adopt it incrementally; the lore lives in one place instead of being reinvented per-handler.slides.narratewired: both the per-segment ffmpeg encode (line ~429) and the final concat step (line ~452) check the exit code through the helper. When OOM is detected the error message also improves — instead of the generic "ffmpeg segment 9 failed (exit 137): <stderr tail>" the operator sees "ffmpeg segment 9 killed by the OS on exit 137 (likely OOM at 1080p — bump SessionSpec.MemoryLimit, reduce slide count, or lower the encode resolution). stderr: ...". The stderr artifact still lands in the artifact store via the existingpersistFfmpegStderrpath so post-mortem debugging keeps the full ffmpeg output. Tests: 2 new tests inshell_exit_test.go(exit 137 →CodeResourceExhausted; every other code returnsok=false); 1 new case inclassify_test.goassertingCodeResourceExhausted → FailureTransient;TestIsRetryableextended to assert it's in the retryable set. All 1549 internal tests pass. Scope kept tight: today only exit 137 is recognized empirically — adding SIGTERM (143), GNUtimeout(124), etc. is a follow-up decision that now lives in one helper, not 8 handlers. -
isSafeClonePath/safeJoinaccept ADR 040 persistent clone paths (<PersistentReposPath>/<Caller>/...). Surfaced by thehelmdeck-debugskill while running its diagnostic sweep:builtin.repo-presentationfailed withstep "map": invalid_input: clone_path must be an absolute path under /tmp/helmdeck- or /home/helmdeck/work/. The bug was a Class 3 schema-vs-handler drift that the skill correctly identified: ADR 040 wiredrepo.fetchto emitclone_path = <ec.PersistentReposPath>/<ec.Caller>/<hash>(e.g./repos/admin/6d3bd03b49986330) when the persistent repos volume is mounted, butisSafeClonePath(internal/packs/builtin/repo_push.go:324-333) was never updated to accept that prefix family — it still only allowed/tmp/helmdeck-and/home/helmdeck/work/. Every downstream consumer ofrepo.fetch's output (repo.map,repo.push,fs.read/fs.write/fs.list/fs.delete,cmd.run, andcontent.groundwhen reading from a clone) therefore rejected the legitimate fetch output withCodeInvalidInput. The failure_class came out ascaller_fixable— secondary Class 4 misclassification, because the caller (the pipeline definition) had passed the fetch output verbatim and had no recourse to fix it. The fix widensisSafeClonePathto take an*packs.ExecutionContextand accept a third path family:strings.TrimSuffix(ec.PersistentReposPath, "/") + "/" + ec.Caller + "/". The per-caller subdir is required — bare/repos/loose-fileand/repos/other-user/...are still rejected — so the validation continues to enforce the per-caller scoping ADR 040 documents. When ec is nil or persistence is off (PersistentReposPath == ""), behavior is byte-identical to the pre-fix version (pre-ADR-040 callers see no change).safeJoinnow takes the sameecparameter and threads it through. Error message is now generated byclonePathRejectMessage(ec)so the rejection text surfaces the actual allowed prefix in this deployment — e.g.clone_path must be an absolute path under /tmp/helmdeck- or /home/helmdeck/work/ (or /repos/admin/ for ADR 040 persistent clones)— instead of a stale hard-coded list. Verification: live re-run ofbuiltin.repo-presentationonhttps://github.com/octocat/Hello-World.gitafter the fix —repo.fetchproduced/repos/admin/35045901fb0127aa,repo.mapconsumed it without rejection, the chain advanced two steps further before failing on an unrelatedslides.outlineinput-shape issue (separately filed). Tests: 3 new tests inrepo_push_test.go—TestIsSafeClonePath_ADR040Persistent(per-caller positive cases + cross-caller / bare-root / traversal negative cases),TestIsSafeClonePath_PersistenceOff(emptyPersistentReposPathkeeps pre-ADR-040 behavior),TestIsSafeClonePath_CallerEmpty(no anonymous fallback to a shared namespace). All 1546 internal tests pass. Surface area touched: 1 function signature change (isSafeClonePath), 1 helper signature change (safeJoin), 8 call sites updated to threadecthrough, 5 hard-coded error messages replaced with the helper. Behavior on pre-ADR-040 paths is preserved bit-for-bit. -
Pipeline template resolver: tolerate missing top-level optional
inputs.*references (resolve to""). The opt-in CTA inputs that landed with theblog.append_ctawiring (project_url,github_url,cta_source_url,cta_copy) failed every real*-rewrite-blogpipeline run unless the caller passed every field explicitly. The validation-suite test fixture sets every input explicitly so unit tests passed, but a live call tobuiltin.brief-rewrite-blogwith onlyproject_url+github_urlset returned:step "cta": unresolved reference "inputs.source_url": no field "source_url". The resolver was designed to fail loud on every unresolved${{ ... }}reference — correct for inter-step wiring, where a missingsteps.X.output.Yindicates a real producer/consumer bug; wrong for pipeline inputs, where callers routinely omit optional fields. The fix scopes leniency tightly: only top-levelinputs.*references that miss → resolve to"". Nested traversal errors (inputs.foo.barwherefooexists butbardoesn't) still fail loud — that surfaces caller-side shape bugs.steps.*.output.*references stay loud always — that's the safety net for inter-step wiring bugs (the high-value catch). Verification: live run ofbuiltin.brief-rewrite-blogonopenrouter/anthropic/claude-haiku-4-5post-fix — the optional CTA inputscta_source_url/source_urlresolved to empty when not passed; the CTA step LLM-rewrote a natural closing section weaving inproject_url+github_urlwithcta_copy("invite the reader to try it and contribute") honored throughout the 3 paragraphs. Tests: 2 new tests intemplate_test.go—TestResolve_MissingInputDefaultsToEmpty(whole-value ref and embedded ref both resolve to empty) andTestResolve_MissingStepStillFails(the safety-net guard for the inter-step path).
Added
-
blog.append_ctapack + opt-in CTA wiring across the*-rewrite-blogpipelines. An external agent driving helmdeck through OpenClaw asked it to "promote this project" viabuiltin.scrape-rewrite-blogand got back well-written articles with zero promotional links and visible[1]/[source]citation markers throughout — the pipeline did its job correctly, but the output shape (blog.rewrite_for_audience's ghostwriter contract +content.ground's verifiability contract) didn't match the user's conversational publication target (dev.to / Medium). New packblog.append_cta(internal/packs/builtin/blog_append_cta.go) closes the CTA half of the gap: when ALL ofsource_url/project_url/github_urlare empty, the pack is a strict no-op that returns markdown unchanged and never calls the dispatcher — so the step can slot into every blog pipeline unconditionally without burning a model call for the common no-CTA path. When at least one link is set, the pack LLM-rewrites a closing CTA section in the article's voice, reusingresolveBlogRewritePersonafromblog.rewrite_for_audienceso a "technical" / "marketing" / "educational" persona threaded through the pipeline locks the voice across both packs. The model is instructed to emit ONLY the closing section; the original article body is appended verbatim in code so the LLM cannot introduce drift. Optionalcta_copyfield lets the caller steer the ask in plain English ("invite contributors", "highlight the free tier"). Wired into four pipelines:builtin.brief-rewrite-blog,builtin.scrape-rewrite-blog,builtin.doc-rewrite-blog, andbuiltin.research-rewrite-blogall gained actastep betweencontent.groundandblog.publish. Optional pipeline inputsproject_url?,github_url?,cta_source_url?,cta_copy?thread through.doc-rewrite-blogusescta_source_urlseparately from its existingsource_url(which is the doc URL) so the CTA stays opt-in — threading the doc URL into the CTA would have fired the LLM on every doc-rewrite-blog run regardless of intent. Pipeline descriptions tightened ininternal/pipelines/seed.gofor all four blog pipelines: each now explicitly calls out that the output includes inline[1]citations fromcontent.groundand recommends stripping them in post-processing for conversational publication targets. Honors the existing project memory about pipeline descriptions matching the mechanism. Tests: 9 new tests inblog_append_cta_test.go(no-op when no links / no-op when whitespace-only links / appends when project_url set / all 3 links land in prompt / model required when link set / persona matches article voice / code fence unwrapping / empty markdown rejected / empty model response surfaces error). All 4 blog pipelines re-verified throughTestBuiltins_RunEndToEnd. 1541 internal tests pass. Companion blog draft atwebsite/blog/2026-06-02-pipeline-output-shape-vs-publication-target.md(draft: true) frames the broader pattern: pipelines are tight contracts on purpose; multi-action intents need the planner to compose pipeline-run + post-processing rather than asking the pipeline to absorb responsibilities it wasn't designed for. Out of scope: citation stripping (its own pack — the design question is sharper than "remove[N]markers"; footnote / inline-hyperlink / references-list-only are all valid targets), and the planner-asks-user clarifying-question flow ("want deep research?"), which needshelmdeck.planprompt engineering plus a UI surface for asking back. -
Prefix-cache routing for the catalog block in
helmdeck.planandhelmdeck.route(ADR 051 PR #4). ADR 051 PR #2 addedBudget.SupportsPrefixCache+CachedInputCostUSDPerMTokas capability flags on 15 tier entries (Anthropic / OpenAI / Google / DeepSeek native + their OpenRouter relays — the providers whose APIs document prompt-prefix caching). PR #4 wires the flag to message assembly. The cache-defeating mutation: today the catalog block (the largest chunk of input tokens by far — 3KB compacted, 30KB uncompacted on Tier A) lives in the USER message alongside the per-call intent and defaults. Two consecutive calls tohelmdeck.planwith different intents produce different user messages → no shared prefix → no provider cache hit. Anthropic's prompt-prefix cache (50% input discount), Gemini's (75% discount), and DeepSeek's (96.7% discount, the 30× number on V4 Pro) all key on byte-identical message-array prefixes; the moment any byte differs, the cache misses. The fix: when the budget advertisesSupportsPrefixCache, the catalog moves into the SYSTEM prompt. The system prompt then carriesplanSystemPrompt + "\n\nCATALOG (helmdeck routing-guide):\n<full catalog>"— stable across every call for that model, since catalog is global engine policy (not per-caller). Per-call variation (defaults projection + intent + optional context) lives in the user message tail.assemblePlanPrompt(budget, ...)andassembleRoutePrompt(budget, ...)helpers carry the branching logic — whenbudget.SupportsPrefixCacheis true, return(systemWithCatalog, userWithoutCatalog); when false, return(legacySystem, legacyUser). The legacy path is byte-identical to pre-PR-4 dispatches, so behavior on non-caching providers (Tier C fallback, Ollama, Mistral / Grok / Fireworks without the flag) is unchanged. Cascade interaction: when the ADR 050 PR #4 filter cascade fires on a SupportsPrefixCache model (onlyopenrouter/deepseek/deepseek-v4-procarries both flags today), the restricted catalog goes into the system prompt for that call. The filter pass keeps its own system prompt (filter and planning system prompts have different role instructions — consolidating them is deferred). Tests: 6 new tests inplan_test.go+route_test.go— (1) Tier A SupportsPrefixCache=true puts catalog in system message and intent in user message, (2) byte-identical system prompts across two sequential calls with DIFFERENT intents (the cache-hit contract), (3) Tier C fallback without the flag keeps the legacy single-user-message shape. 2 pre-existing tests (TestPlan_TierAModelGetsFullCatalog,TestRoute_TierAModelGetsFullCatalog) updated to assert against the combined system + user text since the catalog lifted out of the user message on Tier A. All 1532 internal tests pass. Completes the ADR 051 4-PR roadmap (PR #5 calibration tooling shipped ahead of #2–#4 to unblock operator self-service). -
Provider-side strict JSON via
response_formatongateway.ChatRequest(ADR 051 PR #3). ADR 051 PR #2 addedBudget.WantsStrictJSONas a capability flag on 14 tier entries (Anthropic/OpenAI/Google native + their OpenRouter relays, Mistral, Grok), but no code read it — the gateway request shape had no field for constrained-decoding mode, so every plan/route call still relied entirely on prompt engineering to ask for JSON. The research synthesis cited in ADR 051 names provider-side strict JSON as the cleanest mitigation for the trailing-prose / markdown-injection failure modes Tier A models occasionally exhibit; it also flags constrained decoding as the wrong mode for quantized open-weight inference (Tier C), where the logit masker can deadlock and emit JSON-shaped garbage. NewResponseFormat stringfield ongateway.ChatRequestwith documented values""(unconstrained — current behavior, zero-diff for callers that don't set it) and"json_object"(provider validates output is syntactically valid JSON). String-based for forward-compat: a future"json_schema"value can be added without touching every adapter. Pack handlers set it fromBudget.WantsStrictJSON; the dispatcher passes it through unchanged so any future gateway client (engine.Execute, integration tests) opts in without touching pack code. Per-provider translation: OpenAI adapter sendsresponse_format: {type: "json_object"}upstream (Mistral, Groq, Fireworks, OpenRouter all shareNewOpenAIProviderso they inherit the translation for free). Gemini adapter setsgenerationConfig.responseMimeType: "application/json". Anthropic uses tool-call structure for strict output and ignores the field. Ollama passes through unconstrained. Unknown ResponseFormat values fall through unconstrained at every adapter so a forward-compat value (e.g."json_schema") rolling out faster than the translator can't break dispatch.helmdeck.plan+helmdeck.routewire-up: both handlers readbudget.WantsStrictJSONand setResponseFormat="json_object"when the flag is set AND the tier is not C. The Tier C guard is the safety belt the research synthesis explicitly called out — even an admin who manually setsWantsStrictJSON=trueon a Tier C fallback entry stays on the prompt-engineered path because constrained decoding crashes there. Tests: 6 per-provider translation tests ininternal/gateway/providers_test.go(openai forwards json_object envelope / openai omits response_format on unset / openai ignores unknown values / gemini sets responseMimeType / gemini omits responseMimeType on unset / anthropic ignores silently). 4 new pack-handler tests inplan_test.go+route_test.go(Tier A flips to json_object on a model with WantsStrictJSON=true / Tier C stays empty even when the flag is set on the fallback entry). All 1530 internal tests pass. Sets up PR #4 (prefix-cache-aware two-pass cascade gated onBudget.SupportsPrefixCache). -
Cause-typed empty completions + Budget capability flags (ADR 051 PR #2). ADR 051 PR #1 stripped reasoning-token blocks and consolidated the JSON parser, but every parse failure still surfaced to operators as the same opaque "gateway returned an empty plan response" text regardless of root cause. The research synthesis cited in ADR 051 identifies four distinct causes for empty HTTP-200 completions, each with a different correct response: provider safety filter redaction, length truncation, constrained-decoding deadlock, and connection timeout on hybrid-reasoning models. PR #2 makes the cause inspectable via
errors.Is. New sentinel errors ininternal/packs/builtin/json_response.go:ErrSafetyFiltered,ErrLengthTruncated,ErrConstrainedDeadlock,ErrLikelyTimeout. Each is plainerror(set as theCauseof the returned*packs.PackError). Callers that don't care keep using the existing wrapper; callers that want to bucket telemetry or pick a retry strategy useerrors.Is(perr.Cause, ErrSafetyFiltered)etc. NewDecodeStructuredResponseWithCause(rawBody, finishReason, packName, v)is the cause-typed variant. ReadsfinishReason(whichgateway.ChatResponse.Choices[0].FinishReasonhas been carrying all along — the gateway captures it per provider for theprovider_callsaudit table) and classifies the failure. The existingDecodeStructuredResponsebecomes a backward-compat wrapper that passes an empty finish reason — unchanged behavior on the wire except empty-completion paths now classify asErrLikelyTimeout(preserving the historical message prefix).helmdeck.planandhelmdeck.routewire-up: both handlers now callDecodeStructuredResponseWithCauseand threadchat.Choices[0].FinishReasonthrough. No observable behavior change for callers that don't introspect the Cause; new visibility for those that do.Budgetextended with four capability flags:IsHybridReasoning bool(model emits<think>/<reasoning>blocks — set ono3-mini,claude-3.7-sonnet,claude-opus/sonnetthinking variants,deepseek-v4-pro, the Moonshotkimi-k2/kimi-family).WantsStrictJSON bool(provider supports request-time strict-JSON mode — set on Anthropic / OpenAI / Google native, Mistral, Grok).SupportsPrefixCache bool(provider offers prompt-prefix caching for 2×–30× input-cost discount — set on Anthropic / OpenAI / Google / DeepSeek native + their OpenRouter relays).CachedInputCostUSDPerMTok float64(cached-input rate per million tokens — populated from Artificial Analysis and per-provider pricing pages). Empty defaults on unmapped models are conservative ("we don't know" → don't make affirmative claims).helmdeck://context-budgetsMCP resource extended: surfacesis_hybrid_reasoning,wants_strict_json,supports_prefix_cache,cached_input_cost_usd_per_mtokon each entry withomitemptyso the resource stays compact for legacy entries while exposing the new flags on entries that carry them. 27 tier entries updated with PR #2 flags following the methodology indocs/howto/calibrate-model-tiers.md— Tier A native APIs get strict-JSON + prefix-cache + a cached rate (Anthropic 1.5/M, OpenAI gpt-5 0.46/M, Gemini 2.5 Pro 0.125/M, etc.). DeepSeek V4 Pro flagged hybrid + cache-supporting (30× discount at 0.0145/M). Kimi K2 family flagged hybrid. Open-weights routes (Llama, Gemma, Qwen, free tier) keep all flags off — the report warns of constrained-decoding deadlock when strict-JSON is forced on quantized inference engines. Tests: 9 new cause-typed tests injson_response_test.go(safety filter / length truncated / likely timeout via empty finish_reason / unknown finish_reason fallback / length-truncated parse fail / constrained deadlock / safety-filtered parse fail / backward-compat sentinel preservation / wrapper still produces historical message prefix). 4 new capability-flag tests inbudgets_test.go(hybrid reasoning, strict JSON, prefix cache, fallback conservative defaults). 1 new MCP resource test asserting o3-mini's flags surface on the wire. 1499 tests passing across all internal packages (was 1485 before PR #2, +14 new). Sets up PR #3 (provider-sideresponse_formattranslation ingateway.ChatRequestgated onWantsStrictJSON) and PR #4 (prefix-cache-aware two-pass cascade gated onSupportsPrefixCache). -
Model-tier calibration tooling + maintenance docs (ADR 051 PR #5). ADR 051 PR #1 introduced 14 new entries in
internal/llmcontext/budgets.gocalibrated from a research synthesis on 2026-06-02. Without a documented calibration process the table will be stale within a quarter and operators won't know how to extend it. PR #5 fixes that by shipping the methodology + automation that produced PR #1's table, so the next tier addition is a 5-minute task instead of an afternoon of reverse-engineering. Newscripts/calibrate-model.shruns a fixed suite of helmdeck-specific prompts against a given model id via the live/api/v1/packs/helmdeck.planREST endpoint and emits a recommended tier + draftbudgets.goentry. The prompt suite covers three failure-mode classes: trivial single-action (baseline + "does it respond at all"), multi-action 3-step pack-chain (structured-output reliability), paste-heavy multi-action (the original ADR 050 motivating prompt). For each prompt it measures HTTP status, wall-clock duration, parsed-response shape, and which cascade stages fired (lexical truncation / LLM filter pass via thecompaction.droppedfield). The tier decision tree maps to: Tier A when all prompts succeed with no compaction, Tier B when metadata trim alone suffices, Tier C when lexical/filter cascade fires, Tier C-unstable when only trivial works, "unsupported" when even trivial fails. Hybrid-reasoning detection via trivial-intent latency > 20s. Latency budget tunable viaTIMEOUT_Senv var;--skip-paste-heavyshortens the run on weak models. Output as human text (default) or--jsonfor scripting. Newscripts/calibrate-model.test.shis the self-test — invokes the calibrator against two anchor models (openrouter/openrouter/freeshould be Tier C / C-unstable / unsupported;openrouter/anthropic/claude-haiku-4-5should always be Tier A) and asserts the recommendation matches. Catches regressions in the heuristic logic. Newdocs/howto/calibrate-model-tiers.mdis the operator-facing methodology walkthrough: when to calibrate, where to find benchmark scores (Berkeley Function-Calling Leaderboard, Aider polyglot edit-format adherence, Artificial Analysis pricing), how to identify architectural quirks from provider docs (hybrid reasoning, strict JSON support, prompt-cache support), how to interpret the calibrator's recommendation, how to set the capability flags PR #2 will introduce (IsHybridReasoning/WantsStrictJSON/SupportsPrefixCache), and what the trailing source-of-classification comment should contain. Includes the rules for selectingMaxCatalogBytesper tier (0 / 22000 / 10000) andAllowsLLMFilterper tier (false for A and B, true for C).docs/RELEASES.md§"Agent sync checklist" step 9 added: every release cut now includes a tier-table refresh check pointing operators athttps://openrouter.ai/api/v1/modelsfor newly-shipped models and at the calibrator + how-to for evaluating them. Discovery stays manual — no helmdeck cron watching provider catalogs — but the maintainer who runs the release also notices when their fallback chain has new options worth investigating. Why PR #5 ships ahead of PRs #2–#4: the calibration methodology is freshest in maintainer memory right now; operators wanting to add new models get unblocked immediately; PR #2's capability flags need calibration data the script feeds them; calibration can evolve asynchronously from the typed-errors / strict-JSON / prefix-cache architecture work. -
Reasoning-token stripping + JSON parser parity + research-calibrated tier table (ADR 051 PR #1). ADR 050 shipped a cascade calibrated against three free models. A research synthesis landed today documenting that the "empty completion" failure has four distinct root causes (only one of which — trailing prose — our
json.Decodertolerance fix addresses), and a live test withopenrouter/moonshotai/kimi-k2.6immediately exposed a fifth gap: hybrid-reasoning models emit<think>...</think>/<reasoning>...</reasoning>blocks before the structured payload, and nothing in helmdeck strips them. The Kimi-K2.6 call streamed for 296 seconds inside its<think>block and got cut off by OpenClaw's 5-minute timeout before reaching the JSON; even if it had finished, the parser would have hit the reasoning block first. Newinternal/llmcontext/reasoning.goexportsStripReasoningTokens(s string) stringandHasReasoningTokens(s string) bool. Strips<think>...</think>,<reasoning>...</reasoning>, and[REASONING]...[/REASONING]blocks — case-insensitive, multi-line, tolerates attributes (<think type="planning">), idempotent on clean input, requires a closing tag (unclosed open tags pass through so we never silently drop a real answer). Collapses runs of blank lines that the strip leaves behind; trims leading/trailing whitespace. Newinternal/packs/builtin/json_response.goexportsDecodeStructuredResponse(rawBody, packName, v)consolidating the defensive parsing pipeline every LLM-backed pack was reimplementing slightly differently: strip reasoning tokens → trim → unwrap code fences (unwrapCodeFenceexisting helper) →json.Decoder.Decode(tolerates trailing prose/HTML/markdown, the ADR 050 PR #4 fix) → balanced-braceextractFirstJSONObjectsubstring fallback (reuses the helper fromwebtest.gothat properly handles}inside JSON string literals — better than the naiveLastIndex("}")approach plan.go used to use). Returns*packs.PackErrorwithCodeHandlerFailedand a packName-threaded Message ("gateway returned an empty plan response" / "empty routing response" / "empty rewrite response" depending on caller). Migration:plan.go(had the ADR 050 PR #4 tolerance fix),route.go(still on strictjson.Unmarshal— this brings it to parity), andcontent_ground.go(had its own substring fallback) all now callDecodeStructuredResponse. Three independent fallback paths converge to one. Net code reduction; uniform behavior. Tier table refresh —internal/llmcontext/budgets.gogains 14 new entries calibrated from the research report's BFCL (Berkeley Function-Calling Leaderboard), Aider polyglot edit-format adherence, and Artificial Analysis pricing data. Tier A additions:openai/o3-mini(BFCL 84.00%; hybrid reasoning — emits<think>, now stripped),google/gemini-2.5-pro(BFCL 85.04% leaderboard top, Aider 99.6% edit-format),google/gemini-2.5-flash(BFCL 75.58%),anthropic/claude-3.7-sonnet(BFCL 73.24%, Aider 84.2%; hybrid thinking mode). Tier B additions:openrouter/deepseek/deepseek-v4-pro(BFCL proxy 71.4%, Aider proxy 74.2%; hybrid reasoning with documented 30-minute serverless timeouts),openrouter/deepseek/deepseek-v3.2(Aider 74.2%),openrouter/deepseek/deepseek-chat(broader family),openrouter/x-ai/grok-prefix (BFCL proxy 61.38%, Aider 97.3% edit-format; price-tier bumps past 128K context). Tier C additions:openrouter/moonshotai/kimi-k2(256K context, hybrid reasoning — observed to time out without the<think>stripper),openrouter/moonshotai/kimi-prefix (catches future Kimi releases),openrouter/tencent/prefix (250K context, conservative until live-validated). Each entry's classification source is named in its trailing comment so future operators can trace the call to its evidence. Tests: 15 new tests ininternal/llmcontext/reasoning_test.go(idempotent on clean input, drops think/reasoning/REASONING variants, case-insensitive, multi-line bodies, multiple blocks, tag attributes, unclosed-tag pass-through, blank-line collapse, regression sample modeled on real Kimi-K2.6 output). 12 new tests ininternal/packs/builtin/json_response_test.go(happy path, think-block prefix, reasoning-block prefix, code-fence unwrap, trailing-content tolerance, leading-prose substring extraction, brace-inside-string regression guard, empty-body error message, reasoning-only input post-strip, unrecoverable garbage, packName threading, combined think + fence). 12 new tier-classification assertions ininternal/llmcontext/budgets_test.gocovering each of the report's recommended model ids. Sets up PR #2 (cause-typed empty completions:ErrSafetyFiltered,ErrLengthTruncated,ErrConstrainedDeadlock,ErrLikelyTimeout, plusBudgetcapability flagsIsHybridReasoning/WantsStrictJSON/SupportsPrefixCache/CachedInputCostUSDPerMTok), PR #3 (provider-side strict JSON mode viaresponse_formatongateway.ChatRequest), and PR #4 (prefix-cache-aware two-pass cascade — restructure the ADR 050 PR #4 filter so both passes hit the same provider cache).
[0.22.0] - 2026-06-01
Theme: Agents that work on free models, with memory. Closes ADR 047 (catalog metadata + memory-driven routing), ADR 048 (memory write surface + OpenClaw memory-corpus bridge), ADR 049 PR #1 (helmdeck.plan intent decomposer), and ADR 050 (4-PR retrieval-augmented tool selection cascade). The exact MiniMax M3 launch paste + 3-action ask that motivated the cascade work now returns a valid 3-step plan on openrouter/openrouter/free (was: empty completion).
Added
- Two-pass LLM-filter cascade + JSON-decoder tolerance — original motivating prompt now succeeds on free models (ADR 050 PR #4). Closes the ADR 050 roadmap and the gate the entire ADR was scoped to meet. PR #1 shipped per-model budgets + metadata compaction; PR #2 wired route + added the context-budgets resource; PR #3 added the cascading
Select()with lexical pre-filter. Empirical gap PR #3 left open: complex multi-paragraph prompts (1.5KB MiniMax M3 launch paste + 3-action ask) still empty-completed on free models even with a 3KB catalog — failure had shifted from "catalog overflows working set" to "structured-output reliability on long user pastes." PR #4 closes that gap via two cooperating changes: (1) an optional pre-planning LLM filter pass that runs when lexical retrieval is ambiguous, and (2) a tolerant JSON parser that reads the first complete object from the model's response and ignores trailing prose/HTML/markup. The live acceptance gate now passes: the exact MiniMax M3 prompt that motivated ADR 050 returns a valid 3-step pack-chain plan in ~46s onopenrouter/openrouter/free(was: empty completion at 29.5s before any PR). Diagnostic showed the model was producing JSON + trailing garbage all along; strictjson.Unmarshalwas rejecting otherwise-recoverable output, and operators saw "empty plan response" because the parser failed before extracting the leading object.json.Decoderreads one value and stops, surfacing the actual plan. Mechanism. WhenBudget.AllowsLLMFilter == trueANDSelectended withlexical.low_confidence(an ambiguity signal — top scores within 40% of each other), the pack handler dispatches a SMALL second LLM call: catalog names + one-line descriptions + the user intent → returns just a JSON list of relevant tool ids. The handler then restricts the full catalog to the union of the filter picks and the lexical top-N, preserving lexical's strong signals while letting the filter recover from lexical false-negatives. The planning call then sees a SMALL catalog of only the picked ids. New surfaces ininternal/llmcontext.FilterSystemPrompt(versioned alongside the pack code).BuildFilterUserMessage(rg, intent) string(~10KB for the current 70-entry catalog, vs ~35KB full-metadata).ParseFilterResponse(text) []stringtolerates code-fenced JSON, leading prose, dedup.RestrictCatalog(rg, keep)subsets by id; unknown ids ignored.MergeKeepOrder(primary, secondary)unions two id lists preserving primary order.IDsFromRoutingGuide(rg)extracts sorted ids for reproducible filter prompts.ShouldEscalateToFilter(ranked, min) boolcombinesHighConfidence < 0.4withlen(ranked) >= min. Budget extension.BudgetgainsAllowsLLMFilter bool+FilterModel string. Tier C entries enabled by default with emptyFilterModel(caller falls back to the planning model). Tier A/B disabled.helmdeck://context-budgetsexposes the new fields. JSON parser tolerance. Switchedhelmdeck.planfromjson.Unmarshal(body, &raw)tojson.NewDecoder(strings.NewReader(body)).Decode(&raw)with abody[first{:last}+1]substring fallback. Reads the first complete JSON object and stops, tolerating trailing prose/HTML/markup that weak models produce. Cascade gating fix. PR #3's escalation was eagerly firing on every lexical truncation, adding ~30s of latency on cases lexical alone handled. PR #4 gates on the newlexical.low_confidencemarkerSelectappends only whenShouldEscalateToFilterreturns true — confident-pick cases bypass the filter pass entirely, restoring PR #3's 5-second latency on simple prompts. Wire-up. Bothhelmdeck.planandhelmdeck.routeorchestrate the filter pass afterSelectreturns when escalation conditions match. Filter failures (parse errors, dispatcher errors) fall back silently to the lexical-only selection — the filter is an enhancement, never a hard dependency. Successful runs appendllm_filter(picks=N,kept=M)to the Trim record so operators see the filter stage in the same audit surface. Tests. 14 new tests ininternal/llmcontext/filter_test.go(prompt shape, JSON parsing variants — code-fenced, leading prose, dedup, real-world response — RestrictCatalog membership, MergeKeepOrder primary precedence, IDsFromRoutingGuide stability, ShouldEscalateToFilter thresholds). 2 new plan integration tests using scripted dispatchers (TwoPass cascade fires both LLM calls in correct order with correct system prompts, FilterFails falls back to lexical without breaking the plan call). 1446 tests passing across all internal packages. - Cascading
Select()+LexicalRank+helmdeck://my-plans(ADR 050 PR #3). Closes the "simple multi-intent prompts work on free models" gate. PR #1 shipped metadata compaction; PR #2 wired it into route; PR #3 wraps both stages in a cascadingSelect(catalog, intent, budget) → (selected, Trim)entry point that adds lexical retrieval + top-N truncation as stage 3 when compaction alone can't reach the model's budget. Live test onopenrouter/openrouter/free: a 3-action intent ("remember this fact, then write a blog about it, then generate an image") returned a clean 3-step pack-chain plan in ~5 seconds post-cascade (catalog 30KB → 3.16KB, 89% reduction, all compaction steps +lexical.top_nfired). NewLexicalRank(catalog, intent) []Scoredscores every catalog entry by keyword overlap against intent_keywords (×3.0 weight), accepts/produces (×2.0), name (×2.0), description (×1.0), plus pipelinesupersedes(×2.5) so the supersedes-honoring policy lives at the ranking layer too. Stop-word filtering, case-insensitive, single-character tokens dropped, deterministic ordering on ties (by entry name). NewTopK(ranked, k)truncates ranked slice;HighConfidence(ranked, threshold)reports whether the top score is meaningfully ahead of the second (>=threshold ratio gap) — used by the future PR #4 LLM-filter pass to decide whether escalation is needed. NewSelect(catalog, intent, budget)cascade is the public entry point: Tier A pass-through → CompactCatalog (PR #1 metadata trim) → if still over budget, LexicalRank + TopK (cap isSelectMaxEntriesTierC=12orSelectMaxEntriesTierB=25by tier). Returns sameTrimrecord callers already log; appendslexical.top_ntodropped[]when stage 3 fires.helmdeck.plan+helmdeck.routewire-up: both packs now callSelect(...)instead ofCompactCatalog(...)directly. The cascade is internal; callers see one function call. INFO log line renamed from "catalog compacted" to "catalog selection ran" to reflect the broader cascade. Newhelmdeck://my-plansMCP resource (always listed, ADR 050 PR #3 consolidation of ADR 049's deferred PR #2) projects the caller'splan_historyaudit rows into per-intent_sha cohorts:{intent_sha, count, complexity, top_tools[], last_unix, models[]}. Operators audit the planner's behavior over time; agents detect stable learned plans. Tests: 11 new tests acrossinternal/llmcontext/(tokenize, LexicalRank intent_keywords beat name matches, supersedes boost, deterministic ties, TopK, HighConfidence, Select Tier A passthrough, compact-only sufficient, lexical escalation, supersedes-survives-truncation, over-budget marker not forwarded). 3 new tests ininternal/mcp/resources_test.gocover my-plans listing, aggregation correctness, and empty-history note. Bumped always-listed-resource count assertion to 6. Total: 1431 tests passing across all internal packages. Honest scope. Complex multi-paragraph prompts (e.g. the original MiniMax M3 launch paste + 3-action ask) still empty-complete on the worst free models even with 3KB catalog — the failure has shifted from "catalog overflows working set" to "structured-output reliability on long user pastes." That's a PR #4 problem (two-pass LLM-filter cascade with a paid filter model + a free planner model), not a PR #3 problem. PR #3 closed the "simple multi-action intents work on free models" gate that PR #1 was originally scoped to meet; PR #4 closes the remaining "complex paste + multi-action" gate. helmdeck.routecompaction +helmdeck://context-budgetsresource + plancompactionfield (ADR 050 PR #2). Generalizes the llmcontext integration from PR #1:helmdeck.route's handler now callsllmcontext.BudgetFor(model)+CompactCatalog(catalog, budget)afterbuildCatalog, matching the pattern PR #1 added tohelmdeck.plan. Free models hitting the router now see the same trimmed catalog they see from the planner, with the same INFO log surface so operators can correlate compaction events across both packs. Newhelmdeck://context-budgetsMCP resource (always listed, no caller scoping, no memory dependency) projects the budgets table —budgets[]with{model, input_tokens, output_tokens, max_catalog_bytes, tier}per entry, afallbackrow showing what unmapped models inherit, and apolicystring explaining lookup rules. Operators can audit which model gets which tier without grepping source; agents reading the resource can understand why a plan was made under a slim catalog and decide whether to escalate to a stronger model. Newcompactionfield onhelmdeck.planoutput (optional, omitted on Tier A pass-through) surfaces the Trim record on the wire:{before_bytes, after_bytes, dropped[]}. Same shape as the INFO log line so agents and operators see the same numbers. Output schema declarescompaction: object; agents that ignore unknown fields are unaffected. Tests: 2 new route tests cover Tier A full-catalog pass-through and Tier C supersedes preservation; 2 new plan tests cover the omitempty contract on Tier A and the field-present contract on Tier C with a 30-pack catalog overflowing the budget; 2 new mcp tests cover context-budgets listing + read shape. Total: 1029 tests passing across mcp + packs + llmcontext + pipelines. Sets up PR #3 (retrieval-augmented tool selection: cascadingSelect()entry point with lexical pre-filter + TopK +helmdeck://my-plansprojection; the public-API shift fromCompactCatalogtoSelectis the migration target that wires PR #1 + PR #2 + PR #3 into one cohesive flow).internal/llmcontextmodule — per-model prompt budgets + deterministic catalog compaction (ADR 050 PR #1). ADR 049 PR #1 (helmdeck.plan) shipped correctly but live-test on a real multi-intent prompt reproducibly empty-completed on free models:openrouter/nvidia/nemotron-3-super-120b-a12b:freereturned no completion after 29.5s;openrouter/z-ai/glm-4.5-air:freereturned no completion after 58.0s; OpenClaw's MCP client logged 3 timeouts + 1 empty-plan error. Root cause: the catalog projection assembled bybuildCatalog()is 35KB of JSON for the current stack (52 packs + 21 pipelines with full metadata). Combined with the user's paste, the system prompt, and the structured-output ceiling, free models with imperfect structured-output reliability bail. The pack itself is correct — the failure is a cross-cutting concern affecting every LLM-backed pack that ships catalog or large input context. Newinternal/llmcontextmodule exports three surfaces:Tier(A frontier / B mid-tier / C weak/free),Budget(per-modelInputTokens/OutputTokens/MaxCatalogBytes), andCompactCatalog(full, budget) → (compacted, Trim). Tier classifications are calibrated against live OpenClaw tests, not vendor specs — a model with a 32K window that empty-completes at 20K of input is Tier C even though its window is larger than some Tier B models. Budgets table (budgets.go) maps canonical model ids to budgets via exact-match then longest-prefix-wins; unknown models fall back to Tier C so a fresh model still gets a working (if conservative) profile. Compaction order: packintent_keywords[]→ packtypical_use→ packlimitations[]→ pipelinesteps[]bodies (kept: id/name/pack) → pipeline input/output schemas (replaced with field-name lists) → description truncation to first sentence. Each pass marshals + re-checks untillen(JSON) <= MaxCatalogBytes. Pipelinemetadata.supersedesis NEVER trimmed — it anchors plan's rule P2 (pipeline supersedes packs the user named by hand). Pack names + pipeline ids are also preserved (they're dispatch identifiers).helmdeck.planintegration: handler callsllmcontext.BudgetFor(model)+CompactCatalog(catalog, budget)immediately afterbuildCatalog, before assembling the prompt. When trim record is non-empty, logs an INFO line withmodel,tier,before_bytes,after_bytes,dropped[]so operators see when free models are getting a slim catalog. 20 tests ininternal/llmcontext/cover exact + prefix lookup, Tier-A pass-through, priority-order trim, supersedes survival, slimPipelineSteps preservation of dispatch-relevant fields, firstSentence helper, immutable-input contract, and the still-over-budget marker. 2 plan tests added asserting Tier C compaction never drops dispatch identifiers and Tier A models see the full catalog. Token-counting heuristic: 1 token ≈ 4 chars byte-count instead of a real tokenizer — bounded cost of being slightly conservative (sending a leaner catalog than the model needs) versus pulling a model-specific tokenizer into Go. Empirical results. Trivial-intent calls onopenrouter/openrouter/freepost-compaction succeed in ~23s (catalog 30KB → 13.9KB, 54% reduction, all 6 trim steps fire). Complex multi-paragraph intents on the same model still empty-complete — the 14KB irreducible floor (after stripping every metadata field but preserving names, ids, and supersedes) is still too much for some free models when combined with a long user paste and a structured-output ceiling. The right fix for that case is retrieval-augmented tool selection (load only the catalog entries relevant to the intent) — designed as the next step of this ADR, not a brittle entry-truncation hack on top of metadata compaction. Sets up PR #2 (wirehelmdeck.route+ addhelmdeck://context-budgetsMCP resource for operator visibility) and PR #3 (retrieval-augmented selection: lexical pre-filter + top-N selection over catalog entries viaplan_historypriors, shipshelmdeck://my-plansprojection).helmdeck.planself-learning intent decomposer pack (ADR 049 PR #1). ADR 047 PR #3'shelmdeck.routeanswers "given an intent, which ONE tool?" — and that's enough when the user's ask maps to a single pack or pipeline. But real conversational prompts often span multiple intents in one message. A live OpenClaw test made the gap concrete: a free model (nvidia/nemotron-3-super-120b-a12b:free) received a paste + "do you have memory using helmdeck for [paste]... we can use the memory to create a blog to test the memory" and only called the image-gen tool — never reaching for memory tools or the blog pipeline ADR 048 had just shipped. The bridge worked; the agent simply didn't decompose the multi-intent prompt.helmdeck.plancloses that gap: a new LLM-backed meta-pack that returns an orderedsteps[]array (each{order, tool, args, rationale}), a derivedrewritten_promptstring a free model can execute line-by-line, and acomplexityclassifier (single-action/pipeline-direct/pack-chain). The rewritten_prompt is derived from steps in the handler (not asked of the LLM independently) so the two surfaces can't drift. Pipeline-aware: reuseshelmdeck.route's catalog projection so the model sees both packs AND pipelines; the system prompt teaches three rules — pipeline wins when accepts/produces fit, honormetadata.supersedes(a pipeline supersedes packs the user named by hand), decompose pack-by-pack only when no pipeline matches. Re-implementing a pipeline's curated chain as pack-by-pack steps would regress the supersedes guarantee. Guards: every step.tool MUST resolve to a registered pack name OR the literalhelmdeck__pipeline-runwithargs.idmatching a real pipeline; unknown ids get demoted to"tool": "unknown"with a populatedrationale, andhelmdeck.plancannot call itself (recursive-call guard). Partial demotion is fine — valid steps survive alongsideunknownones, the agent decides. Self-learning seam: every successful plan writes a compactPlanAuditrow to the caller's bare namespace under categoryplan_history(new) — intent SHA + complexity + per-step tool name + SHA-8 of args, NOT the rewritten prompt or rationales. Rows expire after 30 days (matchingpack_history/pipeline_history) or viahelmdeck.memory_forget. Theplan_historycategory joins the reserved-category guard ininternal/packs/facts.goso agents can't poison the projection throughhelmdeck.memory_store. Sets up PR #2 (helmdeck://my-plansprojection mining the history into priors) and PR #3 (frontier-model gap detection comparingexpert_baselineagainst the plan's decomposition). Newdocs/howto/intent-decomposition.mdwalks operators through when to call, the wire shape, the pipeline-aware behavior, and the self-learning story; SKILL.md adds a one-paragraph planning tip.- OpenClaw memory-corpus bridge — QMD-compatible MCP endpoint at
/api/v1/mcp/qmd/sse(ADR 048 PR #3). Closes the ADR 048 roadmap. ADR 047 + PRs #1 and #2 of ADR 048 built up helmdeck's per-caller memory layer (audit history + agent-written facts); this PR makes that memory queryable from OpenClaw's ownmemory_searchtool so agents see helmdeck's corpus inline with their conversational memory. NewQMDServertype (internal/mcp/qmd_server.go) speaks just enough MCP —initialize,tools/list,tools/call— to expose a single tool namedquerymatching the wire shape OpenClaw's MCPorter daemon expects (extensions/memory-core/src/memory/qmd-manager.ts:2167–2205). Response shape:{results: [{docid, score, snippet, collection, file?, start_line?, end_line?}]}. Scoring is substring/keyword (helmdeck doesn't carry embeddings); semantic recall happens client-side via OpenClaw's embedding pipeline (PR #1 sidecar). New SSE transport at/api/v1/mcp/qmd/sse(internal/api/mcp_qmd_sse.go) mirrors/api/v1/mcp/sse1:1 (session GET/POST pairing, 15s keepalives, chanWriter framing). Separate route + server because MCPorter expects the literal tool namequeryand the main PackServer uses dotted pack names — multiplexing would collide. Corpus projection renders three layers verbatim:pack_historyrows format as## Pack call: <name>summaries with input fields;pipeline_historyrows format as## Pipeline run: <id>; agent-writtenuser_facts(and any other non-reserved category) surface the raw fact value with a category footer. Caller scoping reusespacks.CallerFromContextso the corpus is JWT-subject-namespaced just like every other memory surface.compose.openclaw-sidecar.ymlwiresOPENCLAW_MEMORY_QMD_MCPORTER_ENABLED=true+SERVERNAME=helmdeck+STARTDAEMON=trueby default; operators opt out viaOPENCLAW_QMD_ENABLED=falsein their shell.scripts/openclaw-register-qmd.shcompletes the wire by registering helmdeck with MCPorter's own config (reuses the helmdeck JWT OpenClaw already stores so token rotation propagates). Auto-runs fromscripts/install.shafter the stack is healthy; idempotent for manual reruns. Memory-disabled deployments return 503 from/api/v1/mcp/qmd/sseso MCPorter logs a clean "tool not found" andmemory_searchdegrades to the user's local chunks without an agent-side error. Newdocs/howto/openclaw-memory-corpus.mddocuments the wire path, verification steps, opt-out, and what the bridge intentionally does NOT do (no writes via this endpoint, no cross-caller mixing, no vault leaks). 9 new tests ininternal/mcp/qmd_server_test.gocover handshake, tools/list shape, user-fact + pack-audit projection, per-caller isolation, unknown-tool rejection, nil-store safety, limit clamping, and the dualstructuredContent+contentenvelope so both newer and older MCP clients parse the response. - Helmdeck memory write surface —
POST /api/v1/memory/store+helmdeck.memory_storepack +helmdeck://my-memoryMCP resource (ADR 048 PR #2). ADR 047 PR #2 made the memory layer queryable; this PR makes it writable. Any MCP client (and the chat agent) can now persist user-supplied facts under the caller's bare namespace with category tagging + TTL. Two surfaces share one engine policy (internal/packs/facts.go→packs.ValidateFact):POST /api/v1/memory/storefor REST callers and the management UI;helmdeck.memory_storepack for chat agents calling helmdeck via MCP. Request shape:{key, value, category?, tags?, ttl_seconds?}— key/value required, category defaults touser_facts, TTL defaults to 90 days (min 1h, max 365d). Reserved-category guard:pack_historyandpipeline_historyare owned by the engine audit hooks and rejected with 400 / CodeInvalidInput so an agent can't poison the my-defaults projection.NoAudit: trueonhelmdeck.memory_storeso storing a fact doesn't pollute the audit history withhelmdeck.memory_storeranked alongside real packs. Newhelmdeck://my-memoryMCP resource (always listed): per-caller index of stored facts grouped by category, with counts + recent_keys. Audit categories filtered out — those still surface viahelmdeck://my-defaults. Agents read my-memory at the top of a session to discover existing facts before re-asking the user. Lifecycle: facts auto-expire via the existing memory TTL; the existinghelmdeck.memory_forgetpack (ADR 047 PR #2) already handlesscope: "key:<exact-key>"so per-fact cleanup composes for free. SKILL.md teaches the agent to persist durable preferences/conventions viahelmdeck__memory_storeand to peekhelmdeck://my-memorybefore re-asking. Newdocs/howto/agent-facts.mdwalks operators + users through the full lifecycle. Memory-disabled deployments degrade gracefully: writes soft-succeed with anoteso chat agents don't have to special-case nil-store paths. Sets up PR #3 (OpenClaw memory-corpus bridge: helmdeck's audit + user_facts surface through OpenClaw'smemory_search). - Embedding sidecar overlay for OpenClaw
memory_searchsemantic recall (ADR 048 PR #1). Today OpenClaw'smemory_searchdegrades to FTS (keyword/BM25) when no embedding provider is configured — recall on a fresh install is "the OpenAI key for provider 'openai' is missing" and the agent falls back to lexical search. ADR 048's first PR ships an opt-in compose overlay (deploy/compose/compose.embeddings.yml) that runsollama/ollama:latestashelmdeck-embeddingsonbaas-net, plus a one-shot init service thatollama pullsnomic-embed-text(~270 MB, ~600 MB RAM idle, CPU-only, Apache 2.0). A named volume persists the model cache so container re-creates don't re-download.scripts/install.shlayers the overlay by default;--no-embeddingsopts out for operators who'd rather use OpenAI cloud or a remote Ollama. OpenClaw still needs one manualopenclaw agents add mainto register theopenai-compatibleprovider pointing athttp://helmdeck-embeddings:11434/v1— there's no env-var auto-discovery in OpenClaw today, so zero-config will come in a follow-up once the upstream surface stabilizes.docs/howto/openclaw-memory.mdwalks operators through verify / override / opt-out paths. Sets up PR #2 (helmdeck memory write surface —POST /api/v1/memory/store+helmdeck.memory_storepack) and PR #3 (OpenClaw memory-corpus bridge — helmdeck's audit history + user_facts surface through OpenClaw'smemory_search). - Blog persona directives now call out code blocks, mermaid diagrams, and numeric tables alongside tone/length. When the slides persona enrichment shipped, the
technicalslides directive started inviting fenced code + mermaidflowchart/sequenceDiagramblocks where the source supports them, andexecutive/educational/academiceach gained a visual affordance hint. The blog rewriter was behind on that side —technicalmentioned code blocks but was silent on diagrams; the other personas didn't mention either.blog.rewrite_for_audience's persona map now matches the slides vocabulary:technicalinvites a mermaid diagram for process/architecture sources;executivepromotes a numeric comparison into a small markdown table when more than two values are involved;educationalinvites a minimal code block before each concept explanation + a mermaid sequence-of-steps where it builds a mental model;academicincludes a mermaid diagram or numbered figure when the source presents structured data.marketing/generalstay text-first by design (visual treatment for marketing is product screenshots, which the rewriter doesn't control).content.groundis left untouched — it's a citation pass, and asking it to introduce visual structure mid-grounding would destabilize the citations. NewTestBlogRewrite_PersonaVisualAffordancesasserts the new substrings land in the system prompt per persona so prompt drift surfaces as a test failure. - Auto-split slide overflow for code blocks and image+bullets. Marp silently clips anything that doesn't fit the slide — a 60-line code sample renders with its bottom 30 lines invisible, the reader blames "the model produced bad slides" when really the renderer ate half the content.
slides.outlinenow runs a deterministic post-pass between the LLM's output and the artifact write: it walks the deck, splits any code block longer than 22 lines into continuation slides ("Title (cont. 2/3)") with the fence reopened on each, and splits any slide where an image sits next to more than 8 lines of bullets/text into image-only + bullets-only continuation slides. Continuation slides carry a synthetic<!-- Continuation of X (chunk N of M). -->speaker note soslides.narrateproduces sensible audio; the LLM's original speaker notes stay on the first chunk. Speaker notes, frontmatter, post-code paragraphs, and image-prompt indices all survive the split (the pass runs BEFOREextractImagePromptssoslide_indexmaps to the final slide count).max_slidesis now a soft cap — the LLM aims for it, but the post-pass overshoots when overflow demands; the output'sslide_countreflects the final post-split count. The 22-line threshold is tuned for Marp's default 14pt monospace on a 16:9 slide; the splitter prefers a blank-line boundary within ±3 lines of the cut so functions don't get sliced mid-statement when a natural break exists nearby. Idempotent on already-fitting decks. Wide-table pagination (>20 rows) is a known gap deferred to a follow-up; the existing 60vh CSS cap keeps tables on-slide for now. All 7 slide pipelines (grounded-deck,grounded-narrate,research-deck,research-narrate,research-ground-deck,scrape-deck,repo-presentation) get the fix automatically — no pipeline changes needed. - Routing Memory management UI (ADR 047 PR #4). Closes the ADR 047 roadmap. PR #2 added per-caller audit memory +
helmdeck://my-defaultsover MCP; this PR makes that data visible and clearable from the Management UI without needing an MCP-aware client. New page at/memory("Routing Memory" in the sidebar) shows three blocks for the logged-in caller: (1) Learned pack defaults ranked by call count, each row carrying thecommon_inputschips the chat agent pre-fills from; (2) Learned pipeline defaults — same but per pipeline; (3) Recent activity — the last 200 audit rows with{kind, id, outcome, when, learn_inputs}. Every row gets a forget button (per-pack-id / per-pipeline-id / per-exact-key), each defaults section has a "Clear all packs / pipelines" button, and the header has a global "Clear all history". Backed by two new REST endpoints —GET /api/v1/memory/defaults(the projection + recent rows) andPOST /api/v1/memory/forgetwithscopebody — both wired ininternal/api/memory.go. The forget endpoint accepts the same scope vocabulary as thehelmdeck.memory_forgetpack (all/packs/pipelines/pack:<id>/pipeline:<id>) plus a newkey:<exact-key>scope that backs the per-row buttons. Auth-disabled deployments resolve the caller to"unknown"(matchingpacks.callerFromContext's convention) so audit rows remain queryable. Memory-disabled deployments return an empty payload + an explanatory note; forget is a soft-success no-op. helmdeck.routemeta-pack with gap analysis (ADR 047 PR #3). PR #1 made the catalog self-describing; PR #2 gave it per-caller memory; PR #3 is the LLM-backed router that fuses both into a single call the agent makes BEFORE picking a pack/pipeline. Inputs:user_intent(the user's natural-language request) +model. Internally builds the same routing-guide projection PR #1 ships athelmdeck://routing-guideplus the same defaults projection PR #2 ships athelmdeck://my-defaults(now factored into a reusablepacks.Defaultsprojection both surfaces share) and packs them into one model prompt. Returns a structured{recommendation, alternatives[<=3], gap_warning, reasoning, model}JSON object:recommendationis the best fit withsuggested_inputspre-filled from learned defaults;alternativesare runners-up;gap_warningis populated with a structured pack proposal (name,input_schema,output_schema,integration_pattern,why_useful) when nothing in the catalog fits. The agent confirms with the user, then either runs the recommendation or files the gap as a GitHub issue. Hallucination guard: if the model returns anidthat doesn't exist in the catalog, the handler demotes the recommendation and surfaces agap_warningso the agent can't dispatch to nothing. Audit IS recorded forhelmdeck.routeitself — "how often is the router called and what does it route to" is exactly the meta-signal PR #4's management UI surfaces. Registered inmain.gowith the existing vision dispatcher + pack registry + a thin*pipelines.Storeadapter; degrades to "no dispatcher" CodeInternal when no gateway is configured. SKILL.md teaches the agent to preferhelmdeck__routeover readinghelmdeck://routing-guidedirectly for multi-step requests.- Per-caller audit memory +
helmdeck://my-defaults+helmdeck.memory_forget(ADR 047 PR #2). PR #1 made the catalog self-describing; PR #2 turns every pack and pipeline run into a learning event so a fresh conversation can pre-fill from past use instead of starting from zero.*packs.Engine.Executeand*pipelines.Runner.RunSyncnow write one audit row per terminal outcome — pack name (or pipeline ID + run ID), outcome (okor the closed-set error code), duration, and a tinylearn_inputsmap containing the most useful low-cardinality string fields (persona,audience,angle,model,theme,voice,title,author,kind,format,persona_used). Markdown bodies, URLs, raw queries are dropped at write time — audit memory is for routing hints, not data retention. Rows live under the caller's bare namespace (justcallerFromContext(ctx), not session-scoped) so learning spans sessions for the same authenticated subject.helmdeck://my-defaultsis a new always-listed MCP resource that aggregates a caller's recent audits into top-N packs + top-N pipelines, each with acommon_inputsmap ("most-used value per field"). The agent's contract is to peek here before asking the user for inputs that have learned defaults: pre-fill and confirm rather than re-ask from scratch. Empty arrays mean no history; ask normally.helmdeck.memory_forgetis the cleanup half — a pack the agent calls when the user says "forget my defaults" withscope=all(orpacks,pipelines,pack:<id>,pipeline:<id>for targeted resets). Targets only audit rows (categoriespack_history/pipeline_history); never touches pack caches or vault credentials. Audit rows otherwise expire automatically after 30 days viamemory.WithTTL. Memory-disabled deployments: every surface degrades gracefully — audit is a no-op, my-defaults returns an empty payload with anote, forget is a soft-success no-op. SKILL.md teaches the agent to query my-defaults before asking. Sets up PR #3 (helmdeck.routemeta-pack with gap analysis) and PR #4 (memory-management UI inweb/). - Self-describing routing metadata +
helmdeck://routing-guideMCP resource (ADR 047 PR #1). Today every pack and pipeline is a schema with no machine-readable hint about when to use it vs. an alternative — the agent has to read SKILL.md prose and infer. Packs and pipelines now carry an additivemetadatablock (accepts/produces/intent_keywords/typical_use/limitations, plussupersedeson pipelines fordoc-rewrite-blog → doc-ground-blog-style transitions) populated on 10 packs (blog.rewrite_for_audience,content.ground,slides.outline,doc.parse,web.scrape,research.deep,podcast.generate,swe.solve,github.get_issue,hyperframes.compose) and 5 pipelines. A new always-on MCP resource athelmdeck://routing-guidereturns a thin catalog projection —policytext (6-step pack-vs-pipeline decision flow) + per-entry{id, title, description, metadata}for packs and pipelines, with empty metadata collapsed off the wire so the resource stays compact. Cooperates with the existing fullhelmdeck://packscatalog rather than replacing it: clients fetch routing-guide once per turn to pick, then pull the full schema for the chosen entry. SKILL.md gets a one-paragraph routing tip ("for any multi-step request, queryhelmdeck://routing-guidefirst"). Lays the groundwork for memory audit hooks (PR #2) and ahelmdeck.routemeta-pack with gap analysis (PR #3). - Persona + audience + angle + outline-export + image-prompts on all seven slide pipelines.
slides.outlinealready accepted apersonainput, but none of the slide pipelines (grounded-deck,grounded-narrate,research-deck,research-narrate,research-ground-deck,scrape-deck,repo-presentation) forwarded it — so every deck defaulted to the genericgeneralregister. All seven now threadpersona/audience/angle/title/authorplus two new opt-in flags through to the outline step. Persona vocabulary now matches blog:general/technical/marketing/executive/educational/academic(last one new on the slides side). Each persona's directive is enriched to drive slide content, not just tone —technicalasks for fenced code blocks and mermaid diagrams;educationalfor a "Try this" slide;marketingfor scannable bullets + CTA;executivefor numbers + decisions;academicfor hedged language and an open-questions closing.export_outline: truepersists the final Marp markdown as anoutline.mdartifact alongside the PDF/MP4 so the user can review or edit the structure and re-render.include_image_prompts: trueasks the model to embed<!-- image_prompt: A flowchart showing… -->comments in speaker notes AND a handler post-pass emits a structuredimage_prompts: [{slide_index, prompt}]array on the outline-step output — visible inline (Marp presenter view), available structured for downstream image-gen tools. SKILL.md teaches the agent to ask for persona + audience + angle on slide pipelines, mirroring the blog picker. builtin.brief-rewrite-blog. Closes the rewrite-blog matrix for pasted user input. Takes abrief(markdown — a title idea + hook + what-to-cover + audience pitch — not a finished draft) and runs it throughblog.rewrite_for_audienceto expand into an original post, thencontent.ground(citation-only) andblog.publish. Inputs:brief, audience, angle?, persona?, title. Use this when the user pastes ideas/outline notes; the matrix is now: brief paste →brief-rewrite-blog, PDF/DOCX →doc-rewrite-blog, web page →scrape-rewrite-blog, research query →research-rewrite-blog.personainput onblog.rewrite_for_audienceandcontent.ground. Without it, every blog rewrite defaulted to a formal-academic register even when the audience was developers, marketers, or executives. Both packs now accept a closed-set persona (general/technical/marketing/executive/educational/academic— same vocabulary asslides.outline) that injects a tone+register+length directive into the system prompt. Unknown keys are treated as freeform tone hints (e.g."crisp newsroom"). Persona is threaded through all four blog pipelines (brief-rewrite-blog,doc-rewrite-blog,scrape-rewrite-blog,research-rewrite-blog); each pack echoespersona_usedon output. Incontent.ground, persona only affects the rewrite (rewrite:true) path — citation-only mode preserves voice by design. SKILL.md teaches the agent to ask for persona alongside audience and angle.builtin.scrape-rewrite-blogandbuiltin.research-rewrite-blog. Mirrors of thedoc-rewrite-blogswap shipped earlier — for borrowed sources from a web page (scrape-rewrite-blog) or a deep-research query (research-rewrite-blog), the pipeline now runs the source throughblog.rewrite_for_audiencebefore publishing instead of saving the cited synthesis verbatim. Both gainaudienceandangleinputs; the existingbuiltin.grounded-blog(which takes the user's OWN markdown as input) is unchanged because it should preserve the user's voice, not rewrite it. SKILL.md gains a small picker table so the agent reaches for the right pipeline by source type.blog.rewrite_for_audiencepack +builtin.doc-rewrite-blogpipeline. The oldbuiltin.doc-ground-blogchain (doc.parse → content.ground → blog.publish) produced a citation-strengthened transcription of the source — useful as research notes, but as a blog post it read as republishing someone else's work. The new pack translates a source document into an ORIGINAL post for a statedaudienceandangle: it leads with why-it-matters, de-jargons the source's terms, connects them to tools the audience uses, and adds an explicit "Author's note" — staying grounded insource_content(the system prompt forbids claims not present in the source). The new pipeline chains it afterdoc.parseand follows withcontent.ground(citation-only) as a post-rewrite citation pass. Inputs:source_url, audience, angle?, title. SKILL.md instructs the agent to ask the user for audience+angle before running (defaults exist but produce bland output).
Removed
-
builtin.grounded-blog. Replaced bybuiltin.brief-rewrite-blog(above). The old pipeline rancontent.ground (rewrite:true) → blog.publishon whatever markdown was pasted — butcontent.groundis an annotator, not a generator, so the output was always roughly the same length and shape as the input. A pasted brief came back as the brief plus a few[source]links — never a real blog post. The startup reaper deletes the persisted row on upgrade. Operators with finished drafts who want pure citation-strengthening (the one case grounded-blog WAS the right tool for) should callcontent.grounddirectly orhelmdeck__pipeline-createa custom pipeline withrewrite:true. -
builtin.scrape-ground-blogandbuiltin.research-blog. Same product issue asdoc-ground-blog: they took a borrowed source and saved the cited synthesis verbatim, which reads as republishing rather than as an original blog post. Replaced bybuiltin.scrape-rewrite-blogandbuiltin.research-rewrite-blog(above). The startup reaper deletes the persisted rows on upgrade. Operators who depended on the raw cited-synthesis can recreate it viahelmdeck__pipeline-createwith the old shape. -
builtin.doc-ground-blog. Replaced bybuiltin.doc-rewrite-blog(above). The old pipeline's output was an honest cited-rewrite (the description matched the mechanism), but the result wasn't a usable blog post — it cited the source's own claims without adding any perspective. Operators who depended on the raw cited-transcription should calldoc.parse+content.ground(rewrite:true) directly via MCP, orhelmdeck__pipeline-createa custom pipeline that matches the old shape. A new startup reaper in the pipeline store (PruneStaleBuiltins) deletes any persistedbuiltin=1rows whose id is no longer in the currentBuiltins()set — so on the upgrade from a prior version, operators land on a clean catalog without running SQL by hand. User-created pipelines are never touched (the guard is thebuiltincolumn, not the id prefix). -
Coding pipelines (beta) +
github.get_issuepack. Four new pipelines wrapswe.solvefor the common coding workflows —builtin.issue-to-pr(read a GitHub issue → open a PR),builtin.repo-solve-pr(ad-hoc task → PR),builtin.repo-solve-branch(push without PR),builtin.repo-solve-patch(preview as a unified diff, no remote write). They appear under a new Coding section on the Pipelines page (output badge:Code) with a yellowbetatag rendered from a" (beta)"suffix on the pipeline name. A new lightweightgithub.get_issuepack — mirror ofgithub.list_issuesbut filtering by{repo, issue_number}— feeds the issue's title + body intoswe.solve'staskfield; it shares the same 5-minute read-through cache aslist_issuesso a rerun against the same issue doesn't re-hit the REST API. ADR 046 documents the policy plus a research-backed recommendation for the next coding-agent integration (Cline is the recommended v2; OpenHands needs a spike; Aider doesn't fit the pack contract; full SWE-agent isn't needed alongside mini). -
Pipelines page is grouped by output format. The flat table of every built-in is gone; the page now renders one section per output category — Video / Slides / Blog / Podcast / Other — in a fixed order, with each row showing its output as a badge (
MP4/PDF/MP3/Blog). So "I want to make a video" is one heading and four rows away instead of a description-by-description scan. Category is inferred client-side from each pipeline's terminal pack (slides.render→ PDF / Slides,slides.narrate&hyperframes.render→ MP4 / Video,podcast.generate→ MP3 / Podcast,blog.publish→ Blog; anything else falls to "Other") — no SQL migration, no MCP wire change. -
builtin.grounded-narrateandbuiltin.grounded-podcastpipelines. Mirrors of the existingbuiltin.grounded-deck/builtin.grounded-blogfor the video and podcast outputs — a singlemarkdowninput is grounded against web sources viacontent.ground(so un-citable claims are markedskippedrather than silently passed through), then turned into a narrated MP4 (slides.outline→slides.narrate) or a multi-speaker MP3 (podcast.generate). Closes the matrix: paste a chunk of pre-researched notes and produce any of the four output formats in one call.
Fixed
doc.parserejects non-document URLs upfront with a routing hint. The pack used to accept anysource_urland forward it to Docling — so a Medium / blog / extension-less URL slipped through, Docling tried (and failed) to fetch + parse it, and the user got a crypticpack_bug(e.g.docling 404: task result not found) instead of "wrong tool, here's the right one." The pack now closed-set-allowlists the URL's file extension at input validation:.pdf .docx .pptx .xlsx .odt .ods .odp .png .jpg .jpeg .tif .tiff(case-insensitive, query strings ignored). Anything else — web pages, arxiv abstract URLs (/abs/1706.03762),.epub, etc. — fails fast with acaller_fixablemessage that points toweb.scrapefor web pages and tosource_b64 + filenamefor documents whose URL has no extension. Pack description rewritten to declare the same contract so MCP-listing agents pick the right pack first time.source_b64path is unchanged (thefilenamerequirement already carried the type discriminator).doc.parseagainst current Docling. Upstream Docling consolidated its/v1/convert/sourcerequest body from separatehttp_sources/file_sourcesarrays into a single discriminatedsourcesarray (each entry tagged bykind: "http" | "file" | "s3" | …). The pack lagged and was sending the old shape, so every call against a recent docling-serve image failedpack_bugwithHTTP 422: missing body.sources. The pack now sendssources: [{kind, …}]matching the live OpenAPI schema; the existing happy-path tests are updated to assert the new shape and explicitly fail if either legacy field reappears.
[0.21.0] - 2026-05-30
Theme: Pipelines you can see into, stop, and resize. Running runs now surface each step's live progress in the UI; a Cancel button (+ helmdeck__pipeline-cancel MCP tool, + REST) genuinely stops a wedged run by tearing down its session container; the runner auto-cleans in-flight runs orphaned by a control-plane restart; and CPU-bound packs (hyperframes.render, slides.narrate) declare a host-aware compute profile instead of inheriting the legacy 1-core default. Plus a new hyperframes.compose pack turns a plain-language description into a HyperFrames composition so callers no longer hand-author the data-* / window.__timelines contract.
Added
- CPU profiles for session packs. A pack now declares its workload class —
session.ProfileIO(the default, 1 core) orsession.ProfileCompute(host-aware autodetect) — instead of a raw core count. The runtime resolves the compute profile toclamp(host_cores - 1, 1, 6)so an 8-core box gives a video render 6 cores instead of 1, and operators tune per-profile viaHELMDECK_IO_CPU_LIMIT/HELMDECK_COMPUTE_CPU_LIMIT.hyperframes.renderandslides.narrate(MP4 encode) migrate toProfileCompute; every other session pack stays on the implicitProfileIOdefault (no behavior change). On boot the control plane logs the resolved per-profile caps. New CPU-bound packs (and marketplace packs) pick a profile instead of reimplementing the host-aware math. See ADR 045 for the policy anddocs/reference/hardware-sizing.mdfor operator-facing numbers. - Running pipelines show live per-step progress and can be cancelled. The pipelines UI now renders each running step's latest progress message (e.g. compose "outlining…", render "rendering 1920×1080…/uploading…") inline beside its status badge, refreshed by the existing 3s poll — so a long run stops being a black box. A
Cancelbutton (andPOST /api/v1/pipelines/{id}/runs/{runId}/cancel+ thehelmdeck__pipeline-cancelMCP tool) hard-stops a running or pending run: it fires the run's context cancel AND force-removes every session container tagged with the run's id (via a newhelmdeck.run_idDocker label), so a stuck render frees its CPU within ~1-2s instead of waiting on the 30-min pipeline timeout. Partial output from the in-flight step is discarded by design. Already-terminal runs return409 not_cancellable. hyperframes.compose+ describe-a-video pipelines. A new pack turns a plain-language description into a HyperFrames HTML/CSS/JS composition ready forhyperframes.render— so callers no longer hand-author thedata-*/window.__timelinescontract. The pack guarantees the render contract (canvas sized to the aspect ratio, root scaffolding, a paused GSAPwindow.__timelinesregistration); the model writes only the creative visuals. Two one-call pipelines chain it:builtin.prompt-video(describe → compose → render, silent) andbuiltin.prompt-narrated-video(describe →podcast.generate→ compose with the narration synced → render).podcast.generatenow always emitsaudio_url(empty when no presigned store is configured) so the narrated pipeline degrades to a silent video instead of failing on a missing reference.builtin.html-videostays for agent-authored compositions, with its description/docs reworded to make clear the HTML is agent-authored, not hand-typed.
Fixed
- Docker image pulls retry on transient failures.
Runtime.ensureImagewould calldocker pullexactly once — so a Docker Hub 429, a TLS handshake hiccup, or a transientconnection reseton the way to a registry failed the whole session Create withno such image. CI runs that pulledalpine:3for the runtime smoke test broke regularly on rate-limited shared runners. The pull now retries up to 3 times with a 0/2/4-second linear backoff, honoringctx.Done()between attempts. Permanent errors (manifest unknown,unauthorized,denied,no such image) fail fast — the retry is for the transient class only. - In-flight pipeline runs are no longer stuck on
runningafter a control-plane restart. A pipeline run's terminal status is written by the in-process goroutine that's executing it; that goroutine dies with its process, so a restart left every active run frozen atrunningin SQLite forever, with no way to clear them — Cancel even reported success because the API ack'd, but nothing in-process flipped the row. On boot the runner now scanspipeline_runs WHERE status IN ('pending','running')and reaps each one tofailedwithfailure_class=transientandfailure_reason="control plane restarted while this run was in flight", marking any in-flight steps inside the run the same way (so the UI's per-step badges aren't stuck either). Runs before the HTTP listener accepts requests, so there's no race with a live goroutine. Already-terminal runs are untouched; the reaper is idempotent. - Pipeline MCP tools are now callable as
helmdeck__pipeline-run(and-list/-get/-create/-run-status/-rerun). They baked thehelmdeck__server prefix into their MCP tool names while pack tools are advertised bare — so namespacing MCP clients (OpenClaw, etc.) double-prefixed them tohelmdeck__helmdeck__pipeline-run, making the documented name (the UI copy-prompt button,SKILL.md, the prompt templates) fail with "tool not found." Pipeline tools are now advertised bare (pipeline-run, …) like packs, so the client resolves them tohelmdeck__pipeline-*exactly as documented. (MCP pipeline tools were previously only reachable via REST.) - Built-in podcast pipelines run without a manually-supplied
model.builtin.research-podcast,builtin.repo-readme-podcast, andbuiltin.prompt-narrated-videochainedpodcast.generatein source-text mode (which writes the script via an LLM) but omitted themodelfield, so a real run failedcaller_fixable("model is required …"). They now defaultmodeltoopenrouter/autolike every other pipeline, so the run needs only its documented input (no model/voice to supply —speakersis already pre-set to ElevenLabs premade voices). podcast.generateprompt / source_url / source_text modes work in gateway deployments again. The pack was registered twice at startup — with the gateway dispatcher inside the gateway-gated block, then again withnilafter it — and the registry is last-wins, so thenilregistration clobbered the dispatcher one. Any prompt/source-mode call (the script-generating modes) then failedinternal: registered without a gateway dispatcher. The nil/body-mode registration now runs before the gated block (same order asblog.publish), so the dispatcher version wins when a gateway is configured.- Pipeline runner no longer threads a non-preserved session into later steps. It carried
_session_idforward after every session-producing step, including ones whose session is torn down at step end (PreserveSession: false). Sobuiltin.prompt-narrated-videohandedpodcast.generate's already-dead session id tohyperframes.render, which failedsession_unavailable: session not found(render needs its own hyperframes-sidecar session anyway). The runner now only threads a session forward from a pack that preserves it — e.g.repo.fetch→repo.map/fs.*/repo.pushstill chain correctly.
[0.20.0] - 2026-05-28
Theme: A more trustworthy agent surface. Pipelines reject unfilled {{PLACEHOLDER}} inputs instead of running with them; built-in pipeline descriptions say what the packs actually do (cite + save, not "rewrite + publish"); slides.outline guarantees a title slide and gains audience personas so decks open and close properly; and a new installable helmdeck-debug skill sweeps every pipeline + pack and drafts GitHub issues for what it finds.
Added
- Pipeline runs reject unfilled
{{PLACEHOLDER}}inputs. An input whose value is still a literal prompt-template variable (e.g.title={{TITLE}}, pasted from the prompt-template docs without substituting) now fails fast with acaller_fixableerror that names the input and tells the agent to fill it — ask the user for a value, or propose one and confirm it — instead of silently running and producing a post titled{{TITLE}}. helmdeck-debugintegration-debugger skill. A second installable agent skill (skills/helmdeck-debug/SKILL.md) that sweeps every pipeline + pack — a static check of the definitions (oversold descriptions, unguarded inputs, schema-vs-handler drift, failure misclassification) plus a live end-to-end run sweep classified byfailure_class— and drafts a ready-to-file GitHub issue per real bug, confirming before it files anything. Both installers now ship it:scripts/configure-openclaw.shinstalls everyskills/*/SKILL.md, and the newscripts/configure-claude.shinstalls them into a project's.claude/skills/.slides.outlineguarantees a title slide and supports personas + an author byline. Whentitleis provided, the pack deterministically prepends a title slide (with an optionalauthorbyline) if the model omitted one — and never duplicates one the model already wrote. A newpersonainput (general/technical/marketing/executive/educational, or any freeform audience string) injects an audience-appropriate tone and closing-slide directive (e.g. marketing → call-to-action, executive → the decision/ask). New outputshas_title_slide+persona_used. The strengthened prompt makes the opening/closing slides a hard requirement; SKILL.md now tells agents to ask the user for title/author/persona before generating.
Changed
- Honest descriptions for the ground/blog built-in pipelines.
grounded-blog,scrape-ground-blog,doc-ground-blog,research-blog,grounded-deck, andresearch-ground-deckno longer say "fact-check + rewrite … publish."content.groundcites claims against web sources (and, in rewrite mode, strengthens the cited sentences) — it does not rewrite a post into a new voice or structure — andblog.publishsaves a markdown artifact by default (publishing to Ghost requires cloning the pipeline with a credential + host). The descriptions and prompt-template docs now say so.
[0.19.1] - 2026-05-28
Fixed
- Pipelines page "Copy prompt" button now works over plain HTTP.
navigator.clipboardonly exists in a secure context (HTTPS orlocalhost), so on a Management UI served over plain HTTP on a LAN host the button hit an undefined clipboard, threw, and was silently swallowed — nothing reached the clipboard. It now falls back to a hidden-<textarea>+execCommand('copy')in non-secure contexts and reflects the real result (Copied/Copy failed), so it can never silently do nothing.
[0.19.0] - 2026-05-28
Theme: Repo presentations worth watching. builtin.repo-presentation (replacing repo-readme-narrate) builds a narrated deck from a repo's README plus its docs and code structure — not a paraphrase of the front page — backed by a new repo.fetch docs output.
Added
repo.fetchnow surfaces adocsoutput — concatenated markdown/adoc/rst from the repo's doc dirs (docs/,doc/,content/, …) plus top-level design docs (ARCHITECTURE.md,DESIGN.md, …), bounded to 16 KB with a path header per file (empty when the repo has none). Lets presentation/grounding pipelines ground on a project's real docs, not just its README.
Changed
builtin.repo-readme-narratereplaced bybuiltin.repo-presentation. The old starter fed only the README toslides.outline, so a thin README produced a shallow deck. The new pipeline chainsrepo.fetch → repo.map → slides.outline → slides.narrate, building the deck from the README plus the repo's docs and code structure (repo.map's symbol map) — a fuller picture of what the project is and how it's built. Samerepo_urlinput; thebuiltin.repo-readme-narrateid is gone.
[0.18.0] - 2026-05-28
Theme: Pipelines you can see and trust. The deck/narrate pipelines now turn prose into a real multi-slide deck via the new slides.outline pack — no more a whole README collapsing onto one slide and rendering a degenerate 7-second video — and the Management UI shows which pipelines are running plus a copy-paste agent prompt for each.
Added
slides.outlinepack — restates prose/markdown (a README, aresearch.deepsynthesis,content.groundoutput) as a structured Marp deck:----separated slides with titles, bullets, and<!-- speaker notes -->, ready forslides.render/slides.narrate. Bounded by amax_slidesceiling and a clamped completion-token budget, and it guarantees a multi-slide deck or failsinvalid_input("content too thin") rather than emitting a degenerate one-slide deck.- Pipelines page (Management UI): live "running" indicators + a per-pipeline "Copy prompt" button. The
/pipelinespage polls a newGET /api/v1/pipeline-runs(recent runs across all pipelines) and shows a pulsing running badge on any pipeline with an active run, plus an N running header count — so you can see what's executing without expanding each row. Each pipeline also gets a Copy prompt button that copies a ready-to-paste agent prompt (helmdeck__pipeline-run …) with a fill-in line per${{ inputs.* }}the pipeline declares — generated from the live definition, so it can't drift from the actual inputs.
Changed
- Deck & narrate pipelines now structure prose into a real deck before rendering.
grounded-deck,research-deck,research-narrate,research-ground-deck,scrape-deck, andrepo-readme-narrateused to feed raw prose (a README, a synthesis, grounded text) straight intoslides.render/slides.narrate, which split slides only on---— so prose with no---collapsed onto a single slide and produced a degenerate ~7-second silent video that still reportedsucceeded. They now insert aslides.outlinestep, so a README or synthesis becomes a genuine multi-slide deck — or fails legibly (caller_fixable) when the content is too thin. Podcast pipelines are unaffected (podcast.generatealready turns source text into a multi-speaker script).
[0.17.2] - 2026-05-28
Theme: Honest failures — pipeline runs attribute failures correctly. A malformed input or a still-booting overlay no longer masquerades as a helmdeck pack_bug you should file an issue for: input problems are caller_fixable, and overlay-backed packs ride out a cold start instead of failing the first call. Plus the v0.17.1 tts_chars schema regression that broke every slides.narrate/podcast.generate run.
Changed
- Overlay-backed packs now retry a still-booting service instead of failing on the first hit.
research.deep/content.ground/web.scrape(Firecrawl) anddoc.parse(Docling) wrap their HTTP round-trip in a bounded cold-start retry: a connection-refused/reset or a 502/503/504 is treated as "still starting" and retried with exponential backoff (4 attempts, ~3.5s worst case). So the first pack or pipeline call from the OpenClaw chat UI after the stack — or an individual overlay — comes up waits a few seconds for readiness instead of surfacing a failed run. Genuine application errors (4xx/500) and successes return immediately and unchanged, so the pipeline failure classifier behaves exactly as before once the service is actually up.
Fixed
slides.narrate/podcast.generatefailed every real run withinvalid_output: field "tts_chars": expected number, got object(regression from v0.17.1). #299 declared thetts_charscost-output field asnumber, but both handlers emit a per-speaker/per-slide breakdown map (with a_totalkey, seecomputeTTSChars/computeSlideTTSChars). The engine validates handler output against the declared OutputSchema on everyExecute, so the mismatch failedslides.narrate,podcast.generate, and any pipeline using them (e.g.builtin.repo-readme-narrate). Corrected the declaration toobject. The unit tests missed it because they call the pack handler directly, bypassing the engine's output validation — a new output-schema contract test now validates each pack's real output against its declared schema, so this class of drift fails in CI.research.deepreported "no usable sources" as apack_bug, telling callers to file a GitHub issue for a refine-your-query situation. When a Firecrawl search yields zero usable sources (query too long/narrow/obscure, or every result unscrapable), the pack returnedhandler_failed— which the pipeline classifier maps topack_bug— even though the error message itself advised refining the query. It now returnsinvalid_input, so pipelines (e.g.builtin.research-blog) classify itcaller_fixable: shorten/refocus the query and re-run, no issue to file. helmdeck searched fine; the query just didn't match anything usable.hyperframes.renderreported a malformed composition as apack_bug. A composition missingdata-composition-id/data-width/data-height, an unregisteredwindow.__timelines, or an output preset whose orientation doesn't match the composition's dimensions made the hyperframes CLI exit non-zero, which the pack returned ashandler_failed→ the pipeline classifier (e.g.builtin.html-video) labeled itpack_bugand told callers to file a GitHub issue for "fix your HTML." It now classifies the known caller-input signatures asinvalid_input→caller_fixable; genuine render/encode failures (browser crash, ffmpeg error) stayhandler_failed.
[0.17.1] - 2026-05-28
Theme: Fresh-stack reliability — persistent repos, grounded decks, and slide rendering now work on a clean install, and the test-suite gaps that let those bugs ship green are closed (every Docker/integration test now runs in CI, gated against silent skips).
Changed
blog.publishnow renders mermaid diagrams to inline SVG server-side (defaultmermaid: true):mermaidfenced blocks in a markdown body are pre-rendered viammdc(the same rendererslides.renderuses) into<img src="data:image/svg+xml;base64,…">, so diagrams show reliably on Ghost (any theme), in email, RSS, and plain-markdown readers — no client-side MermaidJS required. Setmermaid: falseto keep the previous client-render behavior. As a resultblog.publishnow runs with a session (NeedsSession: true) to reachmmdc— each publish acquires a short-lived sidecar.
Fixed
- Persistent
repo.fetch(ADR 040) failed withmkdir: cannot create directory '/repos/…': Permission denied. Thehelmdeck-reposvolume was root-owned, but the session sidecar runs as uid 1000 and the control-plane janitor as uid 65532 — neither could create clone directories under it, so every persistent clone (e.g.builtin.repo-readme-narrate/-podcast) failed. The session runtime now makes/reposworld-writable on first use (a throwaway root container), with arepos-initcompose one-shot as belt-and-suspenders — so it works for any deployment, not just Compose. The persistent clone also runsumask 000so the janitor (a different uid) can GC clones. Covered by a new Docker integration CI job that runs the//go:build integrationsuite (which exercises this exact clone-into-/repospath but wasn't previously run in CI). - Pipeline-driven
repo.fetchclones all collided in/repos/unknown.StartRunexecutes on a detached context that dropped the caller subject, so persistent clones weren't namespaced per caller. The runner now threads the caller (StartRun/Reruncarry it and re-attach it viapacks.WithCaller), so a pipeline started byaliceclones into/repos/alice/…like a direct pack call. slides.renderstill clipped tall mermaid diagrams by ~39px in PDF/PPTX — the non-scrolling formats #280's auto-fit was meant to protect. The mermaid cap wasmax-height: 70vh(504px on a 720px slide), but a slide also carries its heading plus Marp's ~255px section padding, so a top-down diagram + chrome overflowed. Lowered the cap to60vh, leaving headroom even for a two-line title. The integration suite's geometric overflow check (TestSlidesFit_NoSectionOverflow) had been silently skipping — it required aplaywrightmodule the sidecar doesn't ship — so it never caught this; it now runs on Marp's bundledpuppeteer-core(the same Chromium that prints the PDF) and asserts zero section overflow, so the clip can't regress unnoticed.slides.narrateandpodcast.generatenow declare their cost-transparency outputs (tts_chars,estimated_cost_usd,estimated_cost_breakdown) in their OutputSchema. The handlers already emitted them; declaring them fixes catalog/schema drift so agents and pipeline authors can see and reference the cost fields.
[0.17.0] - 2026-05-28
Theme: Legible, recoverable failures — agents and operators can tell why a run failed and what to do: actionable model errors with a model catalog to pick from, and pipeline failure attribution with one-call re-run.
Added
helmdeck://modelsMCP resource (ADR 043): lists the chat-completion models the gateway can route to right now, as fullprovider/modelIDs (e.g.openrouter/minimax/minimax-m2.7). Agents read it to pick a valid model for any pack'smodelinput instead of guessing one that fails. Mirrorshelmdeck://voices/helmdeck://image-models. (#293)- Legible pipeline failures + re-run (ADR 044, slice 1): when a pipeline run fails, each failed step is now attributed with a typed
error_code, afailure_class—caller_fixable(the inputs/model given were wrong — fix and re-run),pack_bug(a code error in helmdeck — the reason includes a prefilled GitHub issue link to file),transient(environment blip — re-running may work), orstate_changed— and a one-linefailure_reasonsaying what to do. Surfaced inGET …/runs/{runId}, thehelmdeck__pipeline-run-statustool, and the Management UI/pipelinesrun view (failure-class badge + "Report bug" link). Plus a one-call re-run:POST /api/v1/pipelines/{id}/runs/{runId}/rerun, thehelmdeck__pipeline-reruntool, and a "Re-run" button. Resume-from-failed-step and auto-retry are the next slice. (#294) - Pipeline run records now list each step's artifacts: a step's produced files (keys/URLs) are captured on the run, so
run-statusand the/pipelinesUI show what each step emitted (previously only the final output JSON was visible). (#292)
Fixed
- A bad/unroutable model now returns
invalid_inputwith an actionable hint, not an opaquehandler_failed. Calling an LLM pack (content.ground,research.deep,blog.publishprompt mode,web.test) with a model the gateway can't route — e.g.minimax/…when MiniMax is only reachable asopenrouter/minimax/…— used to fail ashandler_failed: … unknown provider: minimax: unknown provider: minimax(a non-recoverable code, with a doubled message). It now returnsinvalid_inputpointing at thehelmdeck://modelsresource, so the agent retries with a valid model instead of hallucinating another. The doubled message is gone. (#293)
[0.16.0] - 2026-05-27
Theme: Correctness + housekeeping — grounding stops truncating long slide decks, artifacts become deletable on demand, and the email.send pack lands.
Added
email.sendpack (helmdeck__email-send): send a transactional email via Resend. Required inputto; optionalfrom,subject,html,cc,bcc,reply_to; returns amessage_id. Vault credentialresend-api-key. Brings the in-tree catalog to 44 packs. (#289)- Prompt-template reference pages at
/reference/prompt-templates/: a copy-and-fill{{VARIABLE}}prompt for every built-in pack and pipeline, kept current by a contributor convention. (#288) - Manual artifact deletion:
DELETE /api/v1/artifacts/{key}plus a delete (trash) button in the Management UI Artifact Explorer remove a single artifact on demand. Previously the only delete path was the TTL janitor (default 7-day age-out); operators can now reclaim space immediately. Delete is idempotent — a missing key still returns204. (#290)
Fixed
content.groundno longer truncates or drops content during the optional rewrite. The full-document rewrite was hard-capped at 2048 output tokens, so a long input — e.g. a 20–25 slide deck — was silently cut off mid-document and every slide past the cap vanished. The rewrite's completion budget now scales with the input size (capped at 8192 tokens); a response that still hits the token ceiling is discarded in favor of the structure-preserving citation-only version; and the rewrite prompt is instructed to preserve every---slide separator and slide count.grounded_textis now always present in the output (equal to the input when no claims were grounded), so pipeline steps wiring${{ steps.<id>.output.grounded_text }}never fail on an unresolved reference. (#290)builtin.grounded-deckandbuiltin.research-ground-decknow ground decks with citations only (rewrite: false) rather than a full prose rewrite, which reflowed and collapsed slide structure. Blog-oriented pipelines (grounded-blog,scrape-ground-blog,doc-ground-blog) keeprewrite: trueand are protected by the truncation guard above. (#290)
[0.15.0] - 2026-05-26
Theme: Pipelines as a first-class resource — a saved, runnable sequence of pack steps any actor can create, run, and watch.
Added
- Pipelines (ADR 041): a pipeline is a stored, named, ordered list of pack steps with
${{ steps.<id>.output.<field> }}/${{ inputs.<name> }}templating and automatic_session_idthreading. Ships as a runnable slice — SQLite-persisted definitions + run history, a sequential runner reusing the pack engine, REST CRUD + async run + run-history at/api/v1/pipelines, thehelmdeck__pipeline-{list,get,create,run,run-status}MCP tools (so agents create/run pipelines conversationally), ~13 auto-seeded built-in starters (grounded deck/blog, research→{deck,podcast,blog}, scrape→ground→blog, clone-a-repo→narrated-deck/podcast, …), and a Management UI/pipelinespanel to list, run with JSON inputs, and watch run status/history poll live. Migration0007_pipelines.sql(additive). (#283, #284) podcast.generatenow surfaces a presignedaudio_urlin its output, unlocking a cleanpodcast.generate → hyperframes.rendernarrated-video chain (embed the URL in the composition's<audio src>). (#283)
Fixed
slides.renderandslides.narrateno longer clip oversized mermaid diagrams or wide tables off the fixed Marp slide canvas: a theme-independent auto-fit<style>scales diagrams/images down (max-height,object-fit:contain) and shrinks-to-fit tables (table-layout:fixed+ wrapping). Applies to PDF/PPTX (which can't scroll) across curated and built-in themes. (#280, #282)
[0.14.0] - 2026-05-26
Theme: Autonomous code-fix (swe.solve) lands end-to-end, the Universal Memory layer and persistent repos ship as default-off-but-on-by-default seams, and ADR 037 upstream pinning is fully enforced across every sidecar.
Added
swe.solve— an autonomous code-fix pack. Give it arepo_url+taskand it runs a mini-swe-agent loop inside a session sidecar to produce a reviewable change.modeselects the output:patch(diff + trajectory, no push),branch, orpull_request. The agent never sees git or AI-gateway credentials (vault-injected), never pushes to the default branch, and every run uploads a replayable trajectory artifact to the object store. Built on aHelmdeckEnvironmentadapter (mini-swe-agent'sEnvironmentcontract routed throughcmd.run). (#265, #271, #233 Phases 1/3/4)- GitHub-issue auto-trigger for
swe.solve(ADR 033) — the webhook receiver now handlesissues/issue_comment: label an issue and helmdeck opens a PR, then posts the result back as an issue comment. HMAC-verified, label-gated, dispatched on a detached context. (#277, #233 Phase 6) - Universal Memory delivery layer (ADR 039) — an
ec.Memoryengine seam giving packs transparent, per-caller, namespace-scoped memory, with a declarative read-through cache (Pack.Memory{Cache,TTL};github.list_issuesis the first exemplar) andContext()aggregation. Backed by a pluggableMemoryStore(SQLite default, AES-256-GCM at rest). Memory is durable by default — the installer now generatesHELMDECK_MEMORY_KEY. (#272, #278, epic #254: #255/#256/#257/#258/#260) - Persistent repos volume (ADR 040) —
repo.fetch(andswe.solve) clone into a per-caller path on a sharedhelmdeck-reposvolume andgit fetchinstead of re-cloning on a repeat, with a persistent per-language dependency cache (.hdcache) and a GC janitor (TTL + size cap). Default-off (no volume ⇒ ephemeral/tmp); enabled by default in the bundled Compose. Newrepo.fetchoutput fieldsreused/persistent. (#274, #259) - New
/reference/agent-memoryandrepo.fetchpersistent-clones docs; the "Clones aren't browser state" design post. - The in-tree pack catalog grows to 43 (adds
swe.solve,github.post_comment).
Changed
- ADR 037 fully enforced — exact upstream version pins, Dependabot, CLI-surface sentinels, and docs across every sidecar Dockerfile, plus follow-up cleanups (drop
marp --stdin, fix--htmlspec, pin the globalplaywright-mcp). (#240–#243, #264) - The
clients-smokematrix builds the control-plane from source and its bridge leg is response-driven rather than sleep-timed. (#273)
Fixed
clients-smokeno longer aborts a slow cold-sidecar screenshot via a blindsleep 30then EOF — it polls for the reply and surfaces a real timeout distinctly. (#273)- The GitHub webhook's async dispatch no longer borrows the request context (cancelled the instant the 200 returns), which would have killed any long-running dispatched pack. (#277)
[0.13.2] - 2026-05-23
Theme: Hot-patch for the v0.13.1 release that shipped without a control-plane image. No code-behavior changes, only the build pipeline that produces the image is unblocked.
Fixed
web/build now succeeds under Vite 8 + TypeScript 6 + lucide-react 1, restoring thePublish control-plane imagestep that failed silently on the v0.13.1 tag push. Dependabot PR #247 carried three breaking major bumps in one auto-merged group (Vite 6 → 8, TypeScript 5 → 6, lucide-react 0 → 1), each of which broke the web build. CI never exercised the failure because theCIworkflow only builds the Go binary — only theReleaseworkflow buildsweb/, so the regression was invisible until the v0.13.1 tag fired the release pipeline. Goreleaser binaries,helmdeck-bridge:0.13.1, and@helmdeck/mcp-bridge@0.13.1on npm shipped fine; onlyghcr.io/tosin2013/helmdeck:0.13.1(the control-plane image) was missing. Three concrete fixes: (a) Vite 8 swapped Rollup for Rolldown, whosemanualChunksonly accepts the function form, so the declarative chunk-grouping moves tocodeSplitting.groups— same two-chunk layout (react+query) preserved; (b) TypeScript 6 removedbaseUrl, so paths are relative under./src/*and a newweb/src/vite-env.d.ts(/// <reference types="vite/client" />) restores side-effect CSS module resolution under TS 6's stricter rules; (c) lucide-react 1 dropped brand icons, so the GitHub-PAT preset swapsGithubforGitBranch— purely visual, the preset label still names the system. (#250)
[0.13.1] - 2026-05-18
Theme: Post-v0.13.0 cleanup. No feature changes. Four post-release bugs found during v0.13.0 → v0.13.1 upgrade verification, each documented per-issue with a reproducer.
Fixed
repo.fetchnow surfacessession_idinside itsoutput(not only on the response envelope), so follow-on packs (fs.*,cmd.run,git.*,repo.push) can find the value adjacent toclone_path. Without this, callers reading onlyoutput.clone_pathmissed the session_id on the envelope, then issued follow-up calls without_session_id, which made the engine spin up a fresh session whose/tmpdid not contain the clone — surfacing as silent empty results (fs.list,repo.map) orcannot openerrors (fs.read,cmd.run). Newinternal/packs/builtin/session_reuse_integration_test.go(build-taggedintegration) pins the cross-pack session reuse contract against a real Docker daemon so this can't silently regress. (#232)deploy/compose/.env.examplenow documentsHELMDECK_ELEVENLABS_API_KEY,HELMDECK_FAL_KEY, andHELMDECK_PEXELS_API_KEY. These keys have first-class vault auto-hydration but were absent from the example file an operator copies on first install, so the only way to discover them was via a CHANGELOG entry or a pack's "key not found" error message. (#229)HELMDECK_PEXELS_API_KEYnow auto-hydrates into the credential vault underpexels-keyon startup — the v0.13.0stock.searchCHANGELOG advertised this behavior but the entry ininternal/vault/hydrate.gowas missed. Operators who set the env var no longer have to POST a credential by hand to get the vault rotation/audit story working, andstock.search'scredential:input override now resolves through the vault path as documented. (#230)compose.firecrawl.ymlhealthcheck for thefirecrawlservice now probes vianode -einstead ofwget. The upstreamghcr.io/firecrawl/firecrawl:latestimage ships onlynode(nowget, nocurl), so every prior healthcheck invocation hitexit 127and the container reportedunhealthyindefinitely despite serving traffic correctly. Real Firecrawl outages were invisible because the steady-state false negative looked identical to a real failure. (#231)
Changed
- Every npm/corepack package installed globally in
deploy/docker/sidecar*.Dockerfileis now pinned to an exactARG <NAME>_VERSION=x.y.z(no@latest,@stable,^x.y,~x.y). Affects@playwright/mcp,@mermaid-js/mermaid-cli,pnpm,yarn,typescript,ts-node,eslint,prettier,vitest, and the previously-caret-pinnedhyperframes(now exact0.6.7). T-2 of ADR 037's migration plan; together with the Dependabot config from #240, every pinned dep now has a delivery mechanism for proposed upgrades that runs the full CI matrix. No functional change — same versions, just declared explicitly so a typosquat or yanked release fails the build instead of shipping silently. (#213) - CLI-surface sentinels split into two layers (T-3 of ADR 037). Catches the failure mode that motivated the ADR — an upstream flag rename or typo-squat — at the earliest possible point. (#214)
- Layer 1 (
docker build-time): each sidecar Dockerfile runs cheap<tool> --versionsmoke checks after install. A yanked release or missing binary fails the image build before the artifact escapes. - Layer 2 (CI-time): new
internal/packs/builtin/cli_surface_invariant_test.go(build-taggedintegration) walks pack source viago/astto extract every--flagstring passed to a known sidecar binary, runs the binary's--helpinside the built image, and asserts each extracted flag appears in the help output. The flag list is derived from Go pack source rather than hand-maintained, so adding a flag to a pack's argv automatically gets verified. A structuredSkipallowlist handles deliberately-undocumented flags with a reason string. Coversmarp(7 flags fromslides_render.go) andhyperframes render(4 flags fromhyperframes_render.go); adding a new sidecar binary takes one newcliSurfaceCaseentry. - Discovered while building this:
slides_render.gopassesmarp --stdin(silently accepted, not documented; marp reads stdin automatically when piped), andsidecar-entrypoint.sh:85invokes@playwright/mcp@latestvianpx, bypassing the pinned global install. Both are tracked as separate follow-ups.
- Layer 1 (
[0.13.0] - 2026-05-15
Theme: Marketplace beta — discover, install, and run community packs from a signed catalog.
Eight headline threads ship in v0.13.0. The marketplace track (T810 catalog endpoint, T812 install/uninstall REST, T813 /marketplace UI, T814 community repo scaffold) is the headline — operators browse helmdeck-marketplace's catalog from the Management UI or the new helmdeck CLI, install with one click, and run the pack immediately via the new helmdeck-sidecar-marketplace image. Trust ships as stage A (deterministic SHA256 content hash, hard-rejects install on mismatch); stage B (full sigstore keyless cosign-verify) is queued for v1.0 hardening. Alongside marketplace: hyperframes.render for HTML→MP4 short-form video (the bigger lift of the cycle, slotted at issue #200 with a new sidecar image and async render pipeline); stock.search for Pexels-backed stock photography that chains into every other media-output pack via the same feature_image_artifact_key contract image.generate introduced in v0.12.0; slides.render contrast guardrails (docs + lint + curated themes — the WCAG-AA reproducer goes from "render succeeds, slide unreadable" to "render succeeds with explicit warnings the agent can act on"); provider_calls diagnostic columns (job_id + finish_reason + raw_content_len, joining the gateway audit table back to the pack-job that triggered the call in a single SQL query); subprocess pack manifest format (typed I/O schemas via YAML sidecar, completing the v0.12.0 MVP); and the blog.publish artifact-first refactor (Ghost failures now return a partial-success response with the saved markdown instead of losing the expensive prompt-expanded body). Three new ADRs land with the cycle — ADR 034 captures the marketplace design ahead of the implementation, ADR 037 turns the hyperframes-npm-pin incident into a project-wide upstream-version discipline, and ADR 038 explains why marketplace packs route through a dedicated sidecar rather than running in the distroless control plane.
Added
- Marketplace trust verification stage A (#30 follow-up) — replaces the structured stub from PR #220 with real deterministic content-hash verification. The installer now computes a stable SHA256 over a pack's non-manifest files (excluding
helmdeck-pack.yamlitself to avoid the chicken-and-egg of "the file containing the hash is in the hash"), compares tomanifest.trust.sha256, and hard-rejects the install on mismatch (removes the materialized files, returnstrust verification failed). Algorithm is platform-deterministic — no tar/gzip non-determinism, no timestamp leakage — so the marketplace'ssign.ymlworkflow can produce the same digest. What stage A catches: handler/data modified between author-sign and install, file rename/add/remove, corrupt downloads. What it doesn't catch (deliberate, documented): a malicious author modifying the manifest itself — that's stage B (full sigstore keyless verification of the signer identity), tracked as a v1.0 hardening item. New trust-note vocabulary surfaces verified hash + declared signed_by in the install response; UI's "Signed (pending)" badge flips to "Signed (verified)" on a passing stage A check. Seedocs/reference/marketplace/catalog.md§Trust model. 9 new tests cover hash determinism, sensitivity to file change/add/rename, install-rejects-mismatch with cleanup, and the no-sha256-but-signed-by intermediate state. helmdeckCLI binary (#30 follow-up) — operator-facing CLI that wraps the marketplace REST endpoints from a terminal. Subcommands:pack list(every registered pack),pack marketplace [--refresh](browse catalog),pack install <name>,pack uninstall <name>,pack installed(marketplace-installed only). Same env-var conventions ashelmdeck-mcp:HELMDECK_URL(defaulthttp://localhost:3000) +HELMDECK_TOKEN.--jsonon any subcommand emits raw response for shell pipelines (helmdeck pack installed --json | jq '.installed[] | .name'). Install output surfacestrust_verified+trust_noteso operators see verification status in the terminal. Non-zero exit on errors and preserves the structured error code (pack_not_in_catalog,marketplace_install_disabled, etc.). Ships via goreleaser alongside the existingcontrol-plane+helmdeck-mcpbinaries. New file:docs/howto/use-the-helmdeck-cli.md. 16 tests cover env-var resolution, request shape (Authorization header, JSON body, content-type), 4xx envelope preservation, and happy-path dispatch for every subcommand.- Marketplace UI panel + pack-detail endpoint (#31 / T813) — new
/marketplaceroute in the Management UI: browse-by-category chips, free-text search across name/description/tags, pack-detail dialog with input/output schema preview + worked examples + trust badge (Signed / Unsigned), Install / Uninstall buttons with busy state and automatictools/listcache invalidation, Refresh button, unsigned-pack confirmation dialog per ADR 034. New REST endpointGET /api/v1/marketplace/packs/{name}returns the catalog entry + fullhelmdeck-pack.yamlmanifest fetched from the marketplace repo on demand — the catalog endpoint deliberately doesn't pre-load every manifest. Sidebar gains a "Marketplace" nav link (Store icon). Operator reference:docs/reference/marketplace/catalog.md§Management UI panel. - Marketplace install / uninstall REST endpoints (#30 / T812) — packs from the marketplace catalog can now be materialized to disk and hot-loaded into the running control plane without a restart.
POST /api/v1/marketplace/installresolves a pack from the cached catalog,git clone --depth=1 --filter=blob:none's the marketplace repo, copiespacks/<name>/toHELMDECK_PACKS_DIR(default~/.helmdeck/packs/<name>/), preserves executable bits, then registers the pack with the livepacks.Registryso it appears intools/listandGET /api/v1/packsimmediately.POST /api/v1/marketplace/uninstallreverses it (deregister-then-delete, atomic from the operator's POV).GET /api/v1/marketplace/installedenumerates everything the operator has installed via the marketplace (NOT built-in core packs).command-handler packs only in this beta —builtin/composite/wasmreject with a clear message. Trust verification ships as a structured stub: the response always carriestrust_verified+trust_note, the manifest'strust:block flows through end-to-end, but the actual sigstore.dev cosign-verify call lands in a follow-up PR. CLI deferred to its own PR to keep this one review-sized; the REST surface is what T813's UI panel actually depends on. New file:docs/reference/marketplace/catalog.md§Install/uninstall. - Marketplace pack execution via dedicated sidecar (ADR 038, paired with #30) — installed marketplace packs run inside a new
helmdeck-sidecar-marketplaceimage (bash + jq + curl + python3 + Node 20 + standard Unix utils) rather than the distroless control-plane process. The pack handler closure uploads the on-disk handler script to the sidecar viaec.Execon each call,chmod +x's it, and pipes the pack input to stdin — matching theslides.narrate/hyperframes.renderexecution model. Manifests can override the sidecar per-pack via a new optionalhandler.sidecar.imagefield (heavier toolchains, e.g. image processing, video, ML). Operators override the default globally withHELMDECK_SIDECAR_MARKETPLACE. Image is amd64 only at v0.13.0; multi-arch follows the base sidecar's track. New files:docs/adrs/038-marketplace-pack-execution-via-sidecar.md,deploy/docker/sidecar-marketplace.Dockerfile,.github/workflows/sidecar-marketplace.yml, Makefilesidecar-marketplace-buildtarget. - Marketplace catalog endpoint (#28 / T810) — first slice of the v0.13.0 Marketplace beta. The control plane now fetches a community pack catalog (
index.yaml) fromHELMDECK_MARKETPLACE_URL(defaulthttps://github.com/tosin2013/helmdeck-marketplace) at boot and serves it via two REST endpoints:GET /api/v1/marketplace/catalogreturns the cached snapshot,POST /api/v1/marketplace/refreshforces a fresh fetch. A failed refresh preserves the previously-cached catalog so a transient upstream blip doesn't blank the UI. Three source-URL shapes supported:github.com/<owner>/<repo>(auto-translated to rawindex.yaml), direct raw URLs, andfile:///for air-gapped operators. SetHELMDECK_MARKETPLACE_DISABLE=1to turn the endpoints off entirely. New Go types ininternal/marketplace/mirror the JSON Schemas published in thehelmdeck-marketplacerepo. Read-only in this PR — install/uninstall (#30 / T812) and/marketplaceUI panel (#31 / T813) land in follow-up PRs. Operator reference:docs/reference/marketplace/catalog.md. Design: ADR 034. stock.searchbuilt-in pack (#217) — search Pexels for stock photos matching a query, download the top 1-4 results into the artifact store, return their artifact keys + per-photo attribution metadata (photographer,photographer_url,source_url,width,height,alt_text). The output uses the same chained-input contract asimage.generateso downloaded stock photos slot straight intoslides.render(hero),slides.narrate(hero),blog.publish(feature_image_artifact_key),podcast.generate(cover_image_artifact_key), andhyperframes.render(embedded<img src>). Usestock.searchfor real photography;image.generatefor AI-generated art. Filter knobs:orientation(landscape/portrait/square),size(large/medium/smallmin-size),color(hex or named). Credential:pexels-key(vault) orHELMDECK_PEXELS_API_KEY(env-var fallback). Free tier 200 req/hr at https://www.pexels.com/api/. Engine-pluggable from day 1 —engine: "pexels"only ships v0.13.0;unsplash/pixabayreserved for community PRs. Photos only;media_type: "video"is a follow-up. Seedocs/reference/packs/stock/search.md. Pack count: 40 → 41.slides.rendercontrast guardrails (#202) — three-pronged fix for "LLM picks a custom palette that produces unreadable slides" (the dark-blue-section-with-default-light-tables reproducer). (A) Docs + agent skill: new "Color contrast best practices" section indocs/reference/packs/slides/render.md+ an updatedslides.renderentry inskills/helmdeck/SKILL.mdteach the WCAG-AA 4.5:1 rule and the "override every nested element when you changesection { background }" checklist. (B) Static contrast lint: the pack now parses the markdown's frontmatterstyle:block and embedded<style>tags before render, flagging two anti-patterns —section-background-without-nested-overrides(the reproducer pattern) andwcag-aa-text-contrast(any single rule whose hexcolor/background-colorpair contrasts below 4.5:1). Warnings surface in the response's newwarnings: [{rule, selector, recommendation}]array — informational, not errors; the render still succeeds. (C) Curated helmdeck themes: two embedded Marp themes ship with the control-plane binary —helmdeck-dark(slate/sky palette, modern technical look) andhelmdeck-corporate(white/blue palette, business deck). Both declare WCAG-AA colors for every nested element type explicitly. The agent picks one viatheme: helmdeck-darkin the frontmatter; the pack uploads the embedded CSS to the sidecar and passes--theme-setto marp automatically. Response carriescurated_theme_usedso callers can confirm the theme applied. Source:internal/packs/builtin/themes/.hyperframes.renderbuilt-in pack (#200) — HTML/CSS/JS composition → deterministic MP4 via Chromium BeginFrame + ffmpeg using the upstreamhyperframesCLI, running in the newhelmdeck-sidecar-hyperframesimage (env overrideHELMDECK_SIDECAR_HYPERFRAMES; Node 22 + ffmpeg on top of the base sidecar). Sizing surface is composable:resolution(1080p/4k) ×aspect_ratio(16:9YouTube standard,9:16Shorts/TikTok/Reels,1:1Instagram feed) resolves to one of six upstream CLI presets (landscape/portrait/square±-4k). Composition must be authored at the matching aspect ratio — upstream's--resolutionflag is an integer-multiple upscale knob, not a dimension setter. Audio handling is mode-free: a composition with no<audio>tag produces a silent MP4; an inline<audio src>produces a narrated MP4 — chainpodcast.generate→hyperframes.renderby embedding the podcast's presigned audio URL in the composition's<audio src>and the audio track flows through automatically. Short-form only (≤12 min, 512 MiB cap; oversize rejects asCodeHandlerFailedpointing at #201 for the v1.x long-form streaming track). Pack isAsync: true, 4 GiB session memory, 60-minute timeout. Seedocs/reference/packs/hyperframes/render.md,docs/SIDECAR-LANGUAGES.md. Pack count: 39 → 40.provider_callsdiagnostic columns (#183) — three new columns on the gateway audit table for diagnosing failed LLM-backed pack calls in a single SQL query instead of timestamp-matchingtsagainst the job'sended_at:job_id(joins back to the pack job that triggered the call; indexed),finish_reason(provider-reportedstop/length/tool_calls/content_filter/…),raw_content_len(bytes inchoices[0].message.contentafter trim — instantly distinguishes "model returned no visible text" from "model returned text the pack couldn't parse"). Migration0005_provider_calls_diagnostics.sqladds columns via SQLiteALTER TABLE ADD COLUMN(O(1) metadata-only, safe on multi-million-row tables). The async-job runner (internal/mcp/jobs.go) stamps the pack job ID on the dispatch context via the newgateway.WithJobIDhelper so existing per-pack call sites don't need touching. Existing rows keep NULLjob_id/ NULLfinish_reason/0 raw_content_len— no backfill required.- Subprocess pack manifest format (#173) — operator-supplied command packs (
$HELMDECK_COMMAND_PACKS_DIR) can now declare typed input/output schemas + execution overrides via a sibling<basename>.helmdeck-pack.yamlfile. The manifest carriesname,version,description,author,input_schema/output_schemablocks (BasicSchema-compatible:string,number,boolean,object,array),timeout_s,max_output_bytes, and anenvlist. Missing manifest falls back to passthrough (the v0.12.x MVP behavior); malformed manifest skips the pack entirely with an error logged. New how-to:docs/howto/build-subprocess-pack.md.
Changed
blog.publishartifact-first refactor (#203) —destinationis now optional and defaults to"artifact". Whendestination="ghost", the pack ALSO saves the post body as an artifact (the safety net) by default; a newalso_save_artifact: falseinput restores the pre-#203 ghost-only behaviour. Ghost failures with the safety net enabled return a partial-success response (status: "artifact_saved_ghost_failed"+ghost_error+artifact_key/artifact_url) instead of a hard error — agents can retry the Ghost step against the saved artifact without paying for prompt expansion again. Strictly additive schema change; existing callers that senddestination="ghost"now also seeartifact_key/artifact_url/sizein the response. Seedocs/reference/packs/blog/publish.md§Partial success.
[0.12.1] - 2026-05-13
Theme: hot-patch for the v0.12.0 release-image regression + three reliability bugs found within hours of v0.12.0 shipping.
The release-blocker (#180) is the dominant fix: every fresh docker pull ghcr.io/tosin2013/helmdeck:0.12.0 user saw a blank Management UI because the embedded web/dist/index.html referenced asset hashes not present in the image. Root cause was a workflow sequencing bug — the release workflow never ran npm run build before bundling the docker image, so the image baked in whatever stale index.html was last committed. The fix adds a Node + web-build step before docker/build-push-action plus a verify step that fails the release loud if the rebuilt index.html references assets that aren't on disk. Defense in depth: if v0.12.0's release had run this check, the broken image would never have shipped.
The other three are smaller but each pinches at a real operator-visible failure mode introduced (or surfaced) by v0.12.0's content-pack push.
Fixed
- Release image's blank Management UI on fresh pulls (#180) —
.github/workflows/release.ymlnow runscd web && npm ci && npm run buildbeforedocker/build-push-action, then verifies that every asset hash referenced from the rebuiltweb/dist/index.htmlexists inweb/dist/assets/. Closes #180. Doesn't changeweb/dist/'s gitignore status — the workflow-step fix is the architecturally correct choice (committing the dist folder would create merge churn on everyweb/src/PR). firecrawl-rabbitmqcold-boot race (#181) —deploy/compose/compose.firecrawl.ymlbumps the rabbitmq healthcheck'sstart_period: 15s→60s. RabbitMQ's Erlang VM + mnesia init takes 30-60s on alpine; the shorter window exhausted retries before the node was ready → container reported unhealthy →helmdeck-firecrawl(correctly waiting viadepends_on: condition: service_healthy) never started → operator had todocker compose upagain. 60s aligns withfirecrawl-searxng's precedent in the same file. Tutorial note added that firecrawl overlay cold-boot takes ~60-90s. Closes #181.content.groundtruncated-JSON failure mode (#179) — the hard-coded 1024-token completion cap was too tight for the structured claim-plan JSON the extractor returns (~750 tokens for 5 claims left ~270 tokens of headroom; weak models or large posts blew through it). Default bumped to 2048 (~1200 tokens of output budget); new optionalmax_completion_tokensinput oncontentGroundInputlets operators raise the cap up to 8192. Over-cap requests now reject withCodeInvalidInput(runaway-cost guard) instead of silently truncating downstream. Closes #179.content.groundsilent degradation when Firecrawl unreachable (#182) — the per-claim grounding loop swallowedcallFirecrawlSearchtransport errors silently, producing an empty-success "no sources found" output instead of surfacing the underlying reachability issue. Now tracksfirecrawlCallsvsfirecrawlErrorsseparately; when 100% of attempted calls hit transport errors, the handler returnsCodeHandlerFailedwith a message pointing at the firecrawl service URL. Partial-success runs preserved: claims with "search succeeded but no usable source" still land underskippedand the run completes. Mirrors the v0.11 narration contract's fail-loud-on-missing-dependency pattern. Closes #182.
Tests
- 5 new tests in
content_ground_test.go—DefaultMaxTokens,MaxCompletionTokensOverride,MaxCompletionTokensOverCap,FirecrawlAllErrors,FirecrawlPartialErrorsSucceed.
Changed
skills/helmdeck/SKILL.md— refreshed catalog (#184). Now correctly advertises 39 packs (was stamped at pre-v0.10.2 commit24bd0c3advertising 36 — missingblog.publish,podcast.generate,image.generate). FrontmatterhelmdeckVersionbumped tov0.12.0. Brings SKILL.md in line withdocs/integrations/SKILLS.md, which was already current.website/docusaurus.config.ts— sitemap ignores/blog/tags,/blog/tags/**,/blog/archive,/blog/authorsto concentrate Google crawl budget on content pages (137 URLs → 122). Filed as SEO follow-up after Search Console reported 61 URLs in "Discovered – currently not indexed" with crawl timestamp1969-12-31(never crawled). Pages still render at their URLs — they're just no longer advertised in the sitemap.
[0.12.0] - 2026-05-12
Theme: content-pack image chaining + v1.0 install-path unblocker + pack-authoring MVP.
A bundled release covering four threads that lined up after v0.11.0: chain image.generate into the three content packs (#146, unblocked by v0.11.0's #71); helmdeck://image-models MCP resource (#158, sibling to #146); unified install paths (#134 step 1, P1 blocker for v1.0.0-rc1); and the originally-planned Pack Authoring MVP (T606a UI + T811 subprocess pack type).
The narrative: covers come for free, the install path becomes Kubernetes-ready, and pack-authoring grows up — operators with no Go toolchain can install via pulled images, and pack authors with no Go can ship in any language via subprocess packs.
Added
- Content-pack image chaining (#146) — additive convenience syntax across four packs, all backed by a shared
RunImageGenentrypoint extracted frominternal/packs/builtin/image_generate.go:podcast.generatecover_image: bool— auto-generates podcast cover artwork viaimage.generate; output gainscover_image_artifact_key+cover_image_model_used. Optionalcover_image_modeloverride (defaultfal-ai/flux/schnell).slides.renderhero_image_prompt: string— auto-generates hero artwork; base64-inlined as<img data:image/png;base64,…>before slide 1 (after Marp frontmatter when present). Inline bytes avoid Marp needing network access inside the sidecar.slides.narratehero_image_prompt: string— same asslides.renderbut inlined INTO slide 1 (no---separator) so the per-slide TTS pipeline still sees a populated narrated slide.blog.publishfeature_image_artifact_key+hero_image: bool— operator-supplied artifact OR auto-generate from the post title. For Ghost destination, uploads via/ghost/api/admin/images/upload/(multipart, same JWT) then stamps the returned URL into the post'sfeature_imagefield. Artifact-mode writes a sidecar<slug>-cover.png.
helmdeck://image-modelsMCP resource (#158) — mirrorshelmdeck://voices(shipped v0.11.0). Curated in-tree catalog of 7 fal.ai models (flux/schnell, flux/dev, flux-pro/v1.1, fast-sdxl, flux-realism, recraft-v3, ideogram/v2) with cost, p50 latency, supports-seed, supports-image-size, max resolution, capability tags, and one-sentence trade-off notes. Backed by newinternal/imagemodelspackage.fal-keyin vault env-hydrate (#158) — closes the consistency gapimage_generate.go:74has advertised since v0.11.0 ("auto-hydrated to vault as 'fal-key' once #142 lands").HELMDECK_FAL_KEYnow imports into the vault underfal-keyon startup, same shape aselevenlabs-key.deploy/compose/compose.build.yamloverlay (#134 step 1) — operators choose between image-mode (justcompose.yaml, pullsghcr.io/tosin2013/helmdeck:${HELMDECK_VERSION:-latest}) and source-build (base + this overlay, builds locally). Compose's deep-merge picksbuild:when both are present, so the sameimage:tag becomes the local build's name.scripts/install.sh --image-modeflag (#134 step 1) — pulls pre-built images instead of building from source. Implies--no-build. Skips host Go / Node /makepreflight checks — the path needs only Docker,openssl,curl. Pin reproducible deploys viaHELMDECK_VERSION=0.12.0in.env.local.- Pack Test Runner UI MVP (T606a) — click any pack row in
/packs→ modal opens with a JSON textarea + Submit. POSTs to/api/v1/packs/{name}and renders the response (duration, cost hint when present, full JSON). Schema-derived form rendering ships in v0.13.0; this MVP unblocks "no UI today." - Subprocess pack type (T811 MVP) —
packs.NewCommandPack(name, version, description, inSchema, outSchema, spec)constructor turns any executable into a pack via the stdin-JSON / stdout-JSON protocol. Operator-supplied packs auto-register from$HELMDECK_COMMAND_PACKS_DIR(one pack per executable, namedcmd.<basename>). Pack authors can now ship in any language — Python, Node, Bash, Rust — without a Go toolchain dependency.
Changed
deploy/compose/compose.yamlis now image-mode by default (#134 step 1) —build:blocks stripped from the base file;control-planeandsidecar-warmpinghcr.io/tosin2013/helmdeck[-sidecar]:${HELMDECK_VERSION:-latest}. Operators wanting source-build layer incompose.build.yamlviadocker compose -f compose.yaml -f compose.build.yaml. The Helm chart (v1.0-rc1) will reuse the same versioned-tag convention.docs/tutorials/install-cli.md— adds "Pick your install mode" section with side-by-side prerequisites for image-mode (Docker only) vs source-build (Docker + Go + Node +make).docs/howto/upgrade-helmdeck.md§2 splits into Path A (image-mode) + Path B (source-build) — operators on a fresh box cangit clone && ./scripts/install.sh --image-modeand skip the Go toolchain entirely.SlidesRender(v, eg)signature — wasSlidesRender(); now takes vault + egress forRunImageGenaccess.cmd/control-plane/main.goupdated to passvaultStore, egressGuard.SlidesNarrate(d, vs, eg)signature — gained thirdegparameter for the same reason.
Tests
~50 new tests across the bundle. Highlights:
podcast.generatecover-image happy path + dry-run-skips-cover + model override (3 tests)slides.renderhero-image insertion (after frontmatter / no frontmatter / model override / no-fal-credential fails loud), empty-prompt skips, mermaid-coexistence (5 tests)slides.narratehero inlined into slide 1 + dry-run skips (2 tests)blog.publishartifact + ghost feature-image paths, supplied-key + auto-gen, mutual-exclusion validation (4 tests)helmdeck://image-modelsresource list/read/unwired + catalog shape + defensive copy (6 tests)- Subprocess pack via test-binary self-exec: happy path, transform, non-zero exit + stderr, non-JSON stdout, empty stdout, timeout, missing path/binary, raw-binary sniff, OutputSchema vs handler boundary, capped-writer truncation (11 tests)
- Subprocess pack dir-loader: empty/nonexistent dir, executable discovery, non-executable skip, basename sanitization (6 tests)
Fixed
image_generate.go:74consistency gap — the doc string promisedfal-keyauto-hydration "once #142 lands"; #142 shipped v0.11.0 but theWellKnownEnvCredentialsentry was missing. Now added.
Out of scope (slipped to v0.13.0 / v1.0-rc1)
- #134 step 2 — the Helm chart itself ships with v1.0-rc1.
- T606a schema-derived form — JSON Schema → React form rendering; v0.13.0.
- T811 manifest format — typed schemas via YAML sidecar (
#173); v0.13.0. - T811 egress sandbox — confine subprocess pack network access (
#174); v0.13.0. - arm64 sidecar image — still blocked on Marp's amd64-only upstream tarball.
MCP Registry
The auto-publish workflow (.github/workflows/mcp-registry.yml) republishes the listing on v* tag push. After tagging, verify at https://registry.modelcontextprotocol.io/v0/servers/io.github.tosin2013/helmdeck (expect version: 0.12.0, isLatest: true). Watch for the npm-publish race condition documented in release.yml:118-157 — workflow_dispatch the mcp-registry.yml after npm publish completes if the first run fails with "package not found."
[0.11.0] - 2026-05-10
Theme: podcast/slides UX hardening + onboarding fixes + image generation.
A coherent feature release that addresses 9 issues filed during a v0.10.2 OpenClaw integration: the new content packs work, but their first-run UX assumed you already knew the conventions. Silent MP3s when the credential name is wrong, hardcoded /root/openclaw paths, blocking Go preflight on the docker-only path, no voice discovery, no cost preview — all fixed.
The vault env-hydrate fix (#142) is the load-bearing piece: it root-causes the silent-fallback class of bug, not just the ElevenLabs instance. Pairing #138 (the per-pack contract change) with #142 (the platform fix) closes the bug class.
Added
image.generatepack (#71) — text → image via fal.ai's synchronousfal.runendpoint. Default modelfal-ai/flux/schnell(~$0.003/image, 1-3s). 1-4 images per call. Theengineinput field is reserved so a follow-up community PR can add Replicate without a schema change. Vault credentialfal-key(withHELMDECK_FAL_KEYenv-var fallback, auto-hydrated). 9 unit tests cover happy path + multi-image + missing credential hard-fail + env fallback + bad engine + 401 surfacing.- Vault env-hydrate (#142) — at control-plane startup,
WellKnownEnvCredentialsregistry auto-importsHELMDECK_*_API_KEYenv vars into the vault under their canonical names. Operators who setHELMDECK_ELEVENLABS_API_KEYin.env.localper the README now get a workingelevenlabs-keyvault entry without a manualPOST /vault/credentialscall. Wildcard ACL granted on first create. Subsequent restarts respect user-managed entries (metadata.source != "env-hydrate"skips re-upsert). One INFO log per hydration (vault env hydrate ok name=elevenlabs-key host=api.elevenlabs.io). vault.Store.UpsertByName— sibling toCreate. Inserts if absent, rotates ciphertext + refreshes patterns/metadata in place if present. Returns(record, created, error).helmdeck://voicesMCP resource (#143) — exposes the operator's ElevenLabs voice catalog via the sameresources/list+resources/readsurface ashelmdeck://packsandhelmdeck://sessions. 1h in-memory cache keyed on the credential's plaintext fingerprint (rotating the key invalidates the cache automatically).internal/voices/— new package withListVoices(ctx, apiKey) → []Voiceextracted fromslides.narrate's inlinepickRandomVoice. Voice exposesvoice_id,name,labels(accent/gender/use_case),preview_url,source. Tests use overridableElevenLabsBaseURLpackage var.podcast.generate+slides.narrateper-turn duration floor (#141) — newmin_turn_duration_s: numberinput (default5). Short TTS turns get padded with trailinganullsrcsilence so the output respects a per-segment minimum (matches the slides.narrate house style). Passmin_turn_duration_s: 0explicitly to opt out and preserve raw TTS pacing.podcast.generate+slides.narratedry_run / cost preview (#145) — newdry_run: bool(defaultfalse) short-circuits before TTS synthesis and returns the script + per-speaker (or per-slide)tts_charsmap +estimated_cost_usd+ breakdown. Cost block is also included in regular (non-dry-run) responses. Newinternal/podcast/cost.gowith plan rate table (Free/Starter/Creator/Pro/Scale) andHELMDECK_ELEVENLABS_RATE_PER_CHAR_USDoverride.podcast.generate+slides.narrateallow_silent_outputopt-in — paired with the #138 contract change below;trueactivates the (now opt-in) silence-padded fallback for CI smoke tests / demo placeholders.
Changed
podcast.generate+slides.narraterequire narration by default (#138) — pre-this-change, missing the ElevenLabs credential silently produced a silence-padded artifact withhas_narration: falseburied in the response. Operators discovered the misconfiguration only by listening to the MP3. Now the packs hard-fail with a typedmissing_credentialerror and an actionable message ("Set HELMDECK_ELEVENLABS_API_KEY in deploy/compose/.env.local..."). Passallow_silent_output: trueto opt back into the silent path. Shared 4-step credential resolver (internal/packs/builtin/elevenlabs_creds.go): explicitcredentialinput → vaultelevenlabs-key→ vaultelevenlabs-api-key(back-compat alias) →os.Getenv("HELMDECK_ELEVENLABS_API_KEY"). Both packs log one INFO line on successful resolve naming the ladder step that matched.slides.narrateffmpeg failure surfaces full stderr (#140) — inline error message cap raised from 512 → 4096 bytes. Full stderr (plus the failing command line) persisted to the artifact store asffmpeg-stderr-segment-NNN.txt/ffmpeg-stderr-concat.txt; the artifact key is referenced from the inline error so operators can fetch the unredacted output via the artifacts API.
Fixed
scripts/install.shblocked--no-buildon hosts with old Go (#136) —check_go_versionran unconditionally even with--no-build, failing on Debian/Ubuntu's apt-default Go 1.22. The control-plane Dockerfile builds insidegolang:1.26-alpine, so the docker-only path needs no host Go. Wrapped inif [[ "${DO_BUILD}" -eq 1 ]].scripts/configure-openclaw.shhardcoded/root/openclaw+ over-strict shell-env auth check (#137) — addedOPENCLAW_COMPOSE_FILEenv override (default unchanged); replaced 3 hardcoded path references. Auth-listdiedowngraded towarnwhen the OpenClaw container hasOPENCLAW_LOAD_SHELL_ENV=trueand<PROVIDER>_API_KEYis set on it (the auth-list probe is a guaranteed false positive in that documented setup path).
Closed as duplicates
- #139 (duplicate of #141) and #144 (duplicate of #145) — closed without separate fixes.
Deferred
- #146 (chain
image.generateinto podcast/slide/blog covers) — defers to a follow-up release. Theimage.generatepack lands in this release; the integration layer on top of it lands later.
MCP Registry
The auto-publish workflow (.github/workflows/mcp-registry.yml) republishes the listing on v* tag push. After tagging, verify at https://registry.modelcontextprotocol.io/v0/servers/io.github.tosin2013/helmdeck (expect version: 0.11.0, isLatest: true).
[0.10.2] - 2026-05-09
A small patch release that ships the MCP Resources surface (closes #44) plus a refined registry-listing description. Functionally additive only; no breaking changes.
Added
- MCP Resources (
#44) — the MCP server now servesresources/listandresources/readper the 2024-11-05 spec, alongside the existingtools/list/tools/call. Two read-only resources surface today:helmdeck://packs— the live pack catalog (every registered pack with its input schema). Equivalent totools/listas a browsable resource.helmdeck://sessions— live session list (id, status, image, created_at). Wired only when the control plane has an active session runtime; safely omitted otherwise.- The
initializeresponse now declares theresourcescapability so MCP clients discover the new surface automatically. - 7 unit tests cover both happy paths, the missing-runtime fallback, the unknown-URI error, lister error propagation, and the capability declaration.
Changed
- Registry description now reads "Self-hosted MCP server: sandboxed browser, desktop, vision, code-edit packs for any agent." (was "38 capability packs (browser, desktop, vision, repo, fs, slides, podcast) for MCP agents."). Leads with the value proposition + self-hosted differentiator instead of the feature list.
- Registry submission script + workflow corrected to point at the search API URL — the registry has no human-facing web UI today, only the metadata API. Was a pre-1.0 documentation bug from the v0.10.1 cycle.
Operator notes
- No action required for existing v0.10.1 installs — MCP Resources is purely additive (new methods don't break existing tools/* clients). Upgrade if you want to expose
helmdeck://sessionsandhelmdeck://packsto your agent for browsing. - Out of scope for #44 (deferred): JWT scope filtering on resources, per-MCP-client integration tests. Tracked as follow-ups; the spec implementation is complete and the 7 unit tests cover the surface.
[0.10.1] - 2026-05-09
A patch release that completes helmdeck's listing on the official MCP Registry. The v0.10.0 attempt failed namespace verification because two pieces of metadata weren't yet declared on the published artifacts — this release adds them. Functionally identical to v0.10.0; no pack/API/binary behavior changes.
Fixed
@helmdeck/mcp-bridgenpm package now declaresmcpName: "io.github.tosin2013/helmdeck"in itspackage.json. The MCP Registry's npm validator reads this field to confirm the package belongs to the registered namespace; without it, registry submission failed withNPM package '@helmdeck/mcp-bridge' is missing required 'mcpName' field.ghcr.io/tosin2013/helmdeck-mcpOCI image now carries theio.modelcontextprotocol.server.name="io.github.tosin2013/helmdeck"label. The OCI validator reads this label to confirm namespace ownership; the v0.10.0 image lacked it.
Operator notes
- No action required for existing v0.10.0 installs. The bridge binary, control plane, and all 38 packs are unchanged. Skip this release unless you specifically need the registry-listed install path.
- Registry entry goes live on tag push.
.github/workflows/mcp-registry.ymlauto-fires; verify via the search API athttps://registry.modelcontextprotocol.io/v0/servers?search=io.github.tosin2013%2Fhelmdeck(the registry is API-only in preview — there is no human-facing web UI; browse downstream aggregators like mcp.so, Glama, and PulseMCP instead).
[0.10.0] - 2026-05-09
A "content packs" release. Two new packs land — blog.publish for posting to Ghost or stuffing markdown/HTML into the artifact store, and podcast.generate for multi-speaker podcast MP3s via a pluggable TTS engine. The capture pipeline ships in-repo, the upgrade procedure is documented for the first time, and the README now opens with the quantified cost-positioning argument the platform earned by shipping the per-pack reference work. Pack count: 36 → 38.
The originally-planned v0.10.0 theme (Pack Authoring + Test Runner) slips to v0.11.0 — the work didn't happen this cycle, the slot got repurposed because the new packs were ready.
Added
blog.publishpack (#68 via #103) — publish to a Ghost installation (live Admin API) OR render markdown/HTML to the helmdeck artifact store. Two body modes (agent-supplied OR prompt+model the pack expands). Goldmark added togo.modfor the markdown→HTML shim. Ghost JWT minted inline viagolang-jwt/jwt/v5(5-min HS256, audience/admin/).podcast.generatepack (#106) — produce a 1..N speaker podcast MP3 from a script, a prompt, or long-form content (URL/text → LLM converts). Three input modes (script / prompt+model / source_*+model). Five themed system prompts:interview,debate,news-roundup,deep-dive,solo-essay. Day 1: ElevenLabs behind apodcast.Engineinterface so future PRs (PlayHT, Hume.ai, Resemble.ai) slot in by adding a new file underinternal/podcast/. Vault credentialelevenlabs-key(same asslides.narrate); silent-fallback when missing. Optionalcover_image_promptoutput for downstream image-gen packs.- 38 per-pack reference pages at helmdeck.dev/reference/packs — every shipped pack on the agent-first / developer-second template, with live OpenClaw chat-UI transcripts embedded alongside
curldeveloper references. (PR-A #83 + PR-B #95 + PR-C #101.) Closes #51, #53, #54, #55, #56, #58, #59, #60, #61, #62, #63, #64. - OpenClaw transcript capture pipeline at
scripts/oc-capture/(#97 + #104) — three scripts (capture-oc.sh,extract-oc-transcript.py,inject-transcripts.py), a genericcapture-batch.shdriver, and prompt files for the three pack-doc clusters. - Cost-positioning blog + long-form reference (#99) —
website/blog/2026-05-08-cheap-models-do-frontier-work.md+docs/explanation/why-helmdeck.mdwith five per-task comparison tables vs. Anthropic Computer Use, OpenAI Operator, Browser-use, Cursor, Aider, Unstructured.io, LlamaParse, Pictory. Includes a "Run the comparison yourself" reproduction recipe + community-contribution invitation. - Operator upgrade documentation at
docs/howto/upgrade-helmdeck.md(#107) — pre-flight checklist, in-place Compose-stack upgrade, schema-migration handling, post-upgrade validation, rollback, Kubernetes/Helm path preview. - SKILLS.md gains a "Freshness contract" section (#98) — teaches agents to re-call stateful packs when state may have changed since the last call. Plus per-client "Load the agent skills" subsections for every integration doc (Claude Code via CLAUDE.md, Claude Desktop via Projects, Gemini CLI via GEMINI.md, Hermes via system_prompt_file).
- Per-release-checklist additions in
docs/RELEASES.md: step 6 (refresh README + cost numbers per release, #100), step 7 (operator upgrade procedure smoke, #107).
Fixed
vision.click_anywheremechanical loop bug (#102 via #105) — per-step screenshots now genuinely reflect post-action desktop state. Two changes:StepandStepNativethread prior-turn actions into the next user message as textual history, and a 250 ms post-dispatch wait gives Xvfb time to repaint. Same fix applies tovision.fill_form_by_label. Verified live: per-step PNG artifacts now have distinct file sizes between iterations (vs. PR-B baseline where every step's bytes were identical because Xvfb hadn't repainted before scrot fired). However, the model-side completion-detection limitation remains — the model still rarely emitsdoneon real tasks even when the click visibly landed. Tracked separately at #112 for follow-up research (try gpt-4o vs. haiku-4.5, native computer-use schema, two-shot verification). Treatvision.click_anywhereas experimental for production workflows until #112 lands an answer.repo.fetchempty-remote infinite hang (#94 via #96) —git ls-remote --headsruns first; pack errors fast withinvalid_input: remote has no branches; push at least one commit before cloning.fs.patchAnthropic-edit-shape rejection (#90 via #93) — both{search, replace}and{edits: [{oldText, newText}]}shapes accepted.doc.parseformats: "markdown"rejection (#91 via #93) —markdownaliasesmd; both work.- OpenClaw capture pipeline cross-prompt context bleed (#97) — every
capture-oc.shinvocation now mints a fresh--session-id. Side-effect: per-call cost dropped ~140× (no 280-event session bloat shipped on every turn). - Vision pack loops now check
ctx.Err()(in #105) — cancelled callers exit cleanly instead of spinning tomax_steps. vision.fill_form_by_labelparity fix (#105) — now records per-step PNG artifacts (parity withclick_anywhere).
Changed
- Pack count: 36 → 38 (
blog.publish+podcast.generate) README.mdopens with the quantified cost-positioning argument ($0.07 Phase 5.5 loop ongpt-oss-120bvs $0.30+ on Sonnet via Cursor) plus a 4-row comparison table; "other 99%" framing kept as the follow-on paragraph- Homepage tagline rewritten from "Self-hosted AI agent platform for small open-weight models" to lead with the cost angle
docs/integrations/SKILLS.mdpicks up the Freshness contract, expanded "How to load" subsection with per-client instructions, "Blog" and "Podcast" catalog entries, and the pack count bump
Operator notes
- Upgrade procedure:
git fetch && git checkout v0.10.0 && make sidecars && make install. See/howto/upgrade-helmdeckfor the full pre-/post-upgrade checklist. - Schema migrations: auto-applied on
store.Open. Cross-version smoke is tracked in #108 (P1). - OpenClaw skill refresh: re-run
./scripts/configure-openclaw.shafter pulling so the new SKILL.md (with podcast/blog entries + Freshness contract) lands in the OpenClaw container. - No breaking changes to existing pack input/output schemas. All
### Addedwork is additive; all### Fixeditems improve observable behavior in agents' favor. - Pre-Kubernetes audit issues filed: #108 (schema-migration cross-version test, P1), #109 (sidecar version pinning, P2), #110 (vault master-key rotation, P2), #111 (cross-version upgrade smoke in CI, P2). All tagged Phase 7; none block v0.10.0.
- Known limitation:
vision.click_anywhereandvision.fill_form_by_labelare experimental — the underlying loop fix in #105 works mechanically (screenshots progress per turn) but the vision model rarely emitsdoneon real tasks. See #112 for the research track. Use at your own risk in production workflows; preferweb.test(Playwright MCP, deterministic) for browser-automation goals where possible.
0.9.0 - 2026-05-07
A "polish + plumbing" release. No new packs and no API changes — the 36 packs from v0.8.0 stay the surface area. What landed: a real install fix that was breaking first-session sessions, a public docs site at helmdeck.dev, two community-contributed AI provider adapters, secret scanning in CI, and the planning-doc cross-references that were documented-but-not-implemented at v0.8.0.
Added
- Documentation site at helmdeck.dev —
Docusaurus 3, Diataxis-organized (Tutorials / How-to / Reference /
Explanation), deployed to Vercel with auto-preview on PRs. Search via
@easyops-cn/docusaurus-search-local. SEO-tuned for Google Search Console submission: explicit titles, OG social card, robots.txt, sitemap with per-route priority bumps, schema.org/WebSite + FAQPage JSON-LD. - Install tutorials —
docs/tutorials/install-cli.md(10-minute walkthrough fromgit cloneto running stack) anddocs/tutorials/install-ui-walkthrough.md(panel-by-panel UI tour). - Troubleshooting how-to —
docs/howto/troubleshoot-install.mdwith FAQPage schema covering 10 known sharp edges (502 on first session, GHCR pull failures, lost admin password, etc.). - Per-pack documentation framework —
docs/reference/packs/with template + fully-written browser family (browser.screenshot_url,browser.interact). 12 family-tracking issues opened for community to pick up the remaining 34 packs. - OSS hygiene files at repo root —
CHANGELOG.md,SECURITY.md(90-day disclosure window),CODE_OF_CONDUCT.md(Contributor Covenant 2.1). - GitHub priority taxonomy —
priority/P0..P3labels applied to all 39 open issues. P1 cohort (14 items) is the next-release shortlist. docs/sitemap.xml— documcp-generated source-side sitemap for link audits and search-engine submission tracking, separate from Docusaurus's runtime sitemap.- Custom logo — helm-wheel + H letterform mark, light/dark variants, SVG favicon. Replaced the scaffolded Docusaurus brand assets.
- Provider adapters via community PRs — Groq (PR #45 by @Dev-31)
and Mistral (PR #47, resolved from @vijit-vishnoi's PR #46) both
ride the
HELMDECK_{PROVIDER}_API_KEY[_FILE]/_BASE_URL/_MODELSenv-var contract introduced for OpenRouter in v0.8.0.
Changed
- Planning docs (
RELEASES.md,MILESTONES.md,TASKS.md) are now cross-linked. Every release has a Milestone + Tasks pointer; every milestone has a Ships-in pointer; the v0.8.0 RELEASES section was added (was missing). 19 task IDs that lived in MILESTONES without rows in TASKS got promoted into proper rows. - README's install section links to the new tutorial pages.
- Trivy CI scan scope narrowed to
scanners: vuln,misconfig. Action pin bumped 0.28.0 → 0.35.0.
Fixed
- Install bug —
docker compose up -d --buildonly builds services with abuild:clause, so published images (Garage, the GHCR-published sidecar tag) weren't pulled before stack-up. Result: first session calls hung on a 30-second timeout. Fix: newcompose_pullstep inscripts/install.shrunsdocker compose pull --ignore-buildablebetween sidecar build andcompose up, fast-failing on network/proxy issues with an actionable error. Thesidecar-warmservice no longer swallows pull failures with|| true. - CI race —
TestBridgeRoundTrip's sharedbytes.Bufferbetween the test goroutine and the bridge writer. Wrapped in async.Mutex-guardedsafeBuffer. Production code unchanged. vercel.json—cleanUrls: trueadded so/PACKSresolves to/PACKS.html(matched to Docusaurus'strailingSlash: false).
Security
- Gitleaks secret-scanning CI workflow on every push + PR. Runs
via
gitleaks/gitleaks-action@v2withfetch-depth: 0so the scanner walks full history. Allowlist covers stable dev credentials indeploy/compose/garage.toml(file header already documents these as override-in-production). serialize-javascriptbumped 6.0.2 → 7.0.5 via npmoverridesto address GHSA-5c6j-r48x-rmvq (HIGH) and CVE-2026-34043 (MEDIUM). Both shipped as transitive deps in @docusaurus/bundler.
Developer experience
make checktarget wrapsvet + race test + build— exactly what CI'svet + test + buildjob runs. Plusmake install-hooksto wire an opt-inpre-pushhook.
0.8.0 - 2026-04-12
Added
- 36 capability packs total (browser, web, research, slides, GitHub, repo, filesystem, shell, HTTP, document, desktop, vision, language families).
- Phase 6.5 validation script (
scripts/validate-phase-6-5.sh). - Multi-provider AI gateway adapters: Groq, Mistral.
- gitleaks secret-scanning CI workflow with allowlist.
Changed
- README leads with the weak-model success story; v0.8.0 + 36-pack catalog refresh.
- Trivy CI scan scope narrowed to vuln+misconfig (secrets owned by gitleaks).
0.5.1 - 2026-04-08
Fixed
- npm trusted publishing: bump npm + add
--provenanceso@helmdeck/mcp-bridgereleases include attestations.
0.5.0 - 2026-04-08
Added
- AES-256-GCM Credential Vault with placeholder-token injection (login, session cookies, API keys, OAuth-with-refresh, SSH/git).
- CDP cookie injection at session start.
- HTTP gateway intercept-and-substitute for outbound agent traffic.
repo.fetch,repo.push,web.login_and_fetch,web.fill_form,slides.videopacks (vault-dependent).- NetworkPolicy egress allowlist + metadata IP / RFC 1918 block.
- Sandbox baseline: non-root, drop-all-caps, seccomp.
- OpenTelemetry GenAI semantic conventions on every span.
- Trivy CRITICAL gate in CI.
0.3.0 - 2026-04-08
Added
- MCP registry with stdio/SSE/WebSocket transports.
- Built-in MCP server auto-derived from the pack catalog.
helmdeck-mcpbridge binary distributed via Homebrew, Scoop, npm (@helmdeck/mcp-bridge), GHCR OCI image, and signed GitHub Releases.- CI smoke matrix verifying
browser.screenshot_urlfrom Claude Code, Claude Desktop, OpenClaw, and Gemini CLI.
Fixed
release.yml: gate binary jobs to push events only.
0.2.0 - 2026-04-08
Added
- OpenAI-compatible
/v1/chat/completionsand/v1/models. - Provider adapters: Anthropic, Gemini, OpenAI, Ollama, Deepseek.
- Encrypted key store with rotation API.
- Fallback routing rules (rate-limit / error / timeout triggers).
- Pack Execution Engine with input/output schema validation.
- Typed error code enforcement (closed set per pack).
- Pack registry with versioned dispatch.
- Three reference packs:
browser.screenshot_url,web.scrape_spa,slides.render. - Object store integration with signed-URL artifacts.
- A2A Agent Card at
/.well-known/agent.json.
Hardware exit gate met
- ≥90% success rate on
browser.screenshot_urlandweb.scrape_spaagainst MiniMax-M2.7 and Llama 3.2 7B.
0.1.1 - 2026-04-07
Fixed
sidecar.yml: publish amd64 only until Marp ships an arm64 tarball.
0.1.0 - 2026-04-07
Added
- Go control plane binary (Gin + chromedp + Docker SDK).
- Browser sidecar image with Chromium, Marp, Tesseract, ffmpeg, xdotool, Xvfb, XFCE4, noVNC.
- Ephemeral session lifecycle (
POST /api/v1/sessions…DELETE /api/v1/sessions/{id}). - CDP REST endpoints: navigate, extract, screenshot, execute, interact.
- JWT bearer auth on every endpoint.
- Audit log (write-only).
- Single-node Compose deployment (
deploy/compose/compose.yaml). make smokeend-to-end harness in CI.