slides.narrate
The "deck-to-narrated-video" pack. Caller hands in a Marp deck where each slide carries <!-- speaker:notes --> HTML comments. The pipeline runs entirely server-side:
- Marp render — each slide becomes a 1920×1080 PNG.
- ElevenLabs TTS — each slide's speaker notes become an MP3 using a vault-stored ElevenLabs key + a chosen voice.
- ffmpeg encode — per-slide PNG + per-slide MP3 → per-slide MP4 segment, with optional cross-slide fade.
- ffmpeg concat — all segments stitched into one final MP4.
- (Optional) LLM metadata synthesis — if
metadata_modelis set, a frozen system prompt asks the model to generate a YouTube title, description with timestamps, tags, category, and language code, written as a separate JSON artifact.
The pack is async by default — calling tools/call returns a SEP-1686 task envelope immediately; the work runs in the background. SDK clients that speak SEP-1686 surface the eventual result transparently. Otherwise use pack.start / pack.status / pack.result or pass webhook_url + webhook_secret.
Setup prerequisite
The pack runs without the ElevenLabs key (degrades to silent video, has_narration: false), but the typical case wants narration. Add via the Vault panel:
| Field | Value |
|---|---|
| Name | elevenlabs-key (exact string) |
| Type | api_key |
| Host pattern | api.elevenlabs.io |
| Value | Your ElevenLabs API key (sk_…) |
Get a key from https://elevenlabs.io/app/settings/api-keys. Free tier is 10,000 chars/month — plenty to validate a few decks end-to-end.
Inputs
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
markdown | string | yes | — | Marp deck. Must preserve --- slide delimiters and <!-- speaker:notes --> HTML comments exactly — agent prompts that escape or reformat the markdown will produce broken output. The frontmatter must start ---\nmarp: true\n---. Custom design (themes, CSS) goes in the markdown's frontmatter — see slides.render §"Custom design" for the syntax; the same Marp render is used internally here. |
voice_id | string | no | random from top 5 popular voices | ElevenLabs voice ID. The pack queries /v1/voices and picks if unset; falls back to EXAVITQu4vr4xnSDxMaL (Rachel) on listing failure. |
model_id | string | no | "eleven_multilingual_v2" | ElevenLabs model. eleven_turbo_v2_5 is faster/cheaper; eleven_multilingual_v2 handles non-English. |
resolution | string | no | "1920x1080" | Video resolution. Smaller = lower memory (try 1280x720 if you OOM at 4K). |
fade_ms | number | no | 0 | Cross-slide fade duration in ms. 300–500 looks polished. |
default_slide_duration | number | no | 5.0 | Seconds of silence for slides without speaker notes. |
metadata_model | string | no | — | Provider/model for the engagement-metadata LLM call (e.g., openrouter/openai/gpt-4o-mini). When unset, no engagement object is returned. The prompt bakes in research-validated YouTube engagement rules (45-55 char title, 0:00 first chapter, ≥3 chapters when >7min, 3-5 hashtags, hook-30s structure). |
hashtag_count | number | no | 4 | Number of hashtags to request; clamped to [3, 5] server-side (the research-validated range). Out-of-range values silently snap to 4. |
category | string | no | "Science & Technology" | YouTube category. Operator input is authoritative — overrides whatever the LLM emits. |
language | string | no | "en" | ISO 639 language code for the metadata. Operator input is authoritative. |
captions_sidecar | boolean | no | true | Emit a sidecar captions.srt artifact alongside the MP4. YouTube/Vimeo auto-import as the CC track (the research-cited ~12-13% view boost path). Default-on; pass explicit false to suppress. Mirrors the mermaid pointer-bool default-on shape. |
captions_burn_in | boolean | no | false | Render captions into every frame via ffmpeg's libass subtitles= filter — visible always-on subtitles. Required on platforms that don't surface CC tracks (Twitter/X embedded videos, LinkedIn embeds, raw MP4 downloads). Adds 5-50% per-segment encode wall-clock and 20-50 MB per encoder thread. On memory-tight hosts (3 GiB MemoryLimit) with large 1080p decks may trigger the OOM-retry path; if libass-with-1-thread also OOMs, the run fails. |
validate | boolean | no | true | Run av.validate as a post-concat step. The structured report lands in the output as a validation field; a sidecar validation.json artifact is also persisted. Default-on; pass explicit false to skip (e.g. benchmarks where the ~5-15s null-muxer decode pass matters). Soft-surface: validation failures never block the artifact from shipping. |
webhook_url | string | no | — | Push the result to this URL on completion (sync alternative to polling). |
webhook_secret | string | no | — | HMAC signature secret for the webhook callback. |
hero_image_prompt | string | no | — | When non-empty (v0.12.0 #146), the pack calls image.generate and inlines the resulting PNG INTO slide 1's content (no --- separator) so the per-slide TTS pipeline still sees a narrated slide. Skipped automatically during dry_run. Fails loud on missing fal-key credential. |
hero_image_model | string | no | "fal-ai/flux/schnell" | fal.ai model used when hero_image_prompt is set. Browse choices via the helmdeck://image-models MCP resource. |
length_intent | string | no | — | JIT density reporting (issue #530) — one of summary / thorough / exhaustive. Pack measures the deck's actual words-per-narrated-slide and reports how many slides fall inside the declared intent's range. Observational only — slides.narrate doesn't generate or trim notes (that's slides.outline's job). |
inspect | boolean | no | false | When true, pack parses the markdown, computes density stats, returns the suggestion, and exits — no session, no vault, no Marp render. The cheapest deck-quality check available. |
words_per_slide_min | number | no | — | Explicit numeric override of the intent range's floor. Both min and max must be set (with max >= min) to be honored. |
words_per_slide_max | number | no | — | Explicit numeric override of the intent range's ceiling. See words_per_slide_min. |
Length intent (observational)
Unlike blog.rewrite_for_audience (#527) or podcast.generate (#533) where the pack actively sizes the output, slides.narrate takes the deck as-is — the narration text is already in the input markdown's <!-- speaker notes --> comments (typically prepared by slides.outline). JIT length sizing here is therefore observational: the agent declares the density they expected, the pack measures what they actually got, and reports the gap.
| Intent | Words per narrated slide | Approx duration at 150 wpm |
|---|---|---|
summary | 40-60 | ~16-24 s |
thorough (default for stats baseline) | 80-120 | ~32-48 s |
exhaustive | 150-220 | ~60-88 s |
Precedence: inspect:true → both words_per_slide_min + words_per_slide_max set ("explicit") → length_intent set ("intent:*") → no input ("default:reporting-only", thorough's range used as the stats baseline).
When the deck's measured density doesn't match the declared intent, the agent's next move is to re-run slides.outline with adjusted notes — not to re-run slides.narrate. The pack reports; the upstream caller decides.
Outputs
| Field | Type | Notes |
|---|---|---|
video_artifact_key | string | slides.narrate/<rand>-deck.mp4. Resolve via /api/v1/artifacts/<key>. |
video_size | number | Bytes. Capped at 256 MiB. |
slide_count | number | Number of slides rendered. |
total_duration_s | number | Cumulative video length, post-TTS — the authoritative timing after ElevenLabs has actually synthesized. |
has_narration | boolean | true if TTS succeeded; false if the ElevenLabs key was missing or the API errored on every slide. |
voice_used | string | Voice ID that narrated. Empty when has_narration: false. |
engagement_artifact_key | string | Present only when metadata_model was set. JSON sidecar file with the engagement metadata. v0.26.0 breaking change: was metadata_artifact_key in v0.25.x. |
engagement | object | Same content as engagement_artifact_key's JSON, inline for convenience. Shape: {title, title_char_count, description, chapters: [{timestamp, title, seconds}], hashtags, tags, hook_30s, captions_recommended, category, language, format_ceiling_note}. The structural rules (0:00 first chapter, title ≤ 60 chars, ≥3 chapters when video > 7min, 3-5 hashtags) are baked into the prompt as hard constraints; category and language are server-side-authoritative. v0.26.0 breaking change: was metadata in v0.25.x. |
captions_artifact_key | string | Sidecar SRT file (captions.srt). Default-present unless captions_sidecar:false. YouTube Studio "Subtitles → Upload file → With timing" auto-imports as the CC track. Empty string when the sidecar was suppressed or the artifact-store Put failed (Put failures are logged but don't fail the run). |
captions_burned_in | boolean | Always emitted (even when false) so consumers can branch on its presence. true only when captions_burn_in:true was passed AND the SRT was successfully written to the burn-in path. |
hero_image_model_used | string | Only when hero_image_prompt was set. Echoes the model that actually generated the hero. |
source_words_per_slide_avg | number | Average word count across slides with non-empty notes. Silent slides excluded so intro/outro placeholders don't drag the signal down. |
source_words_per_slide_min / _max | number | Tightest / loosest narrated slide in the deck. |
narrated_slide_count | number | Count of slides whose notes are non-empty. |
slides_within_intent_range | number | How many narrated slides fall inside the declared intent's [floor, ceiling] range. |
slides_outside_intent_range | number | How many fall outside. |
length_intent_applied | string | Where the range came from — intent:summary / intent:thorough / intent:exhaustive / explicit / default:reporting-only. |
truncated | boolean | true when the engagement-metadata LLM hit finish_reason=length. TTS calls are HTTP (not gateway dispatch) so don't surface a truncation signal — this is engagement-only. |
inspect | boolean | Inspect-mode only — always true in the inspect response. |
suggested_intent | string | Inspect-mode only — which intent's range the deck's actual density best matches. Empty when no slides are narrated. |
reason | string | Inspect-mode only — human-readable summary of the measurement and suggestion. |
Captions
| Output mode | Default | What it's for |
|---|---|---|
Sidecar SRT (captions.srt) | On | YouTube/Vimeo auto-import → user-toggleable CC. Research-cited ~12-13% view boost. Essentially free (no encode cost). |
| Burned-in | Off (opt-in) | Always-visible subtitles for Twitter/X embeds, LinkedIn, raw MP4 downloads where CC tracks don't surface. Adds 5-50% encode wall-clock + 20-50 MB libass overhead per encoder thread. |
Burn-in OOM caveat — the existing per-segment OOM-retry path (primary -threads 4 -preset medium → retry -threads 1 -preset veryfast) gets less headroom when libass is in the chain. Large 1080p decks on a 3 GiB MemoryLimit host may push the primary into OOM and, with libass adding 20-50 MB per thread to the retry's working set, the retry can OOM too — at which point the entire encode fails. If you hit this, leave captions_burn_in:false and rely on the sidecar SRT (YouTube/Vimeo CC import handles the rest).
YouTube upload acceptance check: after merge, regenerate a builtin.repo-presentation run, download the captions_artifact_key artifact, and upload it via YouTube Studio → Subtitles → "Upload file → With timing". YouTube must accept the file and auto-import as the en CC track. If it rejects, the SRT shape regressed and the format-precision tests in slides_captions_test.go need to be tightened.
Engagement metadata — what's baked in vs operator-overridable
| Bucket | Field | Rule |
|---|---|---|
| Non-overridable (enforced by prompt) | First chapter | Always "0:00" / seconds=0 — YouTube rejects chapter lists without a 0:00 anchor. |
| Non-overridable | Chapter floor | ≥3 chapters when video duration > 7 min, ≥10s between consecutive starts. |
| Non-overridable | Title cap | ≤60 chars (target 45-55 for mobile-truncation safety). |
| Non-overridable | Hashtag relevance | No #viral/#fyp/#trending — YouTube validates relevance against the transcript. |
| Non-overridable | Hook structure | Pattern interrupt → payoff promise → commitment hook, by second 15. |
| Operator-tunable | hashtag_count | Clamped to [3, 5]. |
| Operator-tunable | category / language | Authoritative override of whatever the LLM emits. |
Honest scope (format_ceiling_note)
The engagement.format_ceiling_note field — always present when engagement is enabled — carries this constant string:
Slide-deck-with-voiceover sits in the lower retention bracket vs talking-head (avg ~15-20% on 8-15min vs 20-25%). Metadata can move this to median within format category; cannot bridge the gap to talking-head. Best used for asynchronous explainer/educational content where the creator isn't on camera.
This is intentional. The research-validated metadata defaults move a slide-deck-with-voiceover video from bottom-decile to median within its format category; they cannot close the 5-12 percentage-point structural gap to talking-head video. If you want talking-head retention, you need talking-head visuals — not better metadata. See the avbench workflow for the monthly regression check on the structural rules.
Vault credentials needed
elevenlabs-key — type api_key, host pattern api.elevenlabs.io. Optional — without it the pack still ships an MP4, just silent.
TTS quality knob (HELMDECK_ELEVENLABS_FORMAT)
The pack requests 192 kbps MP3 at 44.1 kHz from ElevenLabs by default (mp3_44100_192, Creator-tier or above). The downstream avenc pipeline is sample-rate-matched to that source so the per-segment encode and final concat re-encode don't introduce 44.1 → 48 kHz resampling artifacts (audible high-frequency aliasing under libswresample).
If your ElevenLabs subscription is on the Starter tier (capped at mp3_44100_128), set this environment variable on the helmdeck process to downgrade so the API doesn't reject the request:
export HELMDECK_ELEVENLABS_FORMAT=mp3_44100_128
The env var also covers PCM upgrades on higher tiers (e.g. pcm_44100) when you want to eliminate source-side MP3 loss entirely.
Use it from your agent (OpenClaw chat-UI worked example)
📌 The transcript below shows the narrated path (
has_narration: true) — theelevenlabs-keyis in the vault, ElevenLabs synthesized 2 slides of speech, and ffmpeg encoded them into a 199 KB MP4. The same prompt without the key in the vault produces a silent 47 KB MP4 (has_narration: false); the silent-fallback transcript was the original capture for this page. The transcript is also a clean reference for the async polling pattern (pack.start → pack.status × N → pack.result).
Prompt (sent in OpenClaw chat UI / openclaw-cli agent):
Use helmdeck__slides-narrate with this 2-slide deck: "---\nmarp: true\n---\n# Helmdeck\n\n\n---\n\n# Thanks\n" and model_id=eleven_turbo_v2_5. Tell me the video_artifact_key, slide_count, total_duration_s, and whether has_narration is true.
Tool call (26 calls, no failures):
{
"name": "helmdeck__slides-narrate",
"arguments": {
"markdown": "---\nmarp: true\n---\n# Helmdeck\n<!-- speaker:notes Welcome to a quick demo of the slides.narrate pack. -->\n\n---\n\n# Thanks\n<!-- speaker:notes See you next time. -->",
"model_id": "eleven_turbo_v2_5"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":0,"started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":0,"started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":5,"message":"parsed 2 slides","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":10,"message":"audio 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":10,"message":"audio 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-result",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"error":"not_ready","message":"pack.result: job 834267c19314e628263ab04b93313506 still working — keep polling pack.status"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":50,"message":"encoding segment 1/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":70,"message":"encoding segment 2/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":70,"message":"encoding segment 2/2","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"working","progress":90,"message":"concatenating final video","started_at":"2026-05-09T00:58:38Z"}
{
"name": "helmdeck__pack-status",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"job_id":"834267c19314e628263ab04b93313506","pack":"slides.narrate","state":"completed","progress":100,"message":"uploading video artifact","started_at":"2026-05-09T00:58:38Z","ended_at":"2026-05-09T00:59:19Z"}
{
"name": "helmdeck__pack-result",
"arguments": {
"job_id": "834267c19314e628263ab04b93313506"
}
}
Pack response (truncated):
{"pack":"slides.narrate","version":"v1","output":{"has_narration":true,"metadata_artifact_key":"","slide_count":2,"total_duration_s":7.810612,"video_artifact_key":"slides.narrate/487371c8459e24ad-video.mp4","video_size":198916,"voice_used":"21m00Tcm4TlvDq8ikWAM"},"artifacts":[{"key":"slides.narrate/487371c8459e24ad-video.mp4","url":"http://garage:3900/helmdeck-artifacts/slides.narrate/487371c8459e24ad-video.mp4?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=GKaef177ebe9fd814dc103b84b%2F20260509%2Fgarage%2Fs3%2Faws4_request\u0026X-Amz-Date=20260509T005919Z\u0026X-Amz-Expires=900\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=dfc433b8c42e9bc554b9709da1182d58578335b4a93a89301e010446bf431cad","size":198916,"content_type":"video/mp4","created_at":"2026-05-09T00:59:19.304111422Z","pa
...[truncated]
Agent's final reply:
video_artifact_key:
slides.narrate/487371c8459e24ad-video.mp4
slide_count: 2
total_duration_s: 7.81 seconds
has_narration: true
Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.0187).
Developer reference (curl)
curl -fsS -X POST http://localhost:3000/api/v1/packs/slides.narrate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"markdown": "---\nmarp: true\n---\n# Helmdeck\n<!-- speaker:notes Welcome to a quick demo of the slides.narrate pack. -->\n\n---\n\n# Thanks\n<!-- speaker:notes See you next time. -->",
"model_id": "eleven_turbo_v2_5"
}'
Because the pack is Async: true, this returns a SEP-1686 task envelope immediately:
{
"_meta": {
"modelcontextprotocol.io/related-task": {
"taskId": "task-abc123"
}
},
"content": [{"type": "text", "text": "task started"}]
}
Then poll pack.status (until state == "completed") and call pack.result for the full output:
{
"pack": "slides.narrate",
"version": "v1",
"output": {
"video_artifact_key": "slides.narrate/xyz789-deck.mp4",
"video_size": 3915264,
"slide_count": 2,
"total_duration_s": 12.4,
"has_narration": true,
"voice_used": "EXAVITQu4vr4xnSDxMaL"
}
}
Error codes
| Code | Triggers | Captured response |
|---|---|---|
invalid_input | markdown empty | markdown is required |
invalid_input | Marp render exit non-zero (malformed deck) | marp exit N: <stderr> |
handler_failed | ElevenLabs API rejected the key (401) | Pack still ships silent video; has_narration: false. Not an error. |
handler_failed | ffmpeg encoding failed (resolution OOM, missing codec) | ffmpeg exit 137: … (137 = SIGKILL, usually OOM — drop resolution) |
handler_failed | Final video exceeds 256 MiB cap | final video N bytes exceeds 256 MiB cap |
timeout | Pack-internal timeout (30 min default) | pack timed out after 30 minutes |
Session chaining
Required (creates if absent). Each slides.narrate call gets a fresh session by default — high memory ceiling (3 GiB) for ffmpeg encoding. Stateless from the agent's perspective; the input is the deck.
Async behavior
Async: true. Wall-clock scales with slide count: roughly 30–60 seconds per slide at 1080p (TTS dominates, then per-segment ffmpeg). A 20-slide deck is typically 10–20 minutes end-to-end. Plan accordingly:
- Path 1 (recommended on SDK clients): just call the pack normally; SEP-1686-aware SDKs auto-poll
tasks/getand surface the result transparently when it lands. OpenClaw 2026.5+ supports this. - Path 2 (universal fallback): manual
pack.start/pack.status/pack.resultpolling. - Path 3 (no polling): pass
webhook_url+webhook_secret. The pack returns a task envelope immediately and POSTs the result to the webhook on completion.
See SKILLS.md §"Long-running packs" for the full decision table.
YouTube optimization tips
slides.narrate is designed to produce videos in the YouTube monetization sweet spot (8–12 minutes — long enough for mid-roll ads at ≥8 min, short enough for retention). Each slide's on-screen time = the length of its TTS audio at ~150–160 wpm. Targets for a 20–25 slide deck:
Words per slide (in <!-- speaker:notes -->) | Resulting video length |
|---|---|
| <30 | <4 min (too short for YouTube; feels thin) |
| 30–60 | 4–7 min (short-form) |
| 80–120 | 8–12 min (sweet spot) |
| 150–200 | 15–20 min (long-form, viable for tutorials) |
| 250+ | 25+ min (risky on retention) |
When the user asks "make me a 10-minute video from N slides" without specifying word counts, target ~1500/N words per slide.
See also
- Catalog row:
PACKS.md—slides.narrate. - Source:
internal/packs/builtin/slides_narrate.go. - Companion packs:
slides.render(just the deck),pack.start/pack.status/pack.result(manual async polling). - Vault setup:
tutorials/install-ui-walkthrough.md.