Skip to main content

podcast.generate

The "produce a podcast MP3" pack. Caller supplies a speakers map (1..N speaker name → voice ID) and one of three script-source modes:

  • Mode A — script: agent provides the structured turns directly
  • Mode B — prompt + model: pack calls the gateway LLM to generate dialogue from a prompt
  • Mode C — source_url or source_text + model: pack scrapes long-form content (or accepts inline text) and converts it into speaker-tagged dialogue

The pack iterates each turn through the configured TTS engine, concats the per-turn MP3s with silence padding via ffmpeg, and returns a single MP3 artifact. Day 1 ships ElevenLabs as the only engine; the engine input field is reserved so future PRs can add PlayHT, Hume.ai, Resemble.ai, etc. without touching the pack handler.

Five closed-set themes bake podcast best-practices into the LLM system prompt (modes B and C):

ThemeWhat you get
interviewHost + guest format, open-ended questions, guest does ~70% of talking, actionable takeaway
debateTwo opposing positions, steel-man required, moderator-style closer
news-roundup3–5 fast stories, sponsor-break placeholder at midpoint, "watching this week" closer
deep-diveSingle topic, narrative arc (problem → exploration → resolution → implication)
solo-essayOne speaker, monologue, written-for-the-ear pacing, 8–12 min sweet spot

Distinct from tts.synthesize (single-voice, single-line) and video.generate (talking-head video).

Setup prerequisite

For day-1 ElevenLabs engine, add the API key to the Vault panel:

FieldValue
Nameelevenlabs-key (exact string — pack default; override with credential input)
Typeapi_key
Host patternapi.elevenlabs.io
ValueYour ElevenLabs API key (sk_…)

Same credential as slides.narrate. Optional — without it, the pack still ships an MP3 (silent, with has_narration: false) so the structure stays intact for testing.

For mode C with source_url, the Firecrawl overlay must be enabled (HELMDECK_FIRECRAWL_ENABLED=true).

Inputs

FieldTypeRequiredDefaultNotes
enginestringno"elevenlabs"Closed set; day 1 only "elevenlabs".
speakersobjectyes{name: voice_id} map. Non-empty. Use one entry for solo monologue, two+ for dialogue.
scriptarrayone-ofMode A. [{speaker, text}, ...]. Every speaker value must exist in speakers.
promptstringone-ofMode B. Plain-English description of what the podcast should be. Requires model.
source_urlstringone-ofMode C-1. URL to scrape via Firecrawl. Requires model + Firecrawl overlay.
source_textstringone-ofMode C-2. Inline long-form markdown to riff on. Requires model.
modelstringwith prompt/source_*Provider/model for script generation. e.g. openrouter/openai/gpt-4o-mini.
max_tokensnumberno4096LLM cap for script generation.
model_idstringno"eleven_turbo_v2_5"ElevenLabs TTS model. eleven_turbo_v2_5 is fast/cheap; eleven_multilingual_v2 for non-English.
themestringno"deep-dive"One of: interview, debate, news-roundup, deep-dive, solo-essay. Influences modes B/C only.
duration_target_minnumberno8Explicit numeric override of the target length in minutes (modes B/C). At ~150 wpm, an 8-min target asks for ~1200 total words. Takes precedence over length_intent; preserved verbatim for back-compat.
length_intentstringnoJIT length sizing (issue #528) — one of summary / thorough / exhaustive. Pack measures the source (script text or source_text), picks a duration_target_min from the heuristic table below. Honored only when duration_target_min is unset; back-compat with no-input callers preserved (legacy 8-min default).
inspectbooleannofalseWhen true, pack returns the measurement + suggested duration and does NOT call the gateway / open a session / touch vault. Works without a dispatcher or session executor — pure planning helper. Does NOT scrape source_url (use source_text if you want a measured suggestion for inline content).
silence_between_turns_msnumberno600Pause between consecutive turns (ms). 600ms feels conversational; 200ms feels rushed; 1000ms feels formal.
generate_cover_promptbooleannofalseWhen true, output includes cover_image_prompt — a one-paragraph prompt the agent can pass to a future image-gen pack for cover art.
cover_imagebooleannofalseWhen true, the pack auto-generates the cover via image.generate and surfaces cover_image_artifact_key in the output. Uses the same prompt as generate_cover_prompt. Honored only outside dry_run. Added v0.12.0 (#146).
cover_image_modelstringno"fal-ai/flux/schnell"fal.ai model used when cover_image:true. Browse choices via the helmdeck://image-models MCP resource.
credentialstringno"elevenlabs-key"Vault credential name.
metadata_modelstringno"openrouter/auto"Provider/model for the engagement-metadata LLM call. Default-on (the v0.26.0 distinction vs slides.narrate, which stays opt-in). Pass "" (empty string, NOT missing) to disable. Adds one LLM call per podcast run (~$0.001 on openrouter/auto).
cta_stylestringno"natural"CTA tone: natural / direct / none. Placement is fixed at mid-roll (research-validated).
languagestringno"en"ISO 639 language code. Operator input is authoritative — overrides whatever the LLM emits.
validatebooleannotrueRun av.validate as a post-concat step against the final MP3. The structured report lands in the output as a validation field; a sidecar validation.json artifact is also persisted. Default-on; pass false to skip. Audio-only checks run (codec, packet contiguity, RMS sweep, loudness LUFS, silence runs); the mp4:* and consistency:audio_video_duration checks skip automatically since this pack outputs MP3, not MP4.

Validation:

  • Exactly one of script / prompt / (source_url OR source_text)
  • prompt and source_* modes require model (skipped when inspect:true — inspect doesn't call the model)
  • Every speaker referenced in script (mode A) must exist in speakers
  • theme must be in the closed set
  • engine must be "elevenlabs" (day 1)
  • source_url requires HELMDECK_FIRECRAWL_ENABLED=true (skipped when inspect:true — inspect doesn't scrape)

Length intent heuristic

The pack picks a chosen target duration by multiplying source reading time (source_words / 150 wpm) by the row multiplier, then clamping to floor/ceiling.

IntentMultiplier (vs source reading time)Floor (min)Ceiling (min)
summary0.2013
thorough (default for intent path)0.5038
exhaustive0.90612

Precedence: inspect:true short-circuits everything > duration_target_min > 0 (explicit numeric) > length_intent set → table > legacy default 8 min (when neither numeric nor intent is set — preserves back-compat for existing callers).

For mode A (script provided), length_intent doesn't apply — the script's length is intrinsic. The output reports length_intent_applied: "n/a:script" so callers see why.

Outputs

FieldTypeNotes
enginestringEcho.
audio_artifact_keystringpodcast.generate/<rand>.mp3. Resolve via /api/v1/artifacts/<key>.
audio_sizenumberBytes.
duration_snumberTotal length (sum of per-turn TTS + silence padding), measured by ffprobe.
speaker_countnumberUnique speakers actually appearing in the final script.
turn_countnumberTotal turn count (number of speaker lines synthesized).
script_sourcestring"input" / "model" / "source_url" / "source_text".
model_usedstringOnly when script_source != "input".
voices_usedobject{speaker: voice_id} for speakers that appeared.
has_narrationbooleanfalse when the vault key was missing — MP3 contains silence (5s per turn).
themestringEcho.
cover_image_promptstringOnly when generate_cover_prompt: true.
cover_image_artifact_keystringOnly when cover_image: true. Namespaced under podcast.generate/. Resolve via /api/v1/artifacts/<key>.
cover_image_model_usedstringOnly when cover_image: true. Echoes the model that actually generated the cover.
engagementobjectDefault-on when a dispatcher is wired (set metadata_model:"" to disable). Apple Podcasts + Podcasting 2.0 shape: {title, subtitle, summary, show_notes_md, chapters: [{startTime, title}], hook_30s, cta: {placement, copy}, language, format_ceiling_note, title_char_count}. chapters[0].startTime is always 0, cta.placement is always "mid-roll" — both server-side defensive overrides regardless of what the LLM emitted.
engagement_artifact_keystringPresent only when engagement metadata was generated. JSON sidecar file mirroring the inline engagement object.
source_wordsnumberWhitespace-delimited word count of the source (script text in mode A, source_text in mode C-2, scraped text in mode C-1). 0 in mode B (prompt is a planning instruction, not source).
target_duration_min_chosennumberThe duration the pack picked and plumbed into GenerateScript. Reflects precedence: duration_target_min > intent table > legacy default.
actual_duration_minnumberWhat the rendered MP3 actually clocks at (duration_s / 60). Compare against target_duration_min_chosen to see how close the model landed.
length_intent_appliedstringWhere the chosen duration came from — intent:summary / intent:thorough / intent:exhaustive / explicit / default:legacy-8min / n/a:script.
truncatedbooleantrue when the script-generation LLM hit finish_reason=length. The parsed script may be incomplete — re-run with a smaller length_intent or larger max_tokens.

Inspect-mode response

When inspect:true, the pack returns a minimal planning response — no model call, no artifact upload, no session work:

FieldTypeNotes
enginestringEcho (e.g. "elevenlabs").
inspectbooleanAlways true.
source_wordsnumberWord count of script text (mode A) or source_text (mode C-2). 0 in prompt mode + source_url mode (no scrape).
suggested_duration_minnumberWhat the intent table would pick.
length_intent_appliedstringintent:summary / intent:thorough / intent:exhaustive.
reasonstringHuman-readable explanation, e.g. "source is 3000 words; applying intent:thorough for a target of 8 minutes (floor/ceiling clamped)".

Example inspect call (no model, no session, no vault):

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" \
-H 'Content-Type: application/json' \
-d '{
"speakers": {"A": "v1"},
"source_text": "# Long article...\n\n[~5000 words]",
"inspect": true,
"length_intent":"thorough"
}'

Response (no token cost, no TTS cost):

{
"engine": "elevenlabs",
"inspect": true,
"source_words": 5000,
"suggested_duration_min": 8,
"length_intent_applied": "intent:thorough",
"reason": "source is 5000 words; applying intent:thorough for a target of 8 minutes (floor/ceiling clamped)"
}

Engagement metadata — what's baked in vs operator-overridable

BucketFieldRule
Non-overridable (enforced by prompt + server)First chapterAlways startTime=0.
Non-overridableChapter floor≥3 chapters when episode > 10 min, each ≥120s, titles ≤45 chars (Apple Podcasts guidance).
Non-overridableTitle shape60-80 chars, takeaway-first.
Non-overridableCTA placementAlways "mid-roll" — research-validated; defensive server-side override even if LLM tries something else.
Non-overridableHook structureCold-open hook lands by second 15, no housekeeping.
Operator-tunablecta_styleCTA copy tone: natural / direct / none.
Operator-tunablelanguageServer-side-authoritative.

Honest scope (format_ceiling_note)

The engagement.format_ceiling_note field — always present when engagement is enabled — carries this constant string:

Engagement metadata defaults follow Apple/Spotify spec and Buzzsprout 2025 retention data. Solo vs co-hosted retention is execution-dependent — neither format dominates; this pack supports both. CTA placement is fixed at mid-roll (research-validated); the tone (cta_style) is operator-tunable.

Unlike slides.narrate, the podcast format has no structural retention ceiling — both solo and co-hosted shows can succeed at scale. The honest caveat here is that good metadata still doesn't substitute for good content.

Vault credentials needed

elevenlabs-key for day-1 ElevenLabs engine (same as slides.narrate). Optional — silent fallback when missing.

TTS quality knob (HELMDECK_ELEVENLABS_FORMAT)

The pack requests 192 kbps MP3 at 44.1 kHz from ElevenLabs by default (mp3_44100_192, Creator-tier or above). The downstream internal/avenc.ConcatAudio re-encode is sample-rate-pinned to 44.1 kHz so the silence-segment splice doesn't resample.

If your ElevenLabs subscription is on the Starter tier (capped at mp3_44100_128), set this environment variable on the helmdeck process to downgrade:

export HELMDECK_ELEVENLABS_FORMAT=mp3_44100_128

Same knob as slides.narrate; setting it once covers both packs.

Use it from your agent (OpenClaw chat-UI worked example)

OpenClaw chat capture pending.

Developer reference (curl)

Mode A — script (no LLM, no Firecrawl)

ADMIN_PW=$(grep HELMDECK_ADMIN_PASSWORD /root/helmdeck/deploy/compose/.env.local | cut -d= -f2)
JWT=$(curl -fsS -X POST http://localhost:3000/api/v1/auth/login \
-H 'Content-Type: application/json' \
-d "{\"username\":\"admin\",\"password\":\"${ADMIN_PW}\"}" \
| python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"speakers": {
"Alex": "21m00Tcm4TlvDq8ikWAM",
"Jordan": "EXAVITQu4vr4xnSDxMaL"
},
"script": [
{"speaker":"Alex", "text":"Welcome back to the show. I'\''m Alex."},
{"speaker":"Jordan", "text":"And I'\''m Jordan. Today we'\''re diving into WebAssembly."},
{"speaker":"Alex", "text":"What makes it interesting in 2026?"},
{"speaker":"Jordan", "text":"Two things: performance parity with native, and portability across runtimes."}
],
"theme": "deep-dive",
"silence_between_turns_ms": 600
}'

Response shape (truncated):

{
"pack": "podcast.generate",
"version": "v1",
"output": {
"engine": "elevenlabs",
"audio_artifact_key": "podcast.generate/abc123.mp3",
"audio_size": 512000,
"duration_s": 34.2,
"speaker_count": 2,
"turn_count": 4,
"script_source": "input",
"voices_used": {"Alex":"21m00Tcm4TlvDq8ikWAM","Jordan":"EXAVITQu4vr4xnSDxMaL"},
"has_narration": true,
"theme": "deep-dive"
}
}

Mode B — prompt + model

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"speakers": {
"Host": "21m00Tcm4TlvDq8ikWAM",
"Guest": "EXAVITQu4vr4xnSDxMaL"
},
"prompt": "Interview with a Rust expert about why Rust is gaining ground in 2026 backend systems.",
"model": "openrouter/openai/gpt-4o-mini",
"theme": "interview",
"duration_target_min": 8,
"generate_cover_prompt": true
}'

The pack calls the gateway LLM with a frozen system prompt that bakes in the interview theme + the speaker names + the word target (8 × 150 ≈ 1200 words). The model returns structured JSON [{speaker, text}, ...] that the pack then synthesizes turn-by-turn.

Mode C — long-form content → script

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"speakers": {
"Reader": "21m00Tcm4TlvDq8ikWAM"
},
"source_url": "https://blog.example.com/long-form-essay",
"model": "openrouter/openai/gpt-4o-mini",
"theme": "solo-essay",
"duration_target_min": 10
}'

The pack scrapes the URL via Firecrawl, then asks the LLM to convert the content into a solo-essay-style script with the single speaker Reader.

Error codes

CodeTriggersCaptured response
invalid_inputspeakers missing or emptyspeakers map is required …
invalid_inputNone of script/prompt/source_* setmust provide one of: script | prompt+model | source_url/source_text+model
invalid_inputMultiple modes setmust provide exactly one of: …
invalid_inputprompt/source_* without modelmodel is required when using prompt or source_url/source_text mode
invalid_inputtheme outside closed settheme must be one of: interview, debate, news-roundup, deep-dive, solo-essay …
invalid_inputengine not "elevenlabs"engine must be "elevenlabs" (got …)
invalid_inputSpeaker in script not in speakersscript[N]: speaker "X" not in speakers map (configured: A, B)
invalid_inputsource_url mode without Firecrawlsource_url mode requires Firecrawl overlay …
invalid_inputSource URL blocked by egress guardegress denied: …
internalPrompt/source mode without dispatcherpodcast.generate prompt mode registered without a gateway dispatcher
handler_failedElevenLabs API non-2xx (key invalid, rate-limited, voice not found)synthesize turn N: elevenlabs 401: …
handler_failedFirecrawl scrape failedscrape source_url: …
handler_failedffmpeg concat failedconcat: ffmpeg concat: exit N: …
session_unavailableEngine has no session executorengine has no session executor …
artifact_failedObject store write failedartifact upload failed: …

Session chaining

Required (creates if absent). The pack runs ffmpeg in a session sidecar. Stateless from the agent's perspective; the session is implementation detail.

Common chains:

  • research.deeppodcast.generate (theme: news-roundup) — turn a search-and-synthesis pass into a news-roundup-style podcast
  • web.scrapepodcast.generate (source_text mode + theme: solo-essay) — re-narrate a single article as a solo essay
  • podcast.generate (generate_cover_prompt: true) → future image.generate (#71) — cover-art pipeline

Async behavior

Async: true. Wall-clock scales with turn count: ~2–4s per turn at typical TTS speeds, plus ~5–10s for ffmpeg concat at the end. A 24-turn deep-dive runs ~60–90s end-to-end. The pack reports progress via ec.Report(pct, message) so SDK clients can display "synthesizing 12/24 turns".

See SKILLS.md §"Long-running packs" for the SEP-1686 task-envelope decision table.

See also