Skip to main content

podcast.generate

The "produce a podcast MP3" pack. Caller supplies a speakers map (1..N speaker name → voice ID) and one of three script-source modes:

  • Mode A — script: agent provides the structured turns directly
  • Mode B — prompt + model: pack calls the gateway LLM to generate dialogue from a prompt
  • Mode C — source_url or source_text + model: pack scrapes long-form content (or accepts inline text) and converts it into speaker-tagged dialogue

The pack iterates each turn through the configured TTS engine, concats the per-turn MP3s with silence padding via ffmpeg, and returns a single MP3 artifact. Day 1 ships ElevenLabs as the only engine; the engine input field is reserved so future PRs can add PlayHT, Hume.ai, Resemble.ai, etc. without touching the pack handler.

Five closed-set themes bake podcast best-practices into the LLM system prompt (modes B and C):

ThemeWhat you get
interviewHost + guest format, open-ended questions, guest does ~70% of talking, actionable takeaway
debateTwo opposing positions, steel-man required, moderator-style closer
news-roundup3–5 fast stories, sponsor-break placeholder at midpoint, "watching this week" closer
deep-diveSingle topic, narrative arc (problem → exploration → resolution → implication)
solo-essayOne speaker, monologue, written-for-the-ear pacing, 8–12 min sweet spot

Distinct from tts.synthesize (single-voice, single-line) and video.generate (talking-head video).

Setup prerequisite

For day-1 ElevenLabs engine, add the API key to the Vault panel:

FieldValue
Nameelevenlabs-key (exact string — pack default; override with credential input)
Typeapi_key
Host patternapi.elevenlabs.io
ValueYour ElevenLabs API key (sk_…)

Same credential as slides.narrate. Optional — without it, the pack still ships an MP3 (silent, with has_narration: false) so the structure stays intact for testing.

For mode C with source_url, the Firecrawl overlay must be enabled (HELMDECK_FIRECRAWL_ENABLED=true).

Inputs

FieldTypeRequiredDefaultNotes
enginestringno"elevenlabs"Closed set; day 1 only "elevenlabs".
speakersobjectyes{name: voice_id} map. Non-empty. Use one entry for solo monologue, two+ for dialogue.
scriptarrayone-ofMode A. [{speaker, text}, ...]. Every speaker value must exist in speakers.
promptstringone-ofMode B. Plain-English description of what the podcast should be. Requires model.
source_urlstringone-ofMode C-1. URL to scrape via Firecrawl. Requires model + Firecrawl overlay.
source_textstringone-ofMode C-2. Inline long-form markdown to riff on. Requires model.
modelstringwith prompt/source_*Provider/model for script generation. e.g. openrouter/openai/gpt-4o-mini.
max_tokensnumberno4096LLM cap for script generation.
model_idstringno"eleven_turbo_v2_5"ElevenLabs TTS model. eleven_turbo_v2_5 is fast/cheap; eleven_multilingual_v2 for non-English.
themestringno"deep-dive"One of: interview, debate, news-roundup, deep-dive, solo-essay. Influences modes B/C only.
duration_target_minnumberno8LLM target length in minutes (modes B/C). At ~150 wpm, an 8-min target asks for ~1200 total words.
silence_between_turns_msnumberno600Pause between consecutive turns (ms). 600ms feels conversational; 200ms feels rushed; 1000ms feels formal.
generate_cover_promptbooleannofalseWhen true, output includes cover_image_prompt — a one-paragraph prompt the agent can pass to a future image-gen pack for cover art.
credentialstringno"elevenlabs-key"Vault credential name.

Validation:

  • Exactly one of script / prompt / (source_url OR source_text)
  • prompt and source_* modes require model
  • Every speaker referenced in script (mode A) must exist in speakers
  • theme must be in the closed set
  • engine must be "elevenlabs" (day 1)
  • source_url requires HELMDECK_FIRECRAWL_ENABLED=true

Outputs

FieldTypeNotes
enginestringEcho.
audio_artifact_keystringpodcast.generate/<rand>.mp3. Resolve via /api/v1/artifacts/<key>.
audio_sizenumberBytes.
duration_snumberTotal length (sum of per-turn TTS + silence padding), measured by ffprobe.
speaker_countnumberUnique speakers actually appearing in the final script.
turn_countnumberTotal turn count (number of speaker lines synthesized).
script_sourcestring"input" / "model" / "source_url" / "source_text".
model_usedstringOnly when script_source != "input".
voices_usedobject{speaker: voice_id} for speakers that appeared.
has_narrationbooleanfalse when the vault key was missing — MP3 contains silence (5s per turn).
themestringEcho.
cover_image_promptstringOnly when generate_cover_prompt: true.

Vault credentials needed

elevenlabs-key for day-1 ElevenLabs engine (same as slides.narrate). Optional — silent fallback when missing.

Use it from your agent (OpenClaw chat-UI worked example)

OpenClaw chat capture pending.

Developer reference (curl)

Mode A — script (no LLM, no Firecrawl)

ADMIN_PW=$(grep HELMDECK_ADMIN_PASSWORD /root/helmdeck/deploy/compose/.env.local | cut -d= -f2)
JWT=$(curl -fsS -X POST http://localhost:3000/api/v1/auth/login \
-H 'Content-Type: application/json' \
-d "{\"username\":\"admin\",\"password\":\"${ADMIN_PW}\"}" \
| python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"speakers": {
"Alex": "21m00Tcm4TlvDq8ikWAM",
"Jordan": "EXAVITQu4vr4xnSDxMaL"
},
"script": [
{"speaker":"Alex", "text":"Welcome back to the show. I'\''m Alex."},
{"speaker":"Jordan", "text":"And I'\''m Jordan. Today we'\''re diving into WebAssembly."},
{"speaker":"Alex", "text":"What makes it interesting in 2026?"},
{"speaker":"Jordan", "text":"Two things: performance parity with native, and portability across runtimes."}
],
"theme": "deep-dive",
"silence_between_turns_ms": 600
}'

Response shape (truncated):

{
"pack": "podcast.generate",
"version": "v1",
"output": {
"engine": "elevenlabs",
"audio_artifact_key": "podcast.generate/abc123.mp3",
"audio_size": 512000,
"duration_s": 34.2,
"speaker_count": 2,
"turn_count": 4,
"script_source": "input",
"voices_used": {"Alex":"21m00Tcm4TlvDq8ikWAM","Jordan":"EXAVITQu4vr4xnSDxMaL"},
"has_narration": true,
"theme": "deep-dive"
}
}

Mode B — prompt + model

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"speakers": {
"Host": "21m00Tcm4TlvDq8ikWAM",
"Guest": "EXAVITQu4vr4xnSDxMaL"
},
"prompt": "Interview with a Rust expert about why Rust is gaining ground in 2026 backend systems.",
"model": "openrouter/openai/gpt-4o-mini",
"theme": "interview",
"duration_target_min": 8,
"generate_cover_prompt": true
}'

The pack calls the gateway LLM with a frozen system prompt that bakes in the interview theme + the speaker names + the word target (8 × 150 ≈ 1200 words). The model returns structured JSON [{speaker, text}, ...] that the pack then synthesizes turn-by-turn.

Mode C — long-form content → script

curl -fsS -X POST http://localhost:3000/api/v1/packs/podcast.generate \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"speakers": {
"Reader": "21m00Tcm4TlvDq8ikWAM"
},
"source_url": "https://blog.example.com/long-form-essay",
"model": "openrouter/openai/gpt-4o-mini",
"theme": "solo-essay",
"duration_target_min": 10
}'

The pack scrapes the URL via Firecrawl, then asks the LLM to convert the content into a solo-essay-style script with the single speaker Reader.

Error codes

CodeTriggersCaptured response
invalid_inputspeakers missing or emptyspeakers map is required …
invalid_inputNone of script/prompt/source_* setmust provide one of: script | prompt+model | source_url/source_text+model
invalid_inputMultiple modes setmust provide exactly one of: …
invalid_inputprompt/source_* without modelmodel is required when using prompt or source_url/source_text mode
invalid_inputtheme outside closed settheme must be one of: interview, debate, news-roundup, deep-dive, solo-essay …
invalid_inputengine not "elevenlabs"engine must be "elevenlabs" (got …)
invalid_inputSpeaker in script not in speakersscript[N]: speaker "X" not in speakers map (configured: A, B)
invalid_inputsource_url mode without Firecrawlsource_url mode requires Firecrawl overlay …
invalid_inputSource URL blocked by egress guardegress denied: …
internalPrompt/source mode without dispatcherpodcast.generate prompt mode registered without a gateway dispatcher
handler_failedElevenLabs API non-2xx (key invalid, rate-limited, voice not found)synthesize turn N: elevenlabs 401: …
handler_failedFirecrawl scrape failedscrape source_url: …
handler_failedffmpeg concat failedconcat: ffmpeg concat: exit N: …
session_unavailableEngine has no session executorengine has no session executor …
artifact_failedObject store write failedartifact upload failed: …

Session chaining

Required (creates if absent). The pack runs ffmpeg in a session sidecar. Stateless from the agent's perspective; the session is implementation detail.

Common chains:

  • research.deeppodcast.generate (theme: news-roundup) — turn a search-and-synthesis pass into a news-roundup-style podcast
  • web.scrapepodcast.generate (source_text mode + theme: solo-essay) — re-narrate a single article as a solo essay
  • podcast.generate (generate_cover_prompt: true) → future image.generate (#71) — cover-art pipeline

Async behavior

Async: true. Wall-clock scales with turn count: ~2–4s per turn at typical TTS speeds, plus ~5–10s for ffmpeg concat at the end. A 24-turn deep-dive runs ~60–90s end-to-end. The pack reports progress via ec.Report(pct, message) so SDK clients can display "synthesizing 12/24 turns".

See SKILLS.md §"Long-running packs" for the SEP-1686 task-envelope decision table.

See also