doc.parse
The layout-aware document parsing pack. Hands a PDF / DOCX / PPTX / XLSX / HTML / image to a self-hosted Docling service and returns the document's structure as clean Markdown — with table layout preserved, headings detected, lists recognized, and (optionally) OCR applied to embedded images. Where doc.ocr is "image bytes in, text out", doc.parse is "real document in, structured Markdown out".
⚠️ Known issue (2026-05-07): against current Docling builds the pack sends
http_sourcesbut the upstream/v1/convert/sourceAPI expectssources. URL inputs returnhandler_failedwith a Docling 422 until the handler is updated. Inline base64 inputs (source_b64+filename) work today. Tracked as apriority/P1issue against the helmdeck repo.
Setup prerequisite
doc.parse only works when the Docling overlay is running and the env-var toggle is set:
docker compose -f deploy/compose/compose.yaml \
-f deploy/compose/compose.docling.yml \
--env-file deploy/compose/.env.local up -d
# in deploy/compose/.env.local:
HELMDECK_DOCLING_ENABLED=true
When the toggle is off, the pack returns invalid_input: doc.parse is disabled (set HELMDECK_DOCLING_ENABLED=true).
Inputs
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
source_url | string | one of | — | Absolute http(s) URL. Egress-guarded. (Currently broken; see Known Issue above.) |
source_b64 | string | one of | — | Base64-encoded document bytes. Skips egress (never hits public network). |
filename | string | when source_b64 set | — | File extension picks the parser (.pdf, .docx, .pptx, .xlsx, .html, .png, .jpg). |
formats | array | no | ["md"] | Output formats. Accepts md (or markdown), text (or plaintext/plain), html. Aliases normalize on the way in (issue #91). md is always force-included so the output schema's markdown field is non-empty. |
do_ocr | boolean | no | true | Run OCR over embedded images during parsing. |
ocr_lang | array | no | ["en"] | Languages for embedded-image OCR (Docling's tesseract langs). |
Exactly one of source_url / source_b64 must be set. Base64 payload capped at 32 MiB.
Outputs
| Field | Type | Notes |
|---|---|---|
source | string | URL or filename (echo of the input identity). |
markdown | string | The parsed document as clean Markdown — always present. |
text | string | Plain-text rendering. Only when formats requested it. |
html | string | HTML rendering. Only when formats requested it. |
status | string | Docling's status: success, partial_success, failure, skipped. |
processing_time | number | Docling-reported seconds spent. |
Vault credentials needed
None. The Docling service runs on the private baas-net and accepts unauthenticated calls from the control plane.
Use it from your agent (OpenClaw chat-UI worked example)
📌 The transcript below predates issue #91's fix. The model emitted
formats: ["markdown"]and got bounced. The pack now accepts"markdown"as an alias for"md"— fresh sessions converge on the first call. Captured here for historical evidence of the friction.
Prompt (sent in OpenClaw chat UI / openclaw-cli agent):
Use the helmdeck__doc-parse tool with source_url=https://arxiv.org/pdf/2501.17887 to extract the document content as markdown. If it fails, tell me the exact error.
Tool call (1 call, no failures):
{
"name": "helmdeck__doc-parse",
"arguments": {
"source_url": "https://arxiv.org/pdf/2501.17887",
"do_ocr": false,
"formats": [
"markdown"
]
}
}
Pack response (truncated):
{"error":"invalid_input","message":"unsupported format \"markdown\"; use md, text, or html"}
Agent's final reply:
The tool returned an error:
unsupported format “markdown”; use md, text, or html
Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.0257).
Developer reference (curl)
# Inline PDF (base64) — known-good path
PDF_B64=$(base64 -w0 < /path/to/your.pdf)
curl -fsS -X POST http://localhost:3000/api/v1/packs/doc.parse \
-H "Authorization: Bearer $JWT" \
-H 'Content-Type: application/json' \
-d "{\"source_b64\":\"$PDF_B64\",\"filename\":\"input.pdf\",\"do_ocr\":true}"
URL path (currently returns the upstream-API mismatch error):
curl -fsS -X POST http://localhost:3000/api/v1/packs/doc.parse \
-H "Authorization: Bearer $JWT" \
-H 'Content-Type: application/json' \
-d '{"source_url":"https://arxiv.org/pdf/2501.17887"}'
Captured response (URL path showing the current Docling API mismatch):
{
"error": "handler_failed",
"message": "docling 422: {\"detail\":[{\"type\":\"missing\",\"loc\":[\"body\",\"sources\"],\"msg\":\"Field required\",\"input\":{\"options\":{\"to_formats\":[\"md\"],\"do_ocr\":true,\"image_export_mode\":\"placeholder\",\"abort_on_error\":false},\"http_sources\":[{\"url\":\"https://arxiv.org/pdf/2501.17887\"}]}}]}"
}
Error codes
| Code | Triggers | Captured response |
|---|---|---|
invalid_input | Both source_url and source_b64 missing | {"error":"invalid_input","message":"either source_url or source_b64 is required"} |
invalid_input | source_b64 set without filename | {"error":"invalid_input","message":"filename is required when source_b64 is set (extension picks the parser)"} |
invalid_input | HELMDECK_DOCLING_ENABLED is unset/false | doc.parse is disabled (set HELMDECK_DOCLING_ENABLED=true) |
invalid_input | formats includes a value outside md/text/html | unknown format |
handler_failed | Docling returns non-200 (incl. the current http_sources 422) | docling NNN: <body> |
handler_failed | Docling reports failure or skipped | per-document errors surfaced from upstream |
Session chaining
Stateless. No _session_id needed — Docling runs as its own service. Useful chains: web.scrape → save bytes via fs.write → doc.parse to convert a PDF an agent downloaded into structured Markdown for further processing.
Async behavior
Synchronous. Heavy documents take time — a 100-page scanned PDF with OCR can hit 60+ seconds. The pack's timeout is 300 seconds; longer documents hit timeout rather than handler_failed.
When to use which doc pack
| Pack | Use when |
|---|---|
doc.parse | PDF, DOCX, PPTX, multi-page document with structure. Tables matter. Layout matters. Embedded images need OCR. |
doc.ocr | A single image. Plain text is enough. No layout. No Docling overlay running. |
web.scrape | The document is rendered HTML on a webpage — Firecrawl handles fetch + clean-markdown extraction in one call. |
See also
- Catalog row:
PACKS.md—doc.parse. - Source:
internal/packs/builtin/doc_parse.go. - ADR 035 — MCP Server Hosting & Pack Evolution (Docling overlay rationale).
- T807c in
TASKS.md— the work item that introduced this pack. - Companion pack:
doc.ocr.