Skip to main content

web.scrape

The zero-selectors scrape pack. The agent supplies a URL and gets back clean Markdown — Firecrawl handles the headless render, content extraction, and Markdown conversion. Where web.scrape_spa requires CSS selectors and runs against the session's CDP-driven Chromium, this pack hands the URL off to a Firecrawl service that already knows how to deal with SPAs, ad-walls, and most modern web shapes. Use it whenever the agent doesn't already have the DOM mapped.

Setup prerequisite

web.scrape only works when the Firecrawl overlay is running and the env-var toggle is set:

docker compose -f deploy/compose/compose.yaml \
-f deploy/compose/compose.firecrawl.yml \
--env-file deploy/compose/.env.local up -d

# in deploy/compose/.env.local:
HELMDECK_FIRECRAWL_ENABLED=true

When the toggle is off the pack returns invalid_input: web.scrape is disabled (set HELMDECK_FIRECRAWL_ENABLED=true).

Inputs

FieldTypeRequiredDefaultNotes
urlstringyesAbsolute http(s) URL. Egress-guarded — RFC 1918 / metadata IPs / loopback are rejected before Firecrawl is contacted.
formatsarrayno["markdown"]Subset of markdown, html, rawHtml, links. markdown is required for the output schema's markdown field.
wait_msnumberno0Settle-delay (ms) Firecrawl applies after navigation before snapshotting. Useful for SPAs that hydrate post-DOMContentLoaded.

Outputs

FieldTypeNotes
urlstringEcho of the input.
markdownstringAlways present. Cleaned Markdown of the page's primary content.
htmlstringOnly when formats requested it.
linksarrayOnly when formats requested it. Discovered outbound links.
titlestringFrom Firecrawl's metadata pass.
statusnumberUpstream HTTP status.

Vault credentials needed

None. The Firecrawl service runs on the private baas-net and accepts unauthenticated calls from the control plane.

Use it from your agent (OpenClaw chat-UI worked example)

Prompt (sent in OpenClaw chat UI / openclaw-cli agent):

Use the helmdeck__web-scrape tool to scrape https://example.com to clean markdown. Tell me the page title and the first sentence of the markdown.

Tool call (1 call, no failures):

{
"name": "helmdeck__web-scrape",
"arguments": {
"url": "https://example.com"
}
}

Pack response (truncated):

{"pack":"web.scrape","version":"v1","output":{"markdown":"Example Domain\n==============\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)","status":200,"title":"Example Domain","url":"https://example.com"},"duration_ms":115733635}

Agent's final reply:

  • Page title: Example Domain
  • First sentence of the markdown: “This domain is for use in documentation examples without needing permission.”

Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.1685).

Developer reference (curl)

Mint a JWT first:

ADMIN_PW=$(grep HELMDECK_ADMIN_PASSWORD /root/helmdeck/deploy/compose/.env.local | cut -d= -f2)
JWT=$(curl -fsS -X POST http://localhost:3000/api/v1/auth/login \
-H 'Content-Type: application/json' \
-d "{\"username\":\"admin\",\"password\":\"${ADMIN_PW}\"}" \
| python3 -c 'import sys,json;print(json.load(sys.stdin)["token"])')

Happy path:

curl -fsS -X POST http://localhost:3000/api/v1/packs/web.scrape \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{"url":"https://example.com"}'

Captured response:

{
"pack": "web.scrape",
"version": "v1",
"output": {
"markdown": "Example Domain\n==============\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)",
"status": 200,
"title": "Example Domain",
"url": "https://example.com"
},
"duration_ms": 104911180
}

Error codes

CodeTriggersCaptured response
invalid_inputurl missing or empty{"error":"invalid_input","message":"url is required"}
invalid_inputHELMDECK_FIRECRAWL_ENABLED is unset/falseweb.scrape is disabled; set HELMDECK_FIRECRAWL_ENABLED=true …
invalid_inputformats includes a value outside markdown/html/rawHtml/links{"error":"invalid_input","message":"unsupported format \"pdf\"; use markdown, html, rawHtml, or links"}
invalid_inputURL resolves to a blocked range (metadata, RFC 1918, loopback){"error":"invalid_input","message":"egress denied: security: destination is in a blocked address range: 169.254.169.254 is in 169.254.169.254/32"}
handler_failedFirecrawl returns non-200 (incl. success: false body){"error":"handler_failed","message":"firecrawl 500: …"}
handler_failedFirecrawl returns empty markdown (bot-challenge, blank body){"error":"handler_failed","message":"firecrawl returned empty markdown for https://… (status=200)"}

Session chaining

No session. Stateless. Chains freely upstream of doc.parse (download a page bytestream and feed it through layout-aware parsing), content.ground (rewrite a Markdown blob the pack just produced), or slides.narrate (turn a scraped page into a narrated deck).

Async behavior

Synchronous. Firecrawl's own per-request timeout is generous; helmdeck caps the round-trip at 90 seconds. Heavy SPAs may approach that cap; for those, use web.scrape_spa with explicit wait_ms.

See also