Skip to main content

17. Pack: web.scrape_spa

Status: Accepted Date: 2026-04-07 Domain: api-design

Context

SPA scraping requires waiting for JavaScript hydration, network idle, DOM stability, and schema-conformant extraction — a sequence that frequently produces empty or partial output when driven by weak models (PRD §6.6, §19.10).

Decision

Ship web.scrape_spa as a built-in pack.

Input: { url: string, schema: object (JSON Schema), wait_ms?: integer, max_wait_ms?: integer } Output: { data: object (matching schema), partial?: boolean, fields_extracted: integer } Errors: not_found, timeout, schema_mismatch, network_error

The handler navigates, waits for network idle (default 500 ms quiet period, max 30 s), runs an extraction routine that maps each schema property to a CSS/XPath selector or LLM-assisted field extraction, validates against the schema, and returns the result. On partial success it returns partial: true instead of failing.

Consequences

Positive: structured extraction in one call; schema enforcement guarantees agent-consumable output. Negative: schema design quality determines reliability; LLM-assisted extraction calls back through the AI gateway.

§6.6 Capability Packs, §10 Typed Error Codes