`vision.extract_visible_text`

The "tell me what's on the screen" pack. Caller supplies a vision-capable model; the pack screenshots the visible desktop and asks the model to transcribe every piece of readable text, joined with newlines. Use it as a verification step after a vision.click_anywhere or vision.fill_form_by_label call ("did the post-submit page load? what's the order ID?"), or for general "what app is showing right now" queries.

Single-step — no internal loop. Returns immediately with the transcription text.

Setup prerequisite

Vision packs run on a desktop-mode session (HELMDECK_MODE=desktop — set automatically by the pack via SessionSpec).

Inputs

Field	Type	Required	Default	Notes
`model`	`string`	yes	—	Vision-capable provider/model. e.g. `openrouter/anthropic/claude-haiku-4.5`, `openai/gpt-4o`.
`_session_id`	`string`	yes (chained)	—	Pass the session id from the upstream desktop-mode pack.

Outputs

Field	Type	Notes
`text`	`string`	Newline-joined transcription of every readable piece of text on the visible desktop.
`model`	`string`	Echo — the model used.

Vault credentials needed

None — the AI key for model resolves through the AI Providers UI panel.

Use it from your agent (OpenClaw chat-UI worked example)

Prompt (sent in OpenClaw chat UI / openclaw-cli agent):

Use helmdeck__vision-extract_visible_text against the visible desktop. Use model openrouter/anthropic/claude-haiku-4.5. Tell me what text the vision model transcribed (truncate to 200 chars if very long).

Tool call (1 call, no failures):

{
  "name": "helmdeck__vision-extract_visible_text",
  "arguments": {
    "model": "openrouter/anthropic/claude-haiku-4.5"
  }
}

Pack response (truncated):

{"pack":"vision.extract_visible_text","version":"v1","output":{"model":"openrouter/anthropic/claude-haiku-4.5","text":""},"duration_ms":9509607044,"session_id":"09abae0f-3383-4a03-9fde-1655c18eddc9"}

Agent's final reply:

The vision extraction returned an empty string, indicating no readable text was detected on the visible desktop.

Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.1835).

Developer reference (`curl`)

curl -fsS -X POST http://localhost:3000/api/v1/packs/vision.extract_visible_text \
  -H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
  -d '{"model":"openrouter/anthropic/claude-haiku-4.5"}'

Response shape:

{
  "pack": "vision.extract_visible_text",
  "version": "v1",
  "output": {
    "text":  "<every readable line on the desktop, newline-joined>",
    "model": "openrouter/anthropic/claude-haiku-4.5"
  },
  "duration_ms": …,
  "session_id": "…"
}

Error codes

Code	Triggers	Captured response
`invalid_input`	`model` empty	`{"error":"invalid_input","message":"model must not be empty"}`
`session_unavailable`	Engine has no session executor	`{"error":"session_unavailable","message":"engine has no session executor"}`
`handler_failed`	Screenshot capture failed (Xvfb died)	`{"error":"handler_failed","message":"…"}`
`handler_failed`	Model returned unparseable response	`{"error":"handler_failed","message":"could not parse model response: …"}`

Session chaining

Required (creates if absent). Mostly used as a verification step after a vision.* action. Compatible chains:

Upstream: any vision.* or desktop.* pack — chain to verify the result.
Downstream: nothing in particular — the text is the output the agent uses to make the next decision.

Async behavior

Synchronous. One screenshot + one model call. Wall-clock ≈ 1–3 seconds on a Haiku-tier model.

Setup prerequisite​

Inputs​

Outputs​

Vault credentials needed​

Use it from your agent (OpenClaw chat-UI worked example)​

Developer reference (curl)​

Error codes​

Session chaining​

Async behavior​

See also​