Skip to main content

vision.extract_visible_text

The "tell me what's on the screen" pack. Caller supplies a vision-capable model; the pack screenshots the visible desktop and asks the model to transcribe every piece of readable text, joined with newlines. Use it as a verification step after a vision.click_anywhere or vision.fill_form_by_label call ("did the post-submit page load? what's the order ID?"), or for general "what app is showing right now" queries.

Single-step — no internal loop. Returns immediately with the transcription text.

Setup prerequisite

Vision packs run on a desktop-mode session (HELMDECK_MODE=desktop — set automatically by the pack via SessionSpec).

Inputs

FieldTypeRequiredDefaultNotes
modelstringyesVision-capable provider/model. e.g. openrouter/anthropic/claude-haiku-4.5, openai/gpt-4o.
_session_idstringyes (chained)Pass the session id from the upstream desktop-mode pack.

Outputs

FieldTypeNotes
textstringNewline-joined transcription of every readable piece of text on the visible desktop.
modelstringEcho — the model used.

Vault credentials needed

None — the AI key for model resolves through the AI Providers UI panel.

Use it from your agent (OpenClaw chat-UI worked example)

Prompt (sent in OpenClaw chat UI / openclaw-cli agent):

Use helmdeck__vision-extract_visible_text against the visible desktop. Use model openrouter/anthropic/claude-haiku-4.5. Tell me what text the vision model transcribed (truncate to 200 chars if very long).

Tool call (1 call, no failures):

{
"name": "helmdeck__vision-extract_visible_text",
"arguments": {
"model": "openrouter/anthropic/claude-haiku-4.5"
}
}

Pack response (truncated):

{"pack":"vision.extract_visible_text","version":"v1","output":{"model":"openrouter/anthropic/claude-haiku-4.5","text":""},"duration_ms":9509607044,"session_id":"09abae0f-3383-4a03-9fde-1655c18eddc9"}

Agent's final reply:

The vision extraction returned an empty string, indicating no readable text was detected on the visible desktop.

Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.1835).

Developer reference (curl)

curl -fsS -X POST http://localhost:3000/api/v1/packs/vision.extract_visible_text \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{"model":"openrouter/anthropic/claude-haiku-4.5"}'

Response shape:

{
"pack": "vision.extract_visible_text",
"version": "v1",
"output": {
"text": "<every readable line on the desktop, newline-joined>",
"model": "openrouter/anthropic/claude-haiku-4.5"
},
"duration_ms":,
"session_id": "…"
}

Error codes

CodeTriggersCaptured response
invalid_inputmodel empty{"error":"invalid_input","message":"model must not be empty"}
session_unavailableEngine has no session executor{"error":"session_unavailable","message":"engine has no session executor"}
handler_failedScreenshot capture failed (Xvfb died){"error":"handler_failed","message":"…"}
handler_failedModel returned unparseable response{"error":"handler_failed","message":"could not parse model response: …"}

Session chaining

Required (creates if absent). Mostly used as a verification step after a vision.* action. Compatible chains:

  • Upstream: any vision.* or desktop.* pack — chain to verify the result.
  • Downstream: nothing in particular — the text is the output the agent uses to make the next decision.

Async behavior

Synchronous. One screenshot + one model call. Wall-clock ≈ 1–3 seconds on a Haiku-tier model.

See also