vision.click_anywhere
The natural-language click pack. Caller supplies a goal — "click the URL bar at the top of the Chromium window" — plus a vision-capable model; the pack screenshots the visible desktop, asks the model for click coordinates, fires xdotool click at those coordinates, and loops until the model emits done (goal reached) or max_steps is hit. Every step's screenshot is recorded as an artifact so the operator can replay the trail in the Artifacts UI panel.
This is the agent-loop counterpart to the desktop.* REST primitives — use vision.click_anywhere when the agent doesn't already know the pixel coordinates; use desktop.click when it does.
💡 How the loop self-corrects: each iteration captures a fresh screenshot AND threads the prior turns' actions into the model's prompt as a textual history. So when iteration N's click misses, iteration N+1's screenshot reflects the (still-not-focused) state AND the prompt explicitly says "prior attempts: 1. click(376, 69) — 'click URL bar'. These did NOT achieve the goal — try a different approach." The model can then try a different coordinate or a different action entirely, instead of re-emitting the same failed click. A 250 ms post-dispatch wait inside
Stepgives Xvfb time to repaint between the click and the next screenshot. Together these mean the pack converges in 1–4 steps for typical targets instead of looping untilmax_steps(#102).
Setup prerequisite
Vision packs run on a desktop-mode session (HELMDECK_MODE=desktop — set automatically by the pack via SessionSpec). The session boots with Chromium pre-launched on the XFCE4 display visible at http://localhost:6080/vnc.html (noVNC). Operators watching that URL will see the cursor move and the click happen.
Inputs
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
goal | string | yes | — | Plain-English description of what to click. Be specific ("the blue Submit button at the bottom of the form"), not generic ("the button"). |
model | string | yes | — | Vision-capable provider/model. e.g. openrouter/anthropic/claude-haiku-4.5, openai/gpt-4o. Non-vision models silently fail because the screenshot is invisible to them. |
max_steps | number | no | 6 | Cap on the loop. Each step = one screenshot + one model call + one click (or no-op). |
_session_id | string | yes (chained) | — | Pass the session id from the previous desktop-mode pack to keep the same Chromium / cursor state. |
Outputs
| Field | Type | Notes |
|---|---|---|
completed | boolean | true if the model emitted done within max_steps. |
steps | number | Steps actually taken. |
final_action | object | The last vision.Action the model emitted. Includes the action ("click", "done", "none", etc.), pixel coordinates if any, and a free-text reason. |
The pack also writes one step-NNN.png artifact per step — accessible via the Artifacts API or the Artifacts UI panel keyed off the pack-call id.
Vault credentials needed
None — the AI key for model resolves through the AI Providers UI panel (or env-var fast path for OpenRouter via HELMDECK_OPENROUTER_API_KEY), not the vault.
Use it from your agent (OpenClaw chat-UI worked example)
Prompt (sent in OpenClaw chat UI / openclaw-cli agent):
Use helmdeck__vision-click_anywhere with goal "click the URL bar at the top of the Chromium window to focus it" against the visible desktop. Use model openrouter/anthropic/claude-haiku-4.5. Tell me whether the click completed and what action the model finally took.
Tool call (2 calls, no failures):
{
"name": "helmdeck__vision-click_anywhere",
"arguments": {
"goal": "click the URL bar at the top of the Chromium window to focus it",
"max_steps": 10,
"model": "openrouter/anthropic/claude-haiku-4"
}
}
Pack response (truncated):
{"error":"handler_failed","message":"model call: openrouter upstream 400: {\"error\":{\"message\":\"anthropic/claude-haiku-4 is not a valid model ID\",\"code\":400},\"user_id\":\"user_2hJIOoqnDnf9gSbfewzyidUrkfW\"}"}
{
"name": "helmdeck__vision-click_anywhere",
"arguments": {
"goal": "click the URL bar at the top of the Chromium window to focus it",
"max_steps": 10,
"model": "openrouter/anthropic/claude-haiku-4.5"
}
}
Pack response (truncated):
{"pack":"vision.click_anywhere","version":"v1","output":{"completed":false,"final_action":{"action":"click","x":376,"y":69,"reason":"Click on the URL bar showing 'about:blank' to focus it"},"steps":10},"artifacts":[{"key":"vision.click_anywhere/e30698ebd92de276-step-000.png","url":"http://garage:3900/helmdeck-artifacts/vision.click_anywhere/e30698ebd92de276-step-000.png?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=GKaef177ebe9fd814dc103b84b%2F20260508%2Fgarage%2Fs3%2Faws4_request\u0026X-Amz-Date=20260508T133021Z\u0026X-Amz-Expires=900\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=a951147b1fa1e1972710907a8d6bca838e3c335c5c6774afb9eec0704a1087cf","size":27275,"content_type":"image/png","created_at":"2026-05-08T13:30:21.145059977Z","pack":"vision.click_anywhere"},{"key":"vision
...[truncated]
Agent's final reply:
- Click completed: false
- Final action taken: a click at coordinates (376, 69) with the reason “Click on the URL bar showing ‘about:blank’ to focus it.”
Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.1800).
Developer reference (curl)
curl -fsS -X POST http://localhost:3000/api/v1/packs/vision.click_anywhere \
-H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
-d '{
"goal": "click the URL bar at the top of the Chromium window",
"model": "openrouter/anthropic/claude-haiku-4.5",
"max_steps": 4
}'
Live capture is dependent on which app is currently in focus on the desktop. The response shape returns {completed, steps, final_action} plus per-step screenshot artifacts.
Error codes
| Code | Triggers | Captured response |
|---|---|---|
invalid_input | goal empty | {"error":"invalid_input","message":"goal must not be empty"} |
invalid_input | model empty | {"error":"invalid_input","message":"model must not be empty"} |
session_unavailable | Engine has no session executor | {"error":"session_unavailable","message":"engine has no session executor"} |
handler_failed | Vision step failed (screenshot capture, model call, action parse) | {"error":"handler_failed","message":"…"} |
internal | Pack registered without a gateway dispatcher | vision.* registered without a gateway dispatcher |
Session chaining
Required (creates if absent). Compatible chains — desktop-mode only:
- Upstream:
desktop.run_app_and_screenshot(launch an app first),vision.fill_form_by_label(fill before clicking submit). - Downstream:
vision.extract_visible_text(verify what you clicked),desktop.screenshot(capture the result).
Always pass _session_id from repo.fetch or the upstream desktop-mode pack — see the Session chaining contract in SKILLS.md.
Async behavior
Synchronous. Wall-clock = steps × (screenshot + model_latency + xdotool). Each step on a Haiku-tier vision model is ~2–4 seconds; budget accordingly.
See also
- Catalog row:
PACKS.md—vision.click_anywhere. - Source:
internal/packs/builtin/vision_packs.go. - ADR 027 — Vision pipeline.
- ADR 035 §2026 revision — Native computer-use tool routing (T807f).
- Companion packs:
vision.extract_visible_text,vision.fill_form_by_label,desktop.run_app_and_screenshot.