`vision.click_anywhere`

The natural-language click pack. Caller supplies a goal — "click the URL bar at the top of the Chromium window" — plus a vision-capable model; the pack screenshots the visible desktop, asks the model for click coordinates, fires xdotool click at those coordinates, and loops until the model emits done (goal reached) or max_steps is hit. Every step's screenshot is recorded as an artifact so the operator can replay the trail in the Artifacts UI panel.

This is the agent-loop counterpart to the desktop.* REST primitives — use vision.click_anywhere when the agent doesn't already know the pixel coordinates; use desktop.click when it does.

⚠️ Known limitation — see issue #102: the loop currently does not capture a fresh screenshot after each click before the next model turn. The model decides whether to emit done based on its prior pre-click screenshot plus its own confidence in the action, not on visually verified post-click state. For unambiguous targets (focusing a URL bar, clicking a clearly-labeled standalone button) this works fine. For dense UIs, modal dialogs, or pages mid-load, the pack may report completed: true when the click silently missed. Until the fix lands, callers should verify success with vision.extract_visible_text or desktop.screenshot after each click sequence.

Setup prerequisite

Vision packs run on a desktop-mode session (HELMDECK_MODE=desktop — set automatically by the pack via SessionSpec). The session boots with Chromium pre-launched on the XFCE4 display visible at http://localhost:6080/vnc.html (noVNC). Operators watching that URL will see the cursor move and the click happen.

Inputs

Field	Type	Required	Default	Notes
`goal`	`string`	yes	—	Plain-English description of what to click. Be specific (`"the blue Submit button at the bottom of the form"`), not generic (`"the button"`).
`model`	`string`	yes	—	Vision-capable provider/model. e.g. `openrouter/anthropic/claude-haiku-4.5`, `openai/gpt-4o`. Non-vision models silently fail because the screenshot is invisible to them.
`max_steps`	`number`	no	`6`	Cap on the loop. Each step = one screenshot + one model call + one click (or no-op).
`_session_id`	`string`	yes (chained)	—	Pass the session id from the previous desktop-mode pack to keep the same Chromium / cursor state.

Outputs

Field	Type	Notes
`completed`	`boolean`	`true` if the model emitted `done` within `max_steps`.
`steps`	`number`	Steps actually taken.
`final_action`	`object`	The last `vision.Action` the model emitted. Includes the `action` (`"click"`, `"done"`, `"none"`, etc.), pixel coordinates if any, and a free-text `reason`.

The pack also writes one step-NNN.png artifact per step — accessible via the Artifacts API or the Artifacts UI panel keyed off the pack-call id.

Vault credentials needed

None — the AI key for model resolves through the AI Providers UI panel (or env-var fast path for OpenRouter via HELMDECK_OPENROUTER_API_KEY), not the vault.

Use it from your agent (OpenClaw chat-UI worked example)

Prompt (sent in OpenClaw chat UI / openclaw-cli agent):

Use helmdeck__vision-click_anywhere with goal "click the URL bar at the top of the Chromium window to focus it" against the visible desktop. Use model openrouter/anthropic/claude-haiku-4.5. Tell me whether the click completed and what action the model finally took.

Tool call (2 calls, no failures):

{
  "name": "helmdeck__vision-click_anywhere",
  "arguments": {
    "goal": "click the URL bar at the top of the Chromium window to focus it",
    "max_steps": 10,
    "model": "openrouter/anthropic/claude-haiku-4"
  }
}

Pack response (truncated):

{"error":"handler_failed","message":"model call: openrouter upstream 400: {\"error\":{\"message\":\"anthropic/claude-haiku-4 is not a valid model ID\",\"code\":400},\"user_id\":\"user_2hJIOoqnDnf9gSbfewzyidUrkfW\"}"}

{
  "name": "helmdeck__vision-click_anywhere",
  "arguments": {
    "goal": "click the URL bar at the top of the Chromium window to focus it",
    "max_steps": 10,
    "model": "openrouter/anthropic/claude-haiku-4.5"
  }
}

Pack response (truncated):

{"pack":"vision.click_anywhere","version":"v1","output":{"completed":false,"final_action":{"action":"click","x":376,"y":69,"reason":"Click on the URL bar showing 'about:blank' to focus it"},"steps":10},"artifacts":[{"key":"vision.click_anywhere/e30698ebd92de276-step-000.png","url":"http://garage:3900/helmdeck-artifacts/vision.click_anywhere/e30698ebd92de276-step-000.png?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=GKaef177ebe9fd814dc103b84b%2F20260508%2Fgarage%2Fs3%2Faws4_request\u0026X-Amz-Date=20260508T133021Z\u0026X-Amz-Expires=900\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=a951147b1fa1e1972710907a8d6bca838e3c335c5c6774afb9eec0704a1087cf","size":27275,"content_type":"image/png","created_at":"2026-05-08T13:30:21.145059977Z","pack":"vision.click_anywhere"},{"key":"vision
  ...[truncated]

Agent's final reply:

Click completed: false

Final action taken: a click at coordinates (376, 69) with the reason “Click on the URL bar showing ‘about:blank’ to focus it.”

Verified via OpenClaw 2026.5.6 + helmdeck v0.9.0-dev + openrouter/openai/gpt-oss-120b on 2026-05-07 (cost: $0.1800).

Developer reference (`curl`)

curl -fsS -X POST http://localhost:3000/api/v1/packs/vision.click_anywhere \
  -H "Authorization: Bearer $JWT" -H 'Content-Type: application/json' \
  -d '{
    "goal":      "click the URL bar at the top of the Chromium window",
    "model":     "openrouter/anthropic/claude-haiku-4.5",
    "max_steps": 4
  }'

Live capture is dependent on which app is currently in focus on the desktop. The response shape returns {completed, steps, final_action} plus per-step screenshot artifacts.

Error codes

Code	Triggers	Captured response
`invalid_input`	`goal` empty	`{"error":"invalid_input","message":"goal must not be empty"}`
`invalid_input`	`model` empty	`{"error":"invalid_input","message":"model must not be empty"}`
`session_unavailable`	Engine has no session executor	`{"error":"session_unavailable","message":"engine has no session executor"}`
`handler_failed`	Vision step failed (screenshot capture, model call, action parse)	`{"error":"handler_failed","message":"…"}`
`internal`	Pack registered without a gateway dispatcher	`vision.* registered without a gateway dispatcher`

Session chaining

Required (creates if absent). Compatible chains — desktop-mode only:

Upstream: desktop.run_app_and_screenshot (launch an app first), vision.fill_form_by_label (fill before clicking submit).
Downstream: vision.extract_visible_text (verify what you clicked), desktop.screenshot (capture the result).

Always pass _session_id from repo.fetch or the upstream desktop-mode pack — see the Session chaining contract in SKILLS.md.

Async behavior

Synchronous. Wall-clock = steps × (screenshot + model_latency + xdotool). Each step on a Haiku-tier vision model is ~2–4 seconds; budget accordingly.

Setup prerequisite​

Inputs​

Outputs​

Vault credentials needed​

Use it from your agent (OpenClaw chat-UI worked example)​

Developer reference (curl)​

Error codes​

Session chaining​

Async behavior​

See also​