vision.click_anywhere works mechanically. The model still doesn't. Five projects waiting for someone to build them.
Issue #112 is the canonical research thread. This post pulls the same thinking into a project-shaped frame: five separable projects, any one of which would close the gap. If you're an OSS maintainer or researcher looking for a 3–6 month project that lots of people would benefit from — pick one.
In v0.10.0 we shipped a mechanical fix for vision.click_anywhere's "loops forever clicking the same coords" bug. The fix worked: per-step screenshots now show genuine visual progression instead of identical bytes. Xvfb repaints, scrot captures the new frame, the next iteration sees a different image.
And the model still doesn't emit done.
We tested with three plausible goals — "click the URL bar to focus it", "click the New Tab button at the top of the Chromium window", "click anywhere in the center of the visible window". Per-step coordinates were sensible, screenshots changed turn-to-turn, but claude-haiku-4.5 (and similar small/cheap vision models) hit max_steps: 10 with completed: false every time. The model could see the screenshot. It couldn't decide whether the click had succeeded.
