The Art of Building Verifiers for Computer Use Agents

SessionExpo trackconfirmed

The Art of Building Verifiers for Computer Use Agents

Day: Day 4 — Session Day 3
Time: 11:40am-12:00pm
Room: Expo Stage 1
Track: —

Accessible with the Expo Explorer pass and above.

About this session

Every team building browser agents has the same problem: you can't trust your own evals. Browser tasks are too open-ended for deterministic checks, so teams use LLM verifiers as judges, and the judges are wrong constantly. WebVoyager misses 45% of failures. WebJudge misses 22%. Used as RL reward, you're not training a better agent, you're training a more confident liar. This talk walks through the Universal Verifier, open-sourced with Microsoft Research: false positive rate near zero, Cohen's kappa matching human-human agreement. Four design principles, one open benchmark, and an honest account of where auto-research worked and where it plateaued.

Speakers

Miguel González Fernández

Browserbase

Corby Rosset

Microsoft Research