Accessible with the Engineering pass and above.
Frontier AI models score 80–90% on standard benchmarks like RKGI, yet when tested on visual tasks any 3-year-old handles effortlessly (like counting objects in an image), those same models fall to pieces. I watched this gap widen firsthand during my 14 years at Google Brain and DeepMind, where I co-led development on GLaM, PaLM 2, and Gemini. The problem is that most models hit high RKGI scores not through genuine visual understanding, but by coding – a workaround that scores well and reveals little. Strip that away and you're left with systems that struggle to solve a simple crossword puzzle, identify what's the same or different across two images, or navigate a basic 3D view. These tasks are essential to achieve human-level reasoning capability. And the current benchmark ecosystem wasn’t built to evaluate for it, leaving us with top scoring models that can’t even follow along with Count Von Count. In this talk I'll dig into why the current eval landscape systematically overstates capability, the structural reasons it does so, and how we got here from the viewpoint of someone who was inside a leading frontier lab. I'll close with what I believe a more rigorous, consensus-driven eval framework needs to look like, and why the field needs to build one before the next generation of visual systems ships into the real world. Fixing visual reasoning starts with fixing how we measure it. For engineers building on top of these models today, whether that's document understanding, robotic perception, medical imaging, or any system where visual perception context matters, the cost of getting this wrong is already showing up in production.