Accessible with the Engineering pass and above.
AI is acing the tests we set for it. So why are so many production deployments falling flat? Talk to any team shipping vertical AI and you'll hear the same story: impressive benchmark scores, underwhelming real-world results. The problem isn't that we lack benchmarks; it's that we're measuring the wrong things. Synthetic tasks and standardised datasets tell you how a model performs in a lab. They don't tell you whether it's working in your product, for your customers, on your edge cases. The gap between benchmark-ready and production-ready is where ROI goes to die. This talk draws on lessons from building Anterior's internal benchmark for real-world healthcare tasks; work that now helps health insurance providers make decisions covering 50 million American lives. I'll share how to bring in domain experts to translate real-world performance into concrete measurement rubrics, how to make imperfect synthetic data actually work in practice, the most common pitfalls teams fall into, and how to apply all of this to any vertical domain.