Accessible with the Leadership (All-Access) pass and above.
Leaderboard rankings tell an incomplete story. In this talk, Nick Heiner draws on hundreds of hours of hands-on evaluation across frontier models to argue that benchmark performance and production reliability are increasingly divergent signals. The core of this talk addresses what Nick terms model–system misalignment: the gap between a model's agentic behavior and the infrastructure built to support it. Where Claude Code and Opus 4.6 deploy coordinated agent swarms that reflect tight co-development between model and platform, Gemini 3.1 Pro exhibits self-referential orchestration patterns — calling itself or Gemini 2.5 rather than delegating to purpose-built sub-agents. Nick argues this isn't a capability gap but an architectural one, with real consequences for teams building reliable multi-agent pipelines in production. Attendees will leave with a sharper framework for evaluating models not just on task performance, but on how well their emergent behaviors fit the systems meant to deploy them, and a clearer view of where today's frontier models are actually ready to do economically meaningful work.