Accessible with the Engineering pass and above.
Bad voice-agent calls are starting to look less like QA bugs and more like incident scenes. I learned that instinct at Citizen, where noisy radio, ambiguous speech, fast-moving incidents, and real-time alerts became information people might actually act on. That work was stressful for obvious reasons. Voice agents scare me more. Not because they sound creepy. Because they sound good enough that people trust them. And now they are connected to calendars, CRMs, EHRs, reservation systems, refunds, transfers, account data, and support workflows. At Hamming, we monitor more than 10,000 voice agents and have analyzed millions of calls. The weird thing you learn at that scale is that production voice agents do not usually fail like demos. They fail quietly. The agent sounds natural, but misses a two-word answer. It handles the happy path, but loses the plot when the caller interrupts. It says the address was updated, but no tool call happened. It supports six languages, but gets worse at the switch point between two of them. This talk is about treating every bad voice-agent call like an incident scene. The evidence is there if you collect it: transcript, waveform, latency waterfall, interruption points, ASR uncertainty, tool trace, system-of-record state, and post-call outcome. At Tesla, I learned that autonomous systems need release gates and regression loops before they hit the real world. At Citizen, I learned that messy audio becomes safety-critical when people act on it. Voice agents need both instincts. The takeaway is a voice-agent forensics loop. What did the caller say? What did the agent think happened? What did the tool actually do? What does the system of record say? And how do we turn that weird production failure into a regression test before it happens 10,000 more times?