Loophole - Adversarial Agents To Stress Test Your Morality

SessionEngineering trackconfirmed

Loophole - Adversarial Agents To Stress Test Your Morality

Day: Day 4 — Session Day 3
Time: 1:30pm-1:50pm
Room: Main Stage
Track: Harness Engineering

Accessible with the Engineering pass and above.

About this session

Most natural language specifications have holes their authors didn't notice - and writing more rules tends to create more holes. I built Loophole to try a different approach: point adversarial agents at a spec until it stops breaking. You give the system a set of natural language principles. An AI drafts a formal codified version. Two adversarial agents go to work - one finds cases the code permits but the principles forbid, the other finds cases the code forbids but the principles allow. A judge agent patches the code when it can, but only if the fix doesn't contradict any prior ruling. When a contradiction can't be resolved, it escalates to you. Every decision becomes binding precedent, so the constraint space tightens round after round. I started with moral and legal reasoning as the demo, and on its own that's already interesting - it turns into a kind of game where you discover contradictions in your own beliefs that you didn't know were there. But the pattern generalizes well past that. The same loop works for company policies that need to survive contact with edge cases. For making chatbot system prompts adversarially robust. For stress-testing eval rubrics. And, taking the long view, for something like a smarter legislative process - where proposed laws get checked against the public's stated values before they pass, and the contradictions surface before they hit a courtroom. The talk walks through how the harness works, the design choices that matter (especially why precedent is the load-bearing piece), what kinds of specs it handles well, where it breaks, and what it would take to push it further. All code is open source.

Topics

AI in LawAI in GovernmentMy talk is weird and doesn't fit anywhere listed!!

Speaker

Brendan Rappazzo

Machine learning researcher · Morgan Stanley