1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code

SessionEngineering trackconfirmed

1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code

Day: Day 3 — Session Day 2
Time: 2:25pm-2:45pm
Room: Track 1
Track: Sandbox & Platform Engineering

Accessible with the Engineering pass and above.

About this session

We ran 1,000 automated tasks through a production code interpreter sandbox — file I/O, package installs, data analysis, ML training, binary downloads, multi-language execution — and tracked every failure. 88% passed. The other 12% revealed 18 distinct failure modes that no unit test would catch: binary encoding corruption in the transport layer, null bytes silently truncating file downloads, pip blocked by network isolation with no useful error, and path traversal inputs accepted without validation. This talk walks through the experiment design, the findings ranked by severity, and what we changed. If you are building or operating sandboxed execution for AI agents, these are the bugs waiting for your customers to find first.

Topics

SandboxesCoding Agents

Speaker

Kevin Orellana

Software Engineer · Amazon Web Services