Weight Folding, CUDA Streams, and the Bug That Made My Model Speak Backwards

SessionEngineering trackconfirmed

Weight Folding, CUDA Streams, and the Bug That Made My Model Speak Backwards

Day
Day 4 — Session Day 3
Time
3:20pm-3:40pm
Room
Track 9
Track
Inference

Accessible with the Engineering pass and above.

About this session

A talk about contributing GPU benchmarks to an open-source research paper (FlashNorm). I'll walk through the engineering journey: folding norm weights into projections, writing Triton kernels, accidentally making attention bidirectional (oops), and ultimately proving a 33-35% speedup on the norm+project operation. Practical lessons for anyone trying to optimize transformer inference.

Topics

Memory & Continual LearningLLM Production InfraAI ResearchMy talk is weird and doesn't fit anywhere listed!!

Speaker