2 hr deep dive on LLM Inference at Scale — Part 1 of 2

WorkshopWorkshop trackconfirmed

2 hr deep dive on LLM Inference at Scale — Part 1 of 2

Day: Day 1 — Workshop Day
Time: 12:10pm-1:10pm
Room: Track 3
Track: Workshops Day 1

Accessible with the Engineering + Workshops pass and above.

About this session

Most engineers using LLMs can call an API. Far fewer can explain why their model is slow, why it's running out of memory, or how the inference engines powering every major LLM API actually work. This workshop walks through the full inference stack — from how a transformer generates a single token to serving billions of tokens a day with vLLM, SGLang, TensorRT-LLM, Ray, and KServe/llm-d. 60% explanation with live demos, 40% hands-on exercises. Attendees leave with a running vLLM server they benchmarked themselves. Based on the open-source practitioners handbook being built live at github.com/harshuljain13/llm-inference-at-scale

Topics

Inference (vLLM, SGLang, etc)

Speaker

Harshul Jain

Senior Software Engineer · Audible Inc