
Wolfram Ravenwolf is an AI Evangelist at CoreWeave / Weights & Biases, where he helps builders evaluate, debug, and ship useful AI systems. He works across model evaluation, agent tooling, inference infrastructure, and developer education, translating hands-on engineering work into practical guidance for teams adopting frontier AI. Wolfram is the creator of WolfBench, a five-metric framework for evaluating agent performance based on Terminal-Bench 2.0, and regularly tests new models, coding agents, and evaluation workflows in real-world conditions. He is also a ThursdAI co-host, speaker, writer, and longtime AI community builder. Before joining CoreWeave/W&B, he worked as an engineer, researcher, and consultant focused on making complex technology usable. His talks are practical, opinionated, and grounded in live experimentation: fewer buzzwords, more working systems.
Wolfram Ravenwolf is the creator of WolfBench, an agentic evaluation framework that surfaces five complementary metrics (Ceiling, Best-of, Average, Worst-of, Solid) rather than a single score, and is one of the most prolific public LLM evaluators. Attendees interested in how to rigorously measure AI agent reliability — and why single-number benchmarks mislead — will get direct, data-backed answers from someone actively running thousands of evaluation runs.
Public activity researched automatically · as of Jun 2026