Wolfram Ravenwolf

AI Evangelist · Weights & Biases by CoreWeave

Twitter / X LinkedIn Website Blog

Bio

Wolfram Ravenwolf is an AI Evangelist at CoreWeave / Weights & Biases, where he helps builders evaluate, debug, and ship useful AI systems. He works across model evaluation, agent tooling, inference infrastructure, and developer education, translating hands-on engineering work into practical guidance for teams adopting frontier AI. Wolfram is the creator of WolfBench, a five-metric framework for evaluating agent performance based on Terminal-Bench 2.0, and regularly tests new models, coding agents, and evaluation workflows in real-world conditions. He is also a ThursdAI co-host, speaker, writer, and longtime AI community builder. Before joining CoreWeave/W&B, he worked as an engineer, researcher, and consultant focused on making complex technology usable. His talks are practical, opinionated, and grounded in live experimentation: fewer buzzwords, more working systems.

Session (1)

Day 112:10pm-1:10pmTrack 5

From Zero to Leaderboard: Building an End-to-End AI Agent Evaluation Pipeline

Wolfram Ravenwolf is the creator of WolfBench, an agentic evaluation framework that surfaces five complementary metrics (Ceiling, Best-of, Average, Worst-of, Solid) rather than a single score, and is one of the most prolific public LLM evaluators. Attendees interested in how to rigorously measure AI agent reliability — and why single-number benchmarks mislead — will get direct, data-backed answers from someone actively running thousands of evaluation runs.

GitHub

@WolframRavenwolf

Recent writing (2)

Introducing WolfBench: A new evaluation framework for models and agents · blog · Mar 2026
WolfBench: The impact of time(outs) on agentic benchmarks · blog · Mar 2026

Podcasts & interviews (6)

ThursdAI - Jan 8 - Vera Rubin's 5x Jump, Ralph Wiggum Goes Viral, GPT Health Launches & XAI Raises $20B Mid-Controversy · ThursdAI · Jan 2026
ThursdAI - Jan 22 - Clawdbot deep dive, GLM 4.7 Flash, Anthropic constitution + 3 new TSS models · ThursdAI · Jan 2026
ThursdAI - Feb 5 - Opus 4.6 was #1 for ONE HOUR before GPT 5.3 Codex, Voxtral transcription, Codex app, Qwen Coder Next & the Agentic Internet · ThursdAI · Feb 2026
ThursdAI - Feb 19 - Gemini 3.1 Pro Drops LIVE, Sonnet 4.6 Closes Gap, OpenClaw Goes to OpenAI · ThursdAI · Feb 2026
April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews · ThursdAI · Apr 2026
AI just cracked an 80-year-old math problem nobody could solve — plus everything from Google I/O 26 · ThursdAI · May 2026

Public activity researched automatically · as of Jun 2026