Wolfram Ravenwolf

Wolfram Ravenwolf

Wolfram Ravenwolf

AI Evangelist · Weights & Biases by CoreWeave

Bio

Wolfram Ravenwolf is an AI Evangelist at CoreWeave / Weights & Biases, where he helps builders evaluate, debug, and ship useful AI systems. He works across model evaluation, agent tooling, inference infrastructure, and developer education, translating hands-on engineering work into practical guidance for teams adopting frontier AI. Wolfram is the creator of WolfBench, a five-metric framework for evaluating agent performance based on Terminal-Bench 2.0, and regularly tests new models, coding agents, and evaluation workflows in real-world conditions. He is also a ThursdAI co-host, speaker, writer, and longtime AI community builder. Before joining CoreWeave/W&B, he worked as an engineer, researcher, and consultant focused on making complex technology usable. His talks are practical, opinionated, and grounded in live experimentation: fewer buzzwords, more working systems.

Session (1)

Wolfram Ravenwolf is the creator of WolfBench, an agentic evaluation framework that surfaces five complementary metrics (Ceiling, Best-of, Average, Worst-of, Solid) rather than a single score, and is one of the most prolific public LLM evaluators. Attendees interested in how to rigorously measure AI agent reliability — and why single-number benchmarks mislead — will get direct, data-backed answers from someone actively running thousands of evaluation runs.

GitHub

@WolframRavenwolf

Recent writing (2)

Podcasts & interviews (6)

Public activity researched automatically · as of Jun 2026