news | Victor Barres

May 13, 2026	New blog post — τ-Knowledge: benchmarking agents on realistic knowledge. Frontier has moved from 25.5% → 37.4% Pass^1 since the March release, with ~63 pp of headroom still left. Includes a behavioral analysis of what separates the strong agents from the rest.
May 11, 2026	Three τ-Bench family papers accepted to ICML 2026 — including τ²-Bench as an oral (slides): τ²-Bench (dual-control evaluation), τ-Knowledge (knowledge retrieval), and τ-Voice (full-duplex voice agents). See you in July!
May 01, 2026	τ-Voice — first benchmark to measure full-duplex voice agents on realistic, grounded customer-service tasks. Voice agents have closed most of the gap to non-reasoning text models in ~8 months.
Apr 20, 2026	μ-Bench released — an open multilingual transcription benchmark covering 5 locales, 5 ASR providers, and 4,270 human-annotated utterances from real customer calls.
Mar 18, 2026	τ³-Bench released — extending τ-Bench with a knowledge-retrieval domain (τ-Knowledge), full-duplex voice evaluation (τ-Voice), and community-contributed task fixes. Live leaderboard at taubench.com.
Mar 02, 2026	Organizing and judging the Sierra τ²-Bench Custom Track of the AgentX–AgentBeats Competition (Berkeley RDI, Fall 2025 – Spring 2026).
Jun 10, 2025	τ²-Bench released — a benchmark for evaluating conversational agents in a dual-control environment, where both the agent and the user can take actions on the world.