Victor Barres

I’m a researcher at Sierra, where I build and study conversational AI agents and the benchmarks that tell us when they actually work — the τ-Bench family (live leaderboard at taubench.com). My focus is on agents that must execute complex tool-using tasks while sustaining long, coherent conversational interactions — the two things real-world deployments demand at once.

My background is in computational cognitive science and cognitive linguistics, and I’ve spent years building real-world conversational systems across several startups. More on how I think about the work →

current work

I lead the current work on the τ-Bench family of agent benchmarks (originally introduced at Sierra in 2024) — code, repo, public leaderboard, and a sequence of extensions:

τ²-Bench — extends τ-Bench to a dual-control setting where both the agent and the user can act on the world.
τ-Knowledge — knowledge-retrieval domain.
τ-Voice — first benchmark to measure full-duplex voice agents on realistic, grounded customer-service tasks.
τ³-Bench — combines τ-Knowledge and τ-Voice with community-contributed task fixes and code improvements.

news

May 13, 2026	New blog post — τ-Knowledge: benchmarking agents on realistic knowledge. Frontier has moved from 25.5% → 37.4% Pass^1 since the March release, with ~63 pp of headroom still left. Includes a behavioral analysis of what separates the strong agents from the rest.
May 11, 2026	Three τ-Bench family papers accepted to ICML 2026 — including τ²-Bench as an oral (slides): τ²-Bench (dual-control evaluation), τ-Knowledge (knowledge retrieval), and τ-Voice (full-duplex voice agents). See you in July!
May 01, 2026	τ-Voice — first benchmark to measure full-duplex voice agents on realistic, grounded customer-service tasks. Voice agents have closed most of the gap to non-reasoning text models in ~8 months.
Apr 20, 2026	μ-Bench released — an open multilingual transcription benchmark covering 5 locales, 5 ASR providers, and 4,270 human-annotated utterances from real customer calls.
Mar 18, 2026	τ³-Bench released — extending τ-Bench with a knowledge-retrieval domain (τ-Knowledge), full-duplex voice evaluation (τ-Voice), and community-contributed task fixes. Live leaderboard at taubench.com.

selected publications

ICML

τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi, Alexandra Zytek, Pedram Razavi, and 2 more authors

arXiv preprint arXiv:2603.04370, 2026

Accepted at ICML 2026.

arXiv Bib HTML Code

@article{shi2026tauknowledge,
  title = {τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge},
  author = {Shi, Quan and Zytek, Alexandra and Razavi, Pedram and Narasimhan, Karthik and Barres, Victor},
  journal = {arXiv preprint arXiv:2603.04370},
  note = {Accepted at ICML 2026.},
  year = {2026},
  url = {https://arxiv.org/abs/2603.04370},
  keywords = {ai-agents}
}

ICML

τ-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Victor Barres^*, Soham Ray^*, Keshav Dhandhania^*, and 1 more author

2026

Accepted at ICML 2026.

arXiv Bib HTML Code

@misc{ray2026tauvoice,
  title = {τ-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains},
  author = {Barres, Victor and Ray, Soham and Dhandhania, Keshav and Narasimhan, Karthik},
  note = {Accepted at ICML 2026.},
  year = {2026},
  archiveprefix = {arXiv},
  primaryclass = {cs.SD},
  url = {https://arxiv.org/abs/2603.13686},
  keywords = {ai-agents}
}

ICML
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres^*, Honghua Dong^*, Soham Ray, and 2 more authors

2025

Oral at ICML 2026.

Abs arXiv Bib HTML Code Slides

We introduce τ²-bench, a benchmark for evaluating conversational agents in a dual-control environment, where both the agent and the user can take actions on the world. τ²-bench extends τ-bench’s tool-agent-user interaction setting and supports rigorous evaluation of agents that must coordinate with users in real-world customer-service scenarios.
@misc{barres2025tau2, title = {τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment}, author = {Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik}, note = {Oral at ICML 2026.}, year = {2025}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2506.07982}, keywords = {ai-agents} }

NAACL

From Generating Answers to Building Explanations: Integrating Multi-Round RAG and Causal Modeling for Scientific QA

Victor Barres^*, Clifton James McFate^*, Aditya Kalyanpur, and 4 more authors

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Industry Track), 2025

Bib HTML Slides

@inproceedings{barres2025explanations,
  title = {From Generating Answers to Building Explanations: Integrating Multi-Round {RAG} and Causal Modeling for Scientific {QA}},
  author = {Barres, Victor and McFate, Clifton James and Kalyanpur, Aditya and Saravanakumar, Kailash Karthik and Moon, Lori and Seifu, Natnael and Bautista-Castillo, Abraham},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Industry Track)},
  year = {2025},
  url = {https://aclanthology.org/2025.naacl-industry.42/},
  keywords = {ai-agents}
}

arXiv
LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic

Aditya Kalyanpur, Kailash Karthik Saravanakumar, Victor Barres, and 3 more authors

2024

Abs arXiv Bib HTML

We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement.
@misc{kalyanpur2024llmarc, title = {{LLM-ARC}: Enhancing {LLMs} with an Automated Reasoning Critic}, author = {Kalyanpur, Aditya and Saravanakumar, Kailash Karthik and Barres, Victor and Chu-Carroll, Jennifer and Melville, David and Ferrucci, David}, year = {2024}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2406.17663}, keywords = {ai-agents} }

AAAI-SS

Template Construction Grammar: A Schema-Theoretic Computational Construction Grammar

Victor J. Barres

In AAAI Spring Symposium on Computational Construction Grammar and Natural Language Understanding, 2017

Bib HTML

@inproceedings{barres2017tcg,
  title = {Template Construction Grammar: A Schema-Theoretic Computational Construction Grammar},
  author = {Barres, Victor J.},
  booktitle = {AAAI Spring Symposium on Computational Construction Grammar and Natural Language Understanding},
  year = {2017},
  url = {https://aaai.org/proceeding/02-spring-2017/},
  keywords = {cogsci}
}

See all publications →