Demystifying Evals for AI Agents
Published: January 9, 2026
Source: Anthropic Engineering Blog
Authors: Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe (with acknowledgements to additional contributors)
Introduction
The article opens by arguing that good evaluations enable teams to ship AI agents with greater confidence. Without evals, teams fall into "reactive loops—catching issues only in production, where fixing one failure creates others." Evals surface problems and behavioral shifts before they reach users, and their benefit grows over an agent's lifecycle.
As previously described in Anthropic's "Building effective agents" post, agents work across many turns—calling tools, modifying state, and adapting based on intermediate results. These very capabilities (autonomy, intelligence, flexibility) that make agents useful also complicate evaluation. This post shares what Anthropic has learned internally and through customer collaborations about designing rigorous agent evals.
The Structure of an Evaluation
An evaluation is described as a test for an AI system: provide an input, then apply grading logic to measure success. The post focuses on automated evals that run during development without real users.
Single-turn evaluations are straightforward—prompt, response, grading logic. As AI capabilities have advanced, multi-turn evaluations have become increasingly common.
Agent evaluations are more complex still. Agents use tools across many turns, modifying state and adapting, meaning "mistakes can propagate and compound." Frontier models may also find creative solutions that surpass static eval limits. For example, Opus 4.5 solved a τ2-bench flight-booking problem by discovering a policy loophole—it technically "failed" the eval but actually found a better user solution.
Key Definitions
The article defines the following terms for agent evaluations:
- Task (a.k.a. problem/test case): a single test with defined inputs and success criteria
- Trial: each attempt at a task; multiple trials produce more consistent results due to output variation
- Grader: logic that scores some aspect of performance; a task can have multiple graders, each with multiple assertions or "checks"
- Transcript (trace/trajectory): the complete record of a trial, including outputs, tool calls, reasoning, and intermediate results
- Outcome: the final state in the environment at trial's end (e.g., whether a reservation actually exists in a database, regardless of what the agent said)
- Evaluation harness: infrastructure that runs evals end-to-end—providing instructions/tools, running tasks concurrently, recording steps, grading outputs, aggregating results
- Agent harness (scaffold): the system enabling a model to act as an agent—processing inputs, orchestrating tool calls, returning results. When evaluating "an agent," you're evaluating the harness and model together
- Evaluation suite: a collection of tasks measuring specific capabilities or behaviors, typically sharing a broad goal
Why Build Evaluations?
Early in agent development, teams can get surprisingly far through manual testing, dogfooding, and intuition. But after scaling to production, this approach breaks down.
The breaking point often arrives when users report degraded performance after changes, leaving the team "flying blind." Absent evals, debugging is reactive—waiting for complaints, reproducing manually, fixing bugs, hoping nothing else regresses. Teams cannot distinguish real regressions from noise or test changes against hundreds of scenarios before shipping.
The article cites Claude Code as an example: it began with fast iteration based on Anthropic employee and external user feedback, then added evals first for narrow areas (concision, file edits), then for more complex behaviors (over-engineering). These evals helped identify issues, guide improvements, and focus research-product collaborations.
Case Studies
Descript built evals around three editing workflow dimensions: "don't break things, do what I asked, and do it well." They evolved from manual grading to LLM graders with product-team-defined criteria and periodic human calibration, running separate suites for quality benchmarking and regression testing.
Bolt started building evals after already having a widely used agent. Within 3 months, they built a system that runs their agent, grades outputs with static analysis, uses browser agents to test apps, and employs LLM judges for behaviors like instruction following.
Evals are useful at any stage. Early on, they force teams to specify what success means; later, they uphold a consistent quality bar. They also accelerate adoption of new models—teams with evals can determine a model's strengths, tune prompts, and upgrade in days rather than weeks.
Once evals exist, baselines and regression tests come for free: latency, token usage, cost per task, and error rates can be tracked on a static task bank. Evals can become "the highest-bandwidth communication channel between product and research teams."
How to Evaluate AI Agents
Common agent types today include coding agents, research agents, computer use agents, and conversational agents. Each can be evaluated using similar techniques—the methods below serve as a foundation to extend to your domain.
Types of Graders
Agent evaluations typically combine three types:
Code-based graders include string matching, binary tests (fail-to-pass, pass-to-pass), static analysis, outcome verification, tool call verification, and transcript analysis. Strengths: fast, cheap, objective, reproducible, easy to debug. Weaknesses: brittle to valid variations, lacking nuance, limited for subjective tasks.
Model-based graders include rubric-based scoring, natural language assertions, pairwise comparison, reference-based evaluation, and multi-judge consensus. Strengths: flexible, scalable, captures nuance, handles open-ended tasks. Weaknesses: non-deterministic, more expensive than code, requires calibration with human graders.
Human graders include SME review, crowdsourced judgment, spot-check sampling, A/B testing, and inter-annotator agreement. Strengths: gold standard quality, matches expert judgment, used to calibrate model-based graders. Weaknesses: expensive, slow, often requires access to human experts at scale.
Scoring can be weighted (combined scores hit a threshold), binary (all graders must pass), or hybrid.
Capability vs. Regression Evals
Capability ("quality") evals ask what the agent can do well. They should start at a low pass rate, targeting areas of struggle and giving teams "a hill to climb."
Regression evals ask whether the agent still handles previously working tasks, and should maintain nearly 100% pass rate. They protect against backsliding. As teams improve on capability evals, regression evals ensure changes don't cause issues elsewhere.
After launch and optimization, high-pass-rate capability evals can "graduate" into a regression suite run continuously to catch drift.
Evaluating Coding Agents
Coding agents write, test, and debug code, navigating codebases and running commands. Effective evals rely on well-specified tasks, stable test environments, and thorough tests.
Deterministic graders are natural since "software is generally straightforward to evaluate: does the code run and do the tests pass?" Two benchmarks illustrate this:
- SWE-bench Verified gives agents GitHub issues from popular Python repos and grades by running the test suite—solutions must fix failing tests without breaking existing ones. LLMs progressed from 40% to >80% in one year.
- Terminal-Bench tests end-to-end technical tasks like building a Linux kernel from source or training an ML model.
Beyond pass/fail tests, it's often useful to also grade the transcript—using heuristics-based code quality rules and model-based graders with clear rubrics to assess behaviors like tool usage and user interaction.
The article provides an illustrative YAML example for a coding task (fixing an auth bypass vulnerability) that combines deterministic tests, an LLM rubric, static analysis, state checks, and tool call verification, along with tracked metrics for turns, tool calls, tokens, and latency. In practice, coding evaluations typically rely on unit tests for correctness and an LLM rubric for code quality, with additional graders added only as needed.
Evaluating Conversational Agents
Conversational agents interact with users in domains like support, sales, or coaching. Unlike traditional chatbots, they maintain state, use tools, and take actions mid-conversation. They present a distinct challenge: "the quality of the interaction itself is part of what you're evaluating."
Success is multidimensional: ticket resolved (state check), finished in <10 turns (transcript constraint), appropriate tone (LLM rubric). Two benchmarks incorporating multidimensionality are τ-Bench and τ2-Bench, which simulate multi-turn interactions where one model plays a user persona while the agent navigates realistic scenarios.
The article provides an illustrative YAML example for a support task (handling a frustrated customer refund) combining an LLM rubric, state check, tool call verification, and transcript constraints. In practice, conversational agent evaluations typically use model-based graders for both communication quality and goal completion, since many tasks may have multiple "correct" solutions.
Evaluating Research Agents
Research agents gather, synthesize, and analyze information, producing answers or reports. Unlike coding agents where unit tests provide binary signals, "research quality can only be judged relative to the task."
Unique challenges include: experts may disagree on comprehensiveness, ground truth shifts constantly, and longer outputs create more room for mistakes. The BrowseComp benchmark tests whether agents can find needles in haystacks across the open web—questions easy to verify but hard to solve.
One strategy combines grader types: groundedness checks verify claims are supported by sources, coverage checks define key facts a good answer must include, source quality checks confirm authoritative sources, and exact match works for objectively correct answers. An LLM can flag unsupported claims, gaps, and verify synthesis coherence and completeness.
Given the subjective nature, LLM-based rubrics should be frequently calibrated against expert human judgment.
Computer Use Agents
These agents interact with software through the same interface as humans—screenshots, mouse clicks, keyboard inputs, scrolling—rather than APIs or code execution. They can use any GUI application, from design tools to legacy enterprise software.
WebArena tests browser-based tasks using URL and page state checks plus backend state verification. OSWorld extends to full OS control, with evaluation scripts inspecting diverse artifacts: file system state, application configs, database contents, and UI element properties.
Browser use agents require balancing token efficiency and latency: DOM-based interactions execute quickly but consume many tokens, while screenshot-based interactions are slower but more token-efficient. In the Claude for Chrome product, Anthropic developed evals to check that the agent selected the right tool for each context.
Non-Determinism in Agent Evaluations
Agent behavior varies between runs, making results harder to interpret. Each task has its own success rate, and a task that passed one run might fail the next.
Two metrics capture this nuance:
pass@k measures the likelihood of at least one correct solution in k attempts. As k increases, pass@k rises. In coding, pass@1 is often most important. In other cases, proposing many solutions is valid as long as one works.
pass^k measures the probability that all k trials succeed. As k increases, pass^k falls. With a 75% per-trial rate over 3 trials, the probability of all three passing is (0.75)^3 ≈ 42%. This metric matters especially for customer-facing agents where reliable behavior is expected every time.
At k=1, both metrics are identical. By k=10, they tell opposite stories: pass@k approaches 100% while pass^k falls to 0%. Which to use depends on product requirements.
Going from Zero to One: A Roadmap
The article presents field-tested advice for building evals from scratch.
Step 0: Start Early
Teams often delay building evals thinking they need hundreds of tasks. In reality, "20-50 simple tasks drawn from real failures is a great start." Early in development, each change has a clear impact, so small sample sizes suffice. More mature agents need larger evals to detect smaller effects, but the 80/20 approach works initially. Evals get harder to build the longer you wait.
Step 1: Start with What You Already Test Manually
Begin with manual checks run during development. If already in production, look at bug trackers and support queues. Converting user-reported failures into test cases ensures the suite reflects actual usage.
Step 2: Write Unambiguous Tasks with Reference Solutions
A good task is one where "two domain experts would independently reach the same pass/fail verdict." Ambiguity in specifications becomes noise in metrics. Each task should be passable by a correctly-following agent. With frontier models, a 0% pass rate across many trials (0% pass@100) "is most often a signal of a broken task, not an incapable agent." For each task, create a reference solution that passes all graders—proving solvability and verifying grader configuration.
Step 3: Build Balanced Problem Sets
Test both where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization. Avoid class-imbalanced evals.
The article shares Anthropic's experience building evals for web search in Claude.ai—the challenge was preventing search when inappropriate while preserving extensive research capability. The team built evals covering both directions: queries requiring search (weather) and queries answerable from knowledge ("who founded Apple?"). Striking the right balance took many rounds of refinements.
Step 4: Build a Robust Eval Harness with a Stable Environment
The eval agent should function roughly the same as production, and the environment shouldn't introduce noise. Each trial should start from a clean environment. Unnecessary shared state between runs can cause correlated failures from infrastructure flakiness rather than agent performance. In some internal evals, Claude gained unfair advantage by examining git history from previous trials. Trials affected by the same environmental limitation (like limited memory) are not independent.
Step 5: Design Graders Thoughtfully
Choose deterministic graders where possible, LLM graders where necessary, and human graders judiciously for validation.
A common instinct is checking that agents followed very specific step sequences. The article finds this "too rigid and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate." It's often better to grade what the agent produced, not the path it took.
For multi-component tasks, build in partial credit—a support agent that correctly identifies a problem and verifies the customer but fails to process a refund is meaningfully better than one that fails immediately.
Model grading requires careful iteration. LLM-as-judge graders should be closely calibrated with human experts. To avoid hallucinations, give the LLM a way out (returning "Unknown" when insufficient information). Create clear, structured rubrics grading each dimension with isolated LLM-as-judges rather than one for all dimensions. Once robust, human review is only occasionally needed.
Some evaluations have subtle failure modes. The article cites Opus 4.5 initially scoring 42% on CORE-Bench until an Anthropic researcher found multiple issues: rigid grading penalizing "96.12" when expecting "96.124991…", ambiguous task specs, and stochastic tasks impossible to reproduce exactly. After fixing, the score jumped to 95%. Similarly, METR discovered misconfigured tasks in their time horizon benchmark that penalized models like Claude for following stated instructions while rewarding models that ignored the stated goal.
Make graders resistant to bypasses—tasks and graders should ensure passing genuinely requires solving the problem.
Step 6: Check the Transcripts
"You won't know if your graders are working well unless you read the transcripts and grades from many trials." Anthropic invested in tooling for viewing eval transcripts and regularly reads them. When a task fails, the transcript reveals whether the agent made a genuine mistake or graders rejected a valid solution. "Failures should seem fair: it's clear what the agent got wrong and why." Reading transcripts is how you verify the eval measures what actually matters.
Step 7: Monitor for Capability Eval Saturation
An eval at 100% tracks regressions but provides no improvement signal. Eval saturation occurs when an agent passes all solvable tasks. SWE-Bench Verified started at 30% and frontier models now approach >80%, nearing saturation. As evals saturate, only the hardest tasks remain, making results deceptive—large capability improvements appear as small score increases.
The article cites Qodo, a code review startup initially unimpressed by Opus 4.5 because one-shot coding evals didn't capture gains on complex tasks. They developed a new agentic eval framework providing a clearer picture.
"We do not take eval scores at face value until someone digs into the details of the eval and reads some transcripts."
Step 8: Keep Evaluation Suites Healthy Long-Term
An eval suite is a living artifact needing ongoing attention and clear ownership. At Anthropic, what proved most effective was establishing dedicated evals teams owning core infrastructure, while domain experts and product teams contribute tasks and run evaluations.
For AI product teams, owning and iterating on evaluations should be as routine as maintaining unit tests. Defining eval tasks is "one of the best ways to stress-test whether the product requirements are concrete enough to start building."
The article recommends eval-driven development: build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well. When a new model drops, running the suite quickly reveals which bets paid off.
Product managers, customer success managers, or salespeople can use Claude Code to contribute eval tasks as PRs. "The people closest to product requirements and users are best positioned to define success."
How Evals Fit with Other Methods
Automated evaluations can run against thousands of tasks without deploying to production. But they're just one way to understand agent performance. A complete picture includes production monitoring, user feedback, A/B testing, manual transcript review, and systematic human evaluation.
The article provides a detailed comparison table of six methods:
| Method | Description | Pros | Cons |
|---|---|---|---|
| Automated evals | Running tests programmatically without real users | Faster iteration, fully reproducible, no user impact, can run on every commit, tests at scale | Requires up-front investment, ongoing maintenance, can create false confidence |
| Production monitoring | Tracking metrics and errors in live systems | Reveals real user behavior, catches issues synthetic evals miss, ground truth | Reactive, noisy signals, requires instrumentation investment, lacks ground truth for grading |
| A/B testing | Comparing variants with real user traffic | Measures actual user outcomes, controls for confounds, scalable | Slow (days/weeks for significance), only tests deployed changes, less signal on underlying "why" |
| User feedback | Explicit signals like thumbs-down or bug reports | Surfaces unanticipated problems, real examples, correlates with product goals | Sparse and self-selected, skews toward severe issues, users rarely explain why, not automated |
| Manual transcript review | Humans reading through agent conversations | Builds intuition for failure modes, catches subtle quality issues, helps calibrate | Time-intensive, doesn't scale, inconsistent coverage, reviewer fatigue |
| Systematic human studies | Structured grading by trained raters | Gold-standard judgments, handles subjective tasks, improves model-based graders | Expensive and slow, hard to run frequently, inter-rater disagreement, complex domains require experts |
These methods map to different development stages. Automated evals are especially useful pre-launch and in CI/CD. Production monitoring kicks in post-launch. A/B testing validates significant changes with sufficient traffic. User feedback and transcript review are ongoing gap-fillers. Systematic human studies are reserved for calibrating LLM graders or evaluating subjective outputs.
The article references the Swiss Cheese Model from safety engineering: no single evaluation layer catches every issue, but with multiple methods combined, failures slipping through one layer are caught by another. "The most effective teams combine these methods."
Conclusion
Teams without evals get bogged down in reactive loops. Teams that invest early find development accelerates as "failures become test cases, test cases prevent regressions, and metrics replace guesswork." Evals give the team a clear hill to climb, turning vague complaints into actionable signals.
The fundamentals are constant across agent types: start early; source realistic tasks from observed failures; define unambiguous success criteria; design graders thoughtfully combining multiple types; ensure problems are hard enough for the model; iterate on evaluations to improve signal-to-noise ratio; and read the transcripts.
AI agent evaluation remains a nascent, fast-evolving field. As agents take on longer tasks, collaborate in multi-agent systems, and handle increasingly subjective work, techniques will need to adapt.
Acknowledgements
Written by Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe, with contributions from David Hershey, Gian Segato, Mike Merrill, Alex Shaw, Nicholas Carlini, Ethan Dixon, Pedram Navid, Jake Eaton, Alyssa Baum, Lina Tawfik, Karen Zhou, Alexander Bricken, Sam Kennedy, Robert Ying, and others. Special thanks to customers and partners including iGent, Cognition, Bolt, Sierra, Vals.ai, Macroscope, PromptLayer, Stripe, Shopify, the Terminal Bench team, and more.
Appendix: Eval Frameworks
Several open-source and commercial frameworks help implement agent evaluations:
Harbor: designed for running agents in containerized environments with infrastructure for running trials at scale across cloud providers and a standardized format for tasks and graders. Popular benchmarks like Terminal-Bench 2.0 ship through its registry.
Braintrust: combines offline evaluation with production observability and experiment tracking. Its
autoevalslibrary includes pre-built scorers for factuality, relevance, and other dimensions.LangSmith: offers tracing, offline/online evaluations, and dataset management integrated with the LangChain ecosystem. Langfuse provides similar capabilities as a self-hosted open-source alternative for data residency requirements.
Arize: offers Phoenix (open-source platform for LLM tracing, debugging, and evaluation) and AX (SaaS extending Phoenix for scale, optimization, and monitoring).
Many teams combine multiple tools, roll their own frameworks, or use simple evaluation scripts as a starting point. While frameworks can standardize and accelerate progress, "they're only as good as the eval tasks you run through them." The recommendation is to quickly pick a framework fitting your workflow, then invest energy in high-quality test cases and graders.