Skip to content

Demystifying Evals for AI Agents

Published: January 9, 2026

Source: Anthropic Engineering Blog

Authors: Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe (with acknowledgements to additional contributors)


Introduction

The article opens by arguing that good evaluations enable teams to ship AI agents with greater confidence. Without evals, teams fall into "reactive loops—catching issues only in production, where fixing one failure creates others." Evals surface problems and behavioral shifts before they reach users, and their benefit grows over an agent's lifecycle.

As previously described in Anthropic's "Building effective agents" post, agents work across many turns—calling tools, modifying state, and adapting based on intermediate results. These very capabilities (autonomy, intelligence, flexibility) that make agents useful also complicate evaluation. This post shares what Anthropic has learned internally and through customer collaborations about designing rigorous agent evals.

The Structure of an Evaluation

An evaluation is described as a test for an AI system: provide an input, then apply grading logic to measure success. The post focuses on automated evals that run during development without real users.

Single-turn evaluations are straightforward—prompt, response, grading logic. As AI capabilities have advanced, multi-turn evaluations have become increasingly common.

Agent evaluations are more complex still. Agents use tools across many turns, modifying state and adapting, meaning "mistakes can propagate and compound." Frontier models may also find creative solutions that surpass static eval limits. For example, Opus 4.5 solved a τ2-bench flight-booking problem by discovering a policy loophole—it technically "failed" the eval but actually found a better user solution.

Key Definitions

The article defines the following terms for agent evaluations:

  • Task (a.k.a. problem/test case): a single test with defined inputs and success criteria
  • Trial: each attempt at a task; multiple trials produce more consistent results due to output variation
  • Grader: logic that scores some aspect of performance; a task can have multiple graders, each with multiple assertions or "checks"
  • Transcript (trace/trajectory): the complete record of a trial, including outputs, tool calls, reasoning, and intermediate results
  • Outcome: the final state in the environment at trial's end (e.g., whether a reservation actually exists in a database, regardless of what the agent said)
  • Evaluation harness: infrastructure that runs evals end-to-end—providing instructions/tools, running tasks concurrently, recording steps, grading outputs, aggregating results
  • Agent harness (scaffold): the system enabling a model to act as an agent—processing inputs, orchestrating tool calls, returning results. When evaluating "an agent," you're evaluating the harness and model together
  • Evaluation suite: a collection of tasks measuring specific capabilities or behaviors, typically sharing a broad goal

Why Build Evaluations?

Early in agent development, teams can get surprisingly far through manual testing, dogfooding, and intuition. But after scaling to production, this approach breaks down.

The breaking point often arrives when users report degraded performance after changes, leaving the team "flying blind." Absent evals, debugging is reactive—waiting for complaints, reproducing manually, fixing bugs, hoping nothing else regresses. Teams cannot distinguish real regressions from noise or test changes against hundreds of scenarios before shipping.

The article cites Claude Code as an example: it began with fast iteration based on Anthropic employee and external user feedback, then added evals first for narrow areas (concision, file edits), then for more complex behaviors (over-engineering). These evals helped identify issues, guide improvements, and focus research-product collaborations.

Case Studies

  • Descript built evals around three editing workflow dimensions: "don't break things, do what I asked, and do it well." They evolved from manual grading to LLM graders with product-team-defined criteria and periodic human calibration, running separate suites for quality benchmarking and regression testing.

  • Bolt started building evals after already having a widely used agent. Within 3 months, they built a system that runs their agent, grades outputs with static analysis, uses browser agents to test apps, and employs LLM judges for behaviors like instruction following.

Evals are useful at any stage. Early on, they force teams to specify what success means; later, they uphold a consistent quality bar. They also accelerate adoption of new models—teams with evals can determine a model's strengths, tune prompts, and upgrade in days rather than weeks.

Once evals exist, baselines and regression tests come for free: latency, token usage, cost per task, and error rates can be tracked on a static task bank. Evals can become "the highest-bandwidth communication channel between product and research teams."

How to Evaluate AI Agents

Common agent types today include coding agents, research agents, computer use agents, and conversational agents. Each can be evaluated using similar techniques—the methods below serve as a foundation to extend to your domain.

Types of Graders

Agent evaluations typically combine three types:

Code-based graders include string matching, binary tests (fail-to-pass, pass-to-pass), static analysis, outcome verification, tool call verification, and transcript analysis. Strengths: fast, cheap, objective, reproducible, easy to debug. Weaknesses: brittle to valid variations, lacking nuance, limited for subjective tasks.

Model-based graders include rubric-based scoring, natural language assertions, pairwise comparison, reference-based evaluation, and multi-judge consensus. Strengths: flexible, scalable, captures nuance, handles open-ended tasks. Weaknesses: non-deterministic, more expensive than code, requires calibration with human graders.

Human graders include SME review, crowdsourced judgment, spot-check sampling, A/B testing, and inter-annotator agreement. Strengths: gold standard quality, matches expert judgment, used to calibrate model-based graders. Weaknesses: expensive, slow, often requires access to human experts at scale.

Scoring can be weighted (combined scores hit a threshold), binary (all graders must pass), or hybrid.

Capability vs. Regression Evals

Capability ("quality") evals ask what the agent can do well. They should start at a low pass rate, targeting areas of struggle and giving teams "a hill to climb."

Regression evals ask whether the agent still handles previously working tasks, and should maintain nearly 100% pass rate. They protect against backsliding. As teams improve on capability evals, regression evals ensure changes don't cause issues elsewhere.

After launch and optimization, high-pass-rate capability evals can "graduate" into a regression suite run continuously to catch drift.

Evaluating Coding Agents

Coding agents write, test, and debug code, navigating codebases and running commands. Effective evals rely on well-specified tasks, stable test environments, and thorough tests.

Deterministic graders are natural since "software is generally straightforward to evaluate: does the code run and do the tests pass?" Two benchmarks illustrate this:

  • SWE-bench Verified gives agents GitHub issues from popular Python repos and grades by running the test suite—solutions must fix failing tests without breaking existing ones. LLMs progressed from 40% to >80% in one year.
  • Terminal-Bench tests end-to-end technical tasks like building a Linux kernel from source or training an ML model.

Beyond pass/fail tests, it's often useful to also grade the transcript—using heuristics-based code quality rules and model-based graders with clear rubrics to assess behaviors like tool usage and user interaction.

The article provides an illustrative YAML example for a coding task (fixing an auth bypass vulnerability) that combines deterministic tests, an LLM rubric, static analysis, state checks, and tool call verification, along with tracked metrics for turns, tool calls, tokens, and latency. In practice, coding evaluations typically rely on unit tests for correctness and an LLM rubric for code quality, with additional graders added only as needed.

Evaluating Conversational Agents

Conversational agents interact with users in domains like support, sales, or coaching. Unlike traditional chatbots, they maintain state, use tools, and take actions mid-conversation. They present a distinct challenge: "the quality of the interaction itself is part of what you're evaluating."

Success is multidimensional: ticket resolved (state check), finished in <10 turns (transcript constraint), appropriate tone (LLM rubric). Two benchmarks incorporating multidimensionality are τ-Bench and τ2-Bench, which simulate multi-turn interactions where one model plays a user persona while the agent navigates realistic scenarios.

The article provides an illustrative YAML example for a support task (handling a frustrated customer refund) combining an LLM rubric, state check, tool call verification, and transcript constraints. In practice, conversational agent evaluations typically use model-based graders for both communication quality and goal completion, since many tasks may have multiple "correct" solutions.

Evaluating Research Agents

Research agents gather, synthesize, and analyze information, producing answers or reports. Unlike coding agents where unit tests provide binary signals, "research quality can only be judged relative to the task."

Unique challenges include: experts may disagree on comprehensiveness, ground truth shifts constantly, and longer outputs create more room for mistakes. The BrowseComp benchmark tests whether agents can find needles in haystacks across the open web—questions easy to verify but hard to solve.

One strategy combines grader types: groundedness checks verify claims are supported by sources, coverage checks define key facts a good answer must include, source quality checks confirm authoritative sources, and exact match works for objectively correct answers. An LLM can flag unsupported claims, gaps, and verify synthesis coherence and completeness.

Given the subjective nature, LLM-based rubrics should be frequently calibrated against expert human judgment.

Computer Use Agents

These agents interact with software through the same interface as humans—screenshots, mouse clicks, keyboard inputs, scrolling—rather than APIs or code execution. They can use any GUI application, from design tools to legacy enterprise software.

WebArena tests browser-based tasks using URL and page state checks plus backend state verification. OSWorld extends to full OS control, with evaluation scripts inspecting diverse artifacts: file system state, application configs, database contents, and UI element properties.

Browser use agents require balancing token efficiency and latency: DOM-based interactions execute quickly but consume many tokens, while screenshot-based interactions are slower but more token-efficient. In the Claude for Chrome product, Anthropic developed evals to check that the agent selected the right tool for each context.

Non-Determinism in Agent Evaluations

Agent behavior varies between runs, making results harder to interpret. Each task has its own success rate, and a task that passed one run might fail the next.

Two metrics capture this nuance:

pass@k measures the likelihood of at least one correct solution in k attempts. As k increases, pass@k rises. In coding, pass@1 is often most important. In other cases, proposing many solutions is valid as long as one works.

pass^k measures the probability that all k trials succeed. As k increases, pass^k falls. With a 75% per-trial rate over 3 trials, the probability of all three passing is (0.75)^3 ≈ 42%. This metric matters especially for customer-facing agents where reliable behavior is expected every time.

At k=1, both metrics are identical. By k=10, they tell opposite stories: pass@k approaches 100% while pass^k falls to 0%. Which to use depends on product requirements.

Going from Zero to One: A Roadmap

The article presents field-tested advice for building evals from scratch.

Step 0: Start Early

Teams often delay building evals thinking they need hundreds of tasks. In reality, "20-50 simple tasks drawn from real failures is a great start." Early in development, each change has a clear impact, so small sample sizes suffice. More mature agents need larger evals to detect smaller effects, but the 80/20 approach works initially. Evals get harder to build the longer you wait.

Step 1: Start with What You Already Test Manually

Begin with manual checks run during development. If already in production, look at bug trackers and support queues. Converting user-reported failures into test cases ensures the suite reflects actual usage.

Step 2: Write Unambiguous Tasks with Reference Solutions

A good task is one where "two domain experts would independently reach the same pass/fail verdict." Ambiguity in specifications becomes noise in metrics. Each task should be passable by a correctly-following agent. With frontier models, a 0% pass rate across many trials (0% pass@100) "is most often a signal of a broken task, not an incapable agent." For each task, create a reference solution that passes all graders—proving solvability and verifying grader configuration.

Step 3: Build Balanced Problem Sets

Test both where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization. Avoid class-imbalanced evals.

The article shares Anthropic's experience building evals for web search in Claude.ai—the challenge was preventing search when inappropriate while preserving extensive research capability. The team built evals covering both directions: queries requiring search (weather) and queries answerable from knowledge ("who founded Apple?"). Striking the right balance took many rounds of refinements.

Step 4: Build a Robust Eval Harness with a Stable Environment

The eval agent should function roughly the same as production, and the environment shouldn't introduce noise. Each trial should start from a clean environment. Unnecessary shared state between runs can cause correlated failures from infrastructure flakiness rather than agent performance. In some internal evals, Claude gained unfair advantage by examining git history from previous trials. Trials affected by the same environmental limitation (like limited memory) are not independent.

Step 5: Design Graders Thoughtfully

Choose deterministic graders where possible, LLM graders where necessary, and human graders judiciously for validation.

A common instinct is checking that agents followed very specific step sequences. The article finds this "too rigid and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate." It's often better to grade what the agent produced, not the path it took.

For multi-component tasks, build in partial credit—a support agent that correctly identifies a problem and verifies the customer but fails to process a refund is meaningfully better than one that fails immediately.

Model grading requires careful iteration. LLM-as-judge graders should be closely calibrated with human experts. To avoid hallucinations, give the LLM a way out (returning "Unknown" when insufficient information). Create clear, structured rubrics grading each dimension with isolated LLM-as-judges rather than one for all dimensions. Once robust, human review is only occasionally needed.

Some evaluations have subtle failure modes. The article cites Opus 4.5 initially scoring 42% on CORE-Bench until an Anthropic researcher found multiple issues: rigid grading penalizing "96.12" when expecting "96.124991…", ambiguous task specs, and stochastic tasks impossible to reproduce exactly. After fixing, the score jumped to 95%. Similarly, METR discovered misconfigured tasks in their time horizon benchmark that penalized models like Claude for following stated instructions while rewarding models that ignored the stated goal.

Make graders resistant to bypasses—tasks and graders should ensure passing genuinely requires solving the problem.

Step 6: Check the Transcripts

"You won't know if your graders are working well unless you read the transcripts and grades from many trials." Anthropic invested in tooling for viewing eval transcripts and regularly reads them. When a task fails, the transcript reveals whether the agent made a genuine mistake or graders rejected a valid solution. "Failures should seem fair: it's clear what the agent got wrong and why." Reading transcripts is how you verify the eval measures what actually matters.

Step 7: Monitor for Capability Eval Saturation

An eval at 100% tracks regressions but provides no improvement signal. Eval saturation occurs when an agent passes all solvable tasks. SWE-Bench Verified started at 30% and frontier models now approach >80%, nearing saturation. As evals saturate, only the hardest tasks remain, making results deceptive—large capability improvements appear as small score increases.

The article cites Qodo, a code review startup initially unimpressed by Opus 4.5 because one-shot coding evals didn't capture gains on complex tasks. They developed a new agentic eval framework providing a clearer picture.

"We do not take eval scores at face value until someone digs into the details of the eval and reads some transcripts."

Step 8: Keep Evaluation Suites Healthy Long-Term

An eval suite is a living artifact needing ongoing attention and clear ownership. At Anthropic, what proved most effective was establishing dedicated evals teams owning core infrastructure, while domain experts and product teams contribute tasks and run evaluations.

For AI product teams, owning and iterating on evaluations should be as routine as maintaining unit tests. Defining eval tasks is "one of the best ways to stress-test whether the product requirements are concrete enough to start building."

The article recommends eval-driven development: build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well. When a new model drops, running the suite quickly reveals which bets paid off.

Product managers, customer success managers, or salespeople can use Claude Code to contribute eval tasks as PRs. "The people closest to product requirements and users are best positioned to define success."

How Evals Fit with Other Methods

Automated evaluations can run against thousands of tasks without deploying to production. But they're just one way to understand agent performance. A complete picture includes production monitoring, user feedback, A/B testing, manual transcript review, and systematic human evaluation.

The article provides a detailed comparison table of six methods:

MethodDescriptionProsCons
Automated evalsRunning tests programmatically without real usersFaster iteration, fully reproducible, no user impact, can run on every commit, tests at scaleRequires up-front investment, ongoing maintenance, can create false confidence
Production monitoringTracking metrics and errors in live systemsReveals real user behavior, catches issues synthetic evals miss, ground truthReactive, noisy signals, requires instrumentation investment, lacks ground truth for grading
A/B testingComparing variants with real user trafficMeasures actual user outcomes, controls for confounds, scalableSlow (days/weeks for significance), only tests deployed changes, less signal on underlying "why"
User feedbackExplicit signals like thumbs-down or bug reportsSurfaces unanticipated problems, real examples, correlates with product goalsSparse and self-selected, skews toward severe issues, users rarely explain why, not automated
Manual transcript reviewHumans reading through agent conversationsBuilds intuition for failure modes, catches subtle quality issues, helps calibrateTime-intensive, doesn't scale, inconsistent coverage, reviewer fatigue
Systematic human studiesStructured grading by trained ratersGold-standard judgments, handles subjective tasks, improves model-based gradersExpensive and slow, hard to run frequently, inter-rater disagreement, complex domains require experts

These methods map to different development stages. Automated evals are especially useful pre-launch and in CI/CD. Production monitoring kicks in post-launch. A/B testing validates significant changes with sufficient traffic. User feedback and transcript review are ongoing gap-fillers. Systematic human studies are reserved for calibrating LLM graders or evaluating subjective outputs.

The article references the Swiss Cheese Model from safety engineering: no single evaluation layer catches every issue, but with multiple methods combined, failures slipping through one layer are caught by another. "The most effective teams combine these methods."

Conclusion

Teams without evals get bogged down in reactive loops. Teams that invest early find development accelerates as "failures become test cases, test cases prevent regressions, and metrics replace guesswork." Evals give the team a clear hill to climb, turning vague complaints into actionable signals.

The fundamentals are constant across agent types: start early; source realistic tasks from observed failures; define unambiguous success criteria; design graders thoughtfully combining multiple types; ensure problems are hard enough for the model; iterate on evaluations to improve signal-to-noise ratio; and read the transcripts.

AI agent evaluation remains a nascent, fast-evolving field. As agents take on longer tasks, collaborate in multi-agent systems, and handle increasingly subjective work, techniques will need to adapt.

Acknowledgements

Written by Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe, with contributions from David Hershey, Gian Segato, Mike Merrill, Alex Shaw, Nicholas Carlini, Ethan Dixon, Pedram Navid, Jake Eaton, Alyssa Baum, Lina Tawfik, Karen Zhou, Alexander Bricken, Sam Kennedy, Robert Ying, and others. Special thanks to customers and partners including iGent, Cognition, Bolt, Sierra, Vals.ai, Macroscope, PromptLayer, Stripe, Shopify, the Terminal Bench team, and more.

Appendix: Eval Frameworks

Several open-source and commercial frameworks help implement agent evaluations:

  • Harbor: designed for running agents in containerized environments with infrastructure for running trials at scale across cloud providers and a standardized format for tasks and graders. Popular benchmarks like Terminal-Bench 2.0 ship through its registry.

  • Braintrust: combines offline evaluation with production observability and experiment tracking. Its autoevals library includes pre-built scorers for factuality, relevance, and other dimensions.

  • LangSmith: offers tracing, offline/online evaluations, and dataset management integrated with the LangChain ecosystem. Langfuse provides similar capabilities as a self-hosted open-source alternative for data residency requirements.

  • Arize: offers Phoenix (open-source platform for LLM tracing, debugging, and evaluation) and AX (SaaS extending Phoenix for scale, optimization, and monitoring).

Many teams combine multiple tools, roll their own frameworks, or use simple evaluation scripts as a starting point. While frameworks can standardize and accelerate progress, "they're only as good as the eval tasks you run through them." The recommendation is to quickly pick a framework fitting your workflow, then invest energy in high-quality test cases and graders.

AI 落地咨询
艾维禾砺数字科技

企业 AI 落地全链路服务

Agent 开发工作流搭建Claude Code 集成
微信咨询
d187l8801b6124
访问官网 ivheli.com