Building Effective Agents

Published: December 19, 2024 Authors: Erik S. and Barry Zhang Source: Anthropic Engineering Blog

Overview

The authors have worked with dozens of teams building LLM agents across industries. Their key finding: "the most successful implementations use simple, composable patterns rather than complex frameworks."

What Are Agents?

"Agent" has multiple definitions. Some see agents as fully autonomous systems; others describe more prescriptive implementations following predefined workflows. Anthropic categorizes all variations as agentic systems, drawing a distinction between two types:

Workflows: systems where LLMs and tools are orchestrated through predefined code paths
Agents: systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks

When (and When Not) to Use Agents

The recommendation is to find the simplest solution possible and only increase complexity when needed. Agentic systems often trade latency and cost for better task performance. Workflows offer predictability for well-defined tasks, while agents are better when flexibility and model-driven decision-making are needed at scale. For many applications, "optimizing single LLM calls with retrieval and in-context examples is usually enough."

When and How to Use Frameworks

Several frameworks are mentioned:

The Claude Agent SDK
Strands Agents SDK by AWS
Rivet (drag and drop GUI LLM workflow builder)
Vellum (GUI tool for building and testing complex workflows)

These frameworks simplify low-level tasks like calling LLMs, defining and parsing tools, and chaining calls. However, "they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug." The suggestion is to start using LLM APIs directly since many patterns can be implemented in just a few lines of code.

Building Blocks, Workflows, and Agents

Building Block: The Augmented LLM

The basic building block is an LLM enhanced with augmentations such as retrieval, tools, and memory. Current models can actively use these capabilities — generating search queries, selecting tools, and determining what information to retain. The recommendation focuses on tailoring capabilities to specific use cases and ensuring an easy, well-documented interface for the LLM. The Model Context Protocol is one approach for integrating third-party tools.

Workflow: Prompt Chaining

Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. Programmatic checks ("gates") can be added on intermediate steps.

When to use: Ideal when a task can be cleanly decomposed into fixed subtasks, trading off latency for higher accuracy by making each LLM call an easier task.

Examples:

Generating marketing copy, then translating it into a different language
Writing an outline, checking it meets criteria, then writing the document based on it

Workflow: Routing

Routing classifies an input and directs it to a specialized followup task, enabling separation of concerns and more specialized prompts.

When to use: Works well for complex tasks with distinct categories better handled separately, where classification can be handled accurately.

Examples:

Directing different customer service queries (general questions, refund requests, technical support) into different downstream processes
Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5

Workflow: Parallelization

LLMs can work simultaneously on a task with outputs aggregated programmatically. Two key variations:

Sectioning: Breaking a task into independent subtasks run in parallel
Voting: Running the same task multiple times for diverse outputs

When to use: Effective when subtasks can be parallelized for speed, or when multiple perspectives are needed for higher confidence results. LLMs generally perform better when each consideration is handled by a separate call.

Examples (Sectioning):

Implementing guardrails where one model instance processes queries while another screens for inappropriate content
Automating evals where each LLM call evaluates a different aspect of model performance

Examples (Voting):

Reviewing code for vulnerabilities using several different prompts
Evaluating content appropriateness with multiple prompts requiring different vote thresholds

Workflow: Orchestrator-Workers

A central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.

When to use: Well-suited for complex tasks where subtasks can't be predicted in advance. The key difference from parallelization is flexibility — subtasks aren't pre-defined but determined by the orchestrator based on specific input.

Examples:

Coding products making complex changes to multiple files
Search tasks gathering and analyzing information from multiple sources

Workflow: Evaluator-Optimizer

One LLM call generates a response while another provides evaluation and feedback in a loop.

When to use: Particularly effective when clear evaluation criteria exist and iterative refinement provides measurable value. Two signs of good fit: LLM responses demonstrably improve with articulated feedback, and the LLM can provide such feedback. This is analogous to the iterative writing process a human writer goes through.

Examples:

Literary translation with nuances an evaluator LLM can critique
Complex search tasks requiring multiple rounds of searching and analysis

Agents

Agents emerge in production as LLMs mature in key capabilities: understanding complex inputs, reasoning and planning, using tools reliably, and recovering from errors. Agents begin with a command from or interactive discussion with a human user. Once the task is clear, they plan and operate independently, potentially returning to the human for information or judgment. During execution, gaining "ground truth" from the environment at each step is crucial. Agents can pause for human feedback at checkpoints or when encountering blockers, and stopping conditions (such as max iterations) help maintain control.

Agent implementation is often straightforward — "typically just LLMs using tools based on environmental feedback in a loop." Designing toolsets and their documentation clearly is crucial.

When to use: For open-ended problems where required steps are difficult or impossible to predict, and where a fixed path can't be hardcoded. The LLM may operate for many turns, requiring some level of trust in its decision-making. Autonomy makes agents ideal for scaling tasks in trusted environments.

The autonomous nature means higher costs and potential for compounding errors. Extensive testing in sandboxed environments with appropriate guardrails is recommended.

Examples:

A coding agent to resolve SWE-bench tasks, involving edits to many files based on a task description
The "computer use" reference implementation, where Claude uses a computer to accomplish tasks

Combining and Customizing These Patterns

These building blocks aren't prescriptive — they're common patterns developers can shape and combine. The key to success is measuring performance and iterating. Complexity should be added "only when it demonstrably improves outcomes."

Summary

Success isn't about building the most sophisticated system but the right system for your needs. Start with simple prompts, optimize with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.

Three core principles for implementing agents:

Maintain simplicity in agent design
Prioritize transparency by explicitly showing the agent's planning steps
Carefully craft the agent-computer interface (ACI) through thorough tool documentation and testing

Frameworks can help get started quickly, but don't hesitate to reduce abstraction layers and build with basic components for production.

Appendix 1: Agents in Practice

Two particularly promising applications demonstrate practical value:

A. Customer Support

Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration, making it a natural fit for more open-ended agents because:

Support interactions naturally follow conversation flow while requiring access to external information and actions
Tools can pull customer data, order history, and knowledge base articles
Actions like issuing refunds or updating tickets can be handled programmatically
Success can be clearly measured through user-defined resolutions

Several companies have demonstrated viability through usage-based pricing models charging only for successful resolutions.

B. Coding Agents

Software development has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:

Code solutions are verifiable through automated tests
Agents can iterate on solutions using test results as feedback
The problem space is well-defined and structured
Output quality can be measured objectively

Agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on pull request descriptions alone. However, human review remains crucial for ensuring solutions align with broader system requirements.

Appendix 2: Prompt Engineering Your Tools

Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition. Tool definitions and specifications should receive just as much prompt engineering attention as overall prompts.

Several approaches exist for specifying the same action (e.g., diffs vs. rewriting entire files, code in markdown vs. JSON). Some formats are much more difficult for an LLM to write than others — for example, writing a diff requires knowing how many lines are changing in the chunk header before writing new code.

Suggestions for deciding on tool formats:

Give the model enough tokens to "think" before it writes itself into a corner
Keep the format close to what the model has naturally seen in text on the internet
Avoid formatting overhead such as keeping accurate counts of thousands of lines of code or string-escaping code

One rule of thumb: think about how much effort goes into human-computer interfaces (HCI), and plan to invest equally in creating good agent-computer interfaces (ACI).

Additional guidance:

Put yourself in the model's shoes — is it obvious how to use the tool from the description and parameters?
Consider how parameter names or descriptions could be made clearer, like writing a great docstring for a junior developer
Test how the model uses tools by running many example inputs and iterating on mistakes
Poka-yoke your tools — change arguments so mistakes are harder to make

For the SWE-bench agent, the team "spent more time optimizing our tools than the overall prompt." For example, the model made mistakes with tools using relative filepaths after moving out of the root directory. The fix was requiring absolute filepaths, after which the model used the method flawlessly.

Building Effective Agents ​

Overview ​

What Are Agents? ​

When (and When Not) to Use Agents ​

When and How to Use Frameworks ​

Building Blocks, Workflows, and Agents ​

Building Block: The Augmented LLM ​

Workflow: Prompt Chaining ​

Workflow: Routing ​

Workflow: Parallelization ​

Workflow: Orchestrator-Workers ​

Workflow: Evaluator-Optimizer ​

Agents ​

Combining and Customizing These Patterns ​

Summary ​

Appendix 1: Agents in Practice ​

A. Customer Support ​

B. Coding Agents ​

Appendix 2: Prompt Engineering Your Tools ​

Building Effective Agents

Overview

What Are Agents?

When (and When Not) to Use Agents

When and How to Use Frameworks

Building Blocks, Workflows, and Agents

Building Block: The Augmented LLM

Workflow: Prompt Chaining

Workflow: Routing

Workflow: Parallelization

Workflow: Orchestrator-Workers

Workflow: Evaluator-Optimizer

Agents

Combining and Customizing These Patterns

Summary

Appendix 1: Agents in Practice

A. Customer Support

B. Coding Agents

Appendix 2: Prompt Engineering Your Tools