Building Effective Agents
Published: December 19, 2024 Authors: Erik S. and Barry Zhang Source: Anthropic Engineering Blog
Overview
The authors have worked with dozens of teams building LLM agents across industries. Their key finding: "the most successful implementations use simple, composable patterns rather than complex frameworks."
What Are Agents?
"Agent" has multiple definitions. Some see agents as fully autonomous systems; others describe more prescriptive implementations following predefined workflows. Anthropic categorizes all variations as agentic systems, drawing a distinction between two types:
- Workflows: systems where LLMs and tools are orchestrated through predefined code paths
- Agents: systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks
When (and When Not) to Use Agents
The recommendation is to find the simplest solution possible and only increase complexity when needed. Agentic systems often trade latency and cost for better task performance. Workflows offer predictability for well-defined tasks, while agents are better when flexibility and model-driven decision-making are needed at scale. For many applications, "optimizing single LLM calls with retrieval and in-context examples is usually enough."
When and How to Use Frameworks
Several frameworks are mentioned:
- The Claude Agent SDK
- Strands Agents SDK by AWS
- Rivet (drag and drop GUI LLM workflow builder)
- Vellum (GUI tool for building and testing complex workflows)
These frameworks simplify low-level tasks like calling LLMs, defining and parsing tools, and chaining calls. However, "they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug." The suggestion is to start using LLM APIs directly since many patterns can be implemented in just a few lines of code.
Building Blocks, Workflows, and Agents
Building Block: The Augmented LLM
The basic building block is an LLM enhanced with augmentations such as retrieval, tools, and memory. Current models can actively use these capabilities — generating search queries, selecting tools, and determining what information to retain. The recommendation focuses on tailoring capabilities to specific use cases and ensuring an easy, well-documented interface for the LLM. The Model Context Protocol is one approach for integrating third-party tools.
Workflow: Prompt Chaining
Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. Programmatic checks ("gates") can be added on intermediate steps.
When to use: Ideal when a task can be cleanly decomposed into fixed subtasks, trading off latency for higher accuracy by making each LLM call an easier task.
Examples:
- Generating marketing copy, then translating it into a different language
- Writing an outline, checking it meets criteria, then writing the document based on it
Workflow: Routing
Routing classifies an input and directs it to a specialized followup task, enabling separation of concerns and more specialized prompts.
When to use: Works well for complex tasks with distinct categories better handled separately, where classification can be handled accurately.
Examples:
- Directing different customer service queries (general questions, refund requests, technical support) into different downstream processes
- Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5
Workflow: Parallelization
LLMs can work simultaneously on a task with outputs aggregated programmatically. Two key variations:
- Sectioning: Breaking a task into independent subtasks run in parallel
- Voting: Running the same task multiple times for diverse outputs
When to use: Effective when subtasks can be parallelized for speed, or when multiple perspectives are needed for higher confidence results. LLMs generally perform better when each consideration is handled by a separate call.
Examples (Sectioning):
- Implementing guardrails where one model instance processes queries while another screens for inappropriate content
- Automating evals where each LLM call evaluates a different aspect of model performance
Examples (Voting):
- Reviewing code for vulnerabilities using several different prompts
- Evaluating content appropriateness with multiple prompts requiring different vote thresholds
Workflow: Orchestrator-Workers
A central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
When to use: Well-suited for complex tasks where subtasks can't be predicted in advance. The key difference from parallelization is flexibility — subtasks aren't pre-defined but determined by the orchestrator based on specific input.
Examples:
- Coding products making complex changes to multiple files
- Search tasks gathering and analyzing information from multiple sources
Workflow: Evaluator-Optimizer
One LLM call generates a response while another provides evaluation and feedback in a loop.
When to use: Particularly effective when clear evaluation criteria exist and iterative refinement provides measurable value. Two signs of good fit: LLM responses demonstrably improve with articulated feedback, and the LLM can provide such feedback. This is analogous to the iterative writing process a human writer goes through.
Examples:
- Literary translation with nuances an evaluator LLM can critique
- Complex search tasks requiring multiple rounds of searching and analysis
Agents
Agents emerge in production as LLMs mature in key capabilities: understanding complex inputs, reasoning and planning, using tools reliably, and recovering from errors. Agents begin with a command from or interactive discussion with a human user. Once the task is clear, they plan and operate independently, potentially returning to the human for information or judgment. During execution, gaining "ground truth" from the environment at each step is crucial. Agents can pause for human feedback at checkpoints or when encountering blockers, and stopping conditions (such as max iterations) help maintain control.
Agent implementation is often straightforward — "typically just LLMs using tools based on environmental feedback in a loop." Designing toolsets and their documentation clearly is crucial.
When to use: For open-ended problems where required steps are difficult or impossible to predict, and where a fixed path can't be hardcoded. The LLM may operate for many turns, requiring some level of trust in its decision-making. Autonomy makes agents ideal for scaling tasks in trusted environments.
The autonomous nature means higher costs and potential for compounding errors. Extensive testing in sandboxed environments with appropriate guardrails is recommended.
Examples:
- A coding agent to resolve SWE-bench tasks, involving edits to many files based on a task description
- The "computer use" reference implementation, where Claude uses a computer to accomplish tasks
Combining and Customizing These Patterns
These building blocks aren't prescriptive — they're common patterns developers can shape and combine. The key to success is measuring performance and iterating. Complexity should be added "only when it demonstrably improves outcomes."
Summary
Success isn't about building the most sophisticated system but the right system for your needs. Start with simple prompts, optimize with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
Three core principles for implementing agents:
- Maintain simplicity in agent design
- Prioritize transparency by explicitly showing the agent's planning steps
- Carefully craft the agent-computer interface (ACI) through thorough tool documentation and testing
Frameworks can help get started quickly, but don't hesitate to reduce abstraction layers and build with basic components for production.
Appendix 1: Agents in Practice
Two particularly promising applications demonstrate practical value:
A. Customer Support
Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration, making it a natural fit for more open-ended agents because:
- Support interactions naturally follow conversation flow while requiring access to external information and actions
- Tools can pull customer data, order history, and knowledge base articles
- Actions like issuing refunds or updating tickets can be handled programmatically
- Success can be clearly measured through user-defined resolutions
Several companies have demonstrated viability through usage-based pricing models charging only for successful resolutions.
B. Coding Agents
Software development has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:
- Code solutions are verifiable through automated tests
- Agents can iterate on solutions using test results as feedback
- The problem space is well-defined and structured
- Output quality can be measured objectively
Agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on pull request descriptions alone. However, human review remains crucial for ensuring solutions align with broader system requirements.
Appendix 2: Prompt Engineering Your Tools
Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition. Tool definitions and specifications should receive just as much prompt engineering attention as overall prompts.
Several approaches exist for specifying the same action (e.g., diffs vs. rewriting entire files, code in markdown vs. JSON). Some formats are much more difficult for an LLM to write than others — for example, writing a diff requires knowing how many lines are changing in the chunk header before writing new code.
Suggestions for deciding on tool formats:
- Give the model enough tokens to "think" before it writes itself into a corner
- Keep the format close to what the model has naturally seen in text on the internet
- Avoid formatting overhead such as keeping accurate counts of thousands of lines of code or string-escaping code
One rule of thumb: think about how much effort goes into human-computer interfaces (HCI), and plan to invest equally in creating good agent-computer interfaces (ACI).
Additional guidance:
- Put yourself in the model's shoes — is it obvious how to use the tool from the description and parameters?
- Consider how parameter names or descriptions could be made clearer, like writing a great docstring for a junior developer
- Test how the model uses tools by running many example inputs and iterating on mistakes
- Poka-yoke your tools — change arguments so mistakes are harder to make
For the SWE-bench agent, the team "spent more time optimizing our tools than the overall prompt." For example, the model made mistakes with tools using relative filepaths after moving out of the root directory. The fix was requiring absolute filepaths, after which the model used the method flawlessly.