Writing Effective Tools for Agents — with Agents
Published: Sep 11, 2025
Source: Anthropic Engineering Blog
The article discusses how to build high-quality tools for LLM agents, emphasizing that agents are only as effective as their tools. The Model Context Protocol (MCP) can give agents potentially hundreds of tools, and the post covers techniques for maximizing their effectiveness.
Core Themes
The post outlines how to:
- Build and test tool prototypes
- Create and run comprehensive evaluations using agents
- Use Claude to automatically improve tool performance
What Is a Tool?
The article distinguishes deterministic systems (same output every time) from non-deterministic agents, which "can generate varied responses even with the same starting conditions." Tools represent "a new kind of software which reflects a contract between deterministic systems and non-deterministic agents." The key insight is designing tools for agents rather than for traditional developers or systems. Tools most "ergonomic" for agents are also "surprisingly intuitive to grasp as humans."
How to Write Tools
Building a Prototype
The recommendation is to stand up a quick prototype and connect it via a local MCP server or Desktop extension (DXT). Testing tools yourself helps identify rough edges. LLM-friendly documentation in llms.txt files can help when using Claude Code.
Running an Evaluation
Generating evaluation tasks: Claude Code can create prompt-response pairs based on real-world uses. The article warns against "overly simplistic or superficial 'sandbox' environments that don't stress-test your tools with sufficient complexity." Strong tasks may require multiple (even dozens of) tool calls.
Strong task examples include multi-step scenarios like scheduling meetings with attachments, investigating customer billing issues across log entries, or preparing retention offers requiring synthesis of customer data.
Weak tasks are simple single-step lookups that don't test real complexity.
Each prompt should pair with a verifiable response. Verifiers range from exact string comparison to Claude judging responses. The article advises against "overly strict verifiers that reject correct responses due to spurious differences."
Running the evaluation: Programmatic runs with direct LLM API calls using simple agentic loops are recommended. The article suggests instructing agents to output reasoning and feedback blocks before tool call and response blocks, as this "may increase LLMs' effective intelligence by triggering chain-of-thought (CoT) behaviors."
When using Claude, interleaved thinking can be enabled for similar functionality. Metrics to collect include runtime, number of tool calls, token consumption, and tool errors.
Analyzing results: Agents can help spot issues, "but keep in mind that what agents omit in their feedback and responses can often be more important than what they include." The article notes that "LLMs don't always say what they mean." Reading raw transcripts and analyzing tool calling metrics is essential.
Collaborating with Agents
Evaluation transcripts can be concatenated and given to Claude Code to analyze and improve tools automatically. Most advice in the post came from "repeatedly optimizing our internal tool implementations with Claude Code." Held-out test sets prevented overfitting.
Principles for Writing Effective Tools
Choosing the Right Tools
"More tools don't always lead to better outcomes." A common error is tools that merely wrap existing API endpoints regardless of agent appropriateness. Agents have "distinct 'affordances' to traditional software."
The address book analogy illustrates this: returning ALL contacts wastes limited context, while a search_contacts tool is far better. The recommendation is to build "a few thoughtful tools targeting specific high-impact workflows."
Tools can consolidate functionality. Examples given:
- A
schedule_eventtool instead of separatelist_users,list_events, andcreate_eventtools - A
search_logstool instead ofread_logs - A
get_customer_contexttool instead of separate customer/transaction/notes tools
"Too many tools or overlapping tools can also distract agents from pursuing efficient strategies."
Namespacing Tools
Agents may have access to "dozens of MCP servers and hundreds of different tools." Namespacing helps delineate boundaries. Grouping by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search) helps agents select the right tool.
The choice between prefix- and suffix-based namespacing has "non-trivial effects" on tool-use evaluations, varying by LLM.
Returning Meaningful Context
Tool implementations should "return only high signal information back to agents." They should "prioritize contextual relevance over flexibility, and eschew low-level technical identifiers." Fields like name, image_url, and file_type are more useful than uuid, 256px_image_url, mime_type.
Resolving "arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language" significantly improves precision by reducing hallucinations.
A response_format enum parameter can let agents control verbosity. The article provides an example enum with DETAILED and CONCISE options. In the Slack example, concise responses used "approximately one-third of the tokens" compared to detailed ones. Detailed responses include IDs needed for downstream tool calls.
Response structure format (XML, JSON, Markdown) also impacts performance, with no one-size-fits-all solution, since "LLMs are trained on next-token prediction and tend to perform better with formats that match their training data."
Optimizing Tool Responses for Token Efficiency
The article recommends implementing "pagination, range selection, filtering, and/or truncation with sensible default parameter values." For Claude Code, tool responses are restricted to 25,000 tokens by default.
When truncating, include helpful instructions steering agents toward token-efficient strategies. Error responses should "clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks."
Prompt-Engineering Tool Descriptions
This is described as "one of the most effective methods for improving tools." Descriptions are loaded into agents' context and collectively steer behavior. The advice is to think of "how you would describe your tool to a new hire" and make implicit context explicit.
Input parameters should be "unambiguously named: instead of a parameter named user, try a parameter named user_id."
The article notes that "Claude Sonnet 3.5 achieved state-of-the-art performance on the SWE-bench Verified evaluation" after precise refinements to tool descriptions that "dramatically reduced error rates and improving task completion."
Looking Ahead
The article concludes that we need to re-orient software development "from predictable, deterministic patterns to non-deterministic ones." Effective tools are "intentionally and clearly defined, use agent context judiciously, can be combined together in diverse workflows." As agents become more capable, "the tools they use will evolve alongside them."
Acknowledgements
Written by Ken Aizawa with contributions from colleagues across Research, MCP, Product Engineering, Marketing, Design, and Applied AI.