Skip to content

Writing Effective Tools for Agents — with Agents

Published: Sep 11, 2025

Source: Anthropic Engineering Blog

The article discusses how to build high-quality tools for LLM agents, emphasizing that agents are only as effective as their tools. The Model Context Protocol (MCP) can give agents potentially hundreds of tools, and the post covers techniques for maximizing their effectiveness.

Core Themes

The post outlines how to:

  • Build and test tool prototypes
  • Create and run comprehensive evaluations using agents
  • Use Claude to automatically improve tool performance

What Is a Tool?

The article distinguishes deterministic systems (same output every time) from non-deterministic agents, which "can generate varied responses even with the same starting conditions." Tools represent "a new kind of software which reflects a contract between deterministic systems and non-deterministic agents." The key insight is designing tools for agents rather than for traditional developers or systems. Tools most "ergonomic" for agents are also "surprisingly intuitive to grasp as humans."

How to Write Tools

Building a Prototype

The recommendation is to stand up a quick prototype and connect it via a local MCP server or Desktop extension (DXT). Testing tools yourself helps identify rough edges. LLM-friendly documentation in llms.txt files can help when using Claude Code.

Running an Evaluation

Generating evaluation tasks: Claude Code can create prompt-response pairs based on real-world uses. The article warns against "overly simplistic or superficial 'sandbox' environments that don't stress-test your tools with sufficient complexity." Strong tasks may require multiple (even dozens of) tool calls.

Strong task examples include multi-step scenarios like scheduling meetings with attachments, investigating customer billing issues across log entries, or preparing retention offers requiring synthesis of customer data.

Weak tasks are simple single-step lookups that don't test real complexity.

Each prompt should pair with a verifiable response. Verifiers range from exact string comparison to Claude judging responses. The article advises against "overly strict verifiers that reject correct responses due to spurious differences."

Running the evaluation: Programmatic runs with direct LLM API calls using simple agentic loops are recommended. The article suggests instructing agents to output reasoning and feedback blocks before tool call and response blocks, as this "may increase LLMs' effective intelligence by triggering chain-of-thought (CoT) behaviors."

When using Claude, interleaved thinking can be enabled for similar functionality. Metrics to collect include runtime, number of tool calls, token consumption, and tool errors.

Analyzing results: Agents can help spot issues, "but keep in mind that what agents omit in their feedback and responses can often be more important than what they include." The article notes that "LLMs don't always say what they mean." Reading raw transcripts and analyzing tool calling metrics is essential.

Collaborating with Agents

Evaluation transcripts can be concatenated and given to Claude Code to analyze and improve tools automatically. Most advice in the post came from "repeatedly optimizing our internal tool implementations with Claude Code." Held-out test sets prevented overfitting.

Principles for Writing Effective Tools

Choosing the Right Tools

"More tools don't always lead to better outcomes." A common error is tools that merely wrap existing API endpoints regardless of agent appropriateness. Agents have "distinct 'affordances' to traditional software."

The address book analogy illustrates this: returning ALL contacts wastes limited context, while a search_contacts tool is far better. The recommendation is to build "a few thoughtful tools targeting specific high-impact workflows."

Tools can consolidate functionality. Examples given:

  • A schedule_event tool instead of separate list_users, list_events, and create_event tools
  • A search_logs tool instead of read_logs
  • A get_customer_context tool instead of separate customer/transaction/notes tools

"Too many tools or overlapping tools can also distract agents from pursuing efficient strategies."

Namespacing Tools

Agents may have access to "dozens of MCP servers and hundreds of different tools." Namespacing helps delineate boundaries. Grouping by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search) helps agents select the right tool.

The choice between prefix- and suffix-based namespacing has "non-trivial effects" on tool-use evaluations, varying by LLM.

Returning Meaningful Context

Tool implementations should "return only high signal information back to agents." They should "prioritize contextual relevance over flexibility, and eschew low-level technical identifiers." Fields like name, image_url, and file_type are more useful than uuid, 256px_image_url, mime_type.

Resolving "arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language" significantly improves precision by reducing hallucinations.

A response_format enum parameter can let agents control verbosity. The article provides an example enum with DETAILED and CONCISE options. In the Slack example, concise responses used "approximately one-third of the tokens" compared to detailed ones. Detailed responses include IDs needed for downstream tool calls.

Response structure format (XML, JSON, Markdown) also impacts performance, with no one-size-fits-all solution, since "LLMs are trained on next-token prediction and tend to perform better with formats that match their training data."

Optimizing Tool Responses for Token Efficiency

The article recommends implementing "pagination, range selection, filtering, and/or truncation with sensible default parameter values." For Claude Code, tool responses are restricted to 25,000 tokens by default.

When truncating, include helpful instructions steering agents toward token-efficient strategies. Error responses should "clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks."

Prompt-Engineering Tool Descriptions

This is described as "one of the most effective methods for improving tools." Descriptions are loaded into agents' context and collectively steer behavior. The advice is to think of "how you would describe your tool to a new hire" and make implicit context explicit.

Input parameters should be "unambiguously named: instead of a parameter named user, try a parameter named user_id."

The article notes that "Claude Sonnet 3.5 achieved state-of-the-art performance on the SWE-bench Verified evaluation" after precise refinements to tool descriptions that "dramatically reduced error rates and improving task completion."

Looking Ahead

The article concludes that we need to re-orient software development "from predictable, deterministic patterns to non-deterministic ones." Effective tools are "intentionally and clearly defined, use agent context judiciously, can be combined together in diverse workflows." As agents become more capable, "the tools they use will evolve alongside them."

Acknowledgements

Written by Ken Aizawa with contributions from colleagues across Research, MCP, Product Engineering, Marketing, Design, and Applied AI.

AI 落地咨询
艾维禾砺数字科技

企业 AI 落地全链路服务

Agent 开发工作流搭建Claude Code 集成
微信咨询
d187l8801b6124
访问官网 ivheli.com