Skip to content

Introducing Advanced Tool Use on the Claude Developer Platform

Published: November 24, 2025

Author: Written by Bin Wu, with contributions from Adam Jones, Artur Renault, Henry Tay, Jake Noble, Noah Picard, Sam Jiang, and the Claude Developer Platform team.


Overview

The article describes three new beta features for the Claude Developer Platform that enable Claude to "discover, learn, and execute tools dynamically." These features address the challenge of building AI agents that need to work across potentially thousands of tools without exhausting the context window.

The vision described is one where agents operate seamlessly across massive tool libraries — IDE assistants integrating git, file manipulation, package managers, testing, and deployment; or operations coordinators connecting Slack, GitHub, Google Drive, Jira, databases, and MCP servers.

The Three Features

1. Tool Search Tool

The Problem: Loading all tool definitions upfront consumes enormous context. A five-server setup (GitHub, Slack, Sentry, Grafana, Splunk) uses approximately 55K tokens across 58 tools before any conversation begins. At Anthropic, they've seen tool definitions consume 134K tokens before optimization. Beyond cost, the most frequent failures involve selecting the wrong tool or passing incorrect parameters when tools have similar names.

The Solution: Instead of loading everything upfront, the Tool Search Tool enables on-demand discovery. You mark tools with defer_loading: true, and only the Tool Search Tool itself (~500 tokens) is loaded initially. When Claude needs specific capabilities, it searches, and only matching tools get expanded into context.

The article reports "an 85% reduction in token usage while maintaining access to your full tool library." Internal testing showed accuracy improvements on MCP evaluations — Opus 4 went from 49% to 74%, and Opus 4.5 improved from 79.5% to 88.1%.

The implementation involves adding a tool search tool (regex, BM25, or custom) to your tools array and marking discoverable tools with defer_loading: true. For MCP servers, entire servers can be deferred while keeping high-use tools loaded.

Prompt caching is preserved because deferred tools are excluded from the initial prompt entirely.

When to use it: Tool definitions consuming >10K tokens, tool selection accuracy issues, MCP-powered systems with multiple servers, or 10+ tools available. Less beneficial with small tool libraries under 10 tools, when all tools are used frequently, or when definitions are compact.


2. Programmatic Tool Calling

The Problem: Traditional tool calling creates context pollution from intermediate results and inference overhead. Each tool call requires a full model inference pass, and all intermediate results accumulate in context regardless of relevance.

The Solution: Programmatic Tool Calling enables Claude to orchestrate tools through code rather than individual API round-trips. Claude writes Python code that calls multiple tools, processes their outputs, and controls what enters the context window. The article notes that "loops, conditionals, data transformations, and error handling are all explicit in code."

Example: Budget Compliance Check

The article walks through a scenario checking which team members exceeded Q3 travel budgets using three tools: get_team_members, get_expenses, and get_budget_by_level.

In the traditional approach, fetching data for 20 people generates thousands of expense line items (~50 KB+) all entering Claude's context. With Programmatic Tool Calling, Claude writes a Python script using asyncio.gather for parallel execution. Only the final result (the 2-3 people who exceeded budgets) enters context — reducing from 200KB of raw data to approximately 1KB of results.

Reported improvements:

  • Token savings of 37% on complex research tasks (average from ~43,588 to ~27,297 tokens)
  • Reduced latency by eliminating 19+ inference passes for a 20-tool workflow
  • Improved accuracy: knowledge retrieval from 25.6% to 28.5%; GIA benchmarks from 46.5% to 51.2%

How it works (four steps):

  1. Mark tools as callable from code by adding code_execution to tools and setting allowed_callers to opt-in tools
  2. Claude writes orchestration code in Python
  3. Tools execute in the Code Execution environment without hitting Claude's context — tool requests include a caller field linking back to the code execution session
  4. Only the final output (stdout) enters Claude's context

When to use it: Processing large datasets needing only aggregates, multi-step workflows with 3+ dependent tool calls, filtering/sorting/transforming results before Claude sees them, parallel operations. Less beneficial for simple single-tool invocations, tasks where Claude should reason about all intermediate results, or quick lookups with small responses.


3. Tool Use Examples

The Problem: JSON Schema defines structure but can't express usage patterns. The article uses a support ticket API (create_ticket) as an example, noting ambiguities around date formats, ID conventions, nested structure usage, and parameter correlations that schemas alone cannot resolve.

The Solution: Tool Use Examples let you provide sample tool calls directly in tool definitions via an input_examples field. The article shows three examples for create_ticket demonstrating:

  • A critical bug with full contact info and escalation
  • A feature request with reporter but no contact/escalation
  • An internal task with title only

From these examples, Claude learns format conventions (YYYY-MM-DD dates, USR-XXXXX IDs, kebab-case labels), nested structure patterns, and optional parameter correlations.

Internal testing showed improvement "from 72% to 90% on complex parameter handling."

When to use it: Complex nested structures, tools with many optional parameters, APIs with domain-specific conventions, similar tools needing disambiguation. Less beneficial for simple single-parameter tools, standard formats Claude already understands, or validation concerns better handled by JSON Schema.


Best Practices

The article recommends layering features strategically based on your biggest bottleneck:

  • Context bloat from tool definitions → Tool Search Tool
  • Large intermediate results polluting context → Programmatic Tool Calling
  • Parameter errors and malformed calls → Tool Use Examples

Tool Search Tool tips: Use clear, descriptive names and descriptions. Add system prompt guidance about available tool categories. Keep 3-5 most-used tools always loaded, defer the rest.

Programmatic Tool Calling tips: Document return formats clearly since Claude writes parsing code. Opt-in tools that benefit from programmatic orchestration include those runnable in parallel and idempotent operations.

Tool Use Examples tips: Use realistic data, show variety (minimal/partial/full patterns), keep it to 1-5 examples per tool, and focus on ambiguity where correct usage isn't obvious from schema.


Getting Started

These features are available in beta with the header advanced-tool-use-2025-11-20. The article provides a code snippet showing how to enable them via the Python SDK:

python
client.beta.messages.create(
    betas=["advanced-tool-use-2025-11-20"],
    model="claude-sonnet-4-5-20250929",
    max_tokens=4096,
    tools=[
        {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
        {"type": "code_execution_20250825", "name": "code_execution"},
        # Your tools with defer_loading, allowed_callers, and input_examples
    ]
)

Acknowledgements

The foundational research was contributed by Chris Gorgolewski, Daniel Jiang, Jeremy Fox, and Mike Lambert. The work drew inspiration from Joel Pobar's LLMVM, Cloudflare's Code Mode, and Anthropic's own Code Execution as MCP. Special thanks to Andy Schumeister, Hamish Kerr, Keir Bradwell, Matt Bleifer, and Molly Vorwerck.

AI 落地咨询
艾维禾砺数字科技

企业 AI 落地全链路服务

Agent 开发工作流搭建Claude Code 集成
微信咨询
d187l8801b6124
访问官网 ivheli.com