Skip to content

The "think" tool: Enabling Claude to stop and think in complex tool use situations

Published: Mar 20, 2025

Note (Update Dec 15, 2025): Extended thinking capabilities have improved since initial release, such that Anthropic recommends using that feature instead of a dedicated think tool in most cases. Extended thinking provides similar benefits with better integration and performance.


Overview

A new tool that improves Claude's complex problem-solving performance. The article describes a "think" tool creating "dedicated space for structured thinking during complex tasks." This approach is distinct from Claude's extended thinking capability and has shown improvements in agentic tool use, policy following, consistent decision-making, and multi-step problem handling.

What is the "think" tool?

The think tool gives Claude the ability to include an additional thinking step as part of reaching its final answer. While extended thinking concerns what Claude does before generating a response, the think tool lets Claude, once generating a response, add a step to check whether it has sufficient information to proceed. This is especially helpful during long chains of tool calls or extended multi-step conversations.

The think tool suits cases where Claude lacks all needed information from the user query alone and must process external information from tool call results. Its reasoning is less comprehensive than extended thinking, focusing more on new information the model discovers.

Recommended usage guidance:

  • Extended thinking is recommended for simpler tool use scenarios, non-sequential tool calls, straightforward instruction following, and use cases like coding, math, and physics without tool calls.
  • The "think" tool is better for complex tool calls, careful analysis of tool outputs in long chains, policy-heavy environments with detailed guidelines, and sequential decisions where each step builds on prior ones.

Sample Implementation

json
{
  "name": "think",
  "description": "Use the tool to think about something. It will not obtain new information or change the database, but just append the thought to the log. Use it when complex reasoning or some cache memory is needed.",
  "input_schema": {
    "type": "object",
    "properties": {
      "thought": {
        "type": "string",
        "description": "A thought to think about."
      }
    },
    "required": ["thought"]
  }
}

This specification comes from τ-Bench.

Performance on τ-Bench

τ-bench is a comprehensive benchmark testing a model's tool use ability in realistic customer service scenarios. It evaluates Claude's ability to navigate realistic conversations, follow complex policy guidelines, and use tools to access and manipulate an environment database.

The primary metric is pass^k, measuring the probability that all k independent task trials succeed for a given task, averaged across all tasks. Unlike pass@k (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability.

Configurations Evaluated

  1. Baseline (no think tool, no extended thinking)
  2. Extended thinking mode alone
  3. Think tool alone
  4. Think tool with optimized prompt (airline domain)

Airline Domain Results

The think tool with an optimized prompt achieved 0.570 on pass^1, compared to 0.370 for baseline — a 54% relative improvement.

Airline Domain Table (proportions):

Configurationk=1k=2k=3k=4k=5
"Think" + Prompt0.5840.4440.3840.3560.340
"Think"0.4040.2540.1860.1400.100
Extended thinking0.4120.2900.2320.1920.160
Baseline0.3320.2060.1480.1160.100

The best airline performance came from pairing the think tool with an optimized prompt providing examples of reasoning approaches for analyzing customer requests. An example optimized prompt was included showing instructions for using the think tool as a scratchpad, listing applicable rules, checking collected information, verifying policy compliance, and iterating over tool results. Two detailed examples were provided — one for flight cancellation and one for booking tickets with baggage calculations.

The think tool with optimized prompting significantly outperformed extended thinking mode (which showed similar performance to the unprompted think tool). The airline policy's high complexity meant the model benefited most from examples of how to think.

Retail Domain Results

The think tool achieved the highest pass^1 score of 0.812 even without additional prompting.

Retail Domain Table (proportions):

Configurationk=1k=2k=3k=4k=5
"Think" + no prompt0.8120.7350.6850.6500.626
Extended thinking0.7700.6810.6230.5810.548
Baseline0.7830.6950.6430.6070.583

The retail policy is noticeably easier than the airline domain, so Claude improved simply by having a thinking space without further guidance.

Key Insights from τ-Bench Analysis

  1. Prompting matters significantly on difficult domains. Simply making the think tool available might improve performance somewhat, but pairing it with optimized prompting yielded dramatically better results for difficult domains. Easier domains may benefit from simply having access to think.
  2. Improved consistency across trials. Improvements from think were maintained for pass^k up to k=5, indicating the tool helped Claude handle edge cases and unusual scenarios more effectively.

Performance on SWE-Bench

A similar think tool was added to the SWE-bench setup when evaluating Claude 3.7 Sonnet, contributing to a state-of-the-art score of 0.623. The adapted definition includes a description encouraging use when complex reasoning or brainstorming is needed, such as brainstorming ways to fix bugs or failing tests.

json
{
  "name": "think",
  "description": "Use the tool to think about something. It will not obtain new information or make any changes to the repository, but just log the thought. Use it when complex reasoning or brainstorming is needed.",
  "input_schema": {
    "type": "object",
    "properties": {
      "thought": {
        "type": "string",
        "description": "Your thoughts."
      }
    },
    "required": ["thought"]
  }
}

Experiments (n=30 samples with think tool, n=144 without) showed the isolated effect of including this tool improved performance by 1.6% on average (Welch's t-test: t(38.89) = 6.71, p < .001, d = 1.47).

When to Use the "think" Tool

Based on evaluation results, Claude benefits most from the think tool in these scenarios:

  1. Tool output analysis — when Claude needs to carefully process previous tool call outputs before acting and might need to backtrack.
  2. Policy-heavy environments — when Claude needs to follow detailed guidelines and verify compliance.
  3. Sequential decision making — when each action builds on previous ones and mistakes are costly.

Implementation Best Practices

1. Strategic prompting with domain-specific examples

The most effective approach provides clear instructions on when and how to use the think tool. Examples should be tailored to the specific use case and should cover:

  • The level of detail expected in reasoning
  • How to break down complex instructions into actionable steps
  • Decision trees for common scenarios
  • How to check if all necessary information has been collected

2. Place complex guidance in the system prompt

When instructions about the think tool are long and/or complex, including them in the system prompt was more effective than placing them in the tool description itself. This provides broader context and helps the model better integrate the thinking process.

When NOT to Use the "think" Tool

The think tool is not applicable to all tool use cases and comes at the cost of increased prompt length and output tokens. It does not offer improvements in:

  1. Non-sequential tool calls — if Claude only needs a single or multiple parallel tool calls, there's unlikely to be improvement.
  2. Simple instruction following — when there aren't many constraints and default behavior suffices.

Getting Started

  1. Test with agentic tool use scenarios — start with challenging use cases where Claude currently struggles.
  2. Add the tool definition — implement a think tool customized to your domain, and consider including instructions with examples in the system prompt.
  3. Monitor and refine — watch how Claude uses the tool and adjust prompts to encourage more effective thinking patterns.

Adding this tool has minimal downside — it doesn't change external behavior unless Claude decides to use it, and doesn't interfere with existing tools or workflows.

Conclusion

The research demonstrates the think tool can significantly enhance Claude 3.7 Sonnet's performance on complex tasks requiring policy adherence and reasoning in long tool call chains. It is not a one-size-fits-all solution but offers substantial benefits for the right use cases with minimal implementation complexity.

Footnote: While τ-Bench results focused on Claude 3.7 Sonnet, experiments show Claude 3.5 Sonnet (New) also achieves performance gains with the same configuration, indicating the improvement generalizes to other Claude models.

AI 落地咨询
艾维禾砺数字科技

企业 AI 落地全链路服务

Agent 开发工作流搭建Claude Code 集成
微信咨询
d187l8801b6124
访问官网 ivheli.com