The Decoder March 22, 2025 NEUTRAL

Why Claude’s Think tool matters for multi-step AI work

Anthropic says a new Think tool gives Claude a scratchpad it can use while handling complex tasks. In tests, the approach improved airline customer service performance in Tau Bench by 54 percent with an optimized prompt, while SWE-Bench saw a smaller 1.6 percent gain.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a routine capability improvement for Claude on multi-step tasks, with no clear harm or societal degradation angle.

Why Claude’s Think tool matters for multi-step AI work

Anthropic has found that one of the more practical ways to improve Claude on complex work is also one of the simplest: give the model a place to write things down while it is working.

The company’s new Think tool gives Claude a scratchpad for reasoning during a task. Instead of relying only on the model’s initial reasoning before an answer, the tool lets Claude record intermediate thoughts as it receives new information, checks constraints, and decides what to do next.

What the Think tool changes

The Think tool works through a "think" command. Under the hood, the source describes it as a JSON command that tracks Claude’s thoughts while the model is moving through a task.

That distinction matters because the tool is not the same thing as Claude’s Extended Thinking feature. Extended Thinking is used before Claude produces an answer. The Think tool is designed for the middle of the process, especially when Claude has to deal with new information from other tools.

In plain terms, the scratchpad gives Claude a working area. It can note relevant rules, compare facts, and organize the next step before committing to an action. For multi-step tasks, that can reduce the chance that earlier context gets lost or that a constraint is skipped.

Where Anthropic saw gains

Anthropic tested the approach on airline customer service scenarios in the Tau Bench framework. With the optimized prompt, Claude performed 54 percent better than baseline.

The source also points to software engineering tests, where the gains were more limited. In SWE-Bench, the improvement was 1.6 percent.

Those results suggest that the Think tool is not a universal accelerator for every task. Its value appears strongest when the task requires careful tracking of instructions, tool output, and decisions over several steps. When the problem is simpler, the extra reasoning space may matter less.

Why examples matter

Anthropic’s approach is not just to add a scratchpad and assume Claude will use it well. The company provides example prompts that show how the model should use the space.

Those examples include patterns such as:

listing rules that apply to the current task;
checking facts before acting on them;
planning the next step after reviewing tool output.

This is an important part of the design. A blank scratchpad can help, but a guided scratchpad is more useful when the work has strict rules or costly failure modes. Domain-specific examples help Claude understand what kind of reasoning should happen before the next action.

That also keeps the tool from becoming a default layer for every prompt. According to Anthropic, the Think tool should be added when simpler prompts or single tool calls are not reliable enough. For tasks with only a few constraints, the extra structure may not be necessary.

Why this matters for agent-based AI

The source connects the Think tool to agent-based AI systems, which still struggle with reliability. These systems often need to call tools, interpret results, follow instructions, and make decisions in sequence. A small mistake early in the chain can affect later steps.

A scratchpad gives the model a way to pause and organize the situation after each important input. That can be useful when the model has to analyze tool output, follow complex rules, or make step-by-step decisions where mistakes could be costly.

The Think tool also appears designed to fit into existing Claude systems without requiring a broad rebuild. The source says it integrates easily and only affects performance when it is actually being used.

Most of the testing used Claude 3.7 Sonnet, but Anthropic reports that the improvements work just as well with Claude 3.5 Sonnet (New). That matters because it suggests the benefit is tied to the workflow and prompting pattern, not only to one model version.

A practical shift in AI design

The Think tool points to a practical direction for improving AI assistants: better task handling does not always require a new model capability. Sometimes the improvement comes from giving the model a more disciplined process.

For users building with Claude, the main lesson is selective use. The Think tool is most relevant when the task involves several steps, external tool results, strict instructions, or decisions that need to be checked before action. It is less clearly useful for simple requests that already work reliably.

That makes the Think tool a targeted reliability feature rather than a general-purpose prompt ornament. Used in the right setting, it gives Claude more room to keep track of what matters before moving forward.