The Decoder August 31, 2025 TERMINATOR

How DeepConf cuts reasoning costs without retraining models

DeepConf is a new inference method from Meta and UC San Diego that uses model confidence to improve reasoning efficiency. In tests, it reduced token usage by as much as 84.7 percent while maintaining strong accuracy on math reasoning tasks.

WTF Index TERMINATOR

◄ Terminator 1 Idiocracy 0 ►

DeepConf modestly makes reasoning systems more efficient and capable, but the story is mainly a technical cost-saving advance rather than a clear risk escalation.

How DeepConf cuts reasoning costs without retraining models

Reasoning language models often spend heavily on computation because they try many possible solution paths before choosing an answer. DeepConf (Deep Think with Confidence), introduced by Meta and UC San Diego, offers a different way to decide which paths deserve attention.

Instead of treating every generated solution as equally useful, DeepConf looks at how confident the model appears while it is reasoning. The goal is simple: keep the stronger paths, reduce the influence of weaker ones, and stop spending tokens when a path starts to look unreliable.

Why majority voting can waste computation

Many reasoning language models handle difficult problems by producing multiple solution paths. After that, they often select the answer that appears most frequently across those paths.

That approach can help when several independent attempts converge on the same result. But it also has a clear weakness: every path receives the same weight, even if some of the reasoning looks uncertain or error-prone.

As a result, a common but weak answer can beat a better one. At the same time, generating more paths adds computational cost, and those extra paths do not always improve the final result.

DeepConf is designed for that gap. It does not ask the model to learn a new skill through extra training. Instead, it changes how the system interprets the model's own signals during inference.

How DeepConf reads uncertainty

DeepConf measures confidence by examining the model's probability distribution for each prediction. When the model strongly favors a single next word, that concentration signals higher confidence. When probability is spread across many possible next words, the model is less certain.

The research team found that high-confidence reasoning paths are much more likely to be correct. That makes confidence useful as a filter, not just as a diagnostic after the answer is produced.

Older methods usually averaged confidence across the whole reasoning chain. DeepConf goes further by analyzing individual sections of the chain. This makes it easier to detect weak segments and reduce the role of reasoning paths that contain likely errors.

This section-level view matters because a long chain can contain both strong and weak parts. A method that only looks at the average may miss the point where the model begins to lose confidence.

Two ways to use DeepConf

DeepConf has two operating modes: offline mode and online mode. Both use confidence, but they apply it at different points in the reasoning process.

Offline mode generates all reasoning paths first, then filters or down-weights low-quality paths before choosing a final answer.
Online mode checks confidence while each path is being generated and stops early when confidence drops below a threshold.

The online threshold is set using 16 reference paths. The aggressive version benchmarks against the top 10 percent, while the conservative version uses the top 90 percent.

This creates a tradeoff. The aggressive setting can save more computation by cutting weak paths sooner. The conservative setting is less aggressive, but the researchers recommend it for more stable results.

What the tests showed

The researchers tested DeepConf on five open-source models, ranging from Deepseek-R1-8B to gpt-oss-120B. The evaluation included math competitions such as AIME24/25, HMMT25, and BRUMO25, along with scientific reasoning tasks.

On AIME 2025, DeepConf reached 99.9 percent accuracy in offline mode with gpt-oss-120B. In online mode, it reached 97.9 percent accuracy while reducing token usage by 84.7 percent compared with regular majority voting.

Each experiment was run 64 times to support statistically solid results. Across math tasks, the aggressive setting cut token usage by as much as 84.7 percent, while the conservative mode saved up to 59 percent, typically without sacrificing accuracy.

Those reductions count all tokens generated in every run. That means the savings become especially visible when many weak solution paths are stopped early instead of being allowed to continue to completion.

DeepConf can also be added to systems such as vLLM with just a few lines of code. The method does not require extra model training, which makes it more practical for systems already using reasoning models.

Limits and implications

DeepConf is not a perfect safeguard. If a model is highly confident in a wrong answer, the method may fail to remove that path. The source notes that this risk is especially relevant in aggressive mode.

The researchers therefore recommend the conservative version when stability matters more than maximum efficiency. The code is available on GitHub.

The broader question is whether the rising use of reasoning models can be made less computationally expensive. OpenAI, for example, routes harder questions to a special "thinking" mode in GPT-5, though the source notes that this switch does not always work as intended.

Some studies now question whether investing in "thinking" models is worthwhile, especially as energy costs rise. In that context, DeepConf points to a practical direction: use the model's own confidence to preserve accuracy while avoiding unnecessary reasoning work.