Why reasoning models may be efficient rather than more capable

A study from Tsinghua University and Shanghai Jiao Tong University argues that RLVR can make reasoning models better on a first attempt without giving them abilities the base model lacks. The finding has sparked debate over how benchmarks such as pass@k should measure reasoning, especially when many attempts are allowed.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 1 ►

The story is mainly a technical benchmark debate about reasoning efficiency rather than clear AI danger or societal degradation.

Why reasoning models may be efficient rather than more capable

A study on reasoning models raises a direct question for AI developers and users: are these systems learning new ways to solve hard problems, or are they becoming better at choosing answers they could already produce?

The work, from Tsinghua University and Shanghai Jiao Tong University, focuses on reinforcement learning with verifiable rewards, or RLVR. According to the source article, the paper later earned the highest possible score at NeurIPS, while also drawing debate about what current benchmarks really prove.

What the study tested

RLVR is used for training reasoning models on tasks where outcomes can be checked automatically. The source article lists mathematics, programming, and visual reasoning as examples. Instead of relying on human judgment, the model receives reward signals from verifiable results, such as correct calculations or passed code tests.

The approach has been applied in systems including OpenAI’s o-series and Deepseek-R1. The study asks whether this kind of training gives a large language model new reasoning capabilities, or whether it mainly makes the model more likely to repeat solution paths that already existed in the base model.

The core finding is narrow but important. RLVR improved pass@1, meaning the chance that a model gives a correct answer on its first try. But the study says it did not let the model solve problems that the base model could not solve at all.

"RLVR is not as powerful as previously believed—it doesn't enable the model to solve problems that the base model can't solve,"

That statement is attributed in the source article to study lead Yang Yue. The distinction matters because a stronger first answer can look like deeper reasoning, even when the underlying range of possible solutions has not expanded.

Efficiency can narrow the model’s options

The study describes RLVR as reducing output diversity, also called entropy. In plain terms, the model becomes more concentrated around a smaller set of high-reward strategies. That can be useful when the goal is one strong answer quickly.

But there is a tradeoff. When researchers sampled only a few answers, RLVR models did better because they leaned toward strategies likely to succeed. When more answers were generated, base models performed better because they produced a wider set of responses.

The source article says this pattern held across mathematics, programming, and visual reasoning tasks. RLVR-trained models often did well on the first attempt, while base models showed more strength when several attempts were allowed.

That does not mean the first-attempt gain is unimportant. In many real uses, a model is expected to answer once, not explore hundreds of alternatives. But the study suggests that better pass@1 performance should not automatically be treated as proof that a model has learned fundamentally new reasoning behavior.

Why pass@k became part of the debate

The article also describes a debate around high pass@k values. With pass@k, a benchmark checks whether at least one correct answer appears among several attempts. At high values, a model may be given hundreds or even thousands of chances, and success is counted if any one answer is right.

Some researchers argue this can confuse genuine reasoning with eventually landing on the right answer. The authors acknowledge that pass@1024, where the model gets 1,024 attempts, can be affected by luck on tasks with only a few possible answers, such as AIME.

At the same time, the authors say the same overall pattern appears on harder problems, including programming and math tests, where guessing is not enough. Their manual analysis also found that base models can produce sound logical solutions. They argue this supports the view that large, pretrained base models may contain more reasoning potential than some observers assumed.

The team plans to add explicit random baselines in future studies to better control for lucky guesses. That matters because the benchmark question is not only whether an answer appears, but why it appears and how reliably the model can reach it.

What other researchers said

AI researcher Nathan Lambert described the results as fitting existing expectations. He wrote, "This isn’t a new intuition," and called the work "a nice new set of results." He also said it was "cool because it shows that RL reduces the entropy of samples but makes the model more effective at pass@1."

Lambert also pointed to the limits of the training setup. The models were trained only on MATH and GSM8K, which he described as "great for controlled ablations" but "not great for showing the fundamental limits of RL training." In his view, broader claims require scaling the approach.

"OpenAI and others have shown that scaling RL is a crucial aspect of it, and with only these narrow training sets that isn’t really possible."

Lambert framed the study less as a dismissal of reinforcement learning and more as a sign that harder work remains. As he put it, "We just are getting to the point where we need to do hard things. Hard things are more interesting, but shocker, they're hard and take longer."

Yue also noted that the study focused on RL models trained from scratch, without enhancements such as chain-of-thought fine-tuning or knowledge distillation. The source article quotes him saying, "Here we focused on zero-RL trained model. OpenAI’s model should have extra COT finetuning and distillation etc." He also agreed that warm-starting with supervised fine-tuning could improve results for reasoning models.

What the findings do and do not show

The study does not claim that reinforcement learning can never improve reasoning. The authors specifically stress that point. They plan further experiments on whether and how RL can enhance LLM reasoning, and they note that results may change as models and datasets grow larger.

OpenAI CEO Sam Altman has also suggested that combining reasoning abilities with "a much bigger model" through pre-training could lead to "the first bits or sort of signs of life on genuine new scientific knowledge." In the framing of the source article, that points toward scale rather than reinforcement alone as a possible driver of future progress.

For now, the practical takeaway is more careful language. Reasoning models trained with RLVR may be more efficient at producing a correct first answer. The study argues that this is not the same as showing they have gained new capabilities beyond the base LLM.