Why Qwen2.5’s math scores may not show real reasoning

A new study argues that Alibaba's Qwen2.5 math results are driven mainly by memorized training data rather than genuine reasoning. When researchers moved from contaminated benchmarks to clean tests, performance fell sharply.

WTF Index IDIOCRACY
◄ Terminator 0 Idiocracy 2 ►

The story suggests AI progress claims may be inflated by memorization and contaminated benchmarks rather than real reasoning.

Why Qwen2.5’s math scores may not show real reasoning

A new study raises a direct challenge to one of the most important claims in AI evaluation: that high benchmark scores reliably show stronger reasoning. In the case of Alibaba's Qwen2.5 models, the researchers found that strong math performance appears to come largely from prior exposure to benchmark material, not from a robust ability to solve new problems.

The finding matters because math benchmarks are often used as a proxy for reasoning. If a model performs well because it has seen the problems before, the score can make progress look larger than it really is.

What the researchers tested

The study focused on whether Qwen2.5 was genuinely solving math problems or drawing on memorized training data. To probe that question, the researchers used a partial-completion test on the MATH 500 benchmark.

They gave Qwen2.5 only the first 60 percent of problems from MATH 500 and asked it to reconstruct the missing 40 percent. Qwen2.5-Math-7B reconstructed the missing 40 percent with 54.6 percent accuracy and answered correctly 53.6 percent of the time.

The comparison model, Llama3.1-8B, performed far lower on the same kind of test, with 3.8 and 2.4 percent. That gap is central to the study's argument: Qwen2.5 appeared to have encountered those benchmark problems during training.

This does not mean every correct answer was copied directly. But it does suggest that the benchmark was not clean for this model. A test that is already present in training data cannot cleanly separate reasoning from recall.

Clean benchmarks changed the picture

The researchers then moved to LiveMathBench (version 202505), described in the source as a clean benchmark released after Qwen2.5. On that dataset, Qwen2.5's completion rate dropped to zero, matching Llama, while answer accuracy fell to just two percent.

That shift is the core evidence behind the study's conclusion. A model that looks highly capable on a contaminated benchmark may look far less capable when the questions are new.

The likely cause identified in the source is pre-training on large online datasets, including GitHub repositories containing benchmark problems and their solutions. If benchmark items and answers are present in online data, a model can internalize them before any later evaluation takes place.

This also affects how training improvements are interpreted. The study found that even random or incorrect reward signals could improve Qwen2.5's results on MATH-500 because the model had prior exposure to the data. In that setting, better benchmark performance may reflect how the training process interacts with memorized examples rather than a new reasoning capability.

Why reward signals mattered on new problems

To test performance on material that could not have been in Qwen2.5's pre-training data, the team created the RandomCalculation dataset. It contained fully synthetic arithmetic problems generated after Qwen2.5's release.

On those new problems, Qwen2.5's accuracy declined as problem complexity increased. The result is important because it removes the easy explanation that the model is simply recognizing familiar benchmark content.

The study also examined RLVR, or Reinforcement Learning with Verifiable Rewards. In controlled RLVR experiments, only correct reward signals produced stable improvement. Random rewards made training unstable, while inverted rewards degraded math skills.

That pattern supports a more conservative reading of the model's abilities. When the model faced genuinely new synthetic arithmetic, the quality of the reward signal mattered. Random or inverted feedback did not reliably create better math performance.

The benchmark problem is bigger than one model

The findings do not only concern Qwen2.5. They point to a broader problem in AI evaluation: benchmarks can become part of the public data environment that models learn from.

When that happens, benchmark scores can overstate progress. A model may appear to reason through a problem while actually relying on patterns, answers, or full problem structures that were present in training data.

The source also notes that benchmark gaming is not new. It cites a case where Meta submitted a version of Llama 4 specifically tuned to perform well on the LMArena benchmark by using customized response formats. It also says other studies show that models like Gemini 2.5 Pro and Claude 3.5 Sonnet can identify test scenarios with up to 95 percent accuracy and adjust their responses.

These examples raise a shared concern. If a model can recognize that it is being tested, or if it has already seen test content, the evaluation becomes less reliable as a measure of general ability.

What this means for AI progress claims

Alibaba launched Qwen2.5 in September 2024, followed by the Qwen3 series. The source says whether the same findings apply to Qwen3 remains to be seen.

For now, the study's message is narrower but significant: Qwen2.5's high math scores should not automatically be treated as proof of genuine mathematical reasoning. The evidence points instead to heavy reliance on memorized data in at least the benchmark setting examined.

The researchers recommend future work use clean, uncontaminated benchmarks and evaluate multiple model series. That approach would make it harder for accidental data contamination or benchmark-specific tuning to distort the picture.

For anyone reading AI benchmark results, the practical takeaway is simple. A score is only as meaningful as the test behind it. If the test data has leaked into training, the result may measure memory more than reasoning.