Small changes expose weak spots in LLM reasoning

A study by six Apple engineers suggests that advanced LLM reasoning can become unreliable when familiar math problems are slightly altered. The biggest failures appeared when irrelevant details were added, causing some models to treat distractions as operations.

WTF Index IDIOCRACY
◄ Terminator 0 Idiocracy 2 ►

The story highlights fragile LLM reasoning that can produce unreliable answers when problems are slightly changed, but it is mainly a limitations study rather than a major societal harm case.

Small changes expose weak spots in LLM reasoning

Large language models are being promoted as increasingly capable reasoners, especially by companies such as OpenAI and Google. But a study by six Apple engineers suggests that the mathematical reasoning shown by advanced models can be fragile when benchmark problems are changed in small ways.

The paper, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, examines whether models are solving math word problems through formal logic or leaning on patterns similar to examples they have seen before.

What the Apple researchers tested

The researchers began with GSM8K, a standardized set of over 8,000 grade-school level mathematical word problems. GSM8K is often used to measure complex reasoning in modern LLMs.

Instead of testing only the original questions, the researchers created a modified evaluation called GSM-Symbolic. In this version, some names and numbers were dynamically replaced while the underlying math stayed the same.

For example, a GSM8K problem about Sophie getting 31 building blocks for her nephew could become a GSM-Symbolic problem about Bill getting 19 building blocks for his brother. The story changes, but the required reasoning steps do not.

This matters because a static benchmark can appear in training data, creating a risk of data contamination. By changing names and values, the researchers could test whether models were following the mathematical structure of the problem rather than recognizing a familiar question.

Small edits produced uneven results

When the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, accuracy fell compared with GSM8K. The drops ranged from 0.3 percent to 9.2 percent, depending on the model.

The bigger warning sign was not only the average decline. Across 50 separate GSM-Symbolic runs with different names and values, the same model could vary sharply. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model.

Changing numbers tended to reduce accuracy more than changing names. That is notable because the task should not become harder simply because a value or character name has been swapped. As the researchers point out, “the overall reasoning steps needed to solve a question remain the same.”

The study’s interpretation is direct: these models may not be performing formal reasoning. Instead, they may be matching a problem to patterns and solution paths seen in training data.

Red herrings caused the sharpest failures

The most severe results came from another modified benchmark called GSM-NoOp, short for no operation. In this version, the researchers inserted statements that looked relevant but did not affect the answer.

One example involved a question about counting kiwis picked over multiple days. The modified version added the detail that “five of them [the kiwis] were a bit smaller than average.” That fact should not change the total.

Many models, however, treated the extra detail as something to subtract. The researchers suggest this may happen because the models had seen similar examples where such details did correspond to subtraction.

On GSM-NoOp, the accuracy drops compared with GSM8K ranged from 17.5 percent to 65.7 percent. The researchers described these as “catastrophic performance drops.”

The result points to a practical weakness in LLM reasoning: a sentence can sound mathematically relevant without actually being relevant. A system that does not understand the role of that sentence may turn it into an operation anyway.

Why strong benchmark scores can still mislead

The findings do not mean every model collapsed on every test. OpenAI’s ChatGPT-4o, for instance, moved from 95.2 percent accuracy on GSM8K to 94.9 percent on GSM-Symbolic. That is still a high score.

But the study shows why a high benchmark number does not fully settle the question of reasoning quality. A model can perform well on standard problems and still become unreliable when the wording shifts or when a distracting fact appears.

The researchers’ hypothesis is blunt: “Current LLMs are not capable of genuine logical reasoning.” They add that the models instead try to reproduce reasoning steps from training examples.

That distinction matters for real-world use. If an LLM is pattern-matching, it may look convincing when the prompt resembles familiar training examples. But when the input includes an unexpected wrinkle, the same model may follow the wrong pattern with confidence.

The illusion of understanding

The source article connects the Apple paper to a broader debate in AI research. Other recent papers have also suggested that LLMs imitate formal reasoning through probabilistic pattern-matching rather than actually carrying it out.

Ars’ Benj Edwards previously described a related issue in AI video generation as an “illusion of understanding.” The same idea applies here: a model can combine concepts in impressive ways while still lacking an underlying model of logic or the world.

AI expert Gary Marcus argues that the next major advance will require true “symbol manipulation, in which some knowledge is represented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming…”

Until then, the Apple researchers’ results suggest that LLM reasoning should be treated carefully, especially in tasks where irrelevant details, changed numbers, or added steps can alter the model’s behavior. The core lesson is simple: a model that sounds like it is reasoning may still be vulnerable to small changes that a genuine logical system should handle cleanly.