The Decoder October 12, 2024 IDIOCRACY

Why Apple's AI Study Puts o1 Reasoning Under Pressure

Apple researchers used GSM-Symbolic to test whether large language models show stable logical reasoning on math problems. Their results suggest that even strong systems such as OpenAI's GPT-4o and o1 fluctuate when problem wording changes or irrelevant details are added.

WTF Index IDIOCRACY

◄ Terminator 0 Idiocracy 2 ►

The story highlights unreliable AI reasoning and benchmark fragility, pointing mildly toward degraded trust and quality rather than danger or autonomy.

Why Apple's AI Study Puts o1 Reasoning Under Pressure

A new Apple research study is challenging a central claim in the current AI race: that today's leading large language models are beginning to reason in a reliable, formal way. The work, led by Mehrdad Farajtabar and including renowned AI scientist Samy Bengio, focuses on whether models can solve math problems through logic rather than by matching familiar patterns.

The researchers tested open-source models such as Llama, Phi, Gemma, and Mistral, along with proprietary systems that include the latest offerings from OpenAI. Their conclusion is direct: even top models such as OpenAI's GPT-4o and o1 appear to struggle with consistency when the same kind of task is presented in slightly different forms.

A New Test For Old Benchmark Problems

The Apple team created GSM-Symbolic, an evaluation tool built on the GSM8K mathematical reasoning dataset. GSM8K has become a popular way to measure how well AI systems handle grade-school-style math problems, but the new study argues that simple benchmark scores can hide important weaknesses.

GSM-Symbolic adds symbolic templates to test models more thoroughly. Instead of relying only on a fixed set of questions, it can vary problem details and examine whether performance remains stable. That matters because genuine reasoning should not depend heavily on superficial changes such as names or irrelevant additions.

The results suggest that GSM8K accuracy scores may be less reliable than they appear. The Llama-8B model, for example, scored between 70 percent and 80 percent. Phi-3 fluctuated between 75 percent and 90 percent. Farajtabar says that, for most models, average performance on GSM-Symbolic was lower than on the original GSM8K.

Small Changes Created Large Swings

One of the most revealing tests used the GSM-NoOp dataset. In that setup, researchers added a single statement to a text problem. The added sentence looked relevant, but it did not actually contribute to solving the problem.

That small change reduced performance across all models, including OpenAI's o1 models. The point is not only that scores dropped. The deeper issue is that a model presented as a reasoner should be able to ignore information that has no bearing on the answer.

Farajtabar put the concern plainly: "Would a grade-school student's math test score vary by ~10% if we only changed the names?"

He also emphasized that the real problem is the increase in variance and the drop in performance as task difficulty rises only slightly. Handling that variation as difficulty increases would probably require "exponentially more data."

Pattern Matching Versus Formal Reasoning

OpenAI's o1 series performs better than many other models and reaches top scores on many benchmarks. But according to the Apple researchers, it still shows performance fluctuations and makes "silly mistakes." In their view, that means it shares the same fundamental weakness as other large language models.

Farajtabar's conclusion is blunt: "Overall, we found no evidence of formal reasoning in the language models." He adds: "Their behavior is better explained by sophisticated pattern matching." Scaling data, parameters, and compute may create better pattern matchers, but "not necessarily better reasoners."

The study also raises concerns about benchmark contamination. According to the researchers, the improved GSM8K results seen over time could be partly explained by test examples appearing in training data. The source notes that GPT-3 scored 35 percent about three years ago, while current models score up to 95 percent.

Another recent study cited in the source supports the idea that smaller AI models generalize mathematical tasks less effectively, possibly because they have seen less data during training. That fits the broader concern: high performance can come from exposure and pattern familiarity rather than robust logic.

Why The Debate Matters

The stakes go beyond math puzzles. The Apple researchers argue that understanding the true reasoning capabilities of large language models is important for real-world uses where accuracy and consistency are essential. The source specifically names AI safety, alignment, education, healthcare, and decision-making systems.

In those settings, a model that seems strong on a benchmark but fails when wording shifts can create practical risk. A useful system must not only answer familiar questions. It must handle new variations, discard irrelevant information, and keep performance stable as complexity changes.

The study concludes: "We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills." The source frames that as a key challenge on the path toward systems with human-like cognitive abilities or general intelligence.

OpenAI And Apple Frame The Issue Differently

The debate is also notable because Apple and OpenAI appear to be taking different positions. OpenAI describes o1 as the first reasoning model, level 2, and as a foundation for logical agents, level 3. The source says this is supposed to be the next growth area for OpenAI.

The Apple researchers' argument is not the only view in the field. The source notes that a new OpenAI benchmark shows o1 can solve machine learning engineering tasks, and that OpenAI claims to have explicitly excluded test examples from the training data. Another study concludes that AI models perform at least some kind of probabilistic reasoning.

Part of the disagreement may come from language itself. Terms such as intelligence, reasoning, and logic are vague. They can appear in degrees, and machine logic may take new forms. That leaves room for researchers to disagree about what a model is really doing when it produces a correct answer.

AI researcher François Chollet described the Apple study as "one more piece of evidence to add to the pile." He said the view that LLMs are incapable of logic was an "extremely heretic viewpoint" in early 2023, but is now becoming "self-evident conventional wisdom."

For users and companies, the practical question may eventually matter more than the label. If future AI models can reliably solve the tasks they are given, the academic dispute may fade. For OpenAI, with a valuation of more than $150 billion, that reliability is what it still needs to prove.