The Decoder July 14, 2024 NEUTRAL

Why GPT-4 stumbles when familiar AI reasoning tasks change

Researchers from the Massachusetts Institute of Technology (MIT) and Boston University found that leading language models perform far worse when familiar tasks are changed in small but meaningful ways. The results suggest that models such as GPT-4, GPT-3.5, Claude, and PaLM-2 often rely on memorized patterns rather than fully transferable reasoning.

A new study raises a basic but important question for artificial intelligence: when a language model gets the right answer, is it reasoning through the problem, or is it drawing on patterns it has already seen?

Researchers from the Massachusetts Institute of Technology (MIT) and Boston University tested leading language models, including GPT-4, GPT-3.5, Claude, and PaLM-2, by changing familiar tasks in small ways. The models often performed strongly under standard conditions, but their accuracy dropped sharply when the rules or setup shifted.

What the researchers tested

The study focused on counterfactual task variations. These are versions of familiar tasks where the underlying rules or conditions are altered from the standard form.

The researchers created eleven counterfactual variations. The changes were not presented as entirely unrelated tasks. Instead, they adjusted familiar settings to see whether the models could transfer what they appeared to know into a slightly different situation.

Examples included asking models to do addition outside the standard decimal system, judge chess moves after minor changes to starting piece positions, or solve a spatial task involving placing a soft drink upside down. The same pattern was also examined across programming, spatial reasoning, and logical reasoning.

This approach matters because a model can look capable on standard tasks while still depending heavily on memorized examples. A counterfactual version can reveal whether the model understands the structure of the task or is mainly following patterns tied to the usual version.

The performance gap was large

The clearest example in the source article involves arithmetic. In standard decimal addition, GPT-4 reached nearly perfect accuracy of over 95 percent. But when the task moved to the base 9 number system, performance fell below 20 percent.

That drop is difficult to ignore. The task was still addition, but the condition had changed. If the model were applying a fully general method, the change in number system would not be expected to create such a severe decline.

The same broad pattern appeared in other areas. Programming, spatial reasoning, and logical reasoning all showed similar weaknesses when the task moved away from familiar standard conditions.

The finding does not mean the models had no ability to generalize. The researchers noted that performance on counterfactual tasks was usually above chance level. That suggests the models were not simply repeating memorized material by rote in every case.

Still, the comparison between standard and altered tasks is the central point. Strong results on the usual form of a benchmark did not reliably carry over when the rules were adjusted.

Memorization may be doing more work than it appears

The study suggests that language models often lean on behaviors tied to standard conditions rather than abstract, generalizable logical thinking. In plain terms, they may know how a familiar problem usually looks and how answers to that problem usually proceed, but struggle when the situation is no longer the familiar one.

The researchers also found a relationship between counterfactual performance and how frequent the respective conditions were. One example involved a guitar chord task. GPT-4 performed best on the relatively frequent alternative drop-D tuning.

That pattern points toward a memory effect. If a variation is more common, the model may have more exposure to it in training data, and therefore perform better. If a variation is rare, the model may be less able to adapt from first principles.

The researchers also stated that they could not exclude the possibility that their counterfactual conditions were included in the AI's training dataset. That caveat is important. If some altered conditions were already present in training material, even above-chance performance would not prove pure reasoning.

For users, the practical lesson is straightforward: correct answers on familiar tasks should not automatically be treated as proof that a model can solve the underlying problem in a general way.

Step-by-step prompting helped, but did not solve it

The study also examined chain-of-thought prompting without examples. This is a method where the model is asked to reason in steps.

That technique improved performance in most cases. But it did not completely close the gap between standard tasks and counterfactual tasks.

This is a useful distinction. Asking a model to explain or work step by step can improve output, but it does not necessarily transform pattern-based behavior into fully reliable reasoning. The model may still be constrained by how closely the task resembles cases it has encountered before.

The researchers argue that success on standard tasks should not be considered enough evidence of a model's general ability to solve the target task. A model can appear competent in a familiar setting while failing to transfer that competence when the problem is changed.

Why this matters for AI evaluation

The broader implication is about how language models should be tested. If benchmarks only measure standard versions of tasks, they may overstate how well models reason. Counterfactual testing can help separate memorized solutions from more flexible problem solving.

The source article also connects this issue to other work on the limited reasoning abilities of large language models. It mentions a study on the quality of ChatGPT code generation, which found that GPT-3.5 could reliably solve code tasks from the LeetCode training website that were published before the end of training in 2021, while performance on tasks published after the end of the training period dropped significantly.

That example reinforces the same concern. If a model performs better on material likely to resemble training data, its apparent skill may depend partly on memory. The key question is whether the system can apply learned knowledge to new examples.

The AI industry's ultimate goal, as described in the source, is to develop AI models with a combination of reasoning capabilities and generative AI. Such systems would not merely generate plausible answers. They would apply knowledge learned from training examples to new examples.

For now, the study gives a more cautious way to read model performance. GPT-4 and similar systems can be powerful, but familiar success is not the same as dependable reasoning under changed conditions.