Why medical LLMs stumble when answer patterns change

A JAMA Network Open study tested whether large language models can handle medical questions when familiar answer patterns are disrupted. Accuracy fell across every model tested, raising doubts about whether current systems are reliable enough for clinical work.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 2 ►

The story mainly highlights brittle medical AI that can appear competent while producing lower-quality, pattern-driven answers.

Why medical LLMs stumble when answer patterns change

A new study in JAMA Network Open adds weight to a central concern about medical AI: large language models may look competent on standard tests while still failing when a case no longer fits the pattern they expect.

The work, led by Suhana Bedi, focused on a simple but revealing change to medical multiple-choice questions. When the usual correct option was replaced with None of the other answers, every tested model became less accurate.

A test designed to break familiar patterns

The researchers began with 100 questions from the MedQA benchmark, which is used to test medical knowledge through multiple-choice questions. For each item, they removed the original correct answer and replaced it with None of the other answers, or NOTA.

This change mattered because it forced the models to do more than recognize a familiar answer. To respond correctly, a model had to evaluate the available choices and decide that none of them worked.

A clinical expert reviewed the modified questions before they were used. After that review, 68 questions remained where NOTA was confirmed as the only correct answer.

That made the setup a direct challenge to medical reasoning. If a model was truly working through the case, it should be able to reject the incorrect options. If it was leaning on answer patterns seen before, it would be more likely to choose one of the familiar-looking medical answers instead.

Accuracy dropped across the board

All of the models tested performed worse on the revised questions. The size of the drop varied, but the direction was consistent.

Standard LLMs saw large declines. Claude 3.5 fell by -26.5 percentage points, Gemini 2.0 by -33.8, GPT-4o by -36.8, and LLaMA 3.3 by -38.2.

Reasoning-focused systems were more resilient, but they were not unaffected. Deepseek-R1 dropped by -8.8 percentage points, while o3-mini declined by -16.2.

The researchers also tested chain-of-thought prompting, asking models to explain their reasoning step by step. That approach did not reliably lead the models to the right medical answer.

One of the most striking findings was that some systems dropped from 80 to 42 percent accuracy after only minor changes to the questions. For clinical use, that kind of fragility is the point of concern.

Why this matters in clinical settings

The authors argue that the results point to statistical pattern matching rather than genuine reasoning. In ordinary benchmark settings, a model may appear to know the answer because the question and answer choices resemble examples it has learned from.

Clinical work is not limited to textbook patterns. Doctors frequently face rare conditions, unexpected symptoms, and cases where the obvious answer is not the correct one.

That creates a hard problem for LLMs in medicine. If a system treats a case as a match to something familiar, it may miss the clue that should lead it away from the expected answer.

The NOTA experiment exposes that weakness in a controlled way. The correct response was not hidden behind a new medical fact. It required the model to notice that the provided options no longer contained the answer.

For medical practice, the difference is important. A model that performs well when the correct option is present may still be unreliable when the case structure changes or when the answer is less obvious.

The reasoning debate remains unsettled

The study fits into a broader concern about how easily LLMs can be thrown off by small prompt changes or irrelevant information. Even systems built for reasoning are not immune to that problem.

At the same time, the findings do not settle every question about LLM reasoning. It is still unclear whether these systems lack logical reasoning skills altogether or whether they have difficulty applying those skills consistently.

The source also notes that the debate is complicated by vague definitions and fuzzy benchmarks. Without clearer standards, it remains difficult to judge exactly what counts as reasoning and how reliably a model can do it.

The study did not include the very latest reasoning models like GPT-5-Thinking or Gemini 2.5 Pro, which might perform better. Deepseek-R1 and o3-mini are current for their class, but may still lag the most advanced systems.

Still, the stronger performance of Deepseek-R1 and o3-mini suggests that progress toward more robust LLMs is possible. The key question is whether that progress can produce systems that handle ambiguity and unusual cases well enough for clinical environments.

The cautious takeaway

The study does not show that LLMs have no value in medicine. It shows that high performance on familiar medical benchmarks should not be mistaken for dependable clinical reasoning.

For now, the researchers say these models are not ready for clinical work. Their results suggest that medical LLMs need to be tested not only on whether they can recall likely answers, but also on whether they can reject plausible wrong answers when the case demands it.

That distinction is central to safe clinical use. Medicine often depends on recognizing when the expected pattern does not apply.