Medical AI beats doctors in tests, but the caveats matter

Two Nature studies found specialized medical AI systems performing at or above physician level in simulated patient cases. MIRA focused on emergency diagnosis, while Google's AMIE handled multi-visit care plans, but both teams warned that simulations are not real clinics.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

Specialized medical AI agents outperforming doctors in simulations suggests growing autonomy in high-stakes care, though the article emphasizes major real-world caveats.

Medical AI beats doctors in tests, but the caveats matter

Two new studies in Nature suggest that specialized medical AI systems are moving closer to clinical usefulness. The results are striking: one system diagnosed emergency cases with high accuracy, while another produced treatment and testing plans that reviewers often preferred over those from physicians.

But the same findings also show why the next step is hard. These systems were tested in controlled simulations, not live hospitals or clinics, and both research teams cautioned against treating the results as proof that the technology is ready for everyday care.

MIRA acts inside a virtual hospital record

MIRA stands for Medical Intelligence for Reasoning and Action. It was developed at TUD Dresden and Heidelberg University, among other institutions, and it is designed to work less like a general chatbot and more like an autonomous medical agent.

The system operates inside a sealed, virtual electronic health record. According to the study, it can choose from more than 85,000 options across eleven tools. Those options include taking patient histories, ordering lab work, microbiology tests and imaging, interpreting results, producing differential diagnoses, and writing treatment plans.

The researchers tested MIRA on more than 500 real emergency department cases from the public MIMIC-IV dataset. A second AI agent acted as the patient and shared only information from the actual medical record.

Across eight disease categories, MIRA reached the right diagnosis 88.9 percent of the time when measured against diagnoses documented in the dataset. In a direct comparison on 311 cases, MIRA scored 87.8 percent. Four experienced specialists reached 78.1 percent, while a mixed team of residents and specialists reached 71.1 percent.

The system performed best on appendicitis, where it scored 98.6 percent, and pancreatitis, where it scored 92.3 percent. Both AI and doctors had more trouble with pneumonia, at 72.4 percent, and urinary tract infections, at 77.6 percent.

Safety checks were encouraging, but limited

The MIRA team also asked blinded specialist reviewers to examine the safety of recommendations without knowing whether they came from the AI system or a human. The reviewers found no dangerous drug interactions, no incorrect dosing for patients with impaired kidney function, and no risky painkiller prescriptions.

MIRA was also nearly perfect at capturing a patient's current medications. On hospital admissions, it did not miss a single case that required hospitalization. Its performance held steady even when test patients spoke only German or French, or acted particularly anxious.

Those results matter because medical AI will not be judged only on whether it can name a disease. A useful system must gather the right information, recognize when a patient needs higher-level care, and avoid unsafe recommendations. MIRA showed strength on those dimensions inside the study's simulated environment.

Still, the authors noted important boundaries. MIRA recommended "care that deviated from best practices" for a "small but non-zero" share of patients. They also warned that the simulated patient's answers may have been "more structured than real speech of patients in emergency departments."

Another caveat is the dataset itself. The researchers could not entirely rule out that the freely available MIMIC-IV dataset had appeared in training data for the models used. If so, the measured performance would be closer to a ceiling than a realistic estimate. The comparison physicians also worked in the German emergency department system, which differs from other countries.

AMIE focuses on care across visits

Google's AMIE took a different route. Instead of working through emergency department cases, it managed patients across multiple visits. The system uses two agents: one handles the patient conversation, while another works in the background and cross-references the case against medical guidelines.

Google compared AMIE with 21 primary care physicians across 100 cases spanning multiple visits. The benchmark was the UK's NICE Guidance and BMJ Best Practice guidelines, and actors portrayed patients through text chat.

According to the study, AMIE matched physicians on treatment decisions and outperformed them on plan accuracy and guideline adherence. At the first visit, AMIE's overall plan was rated appropriate in 95 percent of cases. For the physicians, that figure was 72 percent.

Specialist reviewers and patient actors also preferred AMIE more often than the human doctors. That does not mean the system is ready for clinics, but it does show how a structured AI workflow can perform well when the task is narrowed, measured, and tied to formal guidance.

Google also tested drug knowledge using RxQA, a dedicated benchmark based on two national drug formularies and verified by licensed pharmacists. AMIE outscored the primary care physicians on the harder questions. The test remained difficult for both sides, and even on the easier questions, the best score stayed below 75 percent.

The strongest warning is about real-world use

The AMIE developers described their work as a "milestone," but they also said the case selection and text-only conversations do not reflect a real clinic. The system shows "promising capabilities" but is "not ready for real-world translation." They also pointed to "latent reasoning errors" that can appear in hidden reasoning steps.

Jakob Kather, whose research group co-developed MIRA, told the Financial Times: "We are getting a preview of how AI could transform medicine." He also compared AI agents to an airplane's autopilot: "These systems can support and relieve medical professionals by taking over routine tasks, but ultimate responsibility will always remain with the physicians."

Independent experts made a similar point. Catherine Pope, a professor of medical sociology at the University of Oxford, told the FT that this remains "some remove from the messy, complex, human world of everyday healthcare." Julie Jacko, a professor of health informatics at the University of Edinburgh, said many advantages involved "precision and completeness of plans" rather than "clear differences in clinical correctness."

Jacko also said the study "demonstrates performance against a structured standard rather than fully capturing the complexity of real clinical decision-making." That distinction is central. A system can be excellent at completing a structured test while still facing difficult questions in unstructured care.

Older models raise a future problem

One of the most important findings concerns the systems' foundations. AMIE runs on Google's older Gemini 1.5 Flash. MIRA uses OpenAI's GPT-4o and o1-preview. The source article notes that all of these have since been surpassed by newer generations.

Google's researchers tested whether AMIE's performance came from its special architecture, guideline matching, and training, or from the underlying language model. With Gemini 1.5 Flash, the specialized setup delivered the large boost described in the study. But when the same setup was applied to Gemini 2.5 Flash, the advantage almost disappeared.

That suggests specialized medical scaffolding may be most valuable when it compensates for weaker base models. It can force structured reasoning, make the system cite guidelines, and reduce hallucinations. A stronger general model may already handle much of that on its own.

The paper acknowledges that AMIE's value shrinks as the base model improves. Newer general-purpose models such as Gemini 2.5 Pro, o3, and GPT-5 already score "largely comparable" to the full AMIE system on the RxQA drug test.

The practical lesson is not that medical AI has failed. It is that performance claims age quickly when they depend on older base models and carefully designed simulations. The studies show real progress, but they also underline the need for cautious evaluation before these systems move from controlled tests into clinical responsibility.