Why HealthBench changes the AI healthcare test

OpenAI has introduced HealthBench, a benchmark built to test language models on realistic medical conversations. The company says GPT-4.1 and o3 now outperform doctors on this specific test, while also stressing that the comparison has important limits.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 1 ►

This is mainly a benchmark release with limited claims about medical-chat performance rather than clear evidence of dangerous autonomy or societal degradation.

Why HealthBench changes the AI healthcare test

OpenAI has released HealthBench, a new benchmark for evaluating AI systems in healthcare conversations. The test is designed around realistic medical scenarios, and OpenAI says its latest models, GPT-4.1 and o3, outperform doctors on it.

That claim is notable, but it needs careful framing. HealthBench is not a measure of full clinical care. It measures how well language models respond in a chat-style medical format, using criteria created with medical experts.

What HealthBench Tries To Measure

OpenAI says older healthcare benchmarks did not capture enough of what real doctor-patient communication looks like. The company also says prior tests often lacked enough medical expert input or were not detailed enough to show progress in newer AI models.

To build HealthBench, OpenAI worked with 262 doctors from 60 countries. Together, they created 5,000 realistic medical scenarios covering 26 specialties and 49 languages.

The benchmark spans seven medical domains, including emergency medicine and global health. Each AI answer is judged across five categories:

  • communication quality
  • instruction-following
  • accuracy
  • contextual understanding
  • completeness

Across the benchmark, OpenAI applies 48,000 medically grounded evaluation points. That structure is meant to make the scoring more specific than a simple right-or-wrong test.

How The Scoring Works

HealthBench uses GPT-4.1 to score responses. Because that creates an obvious reliability question, OpenAI compared GPT-4.1's evaluations with evaluations from human doctors.

According to OpenAI, GPT-4.1's judgments matched human assessments at about the same level of agreement found between different doctors. In other words, the company argues that the model can serve as a useful evaluator for this benchmark, though the setup still depends on the benchmark's own format and assumptions.

The test looks at medical conversation quality, not just medical facts in isolation. That matters because a helpful answer in this setting has to understand context, follow the user's request, communicate clearly, and be complete enough to be useful.

Where GPT-4.1 And o3 Stand

OpenAI says GPT-4.1 and o3 outperformed physician responses on HealthBench. The result changed over time as models improved.

In early tests from September 2024, doctors were able to improve older model outputs by editing them. Unaided doctor responses scored the lowest in that setup. By April 2025, OpenAI says GPT-4.1 and o3 outperformed physicians even without extra input or refinement.

The raw model scores show the scale of the reported improvement. OpenAI says o3 reached 0.60 on HealthBench. GPT-4o scored 0.32 in August 2024. Among competing models mentioned in the source, xAI's Grok 3 scored 0.54, and Google's Gemini 2.5 scored 0.52.

Those numbers put o3 ahead on this benchmark. They also show why OpenAI is presenting HealthBench as a way to track rapid progress in medical AI communication.

Why The Result Has Limits

OpenAI also points out that the doctor comparison should not be read too broadly. Doctors do not usually provide care by writing chat-style answers to medical questions. HealthBench therefore does not reproduce the full reality of clinical work.

The benchmark tests a specific communication task. That task may favor language models, because language models are built to generate written responses in exactly this kind of format.

This distinction is central to interpreting the result. A higher benchmark score does not mean an AI system has replaced the role of a physician. It means the system performed better under the benchmark's conditions, using the benchmark's criteria.

Worst-Case Reliability And Access

OpenAI says HealthBench also includes a stress test for worst-case performance. The idea is to ask how useful the least helpful response from a model is, because in healthcare a single wrong answer can matter more than many correct ones.

The company says its latest models show major improvements on this front, while also acknowledging that more work remains. That caveat is important: healthcare AI is not only about average performance. It is also about what happens when the model performs badly.

Efficiency is another part of the benchmark story. OpenAI says GPT-4.1 nano is 25 times more cost-effective than GPT-4o from August 2024 while also delivering better results. The company says that could make the model more accessible in low-resource settings.

OpenAI has also released two related datasets for further testing: HealthBench Consensus and HealthBench Hard. HealthBench Consensus includes only highly validated criteria. HealthBench Hard includes 1,000 especially difficult cases where most models still fail.

The company says the test data and evaluation methods are available on GitHub, and that it has published a detailed paper. OpenAI is encouraging researchers to build on the benchmark, which suggests HealthBench is intended not just as a scorecard for current models, but as a tool for continued medical AI evaluation.