MIT Tech Review AI January 22, 2026 TERMINATOR

Why ChatGPT Health raises a harder question than Dr. Google

OpenAI says 230 million people ask ChatGPT health-related queries each week, and ChatGPT Health is built for that reality. The strongest case for the product is not that it can replace doctors, but that it may be better than web search when people look up symptoms. The risk is that hallucination, sycophancy, privacy concerns and overtrust could still make medical LLMs dangerous.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 2 ►

Medical chatbot use could cause real harm through hallucination, overtrust, privacy exposure, and dangerous advice in vulnerable situations.

Why ChatGPT Health raises a harder question than Dr. Google

For years, people with new symptoms often began in the same place: a search box. The habit became familiar enough to earn the nickname Dr. Google, a shorthand for the anxiety and confusion that can follow online symptom searches.

That behavior is now shifting toward large language models. According to OpenAI, 230 million people ask ChatGPT health-related queries each week. ChatGPT Health arrives in that context, promising a more focused way to ask health questions while also raising a difficult question: if people are going to seek medical information online anyway, is an AI chatbot an improvement over web search?

What ChatGPT Health actually is

ChatGPT Health is not a new model and OpenAI has not presented it as a doctor replacement. It appears in a separate sidebar tab from the rest of ChatGPT, but it is closer to a wrapper around one of OpenAI’s existing models than a separate medical intelligence system.

The product gives the model health-specific guidance and tools. Those tools can include access to a user’s electronic medical records and fitness app data, if the user grants permission. That extra context could let the system respond to a person’s situation in a way that a generic search query cannot.

OpenAI emphasizes that ChatGPT Health is intended as additional support, not a substitute for a doctor. That distinction matters because the launch came at a tense moment. Two days earlier, SFGate had reported the story of Sam Nelson, a teenager who died of an overdose last year after extensive conversations with ChatGPT about how best to combine various drugs.

That timing sharpened concerns about medical AI. A tool that can help someone understand symptoms might also produce dangerous responses, especially in long conversations or emotionally charged situations.

Why some doctors see a real upside

The strongest argument for health chatbots begins with a practical reality: many people already seek medical information online. When doctors are not available, or when patients feel they need more context, they turn to alternatives.

Marc Succi, an associate professor at Harvard Medical School and a practicing radiologist, sees a difference between patients who relied on Google and patients who use LLMs. He says treating patients after web searches involved “a lot of attacking patient anxiety [and] reducing misinformation.” With LLMs, he says, “you see patients with a college education, a high school education, asking questions at the level of something an early med student might ask.”

That does not mean the answers are always correct. It does suggest that chatbots may help some people ask more structured questions and navigate medical information that would otherwise be scattered across many websites.

The comparison is not between ChatGPT Health and perfect care. The more realistic comparison is between ChatGPT Health and the messy status quo of online health search. If an LLM reduces misinformation and unnecessary fear compared with web search, it could have value even while remaining imperfect.

The evidence is promising but limited

Measuring the usefulness of a health chatbot is difficult. Danielle Bitterman, the clinical lead for data science and AI at the Mass General Brigham health-care system, says, “It’s exceedingly difficult to evaluate an open-ended chatbot.”

One problem is that medical licensing examinations do not capture how ordinary users ask health questions. Large language models can score well on those examinations, but the tests use multiple-choice formats. Real chatbot use is open-ended, conversational and often imprecise.

Sirisha Rambhatla, an assistant professor of management science and engineering at the University of Waterloo, evaluated how GPT-4 responded to licensing exam questions without answer choices. Medical experts judged only about half of the responses as entirely correct. Even that test remains an imperfect proxy for consumer health use.

Another study tested GPT-4o on more realistic prompts submitted by human volunteers and found that it answered medical questions correctly about 85% of the time. Amulya Yadav, an associate professor at Pennsylvania State University who runs the Responsible AI for Social Emancipation Lab and led the study, said he was not personally a fan of patient-facing medical LLMs. But he also said they seem technically capable, noting that human doctors misdiagnose patients 10% to 15% of the time.

Yadav says LLMs appear to be a better choice than Google for people seeking medical information online. Succi reached a similar view after comparing GPT-4’s responses to questions about common chronic medical conditions with information shown in Google’s knowledge panel.

Those findings matter, but they have boundaries. The studies focused on straightforward factual questions and short interactions. They say less about longer conversations, complex health problems or users who already distrust medical advice.

The risks are not theoretical

LLMs have known weaknesses that are especially serious in health. They can agree too readily with users. They can make up information instead of admitting uncertainty. They can present flawed answers in fluent language that sounds authoritative.

Some studies found those patterns in medical contexts. One study showed that GPT-4 and GPT-4o accepted and built on incorrect drug information included in a user’s question. Another found that GPT-4o frequently invented definitions for fake syndromes and lab tests mentioned by the user.

These behaviors could amplify medical misinformation, especially because the internet already contains dubious diagnoses and treatments. Reeva Lederman, a professor at the University of Melbourne who studies technology and health, warns that a patient who dislikes a doctor’s diagnosis or treatment recommendation might ask an LLM for another opinion. If the system is sycophantic, it might encourage that person to reject medical advice.

OpenAI has said the GPT-5 series is markedly less sycophantic and less prone to hallucination than earlier models. The company also evaluated the model behind ChatGPT Health with its publicly available HealthBench benchmark, which rewards responses that express uncertainty when appropriate, recommend medical attention when necessary and avoid needless alarm.

Even so, Bitterman notes that some HealthBench prompts were generated by LLMs rather than users. That could limit how well the benchmark reflects real-world behavior.

The central tradeoff for patients

ChatGPT Health may be better than Dr. Google in important ways. It can organize information, reduce the burden of sorting through websites and, with permission, use personal health context that a normal search query would not include.

But better than search does not mean safe enough to replace care. Experts have also cautioned against giving ChatGPT access to medical records for privacy reasons. And even a more accurate chatbot could harm health if it makes people rely on the internet instead of doctors.

Lederman’s research found that members of online health communities often trust users who express themselves well, regardless of whether the information is valid. ChatGPT communicates like an articulate person, which may make some users trust it too much.

The future of consumer health AI may therefore depend on a narrow balance. ChatGPT Health could lower some of the confusion created by web search, but it also introduces new risks around confidence, context and trust. For now, its most defensible role is support: useful for questions, dangerous as a substitute for medical judgment.