Ars Technica AI September 19, 2025 TERMINATOR

Why AI medical tools may underrate symptoms in women

Recent research suggests LLM-powered healthcare tools can recommend lower levels of care for female patients and show less empathy toward Black and Asian patients. The findings raise hard questions about medical AI bias, training data, privacy, and how doctors should use these systems.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 1 ►

The story highlights biased medical AI systems potentially causing undertreatment and unequal care for vulnerable patients.

Why AI medical tools may underrate symptoms in women

AI medical tools are moving quickly into clinics, hospitals, and back-office healthcare work. But recent studies described in the source article suggest that some large language models may treat the same symptoms differently depending on the patient’s gender, perceived race, language style, or writing quality.

The concern is not simply that an AI chatbot might make a mistake. The deeper risk is that automated medical summaries, triage suggestions, and patient-support responses could repeat patterns of under-treatment that already exist in healthcare.

What the studies found

Researchers at leading US and UK universities have found signs that LLM-powered medical tools can understate the seriousness of symptoms for some groups. The source article says these systems showed a tendency to reflect symptom severity less accurately for female patients and to offer less “empathy” toward Black and Asian patients.

Research by MIT’s Jameel Clinic in June found that models including OpenAI’s GPT-4, Meta’s Llama 3, and Palmyra-Med recommended a much lower level of care for female patients. In some cases, the tools suggested that patients self-treat at home rather than seek help.

A separate MIT study found that GPT-4 and other models produced mental health support responses with less compassion toward Black and Asian people. Marzyeh Ghassemi, associate professor at MIT’s Jameel Clinic, said that suggests

“some patients could receive much less supportive guidance based purely on their perceived race by the model,”

The issue also appeared in social care documentation. Research by the London School of Economics found that Google’s Gemma model, used by more than half the local authorities in the UK to support social workers, downplayed women’s physical and mental issues compared with men’s when generating and summarizing case notes.

Why communication style matters too

The MIT team also found that messages containing typos, informal language or uncertain phrasing were between 7-9 percent more likely to be advised against seeking medical care by AI models used in a medical setting. The clinical content was the same, but the presentation changed the model’s response.

That finding matters because not every patient writes in polished, formal language. The source article notes that people who do not speak English as a first language, or who are less comfortable using technology, could be treated unfairly by systems that respond differently to formatting and phrasing.

For doctors, that creates a practical warning. A model that summarizes a patient message or suggests a level of urgency may appear neutral, yet still be sensitive to surface features that should not change the clinical meaning.

How bias enters medical AI

The problem is tied partly to training data. General-purpose models such as GPT-4, Llama, and Gemini are trained on internet data, and the source article says biases from those sources can be reflected in model responses. Developers can also affect outcomes when they add safeguards after training.

Travis Zack, adjunct professor of University of California, San Francisco, and chief medical officer of AI medical information start-up Open Evidence, warned against relying on systems that may draw from unreliable online material. He said,

“If you’re in any situation where there’s a chance that a Reddit subforum is advising your health decisions, I don’t think that that’s a safe place to be,”

In a study last year, Zack and his team found that GPT-4 did not account for the demographic diversity of medical conditions and tended to stereotype certain races, ethnicities, and genders.

Researchers also warned that AI tools can reinforce existing under-treatment. The source article notes that health research data is often heavily skewed towards men, while women’s health issues face chronic underfunding and research.

What companies and researchers say should change

OpenAI said many studies evaluated an older model of GPT-4 and that accuracy had improved since launch. The company said it had teams working to reduce harmful or misleading outputs, especially in health, and that it worked with external clinicians and researchers to evaluate models, stress test behavior, and identify risks.

OpenAI has also developed a benchmark with physicians to assess LLM capabilities in health. The benchmark considers user queries with different styles, levels of relevance, and detail.

Google said it took model bias “extremely seriously” and was developing privacy techniques that can sanitise sensitive datasets and create safeguards against bias and discrimination.

Researchers have suggested two linked changes: identifying datasets that should not be used for training, and training on more diverse and representative health datasets. Zack said Open Evidence, used by 400,000 doctors in the US to summarize patient histories and retrieve information, trained its models on medical journals, the US Food and Drug Administration’s labels, health guidelines, and expert reviews. Every output is backed by a citation to a source.

The privacy trade-off

More representative medical data may improve AI healthcare tools, but it also raises privacy concerns. Earlier this year, researchers at University College London and King’s College London partnered with the UK’s NHS to build a generative AI model called Foresight.

Foresight was trained on anonymized patient data from 57 million people, including medical events such as hospital admissions and COVID-19 vaccinations. It was designed to predict probable health outcomes, including hospitalization or heart attacks.

Chris Tomlinson, honorary senior research fellow at UCL and lead researcher of the Foresight team, said national-scale data could better represent England’s demographics and diseases. But the project was paused in June while the UK’s Information Commissioner’s Office considered a data protection complaint filed by the British Medical Association and Royal College of General Practitioners over the use of sensitive health data in model training.

European scientists have also trained Delphi-2M, an AI model that predicts susceptibility to diseases decades into the future, using anonymized medical records from 400,000 participants in UK Biobank.

The source article also notes another safety issue: hallucination. Experts warned that AI systems can make up answers, which is especially risky in medicine.

Still, Ghassemi said AI is bringing major benefits to healthcare. Her hope is that health models shift toward addressing crucial health gaps rather than only improving task performance where doctors already do well.