Tiny Data Poisoning Can Warp Medical LLM Answers

A New York University study found that very small amounts of medical misinformation in LLM training data can make models produce harmful answers. The problem is hard to catch because compromised models still performed comparably to controls on standard medical benchmarks.

WTF Index TERMINATOR
◄ Terminator 3 Idiocracy 1 ►

Tiny data-poisoning attacks can make medical LLMs give harmful advice while evading standard benchmark detection.

Tiny Data Poisoning Can Warp Medical LLM Answers

Medical misinformation does not need to dominate a training set to matter. A study by researchers at New York University found that when misinformation reached 0.001 percent of the training data, the resulting large language model was already compromised.

The finding matters because many LLMs learn from broad collections of online text. If false medical claims enter those collections, they can shape answers later, even when accurate material is far more common.

How Data Poisoning Reaches Medical AI

Data poisoning is the deliberate insertion of misleading information into the material used to train a model. In the case of large language models, that material is often drawn from the Internet, sometimes combined with more specialized sources.

The risk does not necessarily require direct access to the model. A bad actor may only need to place targeted content where it could be collected for training. The source article notes one manuscript’s example of a pharmaceutical company trying to push a drug for many kinds of pain by releasing targeted web documents.

Medical information is an especially sensitive target. General-purpose LLMs are already used by people searching for health information, and specialized medical LLMs may still use non-medical training materials so they can understand and respond in natural language.

What The NYU Researchers Tested

The researchers worked with The Pile, a database commonly used for LLM training. It was useful for this study because it contains a relatively small share of medical terms from sources without human vetting; much of its medical material comes from sources such as the National Institutes of Health’s PubMed database.

The team selected three medical fields: general medicine, neurosurgery, and medications. Within those fields, they chose 20 topics each, for a total of 60 topics.

Across The Pile, those topics appeared in over 14 million references, representing about 4.5 percent of all documents in the dataset. About a quarter of those references came from sources without human vetting, with most of those coming from an Internet crawl.

To test the effect of poisoning, the researchers used GPT 3.5 to generate “high quality” medical misinformation. They then created modified versions of The Pile in which either 0.5 or 1 percent of the relevant information on one of the three topic areas was replaced with misinformation, and used those versions to train LLMs.

Tiny Amounts Still Changed Answers

The compromised models became more likely to produce misinformation on the targeted topics. The effect also spread beyond the directly poisoned subjects, making the models less reliable on other medical concepts as well.

The researchers wrote that, “At this attack scale, poisoned models surprisingly generated more harmful content than the baseline when prompted about concepts not directly targeted by our attack,” according to the source article. That suggests poisoning can weaken a model’s broader medical behavior, not only its answers about the exact planted claims.

The team then tried to find how little misinformation was needed to affect performance. Using vaccine misinformation as a real-world example, they found that 0.01 percent misinformation still led to over 10 percent of answers containing wrong information. At 0.001 percent, over 7 percent of answers were harmful.

The cost and scale were also notable. The researchers wrote that a similar attack against the 70-billion parameter LLaMA 2 LLM4, trained on 2 trillion tokens, “would require 40,000 articles costing under US$100.00 to generate.” Those articles could be ordinary webpages, and the misinformation could be placed in webpage areas that are not displayed.

Why Standard Checks May Not Be Enough

The NYU team tested the compromised models on several standard medical LLM benchmarks. The poisoned models still passed in a way that made detection difficult: “The performance of the compromised models was comparable to control models across all five medical benchmarks,” the team wrote.

Post-training fixes did not solve the issue in the study. The researchers tried prompt engineering, instruction tuning, and retrieval-augmented generation, but none improved matters.

That result is important for anyone relying on medical AI safety checks. A model can appear acceptable on benchmark tests while still producing harmful answers in response to certain prompts. For health information, that gap is not a minor quality issue; it is a trust problem.

A Possible Filter, And A Larger Warning

The study did point to one possible defense. The researchers built an algorithm that identified medical terminology in LLM output and checked phrases against a validated biomedical knowledge graph. Phrases that could not be validated were flagged for human review.

This approach did not catch all medical misinformation, but it flagged a very high percentage of it. That could make it useful for future medical-focused LLMs, especially where human examination is part of the workflow.

The broader problem remains harder. Many people will not use a dedicated medical model when asking health questions. They will use generalist LLMs, including models embedded in Internet search services, and those systems are typically trained on broad online content.

The researchers describe “incidental” data poisoning from “existing widespread online misinformation.” Some of that material already exists because of medical scams or political agendas. The study’s warning is that the same content can also become training material, giving old misinformation a new path into future AI answers.