Ars Technica AI October 28, 2024 TERMINATOR

Why AI transcription in hospitals is raising safety alarms

An Associated Press investigation found that OpenAI’s Whisper can insert text that speakers never said, including in medical and business settings. The concern is sharper in health care because Whisper-based tools are already used by over 30,000 medical workers despite warnings against use in high-risk domains.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 2 ►

Whisper hallucinations in medical transcripts create concrete patient-safety risks as hospitals adopt unreliable AI in high-stakes care.

Why AI transcription in hospitals is raising safety alarms

AI transcription is moving quickly into clinical work, but the evidence described in an Associated Press investigation shows why speed and convenience are not the same as reliability. OpenAI’s Whisper can produce transcripts that sound plausible while adding words, phrases, or even harmful details that were not present in the audio.

Hospitals are adopting a tool with known limits

Whisper was released in 2022, when OpenAI said it approached "human level robustness" in audio transcription accuracy. The Associated Press investigation, however, found that the tool has created fabricated text in medical and business settings even as OpenAI warned against using it in "high-risk domains."

The AP interviewed more than 12 software engineers, developers, and researchers who described a repeated problem: Whisper sometimes invents text that speakers never said. In the AI field, that behavior is often called "confabulation" or "hallucination."

The reported scale of the issue is difficult to ignore. A University of Michigan researcher told the AP that Whisper generated false text in 80 percent of the public meeting transcripts examined. Another developer, not named in the AP report, said invented content appeared in almost all of his 26,000 test transcriptions.

Those findings matter because Whisper is not staying inside low-stakes experiments. According to the AP report, over 30,000 medical workers now use Whisper-based tools to transcribe patient visits. The Mankato Clinic in Minnesota and Children’s Hospital Los Angeles are among 40 health systems using a Whisper-powered AI copilot service from Nabla, a medical technology company that fine-tuned the system on medical terminology.

Why medical transcripts raise the stakes

In ordinary transcription, an error can confuse a record. In health care, a transcript can become part of the flow of clinical information around a patient visit. That is why fabricated language is more than a technical defect: it can create a record that appears authoritative while failing to match what was actually said.

Nabla acknowledges that Whisper can confabulate. The AP also reported that Nabla erases original audio recordings "for data safety reasons." That creates a separate verification problem. If the source audio is gone, doctors may not be able to check the transcript against the original conversation.

The risk is not distributed evenly. Deaf patients may be especially affected by inaccurate transcripts because they may have no way to know whether the audio behind a medical transcript was captured correctly. When the transcript is treated as the record, the ability to compare it with the original audio becomes central to trust.

The concerns described in the source article are not limited to whether a single word is missed. The bigger issue is that the system can add content. A transcript that omits a phrase is a problem; a transcript that invents a phrase can be harder to detect because it may read naturally.

Research found invented harmful content

The potential problem extends beyond hospitals. Researchers from Cornell University and the University of Virginia studied thousands of audio samples and found Whisper adding violent content and racial commentary that did not exist in neutral speech.

In that research, 1 percent of samples included "entire hallucinated phrases or sentences which did not exist in any form in the underlying audio." Of those cases, 38 percent included "explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority."

The examples cited by the AP show how specific the invented content can become. In one case, audio referring to "two other girls and one lady" was transcribed with fictional text saying they "were Black." In another, the spoken audio said, "He, the boy, was going to, I’m not sure exactly, take the umbrella." Whisper turned it into a passage involving a cross, a "terror knife," and killing people.

An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings, actively studies how to reduce fabrications, and incorporates feedback in model updates. That response acknowledges the problem as an area of active work, but it does not remove the immediate question facing hospitals: how much verification is enough before an AI transcript is trusted?

The technical issue is prediction, not certainty

Whisper is built on Transformer-based AI. It processes tokenized audio data and predicts what text is most likely to come next. That makes it powerful, but it also means the output is a prediction rather than a guaranteed record of what happened in the audio.

OpenAI said in 2022 that Whisper learned from "680,000 hours of multilingual and multitask supervised data collected from the web." The source article notes that Whisper is known to produce phrases such as "thank you for watching," "like and subscribe," and "drop a comment in the section below" when given silent or garbled inputs. That behavior points to a system leaning on patterns it has seen often when the audio signal is weak or unclear.

The original Whisper model card described the risk directly. OpenAI researchers wrote that because the models were trained with large-scale noisy data, predictions may include text not actually spoken in the audio input. The model card also suggested that the system may combine predicting the next word with transcribing the audio itself.

That explanation matters for health care AI adoption. If an AI transcription tool can fill in uncertain audio with plausible language, then human review is not just a formality. It is part of the safety system. The source article also notes a possible mitigation: another AI model could flag confusing audio regions where Whisper is more likely to confabulate, allowing a human to manually check those parts later.

For now, the central lesson is straightforward. Whisper-based medical transcription may reduce friction in clinical documentation, but the AP investigation shows that convenience comes with a verification burden. In high-risk settings, a transcript has to be treated not as a finished fact, but as an output that may need careful checking before it becomes part of patient care.