WIRED AI October 30, 2024 IDIOCRACY

Why hospitals still face risk from Whisper AI transcription

Whisper-based transcription tools are being used in medical settings even though the system can create text that was never spoken. The concern is not just ordinary transcription error, but fabricated wording that may be difficult to verify when original audio is deleted.

WTF Index IDIOCRACY

◄ Terminator 2 Idiocracy 3 ►

The story centers on AI-generated false records eroding truth and reliability in high-stakes medical documentation, with some risk of harm.

Why hospitals still face risk from Whisper AI transcription

AI transcription is moving into hospitals because it promises faster documentation and less administrative work. But the source article describes a sharp problem with OpenAI’s Whisper: the tool can invent words, phrases, and even harmful details that do not appear in the audio.

That matters anywhere a transcript is treated as a record. In health care, the stakes rise because a patient visit can become part of medical documentation, and the people relying on it may not have an easy way to check what was actually said.

What the AP investigation found

An Associated Press investigation reported that OpenAI's Whisper transcription tool creates fabricated text in medical and business settings, even though warnings exist against some uses. The AP interviewed more than 12 software engineers, developers, and researchers who said the model regularly invents text speakers never said.

In the AI field, that behavior is often called a “confabulation” or “hallucination.” The core issue is not simply that the tool mishears a word. The concern is that it can produce material that sounds plausible while being absent from the source audio.

OpenAI released Whisper in 2022 and said it approached “human level robustness” in audio transcription accuracy. But the source article describes several findings that complicate that claim:

A University of Michigan researcher told the AP that Whisper created false text in 80 percent of public meeting transcripts examined.
Another developer, unnamed in the AP report, said invented content appeared in almost all of his 26,000 test transcriptions.
Researchers from Cornell University and the University of Virginia studied thousands of audio samples and found examples where Whisper added nonexistent violent content and racial commentary to neutral speech.

Those examples show why the problem is different from normal transcription uncertainty. If a transcript adds claims, identities, threats, or medical details that were never spoken, the document can become misleading while still looking polished.

Why hospitals are a sensitive setting

The source article says OpenAI warns against using Whisper for “high-risk domains.” Even so, according to the AP report, over 30,000 medical workers now use Whisper-based tools to transcribe patient visits.

The Mankato Clinic in Minnesota and Children’s Hospital Los Angeles are among 40 health systems using a Whisper-powered AI copilot service from Nabla, a medical tech company. The article says Nabla’s service is fine-tuned on medical terminology.

That medical focus may make the tool more useful in clinical environments, but it does not remove the underlying risk described in the source. Nabla acknowledges that Whisper can confabulate. The article also says Nabla reportedly erases original audio recordings “for data safety reasons.”

That creates a verification problem. If a transcript appears questionable but the audio has been removed, doctors cannot compare the written record with the original source material. The article also notes that deaf patients may be highly impacted by mistaken transcripts because they would have no way to know whether the medical transcript audio is accurate.

For hospitals, the question is therefore not only whether AI transcription saves time. It is whether the workflow preserves enough evidence and review to catch invented text before it becomes trusted documentation.

How Whisper can make things up

The source article explains Whisper as a Transformer-based AI model. In simple terms, it works by predicting the next likely token, or chunk of data, after a sequence of tokens. For ChatGPT, those input tokens come from text. For Whisper, they come from tokenized audio data.

That design means a Whisper transcript is a prediction of what is likely, not a guarantee of what is accurate. When the model does not have enough context to transcribe an audio segment reliably, it can fall back on patterns learned from training data.

OpenAI said in 2022 that Whisper learned from “680,000 hours of multilingual and multitask supervised data collected from the web.” The article points to common hallucinated outputs such as “thank you for watching,” “like and subscribe,” and “drop a comment in the section below” when Whisper receives silent or garbled inputs. The source says this makes it likely that Whisper was trained on thousands of hours of captioned audio scraped from YouTube videos.

The article also describes “overfitting,” where information encountered more frequently in training data becomes more likely to appear in an output. In poor-quality medical audio, that can mean the model fills gaps with what it predicts should come next, even when the prediction is wrong.

The original Whisper model card, according to the source, warned about this behavior: “Because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.”

Examples show the risk beyond medicine

The problem is not limited to clinical transcripts. The source article says researchers from Cornell University and the University of Virginia found that 1 percent of samples included “entire hallucinated phrases or sentences which did not exist in any form in the underlying audio.” It also says 38 percent of those included “explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority.”

In one example cited by AP, a speaker described “two other girls and one lady,” but Whisper added fictional text specifying that they “were Black.” In another example, the audio said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.” Whisper produced: “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”

These examples matter because they show a transcript can introduce race, violence, or false action into neutral speech. If such text appears in a business record, public transcript, or medical note, a reader may treat it as something the speaker actually said.

The practical lesson for AI transcription

An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings, actively studies how to reduce fabrications, and incorporates feedback in updates to the model.

The source article also suggests a possible mitigation: a second AI model could identify confusing audio where Whisper is more likely to confabulate, then flag that location so a human can manually check it later. That kind of process would treat AI transcription as a draft requiring review, not as a final record.

The larger lesson is straightforward. Whisper-based AI transcription can be useful, but the article presents clear evidence that it can also fabricate content. In hospitals and other high-risk settings, the danger is not just a messy transcript. It is a clean-looking record that may contain words no one said.