AI systems are becoming part of how people search, study, and interpret science. But recent studies show a basic weakness: some tools can draw on retracted scientific papers and present the material as if it still belongs in the scientific record.
That matters because the answer may look grounded. A chatbot may cite a real paper and use real research language, while failing to tell the user that the work has been withdrawn or flagged as unreliable.
What the studies found
One study by Weikuan Gu, a medical researcher at the University of Tennessee in Memphis, and his team tested OpenAI’s ChatGPT running on the GPT-4o model. They asked questions based on information from 21 retracted papers about medical imaging.
The chatbot referenced retracted papers in five cases. It advised caution in only three of those cases. For other questions, it cited non-retracted papers, but the study authors noted that the model may not have recognized the retraction status of the articles.
A separate study from August looked at ChatGPT-4o mini. Researchers used it to evaluate the quality of 217 retracted and low-quality papers from different scientific fields. None of the chatbot’s responses mentioned retractions or other concerns.
The source article notes that no similar studies have been released on GPT-5, which came out in August.
Why retracted research is different from a fake citation
AI search tools and chatbots are already known to fabricate links and references. Retracted papers create a different problem. The paper exists, the citation may be real, and the content may sound technical and credible.
That can make the error harder for users to notice. If a person reads only the answer and does not click through to check the paper, they may miss the fact that the research has been retracted.
Gu described the issue plainly: the chatbot is “using a real paper, real material, to tell you something.” The problem is that real material can still be misleading when the scientific community has removed it from the reliable record.
Yuanxi Fu, an information science researcher at the University of Illinois Urbana-Champaign, said retraction is an important quality signal for tools facing the general public. She also said there is “kind of an agreement that retracted papers have been struck off the record of science,” and that people outside science should be warned when a paper has been retracted.
The issue goes beyond ChatGPT
MIT Technology Review also tested AI tools advertised for research work in June. The test used questions based on the 21 retracted papers in Gu’s study.
The results showed that several tools cited retracted papers without noting the retractions:
- Elicit referenced five of the retracted papers.
- Ai2 ScholarQA, now part of the Allen Institute for Artificial Intelligence’s Asta tool, referenced 17.
- Perplexity referenced 11.
- Consensus referenced 18.
Some companies have since taken steps to address the problem. Christian Salem, cofounder of Consensus, said that until recently the company did not have strong retraction data in its search engine. Consensus has started using retraction data from publishers, data aggregators, independent web crawling, and Retraction Watch, which manually curates and maintains a database of retractions.
In a test of the same papers in August, Consensus cited only five retracted papers.
Elicit said it removes retracted papers flagged by the scholarly research catalogue OpenAlex from its database and is still working on aggregating sources of retractions. Ai2 said its tool does not currently automatically detect or remove retracted papers. Perplexity said that it does not ever claim to be 100% accurate. OpenAI did not provide a response to a request for comment about the paper results.
Why fixing the problem is difficult
Using retraction databases sounds like a straightforward solution, but the source article makes clear that the data problem is messy. Ivan Oransky, the cofounder of Retraction Watch, is careful not to call it a comprehensive database. He said a complete resource would require more resources than anyone has because accurate work has to be done by hand.
Publishers also do not mark problematic papers in one uniform way. Caitlin Bakker from University of Regina, Canada, an expert in research and discovery tools, said retracted work can be labeled in very different ways. Labels can include “Correction,” “expression of concern,” “erratum,” and “retracted.”
Those labels can also point to different kinds of issues, including concerns about content, methodology, data, or conflicts of interest. For an AI system, that variation makes it harder to treat every warning sign consistently.
Copies of papers can also be scattered across preprint servers, paper repositories, and other websites. If a paper is retracted after a model’s training cutoff date, Fu said the model’s responses might not immediately reflect what has changed. Aaron Tay, a librarian at Singapore Management University, said most academic search engines do not do a real-time check against retraction data, which leaves users dependent on the accuracy of the corpus being searched.
What better AI research tools may need
Experts in the source article point toward richer context, not just better citation lists. Oransky and others advocate making more information available for models to use when generating an answer. That could include peer reviews commissioned by journals and critiques from the review site PubPeer.
Some publishers, including Nature and the BMJ, publish retraction notices as separate articles linked to the paper and outside paywalls. Fu said companies need to make effective use of that information, along with news articles in a model’s training data that mention a paper’s retraction.
The stakes are practical. The public asks AI chatbots for medical advice and help diagnosing health conditions. Students and scientists increasingly use science-focused AI tools to review existing literature and summarize papers. The US National Science Foundation invested $75 million in building AI models for science research this August.
For now, the safest lesson is that an AI answer about science is not enough on its own. Users need to check sources, and tool builders need stronger ways to surface retractions and other warnings. Tay summarized the current moment with a simple caution: “We are at the very, very early stages, and essentially you have to be skeptical.”