Fake AI Citations Slip Into NeurIPS Papers Despite Peer Review

GPTZero says it found 100 confirmed fake citations across 51 papers accepted by NeurIPS after scanning all 4,841 accepted papers. The finding is small in statistical terms, but it highlights how LLM-generated errors can enter even high-profile AI research workflows.

WTF Index IDIOCRACY
◄ Terminator 0 Idiocracy 3 ►

The story highlights AI-generated hallucinations eroding research quality, verification, and trust in scholarly workflows.

Fake AI Citations Slip Into NeurIPS Papers Despite Peer Review

GPTZero scanned all 4,841 papers accepted by the Conference on Neural Information Processing Systems, better known as NeurIPS, and found 100 hallucinated citations across 51 papers that it confirmed as fake. The conference took place last month in San Diego and is one of the most prestigious venues in artificial intelligence research.

The numbers do not suggest a collapse of academic quality. They do, however, show how easily small AI-generated errors can pass through systems built for serious technical review.

What GPTZero Found

According to the source article, GPTZero reviewed every paper accepted by NeurIPS and identified 100 confirmed hallucinated citations. Those citations appeared across 51 papers out of 4,841 accepted papers.

That scale matters. Each accepted paper can include dozens of references, meaning the full pool of citations runs into the tens of thousands. Against that backdrop, the confirmed fake citations amount to a very small share of the total.

NeurIPS also told Fortune that even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves is not necessarily invalidated. That distinction is important: a bad reference is not the same thing as a disproven result.

Still, the discovery is notable because NeurIPS is not a casual forum. Having work accepted there is a major credential in AI research. The papers come from people expected to understand both the strengths and weaknesses of large language models.

Why Fake Citations Matter

A hallucinated citation is not just a formatting mistake. In research, citations connect claims to prior work. They help readers verify context, trace ideas, and judge whether an argument is built on real scholarship.

Citations also function as a professional signal. They are used as a measure of influence, showing how often one researcher’s work is used by others. When AI systems invent references, that signal becomes less reliable.

The problem is especially awkward in AI because the suspected source of the mistakes is the same technology many of these researchers study and build around. Large language models can produce convincing text, but they can also generate references that look plausible while pointing to work that does not exist.

That creates a practical challenge for research teams. If an LLM helps with the tedious parts of writing, including references, someone still has to verify the output. The burden does not disappear because the task is boring.

Peer Review Has Limits

The source article is careful not to blame peer reviewers for missing every bad citation. NeurIPS papers are reviewed by multiple people, and reviewers are instructed to flag hallucinations. But the review process is dealing with a very large volume of submissions and references.

GPTZero framed its work as an effort to provide specific data on how AI-generated sloppiness can enter conferences through heavy submission pressure. The startup described a submission tsunami that has strained review pipelines to the breaking point.

GPTZero also pointed to a May 2025 paper called The AI Conference Peer Review Crisis, which discussed the problem at premiere conferences, including NeurIPS.

In that context, a few missed references are not surprising. Peer review is designed to evaluate the strength, relevance, and clarity of research. It can catch errors, but it is not always a citation-by-citation forensic audit.

The Responsibility Still Starts With Authors

The sharper question is why the researchers themselves did not catch the fake references before submission. Authors should know which papers they actually relied on. If an LLM produced a bibliography entry, the final responsibility for checking that entry still sits with the people submitting the work.

That responsibility becomes more important as AI tools move deeper into academic writing. A model can speed up drafting, organizing, and polishing, but it cannot be treated as a source of truth simply because its output looks formal.

For readers outside academia, the lesson is broader. If leading AI experts can miss small but concrete LLM errors in high-stakes work, then ordinary users should be cautious about using AI-generated details without verification.

The NeurIPS case is not evidence that the accepted research is invalid. It is evidence that credible-looking errors can survive even serious workflows. That is the future-facing issue: as AI becomes part of professional writing, review systems must account for mistakes that look polished enough to pass at a glance.

What This Means For AI Research

The discovery leaves two ideas in tension. On one hand, 100 confirmed fake citations across 51 papers is a small finding within a large body of work. On the other hand, fake citations undermine a basic mechanism researchers use to connect, credit, and verify knowledge.

That tension is likely to shape how conferences, authors, and reviewers think about LLM use. The issue is not whether researchers should ever use AI tools. The issue is whether the details those tools produce are checked with enough care before publication.

For NeurIPS, the episode is an embarrassment but not necessarily a verdict on the papers themselves. For the wider AI field, it is a reminder that accuracy in small details is part of trust. When AI invents a reference, the error may be narrow, but the credibility cost can spread further than the citation list.