The Decoder December 6, 2025 IDIOCRACY

Why deep research AI still fills gaps with fake facts

A study from Oppo's AI team found that deep research systems often fail during execution, not because they misunderstand the task. The clearest risk is fabricated detail: nearly 20 percent of errors came from plausible content the systems appeared to invent.

WTF Index IDIOCRACY

◄ Terminator 1 Idiocracy 4 ►

The story focuses on AI research agents fabricating credible-sounding facts, undermining truth and information quality.

Why deep research AI still fills gaps with fake facts

AI research agents are being sold on a simple promise: give them a complex question, and they will return a structured report with evidence, citations, and analysis. A study from Oppo's AI team shows why that promise still needs caution.

The researchers examined around 1,000 reports and found systematic weaknesses in current deep research systems. The most important finding is not just that these systems make mistakes. It is that, when they hit a gap, they may fill it with material that sounds credible but is not supported by the available evidence.

Where the failures show up

The study used two new evaluation tools: FINDER, a benchmark for deep research tasks, and DEFT, a taxonomy for classifying failures. Together, they were designed to test whether research agents can handle complex reporting work that requires hard evidence and strict methodology.

The researchers identified 14 error types across three categories: reasoning, retrieval, and generation. Generation issues were the largest category at 39 percent. Research failures followed at 33 percent, while reasoning errors accounted for 28 percent.

That breakdown matters because it shows that the problem is broader than a single bad citation or a misunderstood prompt. A system can plan a report, search for material, and produce fluent prose while still failing to connect its claims to reliable evidence.

Nearly 20 percent of errors came from systems inventing plausible-sounding but entirely fake content. In other words, some agents did not merely omit information. They produced details that appeared specific enough to be useful, while lacking support.

Fake precision is a serious warning sign

One example from the study involved an investment fund. A system claimed that the fund achieved an exact 30.2 percent annual return over 20 years. Since that specific data is not public, the researchers concluded that the figure was likely fabricated.

This kind of error is especially hard for readers to catch. A vague answer may invite skepticism, but a precise number can create a false sense of authority. The more exact the claim appears, the more it can look like the product of careful research.

A second example involved scientific papers. In that test, a system listed 24 references. When checked, several links were dead, while others pointed to reviews rather than original research. The system nevertheless insisted it had verified every source.

That pattern is a major problem for anyone relying on AI-generated research reports. Citations are supposed to make claims easier to inspect. If the citation layer itself is unreliable, the report can become harder to trust, not easier.

The issue is execution, not only understanding

According to the study, most systems understand the assignment. The breakdown happens when the work does not go according to plan.

If a system intends to analyze a database but cannot access it, it may fail to change course. Instead of explaining the access problem or narrowing the scope, it may fill the missing sections with hallucinated content.

The researchers describe the missing capability as a lack of "reasoning resilience". That means the system struggles to adapt when its original plan stops working. In real-world research, that flexibility can matter more than raw analytical ability.

This is a practical distinction. A strong research process should be able to say what it found, what it could not find, and where the evidence is incomplete. A weak one may turn uncertainty into a smooth paragraph.

Top systems still have limited headroom

The study tested commercial tools including Gemini 2.5 Pro Deep Research and OpenAI's o3 Deep Research, along with open-source alternatives. Gemini 2.5 Pro ranked highest, but it scored only 51 out of 100 points.

OpenAI's o3 stood out for factual accuracy, with nearly 66 percent of its citations right. That result suggests some systems may be better at grounding claims than others, but it also shows that citation reliability remains incomplete.

The researchers argue that these agents do not fail mainly because they are confused by the prompt. They fail because they struggle to integrate evidence and handle uncertainty. That diagnosis points to a different kind of improvement: systems need better ways to expose gaps instead of covering them.

The team released the FINDER and DEFT frameworks on GitHub so the community can work on more reliable agents. The goal is not simply to produce longer or faster reports. It is to make the research process easier to evaluate.

Why this matters now

The timing is important. Since late 2024, Google, Perplexity, Grok, and OpenAI have rolled out deep research features that promise comprehensive reports in minutes, often by scraping hundreds of websites at once.

The study suggests that more data is not enough. If a system cannot judge evidence, recover from blocked steps, or admit uncertainty, a larger pool of sources may simply give it more ways to make mistakes.

The industry is aware of the limitation. OpenAI recently admitted that LLM-based systems like ChatGPT will likely never stop fabricating information entirely. To address the problem, the company is working on features that allow the system to indicate its certainty level.

OpenAI is also experimenting with "confessions", a mechanism where the system generates a separate follow-up note admitting if it made something up or was unsure.

The larger lesson is that a research agent needs a reliable way to say "I don't know". Until that becomes a normal part of the workflow, polished AI reports should be treated as drafts that require verification, especially when they contain precise figures, confident citations, or claims that depend on unavailable evidence.