Why AI hallucinations persist even when models search the web

A new benchmark called Halluhard finds that leading AI models still produce incorrect information in realistic multi-turn conversations. Web search helps, especially with bad references, but it does not reliably ensure that cited sources support the answer.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 3 ►

The story highlights persistent hallucinations that undermine truth and reliability in sensitive domains, more than autonomous danger.

Why AI hallucinations persist even when models search the web

AI hallucinations remain a stubborn reliability problem, even for large models that can search the web. A new benchmark from researchers in Switzerland and Germany tests that problem in a way that looks closer to real use: multi-turn conversations in sensitive knowledge areas.

The benchmark, called Halluhard, was developed by researchers from Switzerland's EPFL, the ELLIS Institute Tübingen, and the Max Planck Institute for Intelligent Systems. Its results challenge the claim that modern LLMs no longer hallucinate, showing that incorrect information still appears often enough to matter.

What Halluhard Measures

Halluhard is built around 950 initial questions across four sensitive knowledge domains: legal cases, research questions, medical guidelines, and programming. For each initial question, a separate user model created two follow-up questions, turning each case into a realistic three-turn conversation.

That structure matters because many AI systems are not used for one-off answers. People ask follow-ups, refine the task, and expect the model to carry context forward. Halluhard tests whether models can keep their answers grounded as that context grows.

The results show that even the best configuration tested, Claude Opus 4.5 with web search, still hallucinated in about 30 percent of cases. Without web search, that rate rose to around 60 percent. GPT-5.2 Thinking with web search came in at 38.2 percent.

Chinese reasoning models like Kimi-K2-Thinking and GLM-4.7-Thinking performed the worst compared with their direct reasoning counterparts. The source notes that these open models often compete well on other benchmarks, raising the suspicion that some systems may be optimized for benchmark scores more than real-world reliability.

Size Helps, But Reasoning Has Limits

The benchmark suggests that bigger models hallucinate less often. Within the GPT-5 family, the average hallucination rate moved from 85.1 percent for GPT-5-nano to 71.8 percent for GPT-5, and then to 53.8 percent for GPT-5.2 Thinking.

Claude showed the same general direction. Haiku was measured at 79.5 percent, Sonnet at 65.6 percent, and Opus at 60 percent. Larger systems did better, but the numbers still show substantial error rates.

Reasoning also helps, but only up to a point. The source describes reasoning as models thinking longer before answering. That can reduce hallucinations, yet more reasoning compute does not automatically solve the problem.

One reason is that models that reason more can produce longer, more detailed answers. More detail means more claims. More claims create more chances for at least one unsupported or incorrect statement to enter the response.

DeepSeek Reasoner is a useful example from the source. It showed no improvement over DeepSeek Chat despite its reasoning capabilities. The researchers also point to a continuing gap between proprietary and open-source models.

Web Search Fixes Some Errors, Not All

Halluhard separates hallucinations into two important categories. Reference grounding asks whether a cited source actually exists. Content grounding asks whether that source actually supports the information the model gives.

This distinction is central to understanding why web search is not a complete fix. A model can point to a real source and still make a claim that the source does not support. The source article gives the example of a claim about the SimpleQA benchmark where the reference was correct but the content was partly invented.

In the research question domain, web search mainly reduced reference errors. For Claude Opus 4.5, the reference error rate dropped from 38.6 to 7 percent when web search was enabled. Content grounding errors, however, declined much less, from 83.9 to 29.5 percent.

GPT-5.2 Thinking showed a similar pattern. With web search, reference errors fell to 6.4 percent, while content grounding errors remained at 51.6 percent. In plain terms, search helped models find real sources, but it did not guarantee that the answer accurately reflected those sources.

Longer Chats Can Compound Mistakes

One of Halluhard's key findings is that hallucination rates increased in later conversation rounds. The researchers explain this through context: models receive the previous conversation and may build on mistakes already made.

Between 3 and 20 percent of incorrect references from the first turn reappeared in later rounds. That means an early error can become part of the conversation's foundation, making later answers less reliable.

The source also notes that previous studies have shown long chats and cluttered context windows degrade AI model performance. Halluhard adds a concrete reliability angle: mistakes are not only made once, they can persist and reappear.

Programming was the exception. In coding tasks, hallucination rates decreased in later rounds. The researchers suspect that coding conversations often narrow over time, moving from broad requests like build X to specific issues like fix this function. Narrower tasks leave less room for unsupported invention.

Niche Knowledge Remains A Weak Spot

The researchers also ran a controlled experiment with 350 short questions to see when models hallucinate and when they refuse to answer. When asked about completely fabricated entities, models tended to abstain.

The bigger problem appeared with niche knowledge, such as rarely cited research papers or artworks from local galleries. In those cases, models hallucinated more often. The researchers explain that niche information may appear only in fragments during training.

Those fragments can be enough to trigger a confident answer, but not enough to produce a correct one. With a completely unknown topic, the model may recognize that it lacks information. With niche knowledge, it may act as though partial familiarity is enough.

That is why Halluhard is designed to remain difficult for future model generations. Existing tests are becoming less useful for comparing frontier systems: on SimpleQA, GPT-4o with Search Preview already reaches 90 percent accuracy, while GPT-5 Thinking with web search reaches 95.1 percent. With an estimated benchmark error rate of around 3 percent, the source says that is essentially the ceiling.

Halluhard combines multi-turn conversations, sensitive knowledge domains, and niche knowledge to keep testing what matters in practical use. Its central message is simple: search, scale, and reasoning all help, but none of them yet removes the need to verify AI answers carefully.