Why AI hallucinations may persist until tests stop rewarding guesses

OpenAI researchers argue that AI hallucinations remain difficult because language models are trained to predict fluent text, not to verify every fact. Their proposed fix focuses on evaluation: penalize confident errors and reward appropriate uncertainty.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 3 ►

The story centers on hallucinations and overconfident false answers eroding truth and user judgment rather than AI autonomy or harm.

Why AI hallucinations may persist until tests stop rewarding guesses

AI hallucinations are not just a glitch that disappears as models get larger or more polished. A new research paper from OpenAI argues that the problem is tied both to how large language models learn language and to how the industry measures success.

The central issue is not that chatbots always lack confidence. It is often the opposite: they can produce a fluent answer even when the answer is false. That makes hallucinations especially difficult for users, because the mistake may arrive in the same clear tone as a correct response.

What OpenAI Means By Hallucinations

OpenAI defines hallucinations as “plausible but false statements generated by language models.” The company also acknowledges that, even with improvements, hallucinations “remain a fundamental challenge for all large language models” and will never be completely eliminated.

The research paper looks at large language models like GPT-5 and chatbots like ChatGPT, asking why these systems can still be wrong in ways that look convincing. The question matters because users often rely on chatbots for direct answers, not just general writing help.

To show the problem in simple terms, the researchers asked “a widely used chatbot” about the title of Adam Tauman Kalai’s PhD dissertation. The chatbot gave three different answers, and all were wrong. Kalai is one of the paper’s authors.

The researchers then asked about his birthday. Again, the chatbot produced three different dates. Again, every answer was wrong.

That example captures the practical danger of AI hallucinations: a model can supply a specific answer where the safer response would be uncertainty. The output may look useful because it is detailed, but the detail can be invented.

Why Fluency Is Not The Same As Truth

The paper points to pretraining as part of the explanation. During pretraining, models learn by predicting the next word. That process helps them absorb patterns in fluent language, but the training statements do not arrive with true or false labels attached.

As the researchers put it, “The model sees only positive examples of fluent language and must approximate the overall distribution.” In plain English, the model is learning what language tends to look like, not receiving a built-in fact-checking system for every possible claim.

This distinction helps explain why some errors fade as models scale while others remain stubborn. The researchers write, “Spelling and parentheses follow consistent patterns, so errors there disappear with scale.” Those are areas where repeated structure gives the model a strong pattern to follow.

But facts are different when they are arbitrary and rare. The paper gives the example of “a pet’s birthday,” which cannot be reliably inferred from general language patterns. If the model has no dependable basis for the answer, it may still produce one because producing fluent text is what it has learned to do.

The Incentive Problem In AI Evaluation

OpenAI’s proposed solution focuses less on changing the initial pretraining process and more on changing how large language models are evaluated. The paper argues that today’s evaluation methods do not directly cause hallucinations, but they “set the wrong incentives.”

The researchers compare the issue to multiple-choice tests where guessing can be rational. If a blank answer earns nothing, but a random answer might be correct, the test encourages the test-taker to guess.

The same logic can apply to model evaluations. When a model is graded only by accuracy, meaning the share of questions it gets exactly right, there is a built-in reason to answer even when it does not know. A lucky guess can improve the score, while admitting uncertainty may not help enough.

That matters because evaluation scoreboards shape behavior. If developers optimize models for tests that reward correct answers without properly punishing confident falsehoods, the model has little reason to be cautious. The output may become more assertive than the evidence supports.

What Better Scoring Could Change

The paper suggests borrowing an idea from tests that discourage blind guessing, including tests “like the SAT” that use “negative [scoring] for wrong answers or partial credit for leaving questions blank.” The goal is to make uncertainty a better choice than confident error when the model lacks support.

OpenAI says model evaluations should “penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty.” That would change what success means. A model would not only be rewarded for being right; it would also be measured on whether it knows when not to guess.

This is a subtle but important shift for AI safety and reliability. Users do not only need chatbots that answer more questions. They need systems that can separate solid answers from weak ones, especially when the question asks for a specific name, date, title, or personal fact.

The researchers also argue that adding a few separate uncertainty-aware tests is not enough. Their view is that “the widely used, accuracy-based evals need to be updated so that their scoring discourages guessing.” In other words, the main benchmarks need to change, not just the side tests.

The Future Of Trustworthy Chatbots

The paper’s message is practical: hallucinations may persist when models are rewarded for sounding right more than for being appropriately cautious. Better AI evaluation would make uncertainty part of the score, not a weakness to hide.

That does not mean hallucinations will vanish. OpenAI’s own framing says they will never be completely eliminated. But the paper argues that incentives can still move behavior in a better direction.

The closing idea is straightforward: “If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess.” For anyone building or using chatbots, that is the core lesson. The way AI is tested helps decide whether it learns to answer carefully or simply answer confidently.