Ars Technica AI March 28, 2025 IDIOCRACY

Why LLM hallucinations can start with misplaced confidence

New Anthropic research offers a closer look at why Claude sometimes answers when it should hold back. The work suggests that a recognized name or entity can weaken the model’s internal refusal pathway, even when the model lacks the specific knowledge needed to answer reliably.

WTF Index IDIOCRACY

◄ Terminator 1 Idiocracy 3 ►

The story focuses on hallucinations and misplaced confidence eroding truth and reliability rather than AI autonomy or harm.

Why LLM hallucinations can start with misplaced confidence

Large language models can produce answers that sound confident even when they are not supported by what the model actually knows. New research from Anthropic, described in Ars Technica, gives a more concrete explanation for how that can happen inside Claude.

The key finding is not that the model simply chooses to invent information. Instead, Anthropic’s work points to internal neural network features that can push Claude toward answering or toward refusing, depending on whether a prompt activates signals tied to familiar or unfamiliar entities.

What Anthropic looked for inside Claude

Anthropic previously used sparse auto-encoders to examine groups of artificial neurons that activate around internal concepts. Those groups are called “features” in the research. Examples include concepts such as “Golden Gate Bridge” and “programming errors.”

The newer research expands that approach by tracing how those features can influence other groups of neurons involved in Claude’s response process. The work looks at internal computational decision “circuits,” including pathways that appear to affect whether Claude answers a question or declines to answer it.

Anthropic’s pair of papers covers several areas, including how Claude “thinks” in multiple languages, how some jailbreak techniques can affect it, and whether its “chain of thought” explanations are accurate. But the section on “entity recognition and hallucination” is especially useful because it connects a familiar user problem to an observable internal mechanism.

The model is built to continue text

At a basic level, large language models take text and predict what text is likely to come next. That design can be powerful when a prompt is close to material represented in the model’s training data. It can also create trouble when the prompt concerns “relatively obscure facts or topics.”

Anthropic writes that this pressure to continue text can “incentivizes models to guess plausible completions for blocks of text.” That is one way to understand why an LLM may produce a fluent answer instead of admitting uncertainty.

Fine-tuning is meant to reduce that tendency. In Claude’s assistant version, Anthropic found features associated with a “known entity” and with an “unfamiliar name.” These features appear to influence whether the model keeps its internal “can’t answer” circuit active.

When the model sees an unfamiliar name, that feature tends to promote the “can’t answer” pathway. The resulting answer may begin with wording along the lines of “I apologize, but I cannot…” According to the research, this refusal pathway tends to be on by default in the fine-tuned assistant model.

When recognition weakens refusal

The situation changes when Claude encounters a familiar entity. A name such as “Michael Jordan” can activate a “known entity” feature. That, in turn, can make the neurons in the “can’t answer” circuit “inactive or more weakly active.”

Once that refusal pathway is weakened, Claude can move into related internal features and produce an answer. For a question such as “What sport does Michael Jordan play?”, that behavior can work because the model has relevant connected information.

The same mechanism can become a problem when recognition is mistaken for reliable knowledge. Anthropic found that artificially increasing weights in the “known answer” feature could make Claude confidently hallucinate information about made-up athletes such as “Michael Batkin.”

That result led the researchers to suggest that “at least some” hallucinations may come from a misfire in the circuit that inhibits the “can’t answer” pathway. In plain terms, the model may act as if a name is answerable even when it is not well represented enough to support the answer.

Recognition is not the same as recall

The most important distinction in the research is between recognizing something and knowing the specific answer being requested. Claude may recognize a person’s name without having enough detailed information to answer a narrower question about that person.

Anthropic gives an example involving AI researcher Andrej Karpathy. When asked to name a paper written by him, Claude produced the plausible but made-up title “ImageNet Classification with Deep Convolutional Neural Networks.”

By contrast, when asked the same kind of question about Anthropic mathematician Josh Batson, Claude responded that it “cannot confidently name a specific paper… without verifying the information.”

After changing feature weights, Anthropic researchers theorized that the Karpathy case may happen because Claude recognizes the name strongly enough to activate “known answer/entity” features. Those features can suppress the default refusal circuit, even though the model lacks more specific information about the names of Karpathy’s papers.

That creates a failure mode that is easy for users to miss:

The model recognizes a name or topic.
Recognition weakens the internal refusal pathway.
The model commits to answering.
If specific knowledge is missing, the answer may be a plausible guess.

This is why a confident LLM answer can be especially misleading. The confidence may reflect that the model has crossed an internal threshold for answering, not that the requested fact is actually available to it.

Why this matters for safer AI answers

The research does not claim to solve hallucinations. Anthropic warns that the current investigatory process “only captures a fraction of the total computation performed by Claude.” It also requires “a few hours of human effort” to understand the circuits and features involved in even a short prompt “with tens of words.”

Still, the work points toward a more precise way to study LLM hallucinations. Instead of treating confabulation as a vague flaw, researchers can examine which features push a model to answer and which circuits encourage it to refuse.

One possible implication from Anthropic’s analysis is that more robust and specific “known entity” features could help a model distinguish between broad familiarity and detailed knowledge. A model that can make that distinction more cleanly may be better at deciding when it should answer and when it should say it cannot do so confidently.

For users, the lesson is straightforward: a model’s familiarity with a subject is not proof that it can answer every question about that subject. For researchers, the finding offers a clearer target: understand the internal switch between refusal and response, then make that switch more reliable.