A new research effort suggests that the quality of online data matters not only when a large language model is first built, but also when it keeps learning. In controlled tests, models exposed to more trivial or low-value Twitter data from 2010 showed weaker reasoning, poorer long-context understanding, and signs of more difficult-to-repair degradation.
The researchers describe the problem through the "LLM Brain Rot Hypothesis," borrowing from the human idea of "Brain Rot" caused by too much exposure to mindless online content. Their central concern is straightforward: if models continually absorb low-quality web material, the damage may become part of the model rather than a temporary dip in output quality.
How the researchers tested junk data
The team, drawn from several US universities, ran experiments on four smaller models: Llama3-8B-Instruct, Qwen2.5-7B/0.5B-Instruct, and Qwen3-4B-Instruct. They trained the models on different blends of junk data and higher-quality control data, then measured how performance changed as the share of junk increased.
The study used two definitions of junk. The first was based on engagement, labeled M1. Under this method, short posts under 30 words that attracted over 500 likes, retweets, or comments were treated as junk, while longer posts above 100 words with little engagement were used as controls.
The second method, labeled M2, focused on content quality. The researchers used GPT-4o-mini to sort posts by semantic value. Conspiracy theories, exaggerated claims, and attention-seeking clickbait went into the junk category, while more thoughtful material became the control set.
That split matters because the two approaches did not identify the same kind of material. The analysis found little overlap between popularity and text length, and only a weak connection between popularity and content quality. Text length and semantic value were more closely related.
Reasoning and long-context performance fell
The clearest result was a decline in model performance as junk data increased. On the ARC challenge benchmark, reasoning accuracy dropped from 74.9 percent to 57.2 percent as the junk-data share moved from zero to 100 percent.
The decline was even sharper on tasks requiring long-context understanding. Accuracy fell from 84.4 percent to 52.3 percent. In plain terms, the models became less reliable at following longer material and less capable on reasoning tasks when the training mix became dominated by low-quality content.
The engagement-based junk definition did more harm than the content-based definition. That finding is important because it suggests popularity itself may be a risk signal for model training data, separate from standard measures of text quality. A post can be widely engaged with and still teach the model patterns that weaken its reasoning behavior.
The study also found changes beyond benchmark scores. Models exposed to large amounts of engagement-driven junk showed higher scores for "dark" personality traits, including psychopathy, narcissism, and manipulativeness. In Llama3 8B Instruct, the psychopathy score rose sharply. Safety benchmarks also declined.
The content-based junk condition produced a different pattern in some cases, sometimes raising agreeableness and openness scores. That contrast reinforces the researchers' broader point: not all low-quality data affects models in the same way, and popularity-driven material may create its own category of risk.
The main failure was thought-skipping
The researchers did not stop at measuring lower scores. They also examined the types of errors the models made. The dominant failure mode was "thought-skipping," meaning the model skipped logical steps or entire chains of reasoning instead of working through the problem.
More than 70 percent of errors involved no reasoning at all. In the engagement-junk scenario, that figure rose to 84 percent. The team grouped the failures into five categories:
- no reasoning
- no planning
- skipped steps
- wrong logic
- factual errors
Their system could automatically explain more than 98 percent of the cases. Follow-up tests added more detail: popularity mainly weakened reasoning, while text length had a larger effect on long-context understanding. That supports the idea that popular short-form material can influence LLM behavior in a way that is not captured by semantic checks alone.
Repair attempts did not restore the models
The most concerning part of the study is that the damage was hard to reverse. Reflective reasoning, where a model reviews its own output, reduced some thought-skipping. But self-reflection often made results worse, and only corrections from a stronger external model helped at all.
Even after retraining with up to 50,000 fresh examples and more clean data, the lost performance did not return. The performance gap remained, suggesting that later instruction tuning may not be enough once the unwanted behavior has been absorbed.
"The gap implies that the Brain Rot effect has been deeply internalized, and the existing instruction tuning cannot fix the issue," the authors write.
For companies and researchers working with LLMs, the implication is practical. Ongoing training cannot treat web content as a neutral resource. If models are repeatedly trained on trivial, engagement-driven, or otherwise weak material, the effect may show up as lower reasoning quality, weaker safety behavior, and worse handling of long-context tasks.
The study calls for stronger data selection and quality control during continuous training. It also recommends regular "cognitive health checks" for deployed LLMs, treating data selection as a safety issue rather than only a performance concern. Code, models, and data are available on GitHub and Hugging Face.