MIT Tech Review AI July 24, 2024 IDIOCRACY

Why AI models can decay when trained on AI-made web junk

New research published in Nature found that AI output can degrade when models are trained on AI-generated data. The concern is not instant collapse for current systems, but slower improvement, weaker performance, and a growing need to track where training data comes from.

WTF Index IDIOCRACY

◄ Terminator 0 Idiocracy 3 ►

The story focuses on AI-generated junk degrading future model quality and weakening the reliability of online knowledge rather than increasing dangerous autonomy.

Why AI models can decay when trained on AI-made web junk

AI models depend on large stores of internet data. That creates a growing problem: the same web they learn from is increasingly being filled with AI-generated junk content.

New research published in Nature suggests that when AI systems learn from the output of earlier AI systems, their own output can gradually become worse. The process can compound over generations, raising concerns for large AI models that rely on the internet as a major source of training material.

What Model Collapse Means

Ilia Shumailov, a computer scientist from the University of Oxford who led the study, describes the problem with the image of copying an image again and again. Each copy adds noise. After enough repetitions, the useful signal can disappear.

He compares the AI version of that failure to a “dark square.” In machine learning, the broader risk is called “model collapse,” where the system’s output degrades into incoherent material.

The issue matters because many powerful models are built by training on huge amounts of material gathered from the internet. GPT-3, for example, was trained in part on data from Common Crawl, an online repository of over 3 billion web pages.

If the web contains more AI-written junk pages, future models may have a harder time finding clean, representative, human-generated data. The source article makes clear that Shumailov is not saying current AI models are simply about to collapse. The nearer risk is more measured: improvements may slow down, and performance might suffer.

How The Researchers Tested The Risk

Shumailov and his colleagues tested the effect by fine-tuning a large language model on Wikipedia data. They then fine-tuned the next model on output from the previous one, repeating the process over nine generations.

To measure deterioration, the team used a “perplexity score.” In this context, that score reflects how confident a model is in predicting the next part of a sequence. A higher score means a less accurate model.

The pattern was clear in the experiment: models trained on the outputs of other models showed higher perplexity scores. In practical terms, the language became less reliable as synthetic output fed later training rounds.

One example from the study shows the drift. The researchers gave the model a passage about masons, parish labourers, architects, and Perpendicular church towers. By the ninth and final generation, the model produced an answer that veered into an odd sequence about “black @-@ tailed jackrabbits,” “white @-@ tailed jackrabbits,” and other color variations.

The point is not only that the sentence sounded strange. It showed how a model trained repeatedly on model-generated data can move away from the original distribution of information and produce less coherent text.

Why Synthetic Data Is Both Useful And Risky

The pressure behind synthetic data is straightforward. The internet does not contain an unlimited amount of data, while foundation models rely heavily on scale to perform well.

Shayne Longpre, who studies how LLMs are trained at the MIT Media Lab and did not take part in the research, says foundation models depend on large-scale data. He also notes that developers are looking to synthetic data under curated, controlled environments because continued web crawling may bring diminishing returns.

That distinction matters. The source does not present synthetic data as automatically harmful in every setting. Matthias Gerstgrasser, an AI researcher at Stanford who authored a different paper examining model collapse, says that adding synthetic data to real-world data rather than replacing it does not cause major issues.

But Gerstgrasser also says the model collapse literature agrees on one central point: high-quality and diverse training data is important.

That makes the challenge less about banning AI-generated data and more about how it is used. If synthetic material replaces broad, diverse, real-world data, the risk grows. If it is curated and added carefully, the outcome can be different.

Who Could Be Most Affected

One concern raised in the source is that degradation may not affect all information equally. Over time, models may distort information related to minority groups because they can overfocus on samples that are more common in the training data.

Robert Mahari, who studies computational law at the MIT Media Lab and did not take part in the research, says current models may affect underrepresented languages because those languages require more synthetic AI-generated data sets.

This is an important implication. If a model already has less real-world training material for a language or group, replacing missing coverage with synthetic data can deepen the imbalance. The result may be less accurate output for the areas where accuracy is already harder to achieve.

For users, this means model quality is not a single universal property. A system can appear strong in common cases while performing worse on less represented information. The training data mix helps determine where those weaknesses show up.

Why Data Provenance Becomes Central

The study points toward one possible way to reduce degradation: keep giving future models access to original human-generated data. In one part of Shumailov’s study, future generations were allowed to sample 10% of the original data set, which mitigated some of the negative effects.

That approach depends on data provenance, meaning a trail that connects later training data back to its original human-generated source. Without that trail, it becomes difficult to know whether a model is learning from human material, AI output, or a blend that has already passed through earlier systems.

The problem is that provenance remains hard to implement across the internet. The source notes that tools exist to identify whether text is AI-generated, but they are often inaccurate.

Shumailov’s conclusion is cautious rather than final. “Unfortunately, we have more questions than answers,” he says. Still, he argues that knowing where data comes from, and how much it can be trusted to represent the target information, is clearly important.

For AI development, that may become one of the defining questions. Bigger models need more data, but more data is not automatically better if it contains too much low-quality synthetic material. The future of model quality may depend as much on filtering, provenance, and diversity as on scale.