TechCrunch AI January 9, 2025 NEUTRAL

Why synthetic data is becoming central to AI training

Elon Musk says the AI field has largely used up the real-world data available for training models. That view puts synthetic data at the center of the next development phase, even as researchers warn it can introduce serious risks.

WTF Index NEUTRAL

◄ Terminator 2 Idiocracy 2 ►

Synthetic data could enable more capable AI while also risking quality and truth degradation, but the story is mainly an industry training-data analysis.

Why synthetic data is becoming central to AI training

AI companies are confronting a basic constraint: the internet and other real-world sources may no longer provide enough fresh material to keep training models in the same way. Elon Musk now says that point has effectively arrived, aligning with other AI experts who argue the industry has reached a turning point.

Musk says real-world training data is running out

During a livestreamed conversation with Stagwell chairman Mark Penn on X late Wednesday, Musk said there is little real-world information left to add to AI training pipelines.

“We’ve now exhausted basically the cumulative sum of human knowledge … in AI training,” Musk said. “That happened basically last year.”

Musk owns xAI, so his comments are not coming from the sidelines. They reflect a concern that is already shaping how leading AI labs think about model development, training data, and the next stage of artificial intelligence.

The core issue is not that models have stopped improving. It is that the main ingredient used to build many systems, large collections of human-created data, may no longer be available in the same expanding supply. If that is true, the industry needs a different way to keep training more capable models.

The idea of “peak data” is spreading

Musk’s comments echo themes raised by former OpenAI chief scientist Ilya Sutskever at NeurIPS in December. Sutskever said the AI industry had reached what he called “peak data,” and predicted that a lack of training data would force a move away from the way models are developed today.

That framing matters because it changes the focus of the AI race. Instead of simply gathering more real-world examples, companies may need to improve how models learn, how training material is generated, and how systems evaluate the quality of that material.

In plain terms, the old growth model depended heavily on access to more human-produced text, code, images, and other information. The emerging question is what happens when that supply is no longer enough to support the next round of model training.

Synthetic data is the proposed path forward

Musk pointed to synthetic data as the main way to supplement real-world data. Synthetic data is generated by AI models themselves, then used as training material for other models or later versions of the same model family.

“The only way to supplement [real-world data] is with synthetic data, where the AI creates [training data],” Musk said. “With synthetic data … [AI] will sort of grade itself and go through this process of self-learning.”

This approach is already being used across the industry. Tech giants including Microsoft, Meta, OpenAI, and Anthropic are using synthetic data to train flagship AI models.

Several examples show how common the method has become:

Microsoft’s Phi-4, which was open sourced early Wednesday, was trained on synthetic data alongside real-world data.
Google’s Gemma models were also trained with synthetic data alongside real-world data.
Anthropic used some synthetic data to develop Claude 3.5 Sonnet, one of its most performant systems.
Meta fine-tuned its most recent Llama series of models using AI-generated data.

Gartner estimates 60% of the data used for AI and analytics projects in 2024 were synthetically generated. That figure shows synthetic data is not a fringe technique. It is already part of mainstream AI and analytics work.

Cost savings are a major attraction

One reason synthetic data is drawing attention is cost. Training advanced AI systems can be expensive, and generating data with AI may reduce part of that burden.

AI startup Writer claims its Palmyra X 004 model, developed using almost entirely synthetic sources, cost just $700,000 to develop. The source compares that with estimates of $4.6 million for a comparably sized OpenAI model.

That difference helps explain why companies are willing to experiment with AI-generated training data. If synthetic sources can help produce strong models at lower cost, they become a practical tool, not just a research idea.

Still, cost is only one side of the question. The value of synthetic data depends on whether it improves models without weakening them in less visible ways.

The risks include model collapse

Synthetic data also carries disadvantages. Some research suggests it can lead to model collapse, a failure mode where a model becomes less “creative” and more biased in its outputs. Over time, that can seriously compromise how the system works.

The reason is straightforward. Synthetic data comes from models, and models are shaped by the data used to train them. If the original training data contains biases and limitations, the generated outputs can carry those same problems forward.

That creates a difficult tradeoff for AI companies. Synthetic data may be necessary if real-world data is exhausted, but relying on it too heavily could amplify flaws already present in the systems producing it.

The debate now is not whether synthetic data will be used. It already is. The harder question is how AI developers can use it without allowing models to recycle their own weaknesses into future generations.