Nvidia is pushing deeper into synthetic data, a technology that could shape how developers build and fine-tune future AI systems. The company has acquired Gretel, a synthetic data firm, for nine figures, according to two people with direct knowledge of the deal.
The purchase price exceeds Gretel’s most recent valuation of $320 million, the sources say, though the exact terms remain unknown. Nvidia and Gretel both declined to comment.
Why Gretel Matters to Nvidia
Gretel will be folded into Nvidia along with its team of approximately 80 employees. Its technology is expected to become part of Nvidia’s expanding set of cloud-based, generative AI services for developers.
Gretel was founded in 2019 by Alex Watson, John Myers, and Ali Golshan, who also serves as CEO. The company offers a synthetic data platform and APIs for developers who want to build generative AI models but face limits around available training data or privacy concerns involving real people’s data.
Gretel does not build and license its own frontier AI models. Instead, it fine-tunes existing open source models, adds differential privacy and safety features, and packages those capabilities for customers. Before the acquisition, the company raised more than $67 million in venture capital funding, according to Pitchbook.
The Data Problem Behind AI Growth
Synthetic data is computer-generated information designed to resemble real-world data. That makes it different from human-created or directly observed real-world data, while still aiming to be useful for model training.
Supporters see synthetic data as a way to make AI development more scalable and less labor intensive. It may also make training data more accessible to smaller or less-resourced AI developers that cannot gather vast real-world datasets on their own.
Privacy is another major reason companies are interested. Health care providers, banks, and government agencies may want to build AI systems without exposing sensitive information about real people. Synthetic data can help create useful training material while reducing the need to share the original data with outside stakeholders or software partners.
The pressure is growing because experts worry AI companies may not be able to keep relying as freely on human-created internet content. A report from MIT’s Data Provenance Initiative showed that restrictions around open web content were increasing.
Nvidia’s Existing Synthetic Data Push
The Gretel acquisition fits into work Nvidia has already been doing. In 2022, the company launched Omniverse Replicator, which allows developers to create custom, physically accurate synthetic 3D data for training neural networks.
Last June, Nvidia also began rolling out Nemotron-4 340B, a family of open AI models that generate synthetic training data for developers building or fine-tuning LLMs. Nvidia said those mini-models could support work across “health care, finance, manufacturing, retail, and every other industry.”
At Nvidia’s annual developer conference this Tuesday, cofounder and chief executive Jensen Huang described the central problems the company sees in scaling AI efficiently.
“There are three problems that we focus on,” he said. “One, how do you solve the data problem? How and where do you create the data necessary to train the AI? Two, what’s the model architecture? And then three, what are the scaling laws?”
Huang also described how Nvidia is using synthetic data generation in robotics platforms. That matters because robotics systems often need large amounts of training data tied to physical environments, motion, and interaction.
Benefits, Limits, and Model Collapse
Ana-Maria Cretu, a postdoctoral researcher at the École Polytechnique Fédérale de Lausanne in Switzerland who studies synthetic data privacy, says synthetic data can be used in different ways. One example is tabular data, including demographic or medical data, where it can help address scarcity or create a more diverse dataset.
Cretu gives the example of a hospital building an AI model to track a certain type of cancer with a small data set from 1,000 patients. Synthetic data could help expand the dataset, reduce bias, and anonymize real human data. As she puts it, “This also offers some privacy protection, whenever you cannot disclose the real data to a stakeholder or software partner.”
But for large language models, synthetic data can become a broader answer to a different question: “How can we just increase the amount of data we have for LLMs over time?” That is where the risks become more serious.
A July 2024 article in Nature highlighted how AI language models could “collapse,” or substantially degrade, when repeatedly fine-tuned on data produced by other models. The concern is that model-generated data, used again and again, could lower quality rather than improve it.
Alexandr Wang, the chief executive of Scale AI, shared the Nature findings on X and wrote, “While many researchers today view synthetic data as an AI philosopher’s stone, there is no free lunch.” Wang later said he believes in a hybrid data approach.
One of Gretel’s cofounders pushed back in a blog post, saying the “extreme scenario” of repeated training only on synthetic data “is not representative of real-world AI development practices.” Cretu also notes that most researchers and computer scientists use a mix of synthetic and real-world data, adding that “You might possibly be able to get around model collapse by having fresh data with every new round of training.”
Why the Industry Is Still Moving Ahead
The risks have not stopped major AI companies from using synthetic data. Sam Altman reportedly discussed OpenAI’s ability to use existing AI models to create more data at a recent Morgan Stanley tech conference. Anthropic CEO Dario Amodei has said it may be possible to build “an infinite data-generation engine” if a small amount of new information is injected during training.
Other large technology companies are also using the approach. Meta has discussed training Llama 3 with synthetic data, including some generated from Llama 2. Amazon’s Bedrock platform lets developers use Anthropic’s Claude to generate synthetic data. Microsoft’s Phi-3 small language model was trained partly on synthetic data, though the company has warned that synthetic data from pre-trained large-language models can sometimes reduce accuracy and increase bias.
Nvidia’s acquisition of Gretel does not settle the debate. It does show that synthetic training data is moving from a research topic into the infrastructure layer of AI development. The central question is no longer whether companies will use synthetic data, but how carefully they will combine it with real-world data, privacy safeguards, validation, and model design.