AI systems are built from data, and the source of that data shapes what the systems can do. New findings from the Data Provenance Initiative show that the data pipeline behind modern AI is becoming less transparent, more dependent on the web, and more favorable to the largest technology companies.
What the audit found
The Data Provenance Initiative, a group of over 50 researchers from both academia and industry, examined nearly 4,000 public data sets. Those data sets spanned over 600 languages, 67 countries, and three decades, and came from 800 unique sources and nearly 700 organizations.
The central question was simple: where does the data to build AI come from? The answer matters because AI developers and researchers often lack clear information about what is inside massive data sets and where that material originally came from.
The source article describes AI data collection practices as immature compared with the sophistication of AI model development. That gap creates practical problems for model builders, researchers, and anyone trying to understand why AI systems behave the way they do.
From curated sources to scraped web data
In the early 2010s, AI data sets were drawn from a wider mix of sources. Shayne Longpre, a researcher at MIT and part of the project, said data came not only from encyclopedias and the web, but also from parliamentary transcripts, earning calls, and weather reports.
Those earlier data sets were often selected for specific tasks. The source material was more deliberately collected, and the relationship between the task and the data was easier to understand.
That pattern changed after transformers, the architecture underpinning language models, were invented in 2017. As the AI sector saw performance improve with larger models and larger data sets, scale became a priority.
Since 2018, the web has been the dominant source for data sets used across media such as audio, images, and video. The result is a widening divide between scraped data and more curated data sets.
Longpre put the logic plainly: “In foundation model development, nothing seems to matter more for the capabilities than the scale and heterogeneity of the data and the web.” The same demand for scale has also massively increased the use of synthetic data.
Why YouTube matters for multimodal AI
The rise of multimodal generative AI models has made video and image data more important. These systems can generate videos and images, and like large language models, they need as much data as possible.
For those models, YouTube has become a central source. The findings show that for video models, over 70% of data for both speech and image data sets comes from one source.
That concentration could benefit Alphabet, Google’s parent company, because it owns YouTube. Text is spread across many websites and platforms, but video data is much more concentrated in one place.
Longpre warned that this gives one company major leverage over important web data. Because Google is also developing its own AI models, Sarah Myers West, the co–executive director at the AI Now Institute, said the situation raises questions about how the company will make this data available for competitors.
Myers West also argued that data should not be treated as a naturally occurring resource. It is created through particular processes, and those processes can reflect the goals of large, profit-motivated companies.
Licenses, access, and hidden restrictions
AI companies usually do not disclose the data used to train their models. One reason is competitive advantage. Another is that data sets are often bundled, packaged, and distributed in complicated ways, so companies may not know the full origin of everything they use.
The researchers also found that data sets can carry restrictive licenses or terms. Those restrictions may limit commercial use, for example, but the lineage of the data is often inconsistent.
Sara Hooker, the vice president of research at Cohere and part of the Data Provenance Initiative, said this lack of consistency makes it difficult for developers to choose data responsibly. Longpre added that it also makes it almost impossible to be completely certain a model has not been trained on copyrighted data.
Exclusive data-sharing deals add another layer to the problem. Companies such as OpenAI and Google have struck exclusive deals with publishers, major forums such as Reddit, and social media platforms on the web.
Longpre said these contracts can divide the internet into zones of who can access data and who cannot. The largest AI players are better positioned to afford those deals and have stronger resources for crawling data sets, while researchers, nonprofits, and smaller companies may struggle to get access.
The global representation problem
The audit also found that AI training data is heavily skewed toward the Western world. Over 90% of the data sets analyzed came from Europe and North America, while fewer than 4% came from Africa.
Hooker said these data sets reflect one part of the world while omitting others. That matters because AI models are used globally, even when their training data does not represent the full range of global languages, cultures, and experiences.
Giada Pistilli, principal ethicist at Hugging Face, who was not part of the research team, said the dominance of English is partly tied to the fact that the internet is still over 90% in English. She also pointed to poor or nonexistent internet connection in many places, along with the effort required to build data sets in other languages and account for other cultures.
The issue becomes especially visible in multimodal models. Hooker noted that if a model is prompted for the sights and sounds of a wedding, it may only represent Western weddings if that is all it has seen in training.
The broader concern is that AI models may reinforce biases, push a US-centric worldview, and erase languages and cultures that are underrepresented in the data. As Hooker put it, “We are using these models all over the world, and there’s a massive discrepancy between the world we’re seeing and what’s invisible to these models.”