The Decoder July 22, 2024 IDIOCRACY

Why AI training data is getting harder for crawlers to reach

A study by the Data Provenance Initiative found that more web domains are blocking AI crawlers from training data. The shift could leave future AI models with less current, less diverse, and lower-quality information unless licensing or legal outcomes change the economics.

WTF Index IDIOCRACY

◄ Terminator 0 Idiocracy 2 ►

Restricted training access could make future AI models less current, diverse, and high-quality, but the risk is indirect and mild.

Why AI training data is getting harder for crawlers to reach

AI developers are running into a growing problem: the open web is becoming less open to model training. A study from the Data Provenance Initiative found a rapid rise in web domains that block AI crawlers, reducing access to the kind of material used in popular training datasets.

The change matters because AI systems depend heavily on large volumes of web-based text. If more publishers, forums, and platforms restrict access, future models may have to learn from a narrower slice of the internet.

What The Study Examined

The Data Provenance Initiative, described as an independent academic group, conducted a large-scale study of web data access for AI models. Researchers reviewed robots.txt files and terms of use across 14,000 web domains.

Those domains are not random corners of the web. They serve as sources for widely used AI training datasets, including C4, RefinedWeb, and Dolma. By looking at these sources, the researchers were able to track how much training data is becoming unavailable to AI crawlers.

The study focused on tokens, which are the sentence and word components that AI models use during training. In practical terms, when a domain blocks crawlers, it can remove many of those tokens from the pool of accessible training material.

Blocking Has Risen Quickly

The numbers show a sharp shift in a short period. From April 2023 to April 2024, the share of tokens in these datasets that were completely blocked for AI crawlers rose from about 1% to 5-7%.

The effect was stronger among key data sources. There, the share of blocked tokens increased from less than 3% to 20-33%. Researchers expect the trend to continue in the coming months.

The blocking is not evenly distributed across AI companies. OpenAI faces the most frequent blocks, followed by Anthropic and Google. That ranking suggests domain owners are not only restricting generic automated access, but are also making choices about specific AI crawlers.

News, Forums, And Social Platforms Are Pulling Back

The biggest restrictions are coming from news websites, forums, and social media platforms. These are important categories because they contain timely discussion, reporting, public commentary, and information about current events.

News sites show the most dramatic change in the source material described. On those sites, the share of completely blocked tokens surged from 3% to 45% within a year.

That creates a likely shift in what future training datasets contain. If news, forums, and social media become less available, their representation may decline. Corporate and e-commerce sites, which have fewer restrictions, could take up more space in the data mix.

That substitution is not neutral. The source article notes that corporate and e-commerce sites often contain lower quality content. If higher-quality sources are increasingly restricted, model developers may face a harder task when trying to build powerful and reliable systems.

Why Data Quality Matters

The issue is not only the size of training datasets. The industry has realized that learning from high-quality data produces better models. A larger pile of lower-quality material may not solve the problem created by losing access to stronger sources.

Several consequences follow logically from the study’s findings:

AI training data may become less representative of news, forums, and social media platforms.
Models may rely more heavily on domains with fewer restrictions, including corporate and e-commerce sites.
Training powerful and reliable AI systems could become more difficult.
The cost of obtaining high-quality content may rise if more access moves into licensing deals.

The study also points to a mismatch between how generative AI services are used and what may be contained in their training data. That mismatch could matter in legal disputes where publishers argue that services like ChatGPT compete with their information offerings based on publishers' content.

Licensing May Become More Important

As access through crawlers becomes more restricted, licensing becomes a more visible path. The source article notes that OpenAI has recently negotiated several multi-million dollar deals with publishers for access to their content for real-time display in chat systems and AI training.

Other companies are likely to follow suit, unless a fair use ruling dramatically changes the situation. For content owners, that could create new revenue streams. High-quality content providers may become major beneficiaries if AI companies need their material and cannot easily replace it.

But the economics are difficult. OpenAI and Meta CEO Mark Zuckerberg have both said that licensing all the data needed to train a good AI model would be impossible or unaffordable.

That tension is now central to the future of AI training data. Web domains are asserting more control over how their content is used. AI developers, meanwhile, still need broad, current, high-quality information to build systems that perform well.

The study does not show the end of web-based training. It shows that the web is becoming a more contested resource. The next generation of AI models may depend not only on technical progress, but also on who can access the best data, under what terms, and at what price.