The Decoder May 4, 2025 TERMINATOR

Why Google AI search training still uses publisher content

Google's publisher opt-out system blocks some content from Deepmind training, but not from every Google AI use. Court testimony said search teams can still train on opted-out publisher data for search features such as AI Overviews.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

Google’s ability to use opted-out publisher content for AI search features raises mild concerns about control and consent in AI training.

Why Google AI search training still uses publisher content

Google's use of publisher content for AI training is under renewed scrutiny after court testimony described a gap between what publishers may think they have blocked and what Google can still use inside search.

The issue is not simply whether publishers opted out. According to the source article, the key point is that the opt-out applies to Google Deepmind, while other parts of Google, including search, may still use the same kind of content for AI systems tied to search.

What the testimony said

The source article says Eli Collins, Vice President at Google Deepmind, addressed the issue during a Washington court hearing, with Bloomberg cited as the report behind that account.

At the hearing, Diana Aguilar of the US Department of Justice asked, "Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" Collins replied, "Correct — for use in search."

That exchange is central because it separates two things that may look identical from the outside: training by Deepmind and training or use by Google's search organization. Publishers may have tried to keep their content out of AI model training, but the testimony described a narrower effect for that choice.

In plain terms, the source article says the current publisher opt-out system only covers Deepmind, the Google AI research division that trains the Gemini models. It does not automatically prevent other Google units from using publisher content for their own AI systems.

Why the opt-out gap matters

The distinction matters because Google search is no longer only a list of links. The source article says Google uses web content to power search features like "AI Overviews," which place AI-generated answers above traditional search results.

That creates a direct tension for website owners. Their pages may help generate answers that users see on Google, while those same users may not click through to the original sites. The source article frames this as direct competition between Google and publishers whose content can support the answers shown in search.

For publishers, the practical problem is control. An opt-out that sounds broad may not block every AI-related use across the company. If it only applies to Deepmind, then publisher content can be excluded from one training pipeline while remaining available for search AI.

The difference is especially important because Gemini is part of the discussion. The testimony, as presented in the source article, involved the scenario where the Gemini model is placed inside the search organization. Once that happens, the search organization can train on publisher data that had been opted out of training, but only "for use in search."

The scale described in Google's document

The source article points to an internal Google document from summer 2024. That document listed 160 billion tokens, described as short snippets of text, originally intended for AI training.

Of those, 80 billion tokens were removed because they came from publishers who opted out. The source article says this amounted to Google losing half of the publisher training data from that set for Deepmind's purposes.

But the testimony suggests the removal did not mean the data was unavailable everywhere inside Google. Instead, the same publisher data could still be used for Google's web search AI, even if Deepmind was not directly using it.

That is the core policy and business concern. Publishers were trying to block Google AI training broadly, but the mechanism described in the source article appears to work more narrowly. It removes content from Deepmind training while leaving room for use elsewhere in the company.

How this fits the antitrust case

The details surfaced in an ongoing antitrust case against Google in federal court. The US Department of Justice is seeking major restrictions, including pushing for Google to sell off the Chrome browser and stop paying hardware and app makers to set Google as the default search engine.

The DOJ also says those restrictions should apply to Google's AI products, including Gemini. Its argument, as described in the source article, is that those AI products benefit from the same search monopoly.

That connection gives the publisher data issue broader significance. It is not only a dispute over web scraping or AI training permissions. It also touches the relationship between search, default distribution, AI products, and the content ecosystem that search depends on.

The source article also notes a wider market question. If leading AI labs need high-quality training data to keep their models performing well, a market for that content could develop. Such a market would conflict with the current practice of using freely available open web content, often defended as "fair use."

The article also says a US judge recently dismissed that defense in a case involving Meta. No further case details are provided in the source, so the point remains limited: the legal footing around AI training data is being tested, and publisher content is at the center of that pressure.

What publishers can take from it

The immediate lesson is that an opt-out may not mean what publishers expect unless it clearly covers all relevant AI uses. In the situation described by the source article, opting out removed content from Deepmind training but did not stop search-related AI use inside Google.

For readers, the issue is also about how AI search answers are built. If AI Overviews rely on content from across the web, then the value created by publishers can be surfaced directly in search results. That changes the relationship between the search engine, the source sites, and the users looking for answers.

The facts in the source article do not show how Google will change its opt-out system, whether publishers will receive new controls, or how the court will rule. What they do show is a sharp boundary inside Google's own AI operations: Deepmind opt-outs and search AI training are not the same thing.