WIRED AI October 7, 2024 TERMINATOR

Publisher deals are changing the fight over OpenAI scraping bots

News publishers rushed to block AI crawlers after the generative AI boom, but the pace of blocking OpenAI’s GPTBot has slowed. The source article ties the shift to publisher licensing deals, robots.txt updates, and uncertainty over whether blocking may become a negotiation tactic.

WTF Index TERMINATOR

◄ Terminator 1 Idiocracy 0 ►

The story mildly leans Terminator because it concerns AI crawlers expanding access to publisher content without clear consent, though it is mostly a business and policy dispute.

Publisher deals are changing the fight over OpenAI scraping bots

The early wave of publisher resistance to OpenAI’s web crawlers appears to be easing. News outlets still use robots.txt to keep AI bots away from their sites, but the sharp growth in blocks aimed at OpenAI’s GPTBot has slowed, and in some measures has begun to reverse.

The change does not settle the larger dispute over AI training data, publisher consent, or how web content should be used by AI companies. It does show that licensing deals are already reshaping the practical rules of access across parts of the media business.

Why publishers started blocking AI crawlers

The generative AI boom created intense demand for online data. For many news organizations, that demand raised a direct concern: their reporting could be collected and used as training material without consent.

One of the main tools publishers use is the Robots Exclusion Protocol, better known through the robots.txt file. That file lets site owners tell automated crawlers which bots are not allowed to access parts of a site.

Robots.txt is not legally binding, but it has long operated as a shared web standard. For much of the internet’s history, crawler operators and website owners have treated it as the basic signal for what a bot should or should not fetch.

The number of AI crawlers has made that work harder. The source article describes the process as a constant effort to keep up with new bots, especially after Apple debuted a new AI agent this summer and many major news outlets quickly opted out of Apple’s web scraping through robots.txt.

GPTBot blocks rose fast, then began to fall

OpenAI’s GPTBot has become one of the best-known AI crawlers. It has also been blocked more often than some competitors, including Google AI.

According to an analysis of 1,000 popular news outlets by Ontario-based AI detection startup Originality AI, the number of high-ranking media websites using robots.txt to disallow GPTBot rose sharply after its August 2023 launch. The increase continued from November 2023 to April 2024, though at a slower pace.

At the highest point, just over a third of those websites were blocking GPTBot. The figure has since fallen closer to a quarter. Among a smaller group of the most prominent news outlets, the block rate remains above 50 percent, but it is down from earlier highs of almost 90 percent.

That decline matters because it suggests the first broad publisher reaction has cooled. The pattern is no longer simply more outlets adding barriers. Some are now removing them.

Licensing deals appear to explain much of the shift

The source article connects several dips in GPTBot blocking to publisher deals with OpenAI. After Dotdash Meredith announced a licensing deal with OpenAI last May, the block rate dropped significantly. It dipped again at the end of May after Vox announced its own arrangement, and again this August after WIRED’s parent company, Condé Nast, struck a deal.

The logic is straightforward. If a publisher gives permission for its data to be used through a partnership, it has less reason to keep blocking the crawler through robots.txt. When enough publishers update those files, the overall block rate falls.

The timing varies by outlet. The Atlantic unblocked OpenAI’s crawlers on the same day it announced a deal. Vox announced its partnership at the end of May, then unblocked GPTBot on its properties toward the end of June.

OpenAI has struck deals with 12 publishers so far. Most have updated their robots.txt files, but not all. Time magazine, for example, continues to block GPTBot, and Time did not respond to WIRED’s request for comment on why GPTBot was still blocked.

OpenAI spokesperson Kayla Wood said that after deals are in place, robots.txt access is less important because the company does not access the data in the same way it approaches crawling what it calls "publicly available" data. "We leverage direct feeds," she says.

Not every unblock means a deal

Some media outlets have unblocked OpenAI’s web crawler without announcing any partnership. Data journalist Ben Welsh, who tracks how news outlets block top AI bots using slightly different metrics, pointed this out to WIRED after first noticing a slight decline in block rates a few weeks ago.

Two examples drew attention: Infowars and The Onion. But the absence of a block does not automatically mean a publisher is negotiating with OpenAI or has reached an undisclosed arrangement.

Onion CEO Ben Collins rejected that idea directly. He said the unblocking was likely tied to the outlet migrating its website to a new hosting service and content management system last month. "Fuck no," he said when asked whether it suggested a deal or negotiation. "Obviously we are not doing any business with the Plagiarism Machine."

Infowars did not respond to requests for comment. OpenAI confirmed that it does not have any partnership with Infowars.

The next phase may be about leverage

The current slowdown in blocking may not be permanent. Originality AI CEO Jon Gillham thinks block rates could rise again if publishers decide that blocking OpenAI helps them negotiate.

His questions frame the issue clearly: "Is step one in a negotiation with OpenAI to block them? Does that bring them to the table?"

That possibility keeps robots.txt at the center of the publisher-AI relationship. It is a technical file, but it now carries business meaning. A block can signal refusal, caution, leverage, or simply a site migration that changed settings by accident.

For OpenAI, the trend is still significant. The first industry-wide rush to block GPTBot appears to have ended for now, and publisher partnerships have reduced some of the resistance. For publishers, the decision remains more complicated: block crawlers, make deals, use direct feeds, or keep the door closed while the market develops.