WIRED AI August 29, 2024 TERMINATOR

Why Applebot-Extended Is Hitting Publisher Roadblocks

Major publishers and platforms are using Applebot-Extended to keep their content out of Apple’s AI training. The move shows how robots.txt has become a business and copyright tool, not just a technical file for web crawlers.

WTF Index TERMINATOR

◄ Terminator 1 Idiocracy 0 ►

The story mildly leans Terminator because it concerns AI companies expanding training data collection against publisher resistance, but no direct harm or autonomy is described.

Why Applebot-Extended Is Hitting Publisher Roadblocks

Apple’s AI data collection is running into resistance from some of the web’s largest publishers and platforms. Less than three months after Apple introduced Applebot-Extended, a tool that lets sites opt out of having their data used for AI training, prominent organizations have already started blocking it.

WIRED confirmed that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and Condé Nast are among those excluding their content from Apple’s AI training systems.

What Applebot-Extended Actually Does

Applebot-Extended is tied to Apple’s existing web crawler, Applebot. The original Applebot was announced in 2015 and was built to support Apple search products such as Siri and Spotlight.

The newer extension does not stop Applebot from visiting a website. That distinction matters because blocking the original crawler could affect how a site appears in Apple search products. Instead, Applebot-Extended tells Apple not to use the collected data for large language models and other generative AI projects.

Apple spokesperson Nadine Haija described Applebot-Extended as a way to respect publishers' rights. Apple refers to the mechanism as “controlling data usage” in a blog post explaining how it works.

Robots.txt Moves From Webmaster Tool to AI Battleground

Publishers block Applebot-Extended by editing robots.txt, the text file that has long guided how bots crawl websites. The Robots Exclusion Protocol allows site owners to permit or block specific bots on a case-by-case basis.

There is no legal obligation for bots to follow robots.txt, but compliance has been a long-standing norm. That norm is under more pressure now because crawlers are no longer only indexing pages for search. They are also part of the pipeline that can feed AI training.

Many publishers have already changed their robots.txt files to block AI bots from OpenAI, Anthropic, and other major AI players. Applebot-Extended is newer, so its block rate is still lower in broad samples.

The Numbers Show Apple Is Still Catching Up

Originality AI analyzed a sampling of 1,000 high-traffic websites last week and found that approximately 7 percent were blocking Applebot-Extended. Dark Visitors ran its own analysis of another sampling of 1,000 high-traffic websites this week and found that approximately 6 percent had blocked the bot.

Those broad web samples suggest that most site owners either do not object to Apple’s AI training practices or do not yet know they can block Applebot-Extended.

The picture looks different among news publishers. Data journalist Ben Welsh found that just over a quarter of the news websites he surveyed, 294 of 1,167 primarily English-language, US-based publications, are blocking Applebot-Extended.

Welsh’s sample showed higher block rates for other AI crawlers: 53 percent of the news websites block OpenAI’s bot, while Google-Extended, introduced last September, is blocked by nearly 43 percent of those sites. Welsh told WIRED the Applebot-Extended number has been “gradually moving” upward since he started looking.

Publishers Are Treating AI Access as a Business Decision

The response is not only technical. For many publishers, allowing or blocking AI crawlers is now part of a wider strategy around licensing, copyright, and the value of published work.

Ben Welsh said a divide has emerged among news publishers over whether to block these bots. He also pointed to licensing deals as a possible factor, noting that some organizations may be paid in exchange for allowing bots in.

Originality AI founder Jon Gillham made a similar point, saying, “A lot of the largest publishers in the world are clearly taking a strategic approach.” He added, “I think in some cases, there's a business strategy involved—like, withholding the data until a partnership agreement is in place.”

There is evidence for that approach. Condé Nast websites previously blocked OpenAI’s web crawlers, then unblocked the company’s bots after announcing a partnership with OpenAI last week. Buzzfeed spokesperson Juliana Clifton told WIRED that the company blocks every AI web-crawling bot it can identify unless the bot’s owner has entered into a partnership, typically paid, with the company.

Why the Opt-Out Model Is Controversial

Some publishers are direct about why they are blocking Applebot-Extended. Lauren Starke, Vox Media’s senior vice president of communications, said Vox Media is blocking the tool across all of its properties, as it has done with many other AI scraping tools when there is no commercial agreement. She said, “We believe in protecting the value of our published work.”

Gannett chief communications officer Lark-Marie Antón gave a shorter explanation: “The team determined, at this point in time, there was no value in allowing Applebot-Extended access to our content.”

The New York Times, which is suing OpenAI over copyright infringement, is critical of the opt-out model itself. Charlie Stadtlander, the Times’ director of external communications, said the company will keep adding unauthorized bots to its block list as it finds them.

“Importantly, copyright law still applies whether or not technical blocking measures are in place. Theft of copyrighted material is not something content owners need to opt out of.”

The practical problem is also growing. Robots.txt has to be edited manually, and new AI agents keep appearing. Dark Visitors founder Gavin King said, “People just don’t know what to block.” His company offers a freemium service that automatically updates a client site’s robots.txt, and he said publishers make up a large portion of its clients because of copyright concerns.

What used to be a quiet technical file is now a policy lever. For publishers, Applebot-Extended is one more sign that AI training has turned web crawling into a negotiation over access, permission, and payment.