The Decoder August 31, 2024 NEUTRAL

Why major websites are shutting out Apple’s AI crawler

Major publishers and platforms are blocking Applebot-Extended, Apple’s AI training crawler, through robots.txt. The move is pushing Apple toward licensing talks while highlighting Google’s stronger position through search.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

This is mostly a business and data-licensing story about AI crawlers, with only mild implications for control over content quality and platform power.

Why major websites are shutting out Apple’s AI crawler

Major websites are drawing a clearer boundary around how their content can be used for artificial intelligence. Apple’s new AI crawler, Applebot-Extended, is being blocked by a growing set of publishers and platforms that do not want their material used for AI training without a deal.

The situation shows how different the AI data market looks depending on who is asking. Apple can be blocked without damaging a publisher’s visibility in Apple search results. Google, by contrast, has a much stronger position because its AI access is tied to the search ecosystem that many publishers depend on.

Publishers are blocking Applebot-Extended

According to a WIRED report, Facebook, Instagram, Craigslist, Tumblr, the New York Times, Financial Times, The Atlantic, Vox Media, USA Today Network, and Condé Nast are among the sites blocking Apple’s AI training crawlers. The same report says some of these publishers have already reached agreements with OpenAI.

Apple recently introduced Applebot-Extended, a crawler that website operators can block through robots.txt files. That gives publishers a direct technical way to say that Apple may not use their sites for AI training access, even if Apple continues to operate other crawling systems for search products.

WIRED found that about 7% of 1,000 analyzed websites currently block Applebot-Extended. A separate analysis by data journalist Ben Welsh found 294 out of 1,167 mainly US-based English-language publications blocking Applebot-Extended.

Those numbers do not mean the entire web is closed to Apple. They do show, however, that a meaningful group of high-value sources is choosing restriction over automatic access. For AI products that depend on high-quality training data and breaking news, that matters.

Licensing is becoming the practical path

The source article frames the publisher response around a simple commercial concern: major websites would rather not provide valuable content for free. If Apple needs that material for AI products, blocking Applebot-Extended can push the company toward licensing negotiations.

Apple is reportedly in talks but has not announced any agreements. OpenAI has already secured some deals. That contrast is important because both Apple and OpenAI need high-quality training data and breaking news for AI products, but they do not have the same relationship with publishers as Google does.

For publishers, blocking a crawler is not just a technical setting. It is a bargaining position. By denying easy access, a publisher can signal that its reporting, archives, or other material has economic value in the AI supply chain.

The list of blockers also matters because it includes both social platforms and major media organizations. Facebook, Instagram, Craigslist, and Tumblr represent large pools of user-facing web content. The New York Times, Financial Times, The Atlantic, Vox Media, USA Today Network, and Condé Nast represent established publishing operations whose content may be especially relevant to AI systems that need current information and well-edited material.

Google has a different kind of leverage

Apple’s position differs sharply from Google’s. The source article says Google can use its search engine dominance to pressure publishers: allow AI access or risk lower search visibility. Google is able to do this by mixing AI overviews with traditional web crawling.

That creates a harder decision for publishers. If a publisher blocks a crawler that is tightly linked to search visibility, the cost may be immediate and practical. Search visibility can affect whether readers find a site at all.

Apple does not have the same leverage in this context. While Apple also has a crawler for search products, publishers can block the AI crawler without affecting their visibility in Apple’s search results. The source describes this setup as fairer, while also noting that search is not Apple’s core business.

This distinction is central to the current AI content dispute. A company that separates AI training access from search crawling gives publishers a cleaner choice. A company that blends AI functions with traditional search crawling can make that choice more difficult.

AI crawlers face broader resistance

Apple is not alone. Anthropic, OpenAI, and Google face similar challenges. The source article notes that OpenAI has experienced even higher block rates for its crawler.

That broader pattern suggests publishers are not only reacting to Apple. They are responding to the wider use of web content in AI systems and to the question of who should benefit when that content helps power AI answers, training, or product features.

The business stakes are especially visible in Google’s case. The company has reportedly ended earlier discussions with publishers about potential license agreements for using their content in AI answers. The source article adds that, given Google’s daily search volume, such agreements would likely be costly and would further squeeze search margins already under pressure from AI.

For Apple, the unresolved question is what future publisher access will look like inside Apple products. The source article specifically notes that it remains to be seen how OpenAI’s ChatGPT publisher licenses will work when accessed through Apple products.

The fight is over control, not just crawling

The conflict around Applebot-Extended is part of a larger shift in how websites treat automated access. Crawlers were once mostly associated with search indexing. AI training and AI answers have changed the calculation because the content may be used in products that do more than send readers back to the original page.

For publishers, the issue is control over valuable material. For AI companies, the issue is access to the kind of information that makes products more useful. For Apple, the immediate challenge is that many large sites can block its AI crawler without accepting a clear downside in Apple search visibility.

That leaves negotiation as the likely route. If publishers continue to block Applebot-Extended, Apple will need to persuade them through licensing terms rather than relying on automatic web access. Google, meanwhile, remains in a different position because search gives it leverage that Apple does not currently have in the same way.