Ars Technica AI March 25, 2025 TERMINATOR

AI crawler traffic is pushing open source sites to lock down

Open source maintainers say AI crawler traffic is overwhelming public infrastructure, raising costs and causing downtime. Projects are responding with country blocks, VPNs, bot filters and proof-of-work systems such as Anubis, but those defenses can also slow legitimate users.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

AI crawler behavior is creating real infrastructure harm and forcing restrictive defenses, but the risk is operational rather than catastrophic.

AI crawler traffic is pushing open source sites to lock down

Open source projects are facing a growing infrastructure problem: AI crawlers are generating so much traffic that maintainers say public services are becoming unstable, expensive to run and harder for real users to reach.

The pressure is not coming from one project or one isolated server. The source describes developers, sysadmins and maintainers across multiple open source communities reporting similar patterns: crawlers ignoring normal limits, hitting costly pages, rotating through addresses and forcing defensive measures that were once unusual for public collaboration sites.

Why AI crawlers are straining open source infrastructure

The clearest example is software developer Xe Iaso, who reached a breaking point after aggressive AI crawler traffic from Amazon overwhelmed their Git repository service. The impact was not theoretical: the service suffered instability and downtime.

Iaso tried standard defenses first. Those included changes to robots.txt, blocks on known crawler user-agents and filters for suspicious traffic. According to the source, those measures did not solve the problem because the crawlers kept evading them by spoofing user-agents and using residential IP addresses as proxies.

That experience led Iaso to put the server behind a VPN and build Anubis, a proof-of-work challenge system. The idea is simple: before a browser can reach the site, it must complete computational work that creates friction for automated traffic.

“It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more,”

Iaso also wrote, “I don’t want to have to close off my Gitea server to the public, but I will if I have to.” That tension captures the central problem for open source: public access is part of the value, but public access also makes these services easy targets for heavy automated collection.

Projects are blocking, challenging and filtering traffic

The same pattern appears across other projects named in the source. Kevin Fenzi, a member of the Fedora Pagure project’s sysadmin team, reported that the project had to block all traffic from Brazil after other attempts to mitigate bot traffic failed.

GNOME GitLab adopted Iaso’s Anubis system. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests passed the challenge, or 2,690 out of 84,056. That result suggests that most of the requests hitting the system were automated rather than normal human browsing.

KDE’s GitLab infrastructure was also affected. According to LibreNews, citing a KDE Development chat, crawler traffic from Alibaba IP ranges temporarily knocked KDE’s GitLab offline.

These responses show how open source maintainers are being pushed toward stronger controls:

blocking traffic by geography when mitigation fails;
placing services behind a VPN;
filtering suspicious bot traffic;
blocking known crawler user-agents;
using Anubis and other proof-of-work challenges.

Those measures may reduce bot traffic, but they also change the character of public infrastructure. A project that exists to support collaboration may suddenly need gatekeeping tools just to remain available.

The cost is technical, financial and human

Bandwidth is one of the most visible costs. The Read the Docs project reported that blocking AI crawlers reduced traffic by 75 percent, from 800GB per day to 200GB per day. According to its blog post “AI crawlers need to be more respectful,” that change saved approximately $1,500 per month in bandwidth costs.

For maintainers, the issue is not just total volume. The crawlers often hit expensive endpoints, including git blame and log pages. Drew DeVault, founder of SourceHut, reported that crawlers access “every page of every git log, and every commit in your repository,” which creates particular strain for code hosting services.

Proof-of-work defenses also create friction for legitimate visitors. When many people open the same GitLab link at once, such as after a link is shared in a chat room, users can face noticeable delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to finish.

There is another burden beyond traffic. LibreNews points out that some open source projects started receiving AI-generated bug reports as early as December 2023. Daniel Stenberg of the Curl project first reported the issue on his blog in January 2024. These reports can look credible at first, but they contain fabricated vulnerabilities and consume developer time.

Responsibility is difficult to narrow down

The source names several companies and traffic sources, but the broader picture is mixed. Dennis Schubert, who maintains infrastructure for the Diaspora social network, described the situation in December as “literally a DDoS on the entire internet” after finding that AI companies made up 70 percent of all web requests to their services.

Schubert’s traffic analysis showed that approximately one-fourth of Diaspora’s web traffic came from bots with an OpenAI user agent. Amazon accounted for 15 percent, and Anthropic accounted for 4.3 percent.

The possible reasons for the crawling vary. Some crawlers may be gathering training data for large language models. Others may be performing real-time searches when users ask AI assistants for information. The source does not settle that question, but it does describe repeated crawling behavior rather than one-time collection.

AI crawlers “don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.”

Some reports suggest more aggressive behavior from certain sources. KDE’s sysadmin team reported crawler traffic from Alibaba IP ranges. Iaso’s problem came from Amazon’s crawler. A member of KDE’s sysadmin team told LibreNews that Western LLM operators like OpenAI and Anthropic were at least setting proper user agent strings, which can theoretically help websites identify and manage crawler traffic.

Open source carries the heavier burden

The imbalance matters because many open source projects run on limited resources. They often provide public code, documentation and collaboration tools without the staffing or infrastructure budget of commercial platforms. When crawler traffic grows, maintainers absorb the operational work.

Martin Owens from the Inkscape project said on Mastodon that the problem was not only “the usual Chinese DDoS from last year,” but also “a pile of companies that started ignoring our spider conf and started spoofing their browser info.” Owens added, “I now have a prodigious block list. If you happen to work for a big company doing AI, you may not get our website anymore.”

That is the practical result of the current conflict. Open source projects want to remain open, searchable and useful. But if AI crawlers behave like denial-of-service traffic, maintainers may keep choosing defenses that make the web less open for everyone.