TechCrunch AI January 10, 2025 TERMINATOR

How an OpenAI bot pushed a small website offline

Triplegangers says an OpenAI crawler overwhelmed its e-commerce site while trying to scrape a large catalog of 3D human reference assets. The incident shows why robot.txt settings, log monitoring, and bot blocking have become urgent operational issues for small online businesses.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 0 ►

An AI crawler allegedly overwhelmed a small business website and scraped valuable catalog data at harmful scale.

How an OpenAI bot pushed a small website offline

Triplegangers is a seven-employee company whose website is also its storefront. When CEO Oleksandr Tomchuk learned on Saturday that the site was down, the first signs looked like a distributed denial-of-service attack. The traffic, he later found, was coming from an OpenAI bot trying to pull a large amount of material from the company’s site.

The episode highlights a practical problem for online businesses that hold valuable data, images, or product catalogs. AI crawlers may not arrive like normal customers, and the cost of their activity can land on the site owner before anyone understands what happened.

A crawler hit a large catalog at high volume

Triplegangers sells 3D object files and photos used by 3D artists, video game makers, and others who need realistic digital human details. Its catalog includes assets such as hands, hair, skin, and full bodies, built from scans of actual human models.

That catalog is large. Tomchuk told TechCrunch, "We have over 65,000 products, each product has a page." He also said, "Each page has at least three photos." According to Tomchuk, OpenAI sent "tens of thousands" of server requests while trying to download product pages, detailed descriptions, and hundreds of thousands of photos.

The company’s logs showed broad activity across many addresses. Tomchuk said, "OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it's way more."

For a small e-commerce company, that distinction between a crawler and an attack may not matter much in the moment. If traffic knocks the site offline, customers cannot buy, staff must investigate, and hosting costs can rise. Tomchuk described the effect plainly: "Their crawlers were crushing our site," and added, "It was basically a DDoS attack."

Terms of service were not enough

Triplegangers had a terms of service page that forbids bots from taking its images without permission. But the source article makes clear that this did not stop the crawler activity. The operational control that mattered was a properly configured robot.txt file with specific instructions for OpenAI’s bots.

OpenAI’s crawler names include GPTBot, ChatGPT-User, and OAI-SearchBot. The source article says each has its own tag, according to OpenAI’s information page on its crawlers. OpenAI says it honors robot.txt files when they are configured with its do-not-crawl tags, while also warning that its bots can take up to 24 hours to recognize an updated robot.txt file.

That creates a difficult default for site owners. If a site does not use robot.txt in the way a crawler expects, OpenAI and others may treat the material as available to scrape. It is not described as an opt-in system.

By Wednesday, Triplegangers had added a properly configured robot.txt file. The company also set up Cloudflare to block GPTBot and other bots Tomchuk found, including Barkrowler and Bytespider, described in the source article as TokTok's crawler. By Thursday morning, Tomchuk said the site did not crash.

The harder question is what was already taken

Stopping future traffic does not answer the question of what data was already downloaded. Tomchuk said he had no reasonable way to determine exactly what OpenAI successfully took. He also said he had found no way to contact OpenAI to ask for removal. OpenAI did not respond to TechCrunch’s request for comment.

This is especially sensitive for Triplegangers because its products are based on scans of real people. Tomchuk said, "We're in a business where the rights are kind of a serious issue, because we scan actual people." He also pointed to laws like Europe’s GDPR and said, "they cannot just take a photo of anyone on the web and use it."

The company’s site was also valuable in a way that goes beyond ordinary product photos. The source article says Triplegangers’ site contains images tagged with detailed attributes, including ethnicity, age, tattoos versus scars, and all body types. That kind of organized labeling is precisely what makes a large visual catalog easier to use for training or analysis.

Tomchuk said the aggressive traffic is what exposed the problem. If the scraping had been gentler, he said he might never have known.

Small businesses are being asked to monitor the bots

The incident points to a wider burden now landing on website owners. Robot.txt can help, but the source article notes that compliance is voluntary for AI companies. It also points to Perplexity, which was called out last summer by a Wired investigation when some evidence implied Perplexity was not honoring it.

Tomchuk wants other small online businesses to actively inspect their logs rather than assume their content is untouched. He warned that "most sites remain clueless that they were scraped by these bots." For Triplegangers, the response now includes daily monitoring of log activity to spot bot traffic.

The problem has grown quickly. Research from digital advertising company DoubleVerify found that AI crawlers and scrapers caused an 86% increase in "general invalid traffic" in 2024, meaning traffic that does not come from a real user.

For companies that sell original images, datasets, digital objects, or carefully labeled product pages, the lesson is direct. A website can be public and still contain assets the owner does not want copied into AI systems. But under the crawler model described here, the burden falls on the business to know the bot names, configure robot.txt, block traffic when needed, and keep watching.

They should be asking permission, not just scraping data.