The Decoder July 19, 2025 IDIOCRACY

Anthropic Copyright Lawsuit Turns on Alleged Pirated Books

A California federal court has allowed a class action against Anthropic to move forward over alleged downloads of books from LibGen and PiLiMi. The case separates possible fair use for legally obtained books from claims involving pirated copies stored in an internal database.

WTF Index IDIOCRACY

◄ Terminator 0 Idiocracy 1 ►

The story is mainly a legal dispute over allegedly pirated training data, with only a mild lean toward AI degrading creative rights and information quality norms.

Anthropic Copyright Lawsuit Turns on Alleged Pirated Books

A class action lawsuit against Anthropic has moved into a higher-stakes phase, with a California federal court allowing claims over alleged large-scale copyright infringement to proceed. The company behind the Claude language model is accused of downloading as many as seven million books from pirate sites between 2021 and 2022.

The case matters beyond one AI company because it draws a sharp line around how training data is sourced. The court had recently given Anthropic a partial fair use win involving legally obtained books, but the allegations tied to pirated works remain a separate and potentially costly problem.

What the Anthropic copyright lawsuit alleges

According to the court order from July 17, 2025, Anthropic is accused of using the BitTorrent protocol to download pirated books from LibGen and PiLiMi. The files were typically in .epub, .pdf, or .txt format and were placed in a central internal database.

A key part of the allegation is that the books were stored regardless of whether they were later used to train AI models. That distinction matters because the lawsuit is not only about model training. It is also about acquisition and storage of copyrighted works that the plaintiffs say came from pirate sources.

Judge William Alsup described the alleged conduct as "Napster-style downloading of millions of works." The order says that between January 2021 and July 2022, an Anthropic co-founder first downloaded about 200,000 books from the Books3 collection. The order then describes roughly five million books from LibGen and another two million from PiLiMi, with PiLiMi targeting titles not already in LibGen.

The court allowed the dispute to proceed as a class action because of the scale and complexity of the evidence. That does not mean the plaintiffs have won the case. It means the claims can be handled collectively for the included works, rather than as a scattered set of individual disputes.

Which books are included

The class action is limited to works sourced from LibGen and PiLiMi. Books3 was excluded because of missing metadata.

That metadata issue is practical but important. In a copyright case involving millions of alleged works, the parties need to identify titles, registrations, and source records with enough precision for the court to evaluate the claims. Without that structure, even a large collection may not be usable as part of the class case.

The schedule now focuses on making those records concrete. Anthropic must provide a complete metadata list of its LibGen and PiLiMi downloads by August 1, 2025. Plaintiffs must then submit a detailed list of titles and registrations by September 1, 2025.

Those deadlines point to the next major phase of the lawsuit. The case is moving from broad allegations about massive book collections toward a more specific accounting of which works are at issue and how they are documented.

Why the damages risk is so large

The financial exposure for Anthropic is significant because US law allows damages for willful copyright infringement to reach up to $150,000 per work. In a case involving millions of alleged downloads, even a far smaller amount per title could still add up to billions.

That is why the class action decision raises the stakes. The case does not depend only on whether AI training can be transformative. It also asks whether a company can face damages for building an internal library from sources alleged to be pirate sites.

The source article frames the risk as a billion-dollar class action lawsuit because of the number of works and the potential damages structure. The exact financial outcome is not determined, but the range of possible exposure is unusually large because the alleged infringement is counted work by work.

Fair use has a boundary in this ruling

In June, the same court ruled that training AI models on legally obtained books may qualify as fair use, especially if the use is "transformative" and no copies are distributed. That part gave Anthropic a partial win.

But the court also made clear that storing pirated works in an internal library does not qualify as fair use. The distinction is central to the case: legally obtained material used for AI training is being treated differently from pirated material kept inside a company database.

The ruling leaves broader questions about mass web scraping and public data for AI training unsettled. At the same time, it sets a clearer boundary for this dispute: pirated content cannot be justified as fair use simply because it may support AI research or innovation.

For AI companies, that boundary is more than procedural. It suggests that how training data is acquired may be judged separately from what a model later does with that data. A company may argue about transformation, distribution, and model behavior, but those arguments may not cure problems in the sourcing of the underlying works.

Why the case could affect AI companies beyond Anthropic

The Anthropic case could become a major precedent for the AI industry. If the court continues to separate legally obtained books from pirated collections, companies may face greater pressure to prove the provenance of training data and internal datasets.

The decision could also ripple into ongoing lawsuits against Meta, OpenAI, and others accused of using copyrighted material to train language models. The source article does not say those cases will follow the same outcome, but it does connect them to the same broader legal debate around copyrighted works and AI training.

The core lesson from the ruling is narrow but consequential. Fair use may remain available in some AI training disputes, especially where books were legally obtained and the use is considered transformative. But the court has signaled that alleged piracy is a different issue, and one that can carry enormous financial risk when millions of works are involved.