New studies put LLM memorization at the center of AI copyright fights

Recent studies found that leading AI models can be prompted to reproduce large portions of books from training data. The findings intensify questions over copyright, fair use, privacy, and whether model safeguards are enough.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

The story centers on models retaining and leaking copyrighted training data despite safeguards, raising control and privacy concerns more than societal dumbness concerns.

New studies put LLM memorization at the center of AI copyright fights

Large language models are facing renewed scrutiny after recent research showed that major AI systems can produce near-verbatim passages from books used in training. The findings challenge a central industry claim: that models learn patterns from copyrighted works without retaining copies of those works.

What the research found

A series of recent studies has examined memorization in large language models from OpenAI, Google, Meta, Anthropic, and xAI. The results suggest that LLM memorization may be more extensive than many experts previously believed.

One study published last month by researchers at Stanford and Yale Universities tested whether models could be induced to continue text from books. The researchers used strategic prompting against systems from OpenAI, Google, Anthropic, and xAI and were able to extract thousands of words from 13 books.

The books named in the source include A Game of Thrones, The Hunger Games, and The Hobbit. The researchers also tested Harry Potter and the Philosopher’s Stone, where Gemini 2.5 reproduced 76.8 percent with high levels of accuracy and Grok 3 generated 70.3 percent.

The source also says researchers extracted almost the entire novel “near-verbatim” from Anthropic’s Claude 3.7 Sonnet by jailbreaking the model. Jailbreaking refers to prompts that push an LLM to ignore its safeguards.

This work builds on a study from last year that found “open” models, including Meta’s Llama, could memorize large portions of particular books in their training data. What surprised some researchers was that closed models, which typically include more guardrails, also appeared vulnerable to large-scale extraction.

Why memorization matters

The issue is not just whether a model can answer questions about a book. The concern is whether it can reproduce protected expression in a form close enough to the original to raise copyright and liability questions.

AI companies have argued that training on copyrighted books is “fair use” because the process transforms original material into a new technology. They have also argued that models do not contain copies of the training material in the ordinary sense.

Google made that position explicit in a 2023 letter to the US Copyright Office, saying “there is no copy of the training data—whether text, images, or other formats—present in the model itself.” The new studies complicate that claim because they show outputs that can closely resemble training texts under certain prompting conditions.

Yves-Alexandre de Montjoye, a professor of applied mathematics and computer science at Imperial College London, said: “There’s growing evidence that memorization is a bigger thing than previously believed.”

Researchers still have not determined why LLMs memorize some material from training data. It also remains unclear how much training data can surface in model outputs overall.

The legal pressure is growing

AI and legal experts told the FT that memorization could affect copyright lawsuits involving AI companies around the world. The reason is straightforward: if a model can reproduce protected works, it becomes harder to frame training as only abstract learning from patterns.

Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons, said the findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works.”

Recent legal disputes have already turned on related questions. A US court last year found that Anthropic’s training of LLMs on some copyrighted content could be considered fair use because it was deemed “transformative.” But the same court found that storing pirated works was “inherently, irredeemably infringing,” which led Anthropic to pay $1.5 billion to settle the lawsuit.

In Germany, a ruling from November last year found that OpenAI had infringed copyright because its model had memorized song lyrics. The case was brought by GEMA, an association representing composers, lyricists, and publishers, and was considered a landmark ruling in the EU.

Rudy Telscher, a partner at law firm Husch Blackwell, drew a distinction between clear reproduction and the harder question of broader responsibility. He said reproducing an entire book without jailbreaking is “clearly a copyright violation,” while adding that the key issue is “whether this is happening enough that [AI models] could be vicariously liable for the infringement.”

Safeguards, privacy, and the broader risk

Anthropic said the jailbreaking method used in the Stanford and Yale research was impractical for normal users. The company said extracting the text would take more effort than simply purchasing the content.

Anthropic also said its model does not store copies of specific datasets and instead learns from patterns and relationships between words and strings in training data. xAI, OpenAI, and Google did not respond to requests for comment.

Even so, the existence of safeguards is itself important. De Montjoye said the fact that AI labs have put protections in place to stop extraction of training data shows they are aware of the problem.

The consequences may extend beyond books and copyright. The source notes that memorization could create privacy and confidentiality issues in sectors such as health care and education, where leakage of training data could be especially sensitive.

For AI companies, the question is both technical and legal. Better guardrails may reduce obvious extraction, but the studies suggest that memorization is not fully understood. That uncertainty may affect how models are trained, how much development costs, and how courts weigh claims that models only transform copyrighted works.

Ben Zhao, a computer science professor at the University of Chicago, framed the debate as a question of whether AI labs need copyrighted content to build cutting-edge models at all. “Whether the technical result can be done or not, it’s still a question of should we be doing this?” Zhao said. “The legal side should eventually hold their ground and really be the arbiter in this whole process.”