A new study adds a concrete data point to one of the central questions in generative AI copyright disputes: when a model is trained on copyrighted material, how often can it reproduce that material word for word?
The research focused on books rather than newspaper articles, and it found a striking result for Meta’s Llama 3.1 70B. For Harry Potter and the Sorcerer’s Stone, the model was estimated to have memorized 42 percent of the book well enough to reproduce 50-token excerpts at least half the time.
What the study examined
The paper was published last month by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. The researchers studied five popular open-weight models: three from Meta, one from Microsoft, and one from EleutherAI.
The books tested came from Books3, a collection of books widely used to train LLMs. Many books in that collection are still under copyright, which makes the findings relevant to ongoing legal fights over AI training data.
The clearest result involved Harry Potter and the Sorcerer’s Stone. According to the study, Llama 3.1 70B, a mid-sized model Meta released in July 2024, was much more likely to reproduce text from that book than the other models tested.
The comparison with an earlier Meta model was especially notable. Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer’s Stone by the same measure. For that book, the reported level of memorization rose sharply between Llama 1 and Llama 3.
Why popular books stood out
Harry Potter and the Sorcerer’s Stone was not the only book examined. The researchers tested dozens of books and found that Llama 3.1 70B was far more likely to reproduce popular titles, including The Hobbit and George Orwell’s 1984, than more obscure works.
That pattern matters because it suggests that memorization is not evenly distributed. Some works appear much easier to recover from a model than others. Some models also appear more prone to verbatim recall than others.
James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors, put the issue directly: “There are really striking differences among models in terms of how much verbatim text they have memorized.”
The study’s authors were also surprised by the spread in results. Mark Lemley, a law professor at Stanford, said they had expected a much lower level of replicability, “on the order of 1 or 2 percent.”
How the researchers measured memorization
The study did not require the researchers to repeatedly ask a model to generate text and then count how often it returned exact passages. Instead, they used the model’s own token probabilities to estimate how likely a specific sequence was.
An LLM does not simply choose one next word with certainty. It produces a probability distribution across possible next tokens. The source article explains this with the example phrase “Peanut butter and,” where a model might assign different probabilities to possible continuations such as “jelly,” “sugar,” “peanut,” “chocolate,” or “cream.”
To estimate whether a model would complete a longer phrase, the researchers could look at each token in sequence and multiply the probabilities together. For example, the source article gives a made-up calculation for “peanut butter and jelly”:
- The probability of “peanut” after “My favorite sandwich is” is 20 percent.
- The probability of “butter” after “My favorite sandwich is peanut” is 90 percent.
- The probability of “and” after “My favorite sandwich is peanut butter” is 80 percent.
- The probability of “jelly” after “My favorite sandwich is peanut butter and” is 70 percent.
Multiplying those values gives 0.2 * 0.9 * 0.8 * 0.7 = 0.1008, or about 10 percent. The same logic can be applied to longer passages, including 50-token excerpts from books.
This method let the researchers estimate rare outcomes without generating huge numbers of samples. The authors estimated that some 50-token sequences would require more than 10 quadrillion samples to reproduce exactly through repeated generation. By using token probabilities, they could still estimate those odds directly.
What it means for AI copyright lawsuits
The findings cut in more than one direction. For critics of AI companies, the study provides evidence that memorization can be more than a fringe behavior, at least for some models and some books. A model that can reproduce large portions of a copyrighted work creates a concrete issue for courts to consider.
But the same study may also help defendants in some cases. The researchers found much lower memorization for other books. For example, Llama 3.1 70B memorized only 0.13 percent of Sandman Slim, a 2009 novel by Richard Kadrey.
That contrast could matter in class-action lawsuits. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class, a court must find that plaintiffs are in largely similar legal and factual situations.
If one author’s book is heavily memorized while another author’s book is barely reproduced, that difference could make class treatment harder to justify. The source article notes that such divergence could work in Meta’s favor because most authors do not have the resources to bring individual lawsuits.
The bigger lesson
The broader point is that AI copyright cases may turn on facts that can be measured, not only broad arguments about whether models copy or learn. The study suggests that answers may depend on the specific model, the specific copyrighted work, and the specific method used to test reproduction.
That makes the debate more complicated, but also more grounded. Instead of treating memorization as a purely theoretical issue, researchers can test whether a model is likely to reproduce exact text from a particular book. For courts, companies, and authors, those details may prove difficult to ignore.