TechCrunch AI January 16, 2025 NEUTRAL

How YouTube entered Zuckerberg's AI copyright defense

Newly released deposition snippets show Mark Zuckerberg comparing Meta's use of disputed training data to YouTube hosting some pirated material while trying to remove it. The testimony is part of Kadrey v. Meta Platforms, an AI copyright case focused on LibGen, Llama, and fair use.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

This is mainly a legal and copyright dispute over AI training data rather than a clear shift toward dangerous autonomy or societal dependence.

How YouTube entered Zuckerberg's AI copyright defense

Mark Zuckerberg's defense of Meta's AI training practices is now partly tied to an unexpected comparison: YouTube. In newly released snippets from a deposition given late last year, the Meta CEO used YouTube's long-running problem with pirated uploads to explain why he viewed a blanket ban on certain datasets as too broad.

The deposition is part of Kadrey v. Meta Platforms, one of several AI copyright cases moving through the U.S. court system. The core dispute is familiar across these lawsuits: AI companies say training on copyrighted content can qualify as fair use, while many authors and other IP holders disagree.

Why YouTube became part of the argument

In the deposition excerpts, Zuckerberg appeared to draw a distinction between a platform or dataset that contains some infringing material and one that should be rejected entirely. He said YouTube may host pirated content for some period of time while trying to remove it, and that he would assume most material on YouTube is legitimate and licensed.

That point was raised as plaintiffs' lawyers questioned Meta's use of LibGen, a dataset of e-books that Meta allegedly used to train at least one of its Llama models. Llama is Meta's family of AI models and competes with flagship models from companies including OpenAI.

Zuckerberg's position, based on the snippets released, was not that copyright risk should be ignored. He stated that Meta should be "pretty careful about" training on copyrighted material. He also said that if a website was intentionally trying to violate people's rights, Meta would want to be cautious, careful, or possibly prevent teams from engaging with it.

The LibGen issue at the center of the case

LibGen describes itself as a "links aggregator" and provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. The source article says LibGen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement.

According to court filings unsealed this week, Zuckerberg allegedly cleared the use of LibGen to train at least one Llama model despite concerns inside Meta's AI executive and research teams about legal implications. Plaintiffs' counsel, representing authors including Sarah Silverman and Ta-Nehisi Coates, quoted Meta employees describing LibGen as a "data set we know to be pirated" and warning that using it "may undermine [Meta's] negotiating position with regulators."

During the deposition, however, Zuckerberg said he "hadn't really heard of" LibGen. He told plaintiffs' attorney David Boies that he did not have knowledge of that specific thing when asked to give an opinion on it.

What the amended complaint alleges

The plaintiffs have amended their complaint several times since the case was filed in U.S. District Court for the Northern District of California, San Francisco Division in 2023. The latest amended complaint, filed late Wednesday by plaintiffs' counsel, added new allegations about how Meta evaluated and used pirated books.

Among the allegations, lawyers claim Meta cross-referenced certain pirated books in LibGen with copyrighted books that were available for license. The complaint alleges Meta used that comparison to decide whether pursuing a licensing agreement with a publisher made sense.

The amended filing also alleges that Meta used LibGen to train Llama 3 and is using the dataset to train next-gen Llama 4 models. Plaintiffs further allege Meta researchers tried to obscure that Llama models had been trained on copyrighted material by inserting "supervised samples" into Llama's fine-tuning.

Another allegation involves Z-Library, also known as Z-Lib. The amended complaint says Meta downloaded pirated e-books from that source for Llama training as recently as April 2024. Z-Library has faced legal actions from publishers, including domain seizures and takedowns, and in 2022 the Russian nationals who allegedly maintained it were charged with copyright infringement, wire fraud, and money laundering.

Why the deposition matters

The released excerpts offer only a partial view of Zuckerberg's testimony. The full transcript was not released, and TechCrunch said it had reached out to Meta for additional context.

Still, the snippets show the shape of Meta's argument in plain terms. Zuckerberg appeared to resist the idea that the presence of copyrighted material should automatically make a dataset unusable, while also acknowledging that Meta should be careful when a source is intentionally violating rights.

That tension is central to Kadrey v. Meta Platforms and to other AI copyright cases now moving through the courts. For AI companies, the question is whether training on copyrighted content can be defended as fair use. For authors and other rights holders, the concern is that copyrighted books and other protected works were used to build commercial AI systems without permission.

The YouTube comparison is important because it frames the debate around scale and policy. Zuckerberg's example suggests that some infringing content inside a larger source may not, by itself, justify a total ban. The plaintiffs' allegations, by contrast, focus on what Meta allegedly knew about LibGen, how employees described it internally, and whether the company used that material while weighing licensing options.