TechCrunch AI November 23, 2024 NEUTRAL

Search data dispute clouds OpenAI copyright lawsuit

Lawyers for The New York Times and Daily News say OpenAI engineers erased search data tied to their review of AI training datasets. OpenAI denies deleting evidence and says a configuration change requested by the plaintiffs caused a technical issue.

A technical dispute has become a new flashpoint in the copyright lawsuit brought by The New York Times and Daily News against OpenAI. The publishers say search work they performed on OpenAI-provided virtual machines became unusable after data was erased, while OpenAI says no evidence was deleted.

What the publishers say happened

The New York Times and Daily News are suing OpenAI over allegations that their works were scraped and used to train AI models without permission. As part of the case, OpenAI agreed earlier this fall to provide two virtual machines so counsel for the publishers could search for copyrighted content in its AI training sets.

Virtual machines are software-based computers that run inside another computer’s operating system. In this case, they were used as controlled environments for reviewing training data and searching for material that the publishers believe may be relevant to their claims.

According to a letter filed in the U.S. District Court for the Southern District of New York late Wednesday, attorneys for the publishers and experts they hired spent over 150 hours since November 1 searching OpenAI’s training data. The letter says OpenAI engineers erased all of the publishers’ search data stored on one of the virtual machines on November 14.

OpenAI tried to recover the data and was mostly successful, according to the publishers’ letter. But the letter says the recovered material could not be used for the same purpose because the folder structure and file names were lost. The practical result, according to the publishers, was that an entire week of work had to be redone.

Why the missing structure matters

The dispute is not only about whether files existed after recovery. The publishers’ position is that the organization of the recovered material mattered because it could help show where copied articles appeared in the datasets used to build OpenAI’s models.

Without the original folder structure and file names, the recovered data may still exist in some form but lose key context. For litigation involving AI training data, that context can be central: search results need to be tied back to locations, files, and dataset organization so lawyers and experts can explain what they found.

The publishers’ lawyers said they had no reason to believe the deletion was intentional. Even so, they argued that the incident shows OpenAI is better positioned to search its own datasets using its own tools.

That point goes to a broader issue in the case. The publishers are trying to determine whether their copyrighted content was included in training data. OpenAI has neither confirmed nor denied that it trained its AI systems on any specific copyrighted works without permission.

OpenAI’s response

An OpenAI spokesperson declined to provide a statement. But late Friday, November 22, counsel for OpenAI filed a response to the publishers’ letter.

OpenAI’s attorneys denied that OpenAI deleted any evidence. They said the issue followed a configuration change that the plaintiffs requested for one of several machines OpenAI had provided to search training datasets.

According to OpenAI’s response, implementing that requested change removed the folder structure and some file names on one hard drive. OpenAI’s lawyers described that drive as one that was supposed to be used as a temporary cache, and said there was no reason to think files were actually lost.

The disagreement leaves the parties with different accounts of the same technical event. The publishers focus on the loss of usable search work and the need to recreate it. OpenAI focuses on whether evidence itself was deleted and on the plaintiffs’ role in requesting the configuration change.

The broader copyright fight

The data dispute sits inside a larger legal and business conflict over AI training. In this case and others, OpenAI has maintained that training models using publicly available data, including articles from The Times and Daily News, is fair use.

OpenAI’s position is that models such as GPT-4o learn from billions of examples of e-books, essays, and more to generate human-sounding text. Under that view, OpenAI says it does not need to license or pay for those examples, even when it makes money from models trained on them.

At the same time, OpenAI has signed licensing deals with a growing number of news publishers. The list includes the Associated Press, Business Insider owner Axel Springer, Financial Times, People parent company Dotdash Meredith, and News Corp.

OpenAI has not made the terms of those deals public. One content partner, Dotdash, is reportedly being paid at least $16 million per year.

What is at stake now

The immediate issue is whether the publishers can efficiently recreate the search work they say was lost and whether OpenAI should play a larger role in searching its own datasets. The publishers say they have been forced to restart work that required significant person-hours and computer processing time.

For readers following the OpenAI copyright lawsuit, the incident highlights how much the case depends on access to technical systems and the ability to preserve context around data. The legal arguments may focus on copyright and fair use, but the day-to-day progress of the case can turn on file names, folder structures, virtual machines, and search records.

The update also shows how contested the discovery process has become. The publishers describe a setback to their review of training data. OpenAI says evidence was not deleted and points instead to a technical issue caused by a requested configuration change.