WIRED AI December 12, 2024 NEUTRAL

Why Harvard’s free AI training dataset matters now

Harvard University is releasing a dataset of nearly 1 million public-domain books for AI training, backed by Microsoft and OpenAI. The project arrives as lawsuits over copyrighted training data continue and public-domain alternatives gain attention.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

A public-domain training dataset may modestly expand AI capability but is mainly a legal and access-focused infrastructure story.

Why Harvard’s free AI training dataset matters now

Harvard University is putting a large new AI training dataset into public reach: nearly 1 million public-domain books, assembled for use in large language models and other AI tools.

The project was created by Harvard’s newly formed Institutional Data Initiative with funding from Microsoft and OpenAI. It is built from books scanned through the Google Books project that are no longer protected by copyright.

A large public-domain library for AI

The Institutional Data Initiative’s database is around five times the size of the Books3 dataset, which was used to train AI models like Meta’s Llama. The Harvard collection spans genres, decades, and languages, mixing well-known literary works with more specialized materials.

The source article says the dataset includes classics from Shakespeare, Charles Dickens, and Dante alongside obscure Czech math textbooks and Welsh pocket dictionaries. That range matters because AI developers often need broad, varied text collections to train or evaluate language systems.

Greg Leppert, executive director of the Institutional Data Initiative, framed the project as an effort to “level the playing field.” His argument is that small AI companies, independent researchers, and the wider public rarely have access to the kind of refined content repositories that major technology companies can build.

“It's gone through rigorous review,” Leppert says.

The key promise is not simply volume. It is access to a curated public-domain dataset that can be used without relying on copyrighted books still under protection.

Why Microsoft and OpenAI are backing it

Microsoft and OpenAI funded the Harvard project, but the source does not describe it as a wholesale replacement for the training data used by major AI companies. Leppert said the public-domain books could be combined with other licensed materials to build artificial intelligence models.

He compared the role of the dataset to Linux as a foundation used widely across the world. In that view, a shared public-domain base could support many AI projects, while companies would still need additional training data to make their models distinct.

Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, connected the company’s support to the idea of “pools of accessible data” for AI startups that are “managed in the public’s interest.” He also said, “We use publicly available data for the purposes of training our models.”

Tom Rubin, OpenAI's chief of intellectual property and content, described OpenAI as “delighted” to support the project in a statement.

Copyright pressure is reshaping AI training

The Harvard dataset is emerging while dozens of lawsuits over copyrighted data and AI training move through the courts. The outcome could shape how artificial intelligence tools are built.

If AI companies win those cases, they may be able to continue scraping the internet without licensing agreements with copyright holders. If they lose, they could have to overhaul how models are made.

That uncertainty helps explain the interest in public-domain datasets. Projects like Harvard’s are advancing on the assumption that there will be demand for training materials that do not create the same copyright risks.

Harvard’s effort is not the only one. The Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from newspapers now in the public domain, and it is open to similar collaborations in the future.

The exact distribution plan for the book dataset is still unsettled. The Institutional Data Initiative has asked Google to work with it on public distribution, but the details are still being worked out. Kent Walker, Google's president of global affairs, said in a statement that Google was "proud to support" the project.

A wider market for safer training data

The Harvard database is part of a broader movement toward substantial, high-quality AI training materials that reduce copyright exposure. Some efforts focus on licensing and compensation, while others focus on public-domain collections.

Firms like Calliope Networks and ProRata have emerged to issue licenses and manage compensation schemes for creators and rights holders who provide AI training data.

There are public-domain efforts as well. Last spring, the French AI startup Pleias rolled out Common Corpus, a public-domain dataset containing an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. Backed by the French Ministry of Culture, Common Corpus has been downloaded more than 60,000 times this month alone on Hugging Face.

Last week, Pleias announced its first set of large language models trained on that dataset. Langlais told WIRED they are the first models “ever trained exclusively on open data and compliant with the [EU] AI Act.”

Image datasets are moving in a similar direction. AI startup Spawning released Source.Plus this summer, using public-domain images from Wikimedia Commons and from museums and archives. The source article also notes that cultural institutions such as the Metropolitan Museum of Art in New York have long made their archives accessible to the public as standalone projects.

The unresolved question

Supporters of public-domain datasets see them as evidence that AI systems do not have to depend on copyrighted material scraped without permission. Ed Newton-Rex, a former Stability AI executive who now runs a nonprofit that certifies ethically-trained AI tools, said the rise of these datasets challenges the claim that copyrighted work is necessary for capable AI.

OpenAI previously told lawmakers in the United Kingdom that it would be “impossible” to create products like ChatGPT without using copyrighted works. Newton-Rex responded that “Large public domain datasets like these further demolish the 'necessity defense' some AI companies use to justify scraping copyrighted work to train their models,” according to the source.

But he also warned that impact depends on how these datasets are used. If public-domain materials replace scraped copyrighted work, they could change the direction of AI training. If they are merely added to datasets that still include unlicensed creative work, he said, they will mostly benefit AI companies.

That is the central tension around Harvard’s release. A nearly 1 million book public-domain dataset can expand access and lower risk, but its broader effect will depend on whether developers treat it as a substitute for contested data or simply another ingredient.