TechCrunch AI October 4, 2024 NEUTRAL

California AI training disclosures put vendors on the spot

California's AB 2013 requires generative AI developers to publish high-level summaries of the data used to train their systems. TechCrunch found that only Stability, Runway and OpenAI said they would comply, while many major companies did not answer or declined to comment.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a transparency and compliance story, with only mild concern about opaque training data and personal information.

California AI training disclosures put vendors on the spot

California's AB 2013 is aimed at a question that has followed generative AI from the start: what data was used to build these systems? Governor Gavin Newsom signed the bill on Sunday, and it requires companies developing generative AI systems to publish a high-level summary of their training data.

The law does not force immediate disclosure. But it sets up a deadline that could make training data transparency a central test for AI companies, especially those already facing pressure over copyrighted material, personal information and web scraping.

What AB 2013 Requires

AB 2013 requires companies to describe the data used to train generative AI systems. The summary must include who owns the data, how the data was procured or licensed, and whether the data includes copyrighted or personal information.

The requirement applies to systems released in or after January 2022. That includes systems such as ChatGPT and Stable Diffusion, according to the source article. Companies have until January 2026 to begin publishing the summaries.

The law applies to systems made available to Californians. That leaves room for companies to make decisions about where and how they offer certain models, but the disclosure obligation is broad for systems that fall within its scope.

AB 2013 also reaches beyond the original developer of a model. Any entity that "substantially modifies" an AI system, including through fine-tuning or retraining, is also required to publish information about the data it used. The law has carve-outs, but the source article says they mostly concern AI systems used in cybersecurity and defense, including systems used for "the operation of aircraft in the national airspace."

Who Said They Would Comply

TechCrunch contacted major AI companies and startups about whether they would comply with AB 2013. The list included OpenAI, Anthropic, Microsoft, Google, Amazon, Meta, Stability AI, Midjourney, Udio, Suno, Runway and Luma Labs.

Fewer than half responded. Microsoft explicitly declined to comment. Only Stability, Runway and OpenAI told TechCrunch that they would comply with the law.

OpenAI's response was direct: "OpenAI complies with the law in jurisdictions we operate in, including this one," an OpenAI spokesperson said. Stability also signaled support for regulation, with a spokesperson saying the company is "supportive of thoughtful regulation that protects the public while at the same time doesn’t stifle innovation."

The silence from many companies does not prove they will refuse to comply. The law's main disclosure requirements are not yet in effect. Still, the limited response shows how sensitive the subject has become for companies building generative AI systems.

Why Training Data Is So Sensitive

Training data often comes from the web. AI vendors have scraped large volumes of images, songs, videos and other material from websites, then used that material to train generative AI systems.

Earlier in the field, companies more often disclosed their data sources in technical papers released with models. Google, for example, once said it trained an early version of Imagen on the public LAION data set. Older papers also referred to The Pile, an open-source collection of training text that includes academic studies and codebases.

That approach has changed as the market has become more competitive. Companies now treat the composition of training data sets as a competitive advantage. They cite that as one reason for keeping details private.

Legal exposure is another reason. The source article notes that LAION links to copyrighted and privacy-violating images, while The Pile contains Books3, a library of pirated works by Stephen King and other authors. A disclosure that identifies data sources can help the public understand how a system was built, but it may also create new legal risk for the company that publishes it.

The Legal Pressure Around AI Training

Training data is already at the center of lawsuits. Authors and publishers claim that OpenAI, Anthropic and Meta used copyrighted books, including some from Books3, for training. Music labels have sued Udio and Suno over alleged training on songs without compensating musicians.

Artists have also filed class-action lawsuits against Stability and Midjourney, arguing that data scraping practices amounted to theft. These disputes make AB 2013 more consequential, because the law requires public information about training data that companies may prefer to keep confidential.

The disclosures must include details such as when data sets were first used and whether data collection is ongoing. For companies already trying to limit courtroom risk, those facts could matter.

Many vendors argue that fair use gives them legal cover. They are making that argument in court and in public statements. The source article also says some companies, including Meta and Google, changed platform settings and terms of service to allow more user data to be used for training.

Other reporting has added to the scrutiny. Reuters reported that Meta at one point used copyrighted books for AI training despite its own lawyers' warnings. The source article also says there is evidence that Runway sourced Netflix and Disney movies to train video-generating systems, and that OpenAI reportedly transcribed YouTube videos without creators' knowledge to develop models, including GPT-4.

What Could Happen Next

AB 2013 could make AI training data disclosure a practical compliance issue, not just a public debate. If the law is not challenged or stayed, companies covered by it will need to decide how to publish summaries by the January 2026 deadline.

The outcome may depend partly on the courts. The source article notes that courts could side with fair use arguments and decide that generative AI is sufficiently transformative, rather than the kind of plagiarism system alleged by The New York Times and other plaintiffs.

There are also more dramatic possibilities. Some vendors could withhold certain models in California. Others could release versions for Californians that are trained only on fair use and licensed data sets. For companies concerned about lawsuits, avoiding disclosures that create new claims may look safer than publishing details about contested training data.

For users, developers and rights holders, the core issue is simple. AB 2013 asks AI companies to explain, at a high level, what went into the systems they sell or make available. The answers, when they arrive, could show how much of the generative AI industry is ready to make training data transparency part of doing business.