MIT Tech Review AI August 22, 2024 NEUTRAL

Why open-source AI now has a working definition

The Open Source Initiative has released its first definition for open-source AI after convening a 70-person group. The definition focuses on permission-free use, inspection, modification, sharing, and a practical level of transparency around training data, source code, and weights.

WTF Index NEUTRAL

◄ Terminator 0 Idiocracy 0 ►

This is a governance and standards story about defining open-source AI, with no clear drift toward danger or societal degradation.

Why open-source AI now has a working definition

Open-source AI has become a powerful label in artificial intelligence, but until now it has lacked a shared meaning. That gap has made it harder for researchers, users, and lawmakers to judge whether an AI model is genuinely open or simply marketed that way.

The Open Source Initiative (OSI) has now released a new definition for open-source AI. The group hopes the standard will help lawmakers develop regulations that protect consumers from AI risks while giving the industry clearer language for describing AI systems.

What the new definition says

OSI has long published guidance on open-source technology in other fields, but this is its first attempt to define the term for AI models. To build the working definition, it asked a 70-person group of researchers, lawyers, policymakers, and activists to participate, along with representatives from big tech companies like Meta, Google, and Amazon.

At the center of the definition is the idea that an open-source AI system should be broadly usable and inspectable. According to the group, such a system can be used for any purpose without needing permission. Researchers should also be able to inspect its components and study how it works.

The definition also says an open-source AI system should be modifiable for any purpose, including changes to its output. Users should be able to share the system with others, with or without modifications, for any purpose.

In practical terms, the definition tries to cover several parts of an AI model that matter for openness:

Training data, or information about the data used to create the system.
Source code, which helps researchers understand how the system was built.
Weights, the parameters that help determine how an AI model generates an output.

That combination matters because AI systems are not open in the same way as many older software projects. A model may be available to download and adapt, but questions can remain about its license, its data, and whether outside researchers can meaningfully study it.

Why the label has been contested

The absence of a standard created room for disagreement. The source article draws a clear contrast between companies that keep models, data sets, and algorithms secret and companies that make models freely accessible. OpenAI and Anthropic are described as closed source because of their decisions to keep those elements secret.

But the harder debate involves models that appear more open. Some experts argue that Meta and Google’s freely accessible models are still not truly open source. Their concerns include licenses that restrict what users can do with the models and training data sets that are not made public.

Meta, Google, and OpenAI were contacted for their response to the new definition, but they did not reply before publication.

The stakes are not only technical. Avijit Ghosh, an applied policy researcher at Hugging Face, says companies have been known to misuse the term when marketing their models. A model described as open source may appear more trustworthy, even when researchers cannot independently verify whether it deserves that description.

That trust problem is central to the new definition. If open-source AI is going to be used as a signal of transparency or accountability, the term needs boundaries. Without them, the same label can be applied to systems with very different levels of openness.

The difficult question of training data

Ayah Bdeir, a senior advisor to Mozilla and a participant in OSI’s process, says some parts of the definition were easier to settle. Revealing model weights was one of the areas where agreement was relatively straightforward.

Training data was more contentious. The source article notes that lack of transparency about where training data comes from has helped drive innumerable lawsuits against big AI companies. The issue affects makers of large language models like OpenAI and music generators like Suno, which do not disclose much about their training sets beyond saying they contain “publicly accessible information.”

Some advocates want open-source models to disclose all training sets. Bdeir says that standard would be difficult to enforce because of issues like copyright and data ownership.

The definition settles on a more practical requirement. Open-source models must provide information about training data to the extent that “a skilled person can recreate a substantially equivalent system using the same or similar data.”

That is not the same as requiring every training data set to be shared in full. But it also asks for more transparency than many proprietary models, and even many models described as open source, currently provide. The result is a compromise between ideal openness and a standard that could actually be met.

What happens next

OSI is not only publishing a definition. Bdeir says the group is planning some sort of enforcement mechanism to flag models described as open source that do not meet the definition.

OSI also plans to release a list of AI models that do meet the new definition. None are confirmed. However, Bdeir told MIT Technology Review that the handful of models expected to land on the list are relatively small names, including Pythia by Eleuther, OLMo by Ai2, and models by the open-source collective LLM360.

That detail is important because it suggests the definition may not simply validate the biggest names using open-source language today. Instead, it could create a clearer distinction between models that are broadly available and models that meet a fuller standard for use, study, modification, sharing, and transparency.

For lawmakers, the definition offers a vocabulary for regulation. For researchers, it offers criteria for evaluating claims. For companies, it raises the bar for using the open-source AI label in public.

The larger effect will depend on adoption and enforcement. But the immediate shift is clear: open-source AI now has a working definition, and that makes the term harder to use casually.