TechCrunch AI October 28, 2024 NEUTRAL

What OSI’s new open source AI test asks models to reveal

The Open Source Initiative has released version 1.0 of its Open Source AI Definition, a standard meant to clarify when an AI model can legitimately be called open source. The definition centers on model design transparency, training data details, and user freedoms, but Meta and others dispute parts of the approach.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a governance and standards story about AI transparency, with little direct drift toward danger or societal deskilling.

What OSI’s new open source AI test asks models to reveal

The phrase open source AI now has a formal test from the Open Source Initiative. The group has released version 1.0 of its Open Source AI Definition (OSAID), after several years of collaboration with academia and industry.

The goal is practical: give developers, policymakers, companies, and the broader AI community a shared way to judge whether a model is truly open source or only described that way.

What the OSI definition requires

Under the OSAID, an AI model must provide enough information about its design for a person to substantially recreate it. That requirement puts the focus on the full build process, not just whether model files are available to download.

The definition also expects disclosure of relevant training data details. That includes where the data came from, how it was processed, and how it can be obtained or licensed.

Stefano Maffulli, the OSI EVP, described the standard in terms of understanding how a model was built. In his explanation, that means access to components such as the complete code used for training and data filtering.

The OSAID also describes the rights that should come with open source AI. Developers should be able to use the model for any purpose, modify it without asking permission, and build on top of it.

Why the definition matters now

The dispute is not only technical. According to Maffulli, regulators are already paying attention to the space, and bodies like the European Commission have sought to give special recognition to open source.

That makes a shared definition important. If public officials, AI companies, developers, and researchers use the same phrase to mean different things, policy and product claims can quickly become unclear.

The OSI has no direct enforcement power. It cannot force developers to follow the OSAID. Instead, it intends to identify models marketed as open source when they do not meet the definition.

The theory is that community pressure can still matter. If a company uses the term in a way the AI community rejects, the OSI hopes that claim will be challenged and corrected.

Where current AI releases fall short

Many startups and major technology companies have used open source language for AI model releases. The source article notes that few meet the OSAID criteria.

Meta is the most prominent example discussed. Its Llama models include a condition requiring platforms with more than 700 million monthly active users to request a special license.

Maffulli has criticized Meta’s decision to describe its models as open source. He said Google and Microsoft agreed, after discussions with the OSI, to stop using the term for models that are not fully open, while Meta has not.

Other companies use restrictions as well. Stability AI requires businesses making more than $1 million in revenue to obtain an enterprise license. Mistral’s license bars the use of certain models and outputs for commercial ventures.

A study last August by researchers at the Signal Foundation, the nonprofit AI Now Institute, and Carnegie Mellon found that many models labeled open source are open mainly in name. The study pointed to hidden training data, compute demands that many developers cannot meet, and fine-tuning techniques that are difficult to use.

Meta’s objection and the training data problem

Meta disagrees with the OSAID as written, even though it participated in the drafting process. A company spokesperson defended the Llama license and its acceptable use policy as guardrails against harmful deployments.

Meta also said it is taking a cautious approach to sharing details about models and training data as regulations such as California’s training transparency law evolve.

The company argued that there is no single definition of open source AI and that older open source definitions do not capture the complexity of today’s AI models. It also pointed to other efforts, including Linux Foundation suggested definitions, Free Software Foundation criteria for free machine learning applications, and proposals from other AI researchers.

The source article also notes a tension in Meta’s position: Meta is one of the companies funding the OSI’s work, alongside Amazon, Google, Microsoft, Cisco, Intel, and Salesforce. The OSI recently secured a grant from the nonprofit Sloan Foundation to reduce reliance on tech industry backers.

Training data is central to the disagreement. AI companies often train models on large collections of images, audio, videos, and other material from social media and websites, described as publicly available data. Companies often treat the assembly and refinement of datasets as a competitive advantage.

Legal pressure adds another reason for nondisclosure. Authors and publishers claim Meta used copyrighted books for training. Artists have filed suits against Stability over scraping and reproduction of their work without credit.

What remains unsettled

Some critics argue the OSAID still leaves important questions open. Luca Antiga, the CTO of Lightning AI, said a definition should give businesses reasonable confidence that what is being licensed can be used in the way an organization intends.

His concern is training data licensing. A model may meet the OSAID requirements even if the data used to train it is not freely available. That raises a practical question: how open is a model if inspecting the underlying data depends on access to private licensed collections?

Version 1.0 also does not resolve copyright questions around AI models. The source article notes that it is not yet clear whether models or their components can be copyrighted under current IP law. If courts decide they can be, the OSI suggests new legal instruments may be needed to properly open source IP-protected models.

Maffulli acknowledged that the definition will need updates. The OSI has created a committee to monitor how the OSAID is applied and to propose amendments for future versions.

For now, the definition gives the AI industry a clearer benchmark. It does not end the debate over open source AI, but it makes the next argument more specific: what must a model reveal, what rights must users receive, and when does a label overpromise what the release actually provides?