WIRED AI September 25, 2024 TERMINATOR

Molmo opens a wider path to open source AI agents

The Allen Institute for AI has released Molmo, an open source multimodal AI model that can understand images and interact through chat. Its visual abilities could help more developers build AI agents that operate on computer screens, though reliability and misuse remain major concerns.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

Open source visual AI for computer-operating agents modestly increases autonomy and misuse risk, though the story is mostly about access and capability.

Molmo opens a wider path to open source AI agents

Molmo, a new open source AI model from the Allen Institute for AI (Ai2), gives developers a fresh route into one of the most closely watched areas of artificial intelligence: AI agents that can act on computers, not just answer questions.

The model, formally called the Multimodal Open Language Model, can interpret images and communicate through a chat interface. That combination matters because many useful computer tasks depend on understanding what is visible on a screen.

Why Molmo matters for AI agents

AI agents are often described as the next major step beyond chatbots. The goal is for a system to take a command and then carry out complicated actions on a computer with reliability. In practice, that broad capability has not yet arrived at scale.

Visual understanding is one reason the problem is difficult. An agent that browses the web, moves through file directories, or drafts documents needs more than text generation. It must be able to make sense of menus, buttons, pages, folders, and other visual context.

Molmo is designed for that kind of environment. Because it can interpret images, it could help software agents understand a computer screen and decide what to do next. That makes the release relevant not only to AI researchers, but also to startups and developers trying to build practical tools around agents.

Ai2 CEO Ali Farhadi described the release as a way to broaden access to multimodal AI. “With this release, many more people can deploy a multimodal model,” he says. “It should be an enabler for next-generation apps.”

Open source changes who can experiment

Several major commercial AI models already have visual capabilities. The source article names GPT-4 from OpenAI, Claude from Anthropic, and Gemini from Google DeepMind as examples. These models can support experimental AI agents, but access is through a paid API and the systems themselves remain hidden from view.

That difference is central to Molmo’s appeal. An open source multimodal model gives outside builders more room to inspect, adapt, and modify the technology behind an agent. Ofir Press, a postdoc at Princeton University who works on AI agents, says that “any startup or researcher that has an idea can try to do it.”

Fine-tuning is one practical example. Press says an open model can be adjusted more easily for specific tasks, such as working with spreadsheets, by adding training data. By contrast, models such as GPT-4 can only be fine-tuned to a limited degree through their APIs.

For developers, that flexibility can affect what kinds of products are possible. A team building a narrow agent for a particular workflow may need more control than a closed commercial API allows. Molmo gives them a model they can work with more directly.

What Ai2 is releasing

Ai2 is releasing several sizes of Molmo. The source article identifies a 70-billion-parameter model and a 1-billion-parameter model small enough to run on a mobile device. A parameter count refers to the units a model uses to store and manipulate data, and it roughly relates to capability.

The smaller version is especially important for portability. Farhadi argues that Molmo’s efficiency could let developers build stronger software agents that run natively on smartphones and other portable devices. He says, “The billion parameter model is now performing in the level of or in the league of models that are at least 10 times bigger.”

Ai2 also says Molmo performs like much larger commercial models because it was trained carefully on high-quality data. The release is described as fully open source, with no restrictions on its use, unlike Meta’s Llama family, which was released under a license that limits commercial use.

Another notable piece is transparency around training. Ai2 is releasing the training data used to create the model, which gives researchers more detail about how it works. For the research community, that can be as important as the model itself.

The risks and limits are still real

Opening powerful models creates risks as well as opportunities. The source article notes that such models can be adapted for harmful purposes. One possible concern is malicious AI agents designed to automate hacking of computer systems.

That risk sits alongside a more basic technical challenge: usefulness depends on reliability. Even if a multimodal model can read a screen and respond to a request, an agent still needs to make dependable choices across multiple steps. A model that acts confidently but unreliably can create problems, especially when it is controlling software directly.

The source article points to reasoning as another unresolved frontier. OpenAI has sought to address reasoning with its latest model o1, which demonstrates step-by-step reasoning skills. The next step may be combining that kind of reasoning with multimodal models.

That means Molmo should not be seen as a finished answer to the AI agent problem. It is better understood as a wider foundation. It may let more people test ideas, build prototypes, fine-tune models for specific tasks, and explore what agents can do outside the companies that dominate advanced AI.

What comes next

Molmo arrives at a moment when OpenAI, Google, and others are racing to develop AI agents. Meta is also expected to announce several new products, perhaps including new Llama AI models, at its Connect event today.

The broader direction is clear: AI systems are moving from conversation toward action. Molmo’s significance is that it brings an open source multimodal model into that race, giving researchers, developers, and startups more control over how agent technology is built.

For now, the practical impact will depend on what builders do with it. The model’s visual abilities, open source release, range of sizes, and training-data transparency all make it a serious new option. The hard part remains turning those ingredients into agents that can perform useful computer tasks reliably.