WIRED AI September 25, 2024 TERMINATOR

Why Llama 3.2 pushes Meta AI beyond text

Meta announced Llama 3.2, its first free AI model family with visual abilities. The update also brings celebrity voices to Meta AI and points toward more capable assistants across phones, glasses, apps, and future AI agents.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

Meta's multimodal Llama 3.2 and widely deployed Meta AI point mildly toward more capable, agent-like systems with surveillance and control potential, though this is mostly a product launch.

Why Llama 3.2 pushes Meta AI beyond text

Meta is moving its AI assistant into a more visual and vocal phase. At Connect, a Meta event held in California today, Mark Zuckerberg announced Llama 3.2 and a broader upgrade to Meta AI, adding image understanding, celebrity voice options, and new possibilities for mobile AI apps.

The update matters because Meta AI already sits inside products with huge reach, including Facebook, Instagram, WhatsApp, and Messenger. Meta said more than 180 million people use Meta AI every week, which means these changes could quickly introduce many users to assistants that can respond to photos, talk in familiar voices, and support more agent-like tasks.

Llama 3.2 adds vision to Meta's open model strategy

Llama 3.2 is the first version of Meta's free AI models to include visual abilities. That gives the model a wider role than text generation alone, because it can process photos and other visual information as part of an interaction.

Zuckerberg described the release this way: “This is our first open source, multimodal model, and it's going to enable a lot of interesting applications that require visual understanding.” The key term is multimodal, meaning the model can work with more than text. In the source article, that includes images and audio as inputs, alongside language.

For developers, this shift expands what can be built on top of Llama. A model that can understand visual context may be more useful in robotics, virtual reality, and AI agents. It can also support software that sees what is on a screen or uses a phone camera as part of the task.

Meta is releasing several sizes of Llama 3.2. The more powerful versions have 11 billion and 90 billion parameters, while less capable 1 billion and 3 billion parameter versions are designed to work well on portable devices. Meta says those smaller versions have been optimized for ARM-based mobile chips from Qualcomm and MediaTek.

Meta AI gets celebrity voices and image tools

The consumer-facing upgrade is more direct: Meta AI is getting voices. The new celebrity voice options include Dame Judi Dench, John Cena, Awkwafina, Keegan Michael Key, and Kristen Bell.

Meta says the new voices will be made available to users in the US, Canada, Australia, and New Zealand over the next month. The company has tried celebrity personas before with text-based assistants, but those characters did not gain much traction. In July, Meta also launched AI Studio, a tool that lets users create chatbots with any persona they choose.

The new version of Meta AI will also be able to respond to users' photos. If a user takes a picture of a bird and does not know what it is, Meta AI can identify the species. It will also help edit images, including by adding new backgrounds or details on demand.

Google released a similar tool for its Pixel smartphones and for Google Photos in April. Meta's image capabilities will be rolled out in the US, though the company did not say when those features might appear in other markets.

Smart glasses show where visual AI could go

At Connect, Zuckerberg demonstrated several AI features that show why visual understanding is central to Meta's plans. In videos, Ray Ban smart glasses running Llama 3.2 offered recipe advice based on ingredients in view. In another example, the glasses gave commentary on clothing seen on a rack in a store.

These examples are important because they show AI moving from a chat box into the user's surroundings. If a system can interpret what a camera sees, it can become more useful in moments where typing a prompt is awkward or too slow.

Meta also showed experimental AI features the company is working on, including software for live translation between Spanish and English, automatic dubbing of videos into different languages, and an avatar for creators that can answer fan questions on their behalf.

None of those examples turns Meta AI into a finished universal assistant. But together they suggest the direction of travel: AI that can listen, speak, see, and act across apps and devices.

Why open multimodal models matter

Llama is different from many proprietary models because it can be downloaded and run locally without charge, although there are restrictions on large-scale commercial use. It can also be fine-tuned, or modified with additional training, for specific tasks.

That openness has made the Llama family widely adopted by developers and startups. Patrick Wendell, cofounder and VP of engineering at Databricks, said many companies are drawn to open models because they allow them to better protect their own data.

Multimodal models also match the way information actually appears in work and daily life. Phillip Isola, a professor at MIT, put it this way: “Multimodal models are a big deal because the data people and businesses use is not just text, it can come in many different formats, including images and audio or more specialized formats like protein sequences or financial ledgers.”

That broader input range may help developers build AI agents that can carry out tasks on computers. The source article gives one example: an agent that browses the web to hunt for deals on a product based on a short description.

Meta is not alone in pushing open multimodal AI. Earlier today, the Allen Institute for AI (Ai2), a research institute in Seattle, released an advanced open source multimodal model called Molmo. Molmo was released under a less restrictive license than Llama, and Ai2 is also releasing details of its training data.

Nathan Benaich, founder and general partner of Air Street Capital, said Meta showed with Llama 3.1 that open models could close the gap with proprietary counterparts. He also said multimodal models tend to out-perform larger text-only ones.

For Meta, Llama 3.2 is both a model release and a platform move. By adding vision, mobile optimization, and voice-driven assistant features, the company is trying to make Meta AI more useful to everyday users while giving developers a broader foundation for AI tools and services.