The Decoder October 21, 2024 NEUTRAL

Meta's Spirit LM blends speech and text in one AI model

Meta's Fundamental AI Research (FAIR) team has released Spirit LM, a multimodal language model that treats speech and text as connected inputs. The model is available in Basic and Expressive versions, with the latter designed to capture pitch, style, intonation and emotion information.

Meta's Fundamental AI Research (FAIR) team has introduced Spirit LM, a multimodal language model built to work across text and speech. The release adds another piece to Meta's broader AI research push and points toward voice systems that can move more naturally between written language and spoken expression.

What Spirit LM Is Designed To Do

Spirit LM is built on a pre-trained text language model, then extended through continuous training with text and speech units. Instead of treating speech and text as completely separate systems, the model combines them as one sequence of tokens.

That design matters because many AI systems handle speech through a chain of separate steps. A spoken request may be transcribed into text, processed as text, and then converted back into speech. Spirit LM follows a multimodal approach in which speech and text are more directly connected inside the model.

The source describes Spirit LM as using a word-level method to interleave speech and text sequences. To support that training, the researchers used a small, automatically curated parallel corpus of speech and text.

Meta published the research paper for Spirit LM in February and has now made the corresponding code and model weights available for free download. That release fits with Meta's stated emphasis on advanced AI development and open science, even as the company has also faced criticism for attempting to redefine the term "open source" according to its own interpretation.

Basic And Expressive Versions

Meta has released Spirit LM in two versions: a base model and an expressive model. The distinction is important because speech contains more than words alone.

The base model uses semantic units of speech. In practical terms, that means it focuses on the meaning carried by spoken language. Spirit LM Expressive adds pitch and style units, which are used to capture intonation and emotion information.

This gives the project two related goals. One is to demonstrate the semantic capabilities of speech models. The other is to show the expressive abilities of voice models, where tone, delivery and emotional character become part of the output.

The expressive version is especially notable because the source says it combines semantic, prosodic and stylistic information. Experiments showed that Spirit LM Expressive can maintain the mood of text and speech input in generated output, a capability often lacking in previous language models.

Tasks The Model Can Handle

Spirit LM's combined text-to-speech architecture gives it several possible uses. The source identifies tasks that span both familiar speech processing and cross-modality conversion.

It can transcribe spoken language.
It can read written text aloud.
It can classify spoken utterances based on content.
It can convert written text into speech.
It can convert speech into written text.

The broader point is that these tasks do not sit in isolation. A model that can process speech and text together may be able to move between formats with less friction than a system made from separate parts.

Researchers also demonstrated few-shot learning with Spirit LM. According to the source, the model can learn new tasks after being shown only a few examples. This was shown both within a single modality and across modalities, meaning the model can adapt in text-only or speech-only settings as well as across the boundary between the two.

Why This Matters For AI Voice Systems

The development mirrors OpenAI's approach with GPT-4o and its Advanced Voice Mode. The comparison is not just about adding audio output to a chatbot. It is about building models that treat voice as a richer channel for interaction.

Voice carries meaning through words, but also through timing, pitch, style and emotional tone. Spirit LM Expressive is aimed at that fuller version of speech. By incorporating pitch and style units, the model is designed to preserve more of the input's expressive character when generating output.

That could make future AI voice tools feel less like text systems with audio attached and more like systems that understand speech as speech. The source does not say Meta has already turned Spirit LM into a product, but it does frame the model as a step that could support a voice mode similar to OpenAI's Advanced Voice Mode.

With the recent release of Llama 3.2, which includes image understanding capabilities integrated into its AI platforms, the source says it is possible that Meta may incorporate findings from Spirit LM into a future Llama model. Such a move could potentially lead to a genuine "omnimodal" competitor to GPT-4o.

Part Of A Wider FAIR Research Push

Spirit LM is one part of a larger set of AI announcements from Meta's Fundamental AI Research (FAIR) team. The source also names an update to the Segment Anything model for image segmentation and a solution called Layer Skip for speeding up large language models.

Meta also reported advances in efficient training of multilingual models with Meta Lingua. In addition, the company presented new research on post-quantum cryptography security, AI-supported materials research and improving sentence representations.

Taken together, these announcements show Meta working across multiple AI research areas at once. Spirit LM stands out because it focuses on the interface many people may experience most directly: speaking to an AI system and hearing it respond.

The central question now is how much of this research will influence future Meta systems. Based only on the source, Spirit LM is a released research model with code and model weights available for free download. Its longer-term importance may depend on whether its speech, text and expressive voice capabilities become part of later models, including a future Llama system.