TechCrunch AI March 13, 2025 TERMINATOR

Sesame opens the AI model behind Maya to developers

Sesame has released CSM-1B, the 1 billion parameter base model that powers its realistic voice assistant Maya. The Apache 2.0 release gives developers broad commercial freedom, but the source highlights limited safeguards and easy voice cloning in the public demo.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 1 ►

An open realistic voice generation model with limited safeguards and easy voice cloning raises misuse and impersonation risks.

Sesame opens the AI model behind Maya to developers

Sesame, the AI company behind the viral virtual assistant Maya, has released the base model that powers the assistant’s voice technology. The release gives developers access to CSM-1B, a 1 billion parameter model built for generating audio from text and audio inputs.

The move matters because Maya drew attention for sounding unusually lifelike. Now, the underlying base model is available under an Apache 2.0 license, which allows commercial use with few restrictions.

What Sesame Released

The model is called CSM-1B. According to Sesame’s description on Hugging Face, it generates “RVQ audio codes” from text and audio inputs. The source explains that RVQ stands for “residual vector quantization,” a method for turning audio into discrete tokens called codes.

That technique is not unique to Sesame. The source notes that RVQ is used in a number of recent AI audio technologies, including Google’s SoundStream and Meta’s Encodec. In plain terms, the model works with a representation of audio that can be processed and generated by AI systems.

CSM-1B uses a model from Meta’s Llama family as its backbone, paired with an audio “decoder” component. Sesame says a fine-tuned variant of CSM powers Maya, which means the publicly released model is related to the assistant but is not presented as the exact finished Maya experience.

Sesame describes the release this way in CSM-1B’s Hugging Face and GitHub repositories: “The model open-sourced here is a base generation model.” The company also says it can produce a variety of voices, but has not been fine-tuned on any specific voice.

Why Maya Drew Attention

Sesame went viral in late February for assistant technology that, according to the source, comes close to clearing uncanny valley territory. Maya and Sesame’s other assistant, Miles, do more than speak cleanly. They take breaths, speak with disfluencies, and can be interrupted while speaking.

Those behaviors make the assistants feel more conversational than many synthetic voice systems. The source compares the interruptible interaction to OpenAI’s Voice Mode, while keeping the focus on Sesame’s own assistant technology.

The release of CSM-1B shifts attention from the finished demo experience to the developer layer underneath it. A base AI model is not the same thing as a polished assistant. But it can become a foundation for experiments, integrations, and products that use generated speech.

The Apache 2.0 license is important for that reason. Because it allows commercial use with few restrictions, developers and companies can work with CSM-1B without treating it only as a research artifact.

The Safeguard Gap

The source also raises a central concern: the model has no real safeguards to speak of. Sesame uses an honor system and urges developers and users not to mimic a person’s voice without consent, create misleading content like fake news, or engage in “harmful” or “malicious” activities.

That is a significant limitation for a voice model because the same capabilities that make generated speech useful can also make abuse easier. The source’s own test of the Hugging Face demo found that cloning a voice took less than a minute. After that, it was easy to generate speech, including on controversial topics like the election and Russian propaganda.

The concern is not isolated to Sesame. Consumer Reports recently warned that many popular AI-powered voice cloning tools on the market don’t have “meaningful” safeguards to prevent fraud or abuse.

For developers, the practical issue is clear: access and responsibility are arriving together. CSM-1B may be commercially usable, but the source describes a release where the main protection against misuse is a request that people behave responsibly.

What Developers Can And Cannot Assume

The public release also comes with caveats about performance. Sesame says the model can produce a variety of voices, but it has not been fine-tuned on any specific voice. That distinction matters because a base generation model may require additional work before it performs like a polished assistant in a real product.

Sesame also notes that the model has some capacity for non-English languages due to data contamination in the training data, but that it likely will not do well. That makes English the safer assumption for anyone evaluating the model based only on the source details.

There is another unknown: the company did not say what data it used to train CSM-1B. The source states that it is unclear what data Sesame used, and that Sesame did not disclose it.

Those limits frame the release as both technically notable and incomplete from a transparency standpoint. Developers can inspect and use the model, but the source does not provide a full account of the training data behind it.

Where Sesame Goes Next

Sesame is not only building voice assistant technology. The company says it is prototyping AI glasses “designed to be worn all day” that will be equipped with its custom models. That places voice interaction inside a broader product direction, where assistants may become part of wearable hardware.

The company was co-founded by Oculus co-creator Brendan Iribe. It has raised an undisclosed amount of capital from Andreessen Horowitz, Spark Capital, and Matrix Partners.

For now, the main development is the release of CSM-1B. Sesame has made the base model behind Maya available to developers, opening the door to commercial experiments in AI voice generation. At the same time, the release puts the risks of voice cloning and weak safeguards directly in view.