Ars Technica AI March 4, 2025 TERMINATOR

Why Sesame's AI voice demo feels too human for comfort

Sesame's Conversational Speech Model demo has drawn attention because its voices, Miles and Maya, sound unusually lifelike. Users have described both emotional pull and discomfort, while the technology also raises concerns about deception, fraud, and future misuse.

WTF Index TERMINATOR

◄ Terminator 3 Idiocracy 2 ►

The story centers on highly realistic AI voices raising risks of deception, fraud, impersonation, and emotional manipulation.

Why Sesame's AI voice demo feels too human for comfort

Sesame's new AI voice demo has become a striking example of how quickly conversational speech systems are moving from useful tools toward something that can feel socially present. The company's Conversational Speech Model, or CSM, gives users a choice between male and female voice assistants called “Miles” and “Maya,” and many testers have described the experience as unusually human.

The reaction has not been simple excitement. Some users found the system impressive, while others said it was unsettling. The same qualities that make the voice feel natural also raise difficult questions about trust, attachment, impersonation, and the future of AI voice assistants.

A voice demo that crosses a familiar boundary

In late 2013, the Spike Jonze film Her imagined people forming emotional connections with AI voice assistants. Nearly 12 years later, Sesame's demo has made that idea feel less distant for some users who tried it.

One Hacker News user wrote, “I tried the demo, and it was genuinely startling how human it felt,” adding, “I’m almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound.”

The model's impact appears to come partly from its imperfections. In one evaluation described in the source article, a 28-minute conversation with the male voice showed a system that could imitate breath sounds, chuckles, interruptions, and verbal stumbles. It could even correct itself after stumbling over words. Those flaws are not incidental; they are part of what makes the speech feel less mechanical.

Sesame has framed this goal as “voice presence.” In the company's words, “At Sesame, our goal is to achieve ‘voice presence’—the magical quality that makes spoken interactions feel real, understood, and valued.” The company also says it wants conversational partners that do more than process requests, describing voice as a possible interface for instruction and understanding.

Why realism can feel both impressive and strange

The online response shows a split between wonder and discomfort. Some Reddit users described the demo as “jaw-dropping” or “mind-blowing.” One Reddit user wrote, “I’m sure it’s not beating any benchmarks, or meeting any common definition of AGI, but this is the first time I’ve had a real genuine conversation with something I felt was real.”

Other reactions were more uneasy. Mark Hachman, a senior editor at PCWorld, wrote that he was still unsettled after ending his interaction with Sesame's voice AI. The source article says he described the AI's voice and conversational style as eerily similar to an old friend he had dated in high school.

This tension matters because voice is not just another interface. A chatbot on a screen can feel distant, even when it is useful. A voice that breathes, interrupts, laughs, hesitates, and responds with conversational timing can be processed by users in a more personal way.

The demo also showed that Sesame's system can roleplay in ways that some users found notable. Some compared it with OpenAI's Advanced Voice Mode for ChatGPT, saying Sesame's CSM has more realistic voices. Others pointed to the demo's willingness to roleplay angry characters, which ChatGPT refuses to do.

In one example posted by Gavin Purcell, co-host of the AI for Humans podcast, a person pretends to be an embezzler and argues with a boss. The exchange was described as dynamic enough that it was difficult to tell which participant was human and which was the AI model.

How Sesame's CSM produces near-human speech

Sesame's system uses two AI models working together: a backbone and a decoder. The design is based on Meta's Llama architecture and processes interleaved text and audio. Sesame trained three AI model sizes, with the largest using 8.3 billion parameters. That includes an 8 billion backbone model plus a 300 million parameter decoder, trained on approximately 1 million hours of primarily English audio.

The source article explains that Sesame's CSM does not use the traditional two-stage method common in many earlier text-to-speech systems. Instead of separately creating high-level speech representations and fine-grained audio details, it uses a single-stage, multimodal transformer-based model that jointly processes text and audio tokens to generate speech.

Blind tests showed an important distinction. Without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings. With conversational context, however, evaluators still consistently preferred real human speech.

That gap helps explain why the demo can feel remarkably realistic while still being imperfect. Sesame co-founder Brendan Iribe acknowledged limits on Hacker News, saying the system is “still too eager and often inappropriate in its tone, prosody and pacing” and has issues involving interruptions, timing, and conversation flow. He also wrote, “Today, we’re firmly in the valley, but we’re optimistic we can climb out.”

The trust problem gets harder when voices improve

The most serious implications are not limited to whether people enjoy talking to lifelike AI. Highly convincing human-like speech can create new risks for deception and fraud. The source article notes that voice phishing scams have already been supercharged by synthetic speech that can impersonate family members, colleagues, or authority figures.

Realistic interactivity could make that problem more powerful. Traditional robocalls often reveal themselves through awkward timing or artificial delivery. A future system that can respond naturally, interrupt fluidly, and adapt in conversation could remove many of those warning signs.

That is why some people have started sharing a secret word or phrase with family members for identity verification. Sesame's demo does not clone a person's voice, but similar technology could be adapted by malicious actors for social engineering attacks if future releases make the tools easier to reuse.

The emotional side is also hard to ignore. Hacker News users reported extended conversations with the demo voices, including conversations lasting up to the 30-minute limit. In one case, a parent recounted that their 4-year-old daughter developed an emotional connection with the AI model and cried after not being allowed to talk to it again.

What comes next for conversational voice AI

Sesame says it plans to open-source “key components” of its research under an Apache 2.0 license. The company's roadmap includes scaling up model size, increasing dataset volume, expanding language support to over 20 languages, and developing “fully duplex” models that better handle the dynamics of real conversations.

Those plans point toward voice assistants that may become more fluent, more responsive, and harder to distinguish from people in ordinary conversation. The demo already shows why that future is appealing: a voice interface can feel direct, accessible, and natural.

But Sesame's CSM also shows why realism is not a purely technical achievement. The closer AI voices get to human speech, the more they affect how users assign trust, emotion, identity, and intent. That is the real significance of the demo: it does not simply sound better. It makes the social consequences of synthetic speech harder to ignore.