Ars Technica AI July 31, 2024 TERMINATOR

Why ChatGPT Advanced Voice Mode sounds startlingly human

OpenAI has started rolling out an alpha version of ChatGPT Advanced Voice Mode to a small group of ChatGPT Plus subscribers. Early testers report fast voice conversations, near-instant interruptions, sound effects, simulated breathing and some safety limits around voices and copyrighted audio.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

The story mildly leans Terminator because increasingly humanlike real-time voice AI raises concerns about manipulation, impersonation and control despite safety limits.

Why ChatGPT Advanced Voice Mode sounds startlingly human

OpenAI’s ChatGPT Advanced Voice Mode is now reaching a small group of ChatGPT Plus subscribers in an alpha rollout, and early reactions show why the feature is drawing attention. Testers describe a version of ChatGPT that responds quickly, can be interrupted while speaking, reacts to vocal delivery and produces voice performances that include sound effects and audible breaths.

The feature was previewed in May with GPT-4o. OpenAI’s goal is to make spoken conversations with ChatGPT feel more natural and responsive, but the rollout also arrives after criticism of simulated emotional expressiveness and a public dispute with actress Scarlett Johansson over accusations that OpenAI copied her voice.

A faster, more interruptible voice assistant

In the early tests described by users with access, Advanced Voice Mode supports real-time voice conversations with ChatGPT. One of the most important changes is that users can interrupt the AI mid-sentence almost instantly, which makes the exchange feel less like waiting through a recorded answer and more like speaking with a responsive system.

That responsiveness matters because voice interfaces can become frustrating when a user has to wait for the system to finish before correcting it, redirecting it or asking a follow-up. In the examples shared by testers, the new mode appears designed around conversational turn-taking rather than a rigid question-and-answer rhythm.

Tech writer Cristiano Giardina wrote on X that the feature is very fast and that there is virtually no delay between a user finishing a sentence and ChatGPT responding. He also shared examples involving counting, accents and sound effects, which became part of the wider discussion around how humanlike the audio output can seem.

The breathing effect is getting attention

The detail that surprised many people was not only the speed. It was the way the voice appeared to stop for breath while speaking. In one example, Giardina described ChatGPT Advanced Voice Mode counting quickly to 10 and then to 50, noting that it stopped to catch its breath in a way that reminded him of a person speaking at high speed.

The source of that behavior is not that the system needs air. Advanced Voice Mode simulates those audible pauses because it was trained on human speech audio that includes inhalations and other speech patterns. After exposure to hundreds of thousands, if not millions, of examples of people talking, the model has learned to reproduce breathing sounds at moments that can feel appropriate.

This is a useful reminder of what large language models like GPT-4o are doing in the audio domain. They are powerful imitators. In text, that means generating language that resembles patterns found in written material. In voice, the imitation extends to timing, tone, pauses and nonverbal audio cues that listeners associate with human speech.

Sound effects, stories and accents

Advanced Voice Mode is also being tested as a performance tool. Users have shared examples of ChatGPT making sound effects while telling stories, playing multiple parts with different voices and producing an audiobook-like sci-fi story from a prompt asking for action, atmosphere and onomatopoeia.

X user Kesku, a moderator of OpenAI’s Discord server, shared examples of ChatGPT performing multiple roles and recounting a sci-fi story. Kesku also ran prompts for Ars Technica, including a story about the Ars Technica mascot “Moonshark.” Another test asked the system to sing the “Major-General’s Song” from Gilbert and Sullivan’s 1879 comic opera The Pirates of Penzance.

Giardina also observed that the system can do accents, though when speaking other languages it always has an American accent. In one video, ChatGPT acted as a soccer match commentator. According to Giardina, when asked to make noises, the voice performs the sounds itself, sometimes with funny results.

Frequent AI advocate Manuel Sainsily posted a video showing Advanced Voice Mode reacting to camera input while offering advice about how to care for a kitten. He described the experience as feeling like face-timing a knowledgeable friend, and said it could answer questions in real time while using the camera as input.

Useful demo, familiar limits

For all the enthusiasm, the system still carries the limitations of an LLM. The source article notes that it may occasionally confabulate incorrect responses when its training-based knowledge is lacking. That matters especially when a voice interface feels fluent, immediate and confident, because a natural delivery can make an answer seem more reliable than it is.

Seen as a technology demo or an AI-powered amusement, Advanced Voice Mode appears to carry out many of the tasks OpenAI showed in its May demo. The core appeal is clear: a spoken AI that can respond quickly, change direction when interrupted and produce expressive audio can feel more fluid than earlier voice assistants.

But the same qualities also sharpen the need for boundaries. OpenAI told Ars Technica that it worked with more than 100 external testers for the Advanced Voice Mode release. Those testers collectively spoke 45 different languages and represented 29 geographical areas.

The system is reportedly designed to prevent impersonation of individuals or public figures by blocking outputs that differ from OpenAI’s four chosen preset voices. OpenAI has also added filters meant to recognize and block requests for music or other copyrighted audio.

There are still signs of complexity in the audio training and output. Giardina reported audio “leakage” in some generated speech, with unintended music in the background. The source article says this suggests the voice model was trained on a wide range of audio sources, likely including licensed material and audio scraped from online video platforms.

Who gets access next

For now, Advanced Voice Mode is limited to a small alpha test group of ChatGPT Plus subscribers. OpenAI plans to expand access to more Plus users in the coming weeks, with a full launch to all Plus subscribers expected this fall.

A company spokesperson told Ars Technica that users selected for the alpha test will receive a notice in the ChatGPT app and an email with usage instructions. Since the May preview of GPT-4o voice, OpenAI says it has improved the model’s ability to support millions of simultaneous, real-time voice conversations while keeping latency low and quality high.

That last point may be as important as the voice effects themselves. If Advanced Voice Mode becomes available to all Plus subscribers, OpenAI will need the back-end capacity to handle many live conversations at once. The early tests show a feature that can sound strikingly lifelike, but the broader test will be whether it can scale while staying fast, controlled and useful.