TechCrunch AI March 20, 2025 NEUTRAL

OpenAI’s new voice API models aim to make agents sound less flat

OpenAI is adding new transcription and text-to-speech models to its API as part of its broader push toward agentic systems. The models promise more controllable voices and better transcription, but the new speech-to-text tools will not be openly released like Whisper.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

This is mostly a routine model/API launch, with only mild implications for more capable agentic systems and synthetic voice dependence.

OpenAI’s new voice API models aim to make agents sound less flat

OpenAI is updating the voice layer of its API with new models for speech generation and transcription, positioning them as building blocks for more capable AI agents. The company says the releases improve on earlier tools by making synthetic voices easier to direct and transcripts more accurate in difficult audio settings.

The move fits OpenAI’s broader “agentic” vision: automated systems that can carry out tasks for users. In one example described by OpenAI Head of Product Olivier Godement, an agent could be a chatbot that speaks with a business’s customers.

Why voice matters for agents

OpenAI’s argument is straightforward: if agents are going to handle real interactions, they need to communicate in ways that feel appropriate to the situation. A customer support agent, for instance, may need a different tone from a general assistant or a scripted audio experience.

That is where the new text-to-speech model, gpt-4o-mini-tts, comes in. OpenAI claims it can produce more nuanced and realistic-sounding speech than its previous speech-synthesizing models. The company also says it is more “steerable,” meaning developers can guide not only the words being spoken but the manner in which they are delivered.

Developers can use natural language instructions to shape the voice. The source gives examples such as “speak like a mad scientist” and “use a serene voice, like a mindfulness teacher.” OpenAI also showed samples described as a “true crime-style,” weathered voice and a female “professional” voice.

Jeff Harris, a member of the product staff at OpenAI, said the goal is to help developers tailor both the voice “experience” and “context.” In practical terms, that means a voice interface could sound apologetic in a customer support scenario where it has made a mistake, rather than delivering every response in the same flat style.

More control over how speech is delivered

The clearest change in gpt-4o-mini-tts is not just voice quality. It is the degree of control OpenAI wants to give developers over delivery.

For products built around spoken interaction, this distinction matters. A voice assistant that reads every sentence with the same tone can feel disconnected from the user’s situation. A voice that changes delivery based on context can make the interaction clearer, more natural, and more aligned with the task at hand.

According to Harris, OpenAI believes developers and users want control over “not just what is spoken, but how things are spoken.” That framing puts the new model closer to an interface design tool than a simple speech output system. It gives developers a way to specify emotional or situational cues without needing to manually engineer every vocal detail.

The examples in the source are broad, but they show the intended direction. A business could use a professional tone for routine interactions. A wellness experience could ask for a calmer delivery. A more theatrical app could request a character-like voice direction. The key point is that the voice can be guided through plain language.

New transcription models move beyond Whisper

OpenAI is also introducing two speech-to-text models: gpt-4o-transcribe and gpt-4o-mini-transcribe. The source says these models effectively replace Whisper, OpenAI’s older transcription model.

The company says the new transcription tools were trained on “diverse, high-quality audio datasets” and can better capture accented and varied speech, including in chaotic environments. That is important for voice agents because poor transcription can break the entire interaction. If the system mishears a user, every step that follows can be wrong.

OpenAI also claims the new models are less likely to hallucinate. The source notes that Whisper was known to fabricate words and even full passages in conversations, including racial commentary and imagined medical treatments. Harris said the new models are “much improved versus Whisper on that front.”

For a voice product, fewer hallucinated words are not a minor upgrade. A transcript is often treated as the factual record of what a user said. If a model fills in details it did not hear, the agent may respond to something that was never spoken. Harris described accuracy in this context as hearing the words precisely and not adding details that were not present in the audio.

Accuracy still depends on language

The improvements do not appear to be even across all languages. The source says results can vary depending on the language being transcribed.

According to OpenAI’s internal benchmarks, gpt-4o-transcribe, the more accurate of the two transcription models, has a “word error rate” approaching 30% (out of 120%) for Indic and Dravidian languages such as Tamil, Telugu, Malayalam, and Kannada. The source explains that this means three out of every 10 words from the model will differ from a human transcription in those languages.

That caveat is central for developers evaluating the new models. A transcription system may perform well enough in one setting but fall short in another, especially when language coverage is uneven. For customer-facing agents, that difference can affect who receives a reliable experience and who does not.

The source also notes that the article was updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and to update the benchmark results chart with a more recent version.

No open release for the new transcription tools

One of the biggest product shifts is availability. OpenAI does not plan to make gpt-4o-transcribe and gpt-4o-mini-transcribe openly available, according to the source.

That breaks from how OpenAI historically handled Whisper, where new versions were released for commercial use under an MIT license. Harris said the new transcription models are “much bigger than Whisper” and are therefore not good candidates for an open release.

He also said they are not the kind of models that can simply run locally on a laptop, like Whisper. OpenAI’s position, as described in the source, is that open-source releases should be handled thoughtfully and matched to models designed for that need. Harris identified end-user devices as one of the most interesting cases for open-source models.

For developers, that creates a tradeoff. The new API models may offer stronger voice and transcription capabilities, especially for agentic products. But teams that relied on Whisper as an open model will not get the same kind of local, openly available path with these new transcription releases.

OpenAI’s voice update is therefore both a capability upgrade and a platform decision. It gives developers more ways to build spoken agents through the API, while moving the newest transcription improvements away from the open-release model that made Whisper widely accessible.