Resemble AI has released Chatterbox, a free open-source voice cloning model built for developers who want local voice generation with control over emotional delivery. The model can clone a voice from only a few seconds of audio, runs on Windows, Mac, and Linux, and is licensed under MIT.
What Chatterbox Offers
Chatterbox is positioned as a developer-focused voice cloning model rather than a closed hosted service. The key point is local use: the model runs on the user’s own machine, provided the system has 5–6 GB of video memory.
That local setup matters because it changes how developers can experiment with generated speech. Instead of depending only on a remote interface, they can work with an open-source model directly in their own development environment.
The release is also free and open-source, with an MIT license. The source describes Chatterbox as targeting developers, so its value is not only in producing speech, but in giving builders a model they can test, integrate, and evaluate locally.
Voice Cloning With Tone Control
The model clones voices using just a few seconds of audio. That short input requirement is one of the headline capabilities, because it lowers the amount of source material needed to create a synthetic voice.
Chatterbox also supports emotional tone control. The source gives examples such as "dramatic" and "monotone," which means developers can guide not just what the generated voice says, but how it sounds in delivery.
This gives the model a broader role than simple text-to-speech output. A voice that can shift tone can be more useful for demos, experiments, and applications where the same words need different vocal presentation.
The source also notes that Chatterbox responds in under 200 milliseconds. That response time is part of its developer appeal, especially for projects where generated audio needs to feel responsive rather than delayed.
Where It Runs
Chatterbox works on Windows, Mac, and Linux. That cross-platform support gives developers several options for local testing, rather than tying the model to a single operating system.
The hardware requirement named in the source is 5–6 GB of video memory. That requirement is important because local AI models are limited not only by software support, but by the machine available to run them.
At the same time, the model currently only supports English. That is a meaningful boundary for anyone evaluating Chatterbox for multilingual use, localization, or voice applications outside English-language speech.
- Platforms: Windows, Mac, and Linux.
- Memory requirement: 5–6 GB of video memory.
- Language support: English only.
- License: MIT.
Watermarking and AI Identification
Every generated speech output includes a faint watermark called "PerTh" to identify it as AI-made. That watermark is a notable part of the release because voice cloning raises an obvious question: how can generated speech be recognized as synthetic?
The source does not describe the watermark beyond that identification role, so the practical details are limited. What is clear is that Resemble AI has attached an AI-made marker to the speech generated by Chatterbox.
That detail sits alongside the model’s open-source and local-running design. Chatterbox is being offered as a tool developers can run themselves, while its outputs still include a signal that they were made by AI.
How Resemble AI Frames Performance
According to Resemble AI, Chatterbox performed better than ElevenLabs in blind tests. That is a performance claim from the company behind the release, and the source presents it as part of the model’s positioning.
The comparison is especially relevant because ElevenLabs is named directly in the source. For readers evaluating voice cloning models, the claim gives Chatterbox a clear benchmark target, while still depending on Resemble AI’s stated result.
Taken together, the release gives developers a free open-source voice cloning model with local execution, fast responses, emotional tone control, English support, and AI watermarking. Its limits are also clear from the source: it requires 5–6 GB of video memory and currently supports only English.