MIT Tech Review AI September 4, 2025 IDIOCRACY

AI avatars are getting harder to spot and easier to talk to

Synthesia’s Express-2 model makes AI avatars more lifelike, with stronger voice cloning, more natural gestures, and better preservation of accents. The next step is real-time interaction, which could make these digital humans useful for training and education while also raising concerns about trust and emotional attachment.

WTF Index IDIOCRACY

◄ Terminator 2 Idiocracy 3 ►

More lifelike AI avatars and voice cloning mainly threaten trust, authenticity, and human judgment, though misuse risks are present too.

AI avatars are getting harder to spot and easier to talk to

AI avatars are moving closer to ordinary video. Synthesia’s newest model, Express-2, shows how quickly synthetic presenters are becoming smoother, more expressive, and harder to identify at first glance.

The company is still focused on corporate uses such as financial results, internal communications, and staff training videos. But its ambitions point toward something larger: avatars that do not just read a script, but can respond in real time.

From scripted presenter to lifelike clone

Synthesia launched in 2017 with a different emphasis. Its early work focused on matching AI versions of real human faces, including David Beckham, with dubbed voices speaking in different languages.

By 2020, the company was offering business customers a way to create professional-level presentation videos using AI versions of staff members or consenting actors. The results were useful, but imperfect. Movements could look stiff, accents could slip, and vocal emotion did not always align with the face on screen.

Express-2 is meant to close that gap. Synthesia says the newer system creates more natural mannerisms, more expressive voices, stronger accent preservation, and better alignment between speech and gesture.

In a recent test, the avatar creation process had also become easier. A previous visit to Synthesia’s London studio required a longer calibration process, including reading scripts in different emotional states and mouthing sounds used for vowels and consonants. This time, the company gathered the needed footage in just an hour.

A couple of weeks later, the result was delivered in two versions: one made with the earlier Express-1 model and another using Express-2. The contrast showed real progress, but also showed how narrow the difference between convincing and unsettling can be.

What Express-2 improves

The Express-2 avatar looked strikingly close to the person it was modeled on. Its facial features matched well, its voice was described as spookily accurate, and its hand movements generally lined up with what was being said.

That is a clear improvement over the Express-1 version in the same test, which blinked heavily and struggled to synchronize body movement with speech. Last year, another Express-1 avatar also failed to match a transatlantic accent and had a limited emotional range.

Still, the signs of AI generation had not disappeared. The avatar’s palms looked unusually bright pink and smooth. Hair remained stiff instead of moving naturally. The eyes stared ahead and rarely blinked. The voice was recognizable, but the intonation and speech patterns sometimes felt slightly wrong.

Anna Eiserbeck, a postdoctoral psychology researcher at the Humboldt University of Berlin, said she was not sure she would have recognized the avatar as a deepfake at first glance. Over time, though, she would have noticed that something was off.

“Something seemed a bit empty. I know there’s no actual emotion behind it— it’s not a conscious being. It does not feel anything,”

That reaction captures the central tension. An AI clone can appear more polished than a real person, while still lacking the inner life that people instinctively look for in a face.

The technical challenge is behavior

Björn Schuller, a professor of artificial intelligence at Imperial College London, says the hard part is no longer simply matching appearance. The bigger challenge is copying behavior: micro gestures, intonation, the sound of a voice, and the right word at the right moment.

“I don’t want an AI [avatar] to frown at the wrong moment—that could send an entirely different message.”

To improve realism, Synthesia built new audio and video AI models. Its voice cloning model is designed to preserve a speaker’s accent, intonation, and expressiveness, instead of flattening a distinctive accent into something more generic.

Express-1 analyzes a script to infer tone, then feeds that information into a diffusion model that renders facial expressions and movements. Express-2 adds a more complex chain. One model generates gestures from the speech provided by Express-Voice. A second model compares versions of the motion against the audio and selects the best fit. A final model renders the avatar using that selected motion.

The rendering model is much larger than before. Express-1 used a model with a few hundred million parameters, while Express-2’s rendering model has parameters in the billions. Youssef Alami Mejjati, Synthesia’s head of research and development, said the newer system can learn associations automatically because it has been trained on more diverse data, larger data sets, and more compute.

Where synthetic presenters go next

Synthesia is not the only company building AI avatars for business video. Yuzu Labs, Creatify, Arcdads, and Vidyard also offer tools for generating and editing videos with AI actors or artificial staff members. In China, AI-generated clones of livestreamers have become popular partly because they can sell products 24/7 without getting tired or needing to be paid.

Synthesia says it remains “laser focused” on corporate use, but it is not ruling out entertainment or education. The company has also partnered with Google to integrate Veo 3 into its platform, allowing users to generate and embed clips directly into Synthesia videos.

That could support practical training content, such as a video of meat-processing machinery with an avatar explaining safe use beside it. Alex Voica, head of corporate affairs and policy at Synthesia, also described future educational videos that could be adjusted for different levels of knowledge, such as a biology degree or high-school-level knowledge.

The bigger shift would be avatars that can talk back. Synthesia has already added interactive quiz features that let users click through on-screen questions. The company is also exploring avatars that can pause, expand on a point, or answer a question in real time.

That possibility makes the technology more useful, but also more complicated. Pat Pataranutaporn, an assistant professor at the MIT Media Lab, warned that people already form deep emotional bonds with AI systems, including basic text-based chatbots. A realistic face could intensify that connection.

“If you make the system too realistic, people might start forming certain kinds of relationships with these characters,”

Schuller also expects future avatars to become better at adjusting emotion and charisma to hold attention. The result could be highly engaging digital humans that are always available, always responsive, and difficult for ordinary human interaction to compete with.

That is why the progress of AI avatars matters beyond business video. Express-2 shows that synthetic humans are becoming better presenters. The next question is what happens when they become better conversational partners too.