The Decoder January 5, 2025 NEUTRAL

New ByteDance AI turns portrait photos into audio-led conversations

ByteDance has developed INFP, an AI system that can animate static portrait photos so they appear to speak and respond to audio. The model is designed to handle both speaking and listening behavior in conversations without manual role assignment.

WTF Index NEUTRAL

◄ Terminator 2 Idiocracy 2 ►

INFP advances realistic talking-portrait generation with some deepfake and truth-eroding potential, but the article describes a technical capability rather than clear harmful deployment.

New ByteDance AI turns portrait photos into audio-led conversations

ByteDance has developed an AI system that can make a still portrait look as if it is taking part in a conversation. The system, called INFP, uses audio input to generate facial expressions, head movement, lip motion, and listening behavior from a static image.

INFP stands for "Interactive, Natural, Flash and Person-generic." Its central promise is not just making a face appear to talk, but making two people in a conversation appear to react naturally as the exchange develops.

What INFP Is Built To Do

Most simple talking-photo systems focus on speech: an audio track comes in, and the face is animated to match the words. ByteDance's INFP goes further by trying to model the flow of a conversation, including moments when one person is speaking and the other is listening.

The important distinction is that INFP does not need a human operator to label who is speaking and who is listening at each point. The system works out those roles automatically as the conversation moves forward.

That matters because real conversation is not only a series of mouth movements. People nod, pause, tilt their heads, react with small expressions, and shift attention. A convincing conversation video has to account for those signals as well as lip synchronization.

According to the source article, ByteDance says INFP performs strongly in several areas:

matching lip movements to speech
preserving the person's unique facial features
creating a wide range of natural-looking movements
generating videos of someone listening during a conversation

How The System Turns Audio Into Motion

INFP works in two main stages. The first is called "Motion-Based Head Imitation." In this stage, the AI studies videos of people communicating and captures the details of their facial expressions and head movements.

Those movements are converted into motion data. Once the system has that data, it can apply similar movement patterns to a still portrait photo, making the person in the image appear animated.

The second stage is called "audio-guided motion generation." Here, the system connects sounds with the movements that should accompany them. That includes the motion patterns of a person who is speaking and the motion patterns of a person who is listening.

ByteDance's team developed a component called a "motion guider" for this part of the process. It analyzes audio from both sides of a conversation and produces motion patterns for the different conversational roles.

After that, a diffusion transformer refines those patterns into smoother motion. The goal is to make the final animation look natural rather than mechanical, with movement that fits the sound and the conversational context.

Why DyConv Was Created

To train INFP, the team built a new dataset called DyConv. It contains over 200 hours of real-world conversations gathered from videos across the internet.

The source article notes that other conversation databases exist, including ViCo and RealTalk. ByteDance's team says DyConv stands out because it captures a wider range of human emotions and expressions, and because its video quality is notably better.

That training material is central to the system's purpose. If the model is expected to generate realistic talking and listening behavior, it needs examples of how people actually behave in conversation. DyConv is meant to provide that range.

The emphasis on both emotion and expression also points to why this kind of AI is difficult. A still image does not contain the timing of a nod, the rhythm of a response, or the difference between speaking and quietly reacting. INFP attempts to infer those patterns from audio and learned motion examples.

What ByteDance Wants To Add Next

At the moment, INFP works with audio. The team is exploring ways to expand the system so it can also work with images and text.

That would broaden what the model can respond to, though the source article does not provide specific product plans or release details. The next stated goal is to create realistic animations of people's entire bodies, not only their heads and facial expressions.

That is a larger challenge. Full-body animation would require the system to handle posture, gesture, and broader physical movement, while still preserving the conversational realism that INFP is trying to achieve at the portrait level.

The Misuse Question

The researchers acknowledge that this kind of technology can be misused. A system that makes still images appear to speak and respond could be used to create fake videos or spread false information.

Because of that risk, they plan to keep the core technology limited to research institutions. The source article compares that approach to Microsoft limiting access to its advanced voice cloning system last summer.

INFP also fits into ByteDance's broader AI strategy, which the company announced earlier this year. With TikTok and CapCut in its portfolio, ByteDance has widely used platforms where AI media tools could eventually matter, although the source does not say INFP is being released inside those apps.

For now, the main takeaway is technical: ByteDance is working on AI that can animate portrait photos with audio in a way that accounts for both sides of a conversation. The result is a system aimed at making static images look less like simple talking heads and more like participants in a real exchange.