Real-time translation is moving beyond one person speaking into a device. A new system called Spatial Speech Translation is designed for a harder setting: a group conversation where several people may speak, switch languages and overlap.
The project focuses on headphones, not a phone screen. Its goal is to help the person wearing the headphones understand who is speaking, where the speaker is located and what they are saying in English.
Why group speech is harder
Many live AI translation tools are built around a single speaker. That is useful for direct exchanges, but it does not solve the problem of a dinner table or group discussion where multiple people are talking in different languages.
Spatial Speech Translation addresses that challenge by combining translation with spatial awareness. The system tracks the direction and vocal characteristics of each speaker so the listener can connect the translated words with the person who said them.
That detail matters because a group conversation is not only a stream of words. The listener also needs to follow turns, recognize voices and understand where attention is moving. Without that, even accurate translation can feel disconnected from the room.
How Spatial Speech Translation works
The system uses two AI models. The first divides the area around the headphone wearer into small regions, then uses a neural network to look for possible speakers and identify their direction.
The second model handles language and voice. It translates speech from French, German or Spanish into English text using publicly available data sets. It also extracts characteristics of each speaker’s voice, including pitch and amplitude, along with emotional tone.
Those properties are then applied to the translated text. The result is a translated voice that sounds similar to the original speaker rather than a robotic computer voice.
The translated speech is also presented as if it is coming from the speaker’s direction. A few seconds after someone speaks, the headphone wearer hears the English version with both spatial placement and a voice that resembles the person speaking.
The hardware behind the demo
The research was designed to work with existing, off-the-shelf noise-canceling headphones that have microphones. In the described setup, the headphones are plugged into a laptop powered by Apple’s M2 silicon chip, which can support neural networks.
The same chip is also present in the Apple Vision Pro headset. The research was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, this month.
This is different from systems such as the live AI translation running on Meta’s Ray-Ban smart glasses, which focus on a single speaker. Spatial Speech Translation is aimed at the more complex situation of several speakers at once.
What researchers say is promising
Shyam Gollakota, a professor at the University of Washington who worked on the project, described the language barrier as a confidence barrier. He pointed to his mother speaking Telugu and finding it difficult to communicate with people in the US when she visits from India.
Alina Karakanta, an assistant professor at Leiden University in the Netherlands who studies computational linguistics and was not involved in the project, said the application could be useful and helpful.
Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute who did not work on the project, noted that separating human voices is already difficult for AI systems. He said it is impressive to combine that with real-time translation, distance mapping and usable latency on a real device.
The latency problem
The team is now working to reduce how long it takes for the translation to begin after someone speaks. Gollakota said the goal is to bring latency to less than a second so conversations can retain their natural feel.
That is not only a computing problem. Translation speed also depends on language structure. Among the three languages the system was trained on, it translated French into English fastest, followed by Spanish and then German.
Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany who did not work on the project, explained that German often places verbs and much of a sentence’s meaning at the end rather than the beginning.
That creates a trade-off. Waiting longer can give the system more context and improve translation quality, but waiting too long makes the exchange feel less conversational. Fantinuoli warned that reducing latency could make translations less accurate.
Spatial Speech Translation is still framed by limited testing settings. Cornell said a real product would need much more training data, possibly including noise and real-world recordings from the headset rather than relying on synthetic data alone.
Even with those constraints, the system points to a clear direction for AI translation: not just converting words, but preserving the social structure of conversation. In multilingual groups, the next step is making the translation sound like it belongs to the person who spoke.