The Decoder March 31, 2026 TERMINATOR

Alibaba's Qwen3.5-Omni pushes voice and video AI into code

Alibaba has released Qwen3.5-Omni, an omnimodal AI model that works across text, images, audio, and video. Its standout claims include stronger audio performance than Gemini 3.1 Pro, expanded speech recognition, and an emergent ability to write code from spoken instructions and video.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

A routine model launch, but stronger real-time audio/video understanding and code generation from speech/video mildly pushes AI toward more powerful autonomous capability.

Alibaba's Qwen3.5-Omni pushes voice and video AI into code

Alibaba has introduced Qwen3.5-Omni, a new omnimodal AI model built to handle text, images, audio, and video in one system. The release extends the Qwen series into broader real-time voice and audiovisual use cases, while also introducing a surprise capability: generating code from spoken instructions and video input.

The model is available in three Instruct variants: Plus, Flash, and Light. Unlike earlier Qwen releases, Alibaba has not published the model weights or named a license, so Qwen3.5-Omni is currently available only as an API service.

What Qwen3.5-Omni Can Process

Qwen3.5-Omni is designed as a native omnimodal model. That means the system was trained to work across several forms of input rather than treating audio, video, images, and text as separate add-ons.

According to the Qwen team, the model handles contexts up to 256,000 tokens. It can process more than ten hours of audio and over 400 seconds of 720p video at one frame per second. The model was natively pre-trained as omnimodal on over 100 million hours of audiovisual material.

It also generates speech output alongside text. That makes the model relevant not only for analysis tasks, but also for voice assistants, live dialog, transcription, translation, and audiovisual understanding.

Audio Benchmarks Are the Main Claim

The strongest performance claims center on Qwen3.5-Omni-Plus. The Qwen team says the Plus version sets a new state of the art across 215 audio and audiovisual subtasks. Those include three audiovisual benchmarks, five audio benchmarks, eight speech recognition benchmarks, 156 language-specific translation tasks, and 43 language-specific recognition tasks.

Qwen3.5-Omni-Plus reportedly beats Google's Gemini 3.1 Pro in overall audio comprehension, reasoning, recognition, translation, and dialog. For audiovisual comprehension overall, it matches Gemini 3.1 Pro.

Some of the reported benchmark gaps are narrow, while others are larger. In audio comprehension on MMAU, Qwen3.5-Omni-Plus scored 82.2, compared with 81.1 for Gemini 3.1 Pro. In music comprehension on RUL-MuchoMusic, it scored 72.4 versus 59.6. On the VoiceBench dialog benchmark, it reached 93.1 compared with Gemini's 88.9.

The source also says visual and text capabilities match the standalone Qwen3.5 text models at the same size. That matters because the model is being positioned as a broader assistant, not only as an audio specialist.

Speech Support Expands Sharply

One of the clearest product changes is language coverage. Speech recognition now supports 74 languages and 39 Chinese dialects, for 113 languages and dialects total. The previous Qwen3-Omni handled eleven languages and eight Chinese dialects.

Voice output supports 36 languages and dialects. The system includes 55 voices, with user-defined, scenario-specific, dialectal, and multilingual options.

On the Fleurs speech recognition dataset for the top 60 languages, Qwen3.5-Omni-Plus achieved a word error rate of 6.55, compared with 7.32 for Gemini 3.1 Pro. For Chinese variants like Cantonese, the reported gap is much wider: 1.95 versus 13.40.

The Qwen team also compares speech generation against ElevenLabs, Gemini 2.5 Pro, GPT-Audio, and Minimax. On the seed-hard test set, Qwen3.5-Omni-Plus has a word error rate of 6.24. GPT-Audio is listed at 8.19, Minimax at 8.62, and ElevenLabs at 27.70. For voice cloning across 20 languages, the model reaches a word error rate of 1.87 and a cosine similarity of 0.79.

ARIA Targets Real-Time Voice Problems

The architecture keeps the Qwen series' thinker-talker design. The thinker analyzes omnimodal input and produces text, while the talker converts that into contextual speech.

Both parts now use a hybrid attention-MoE architecture, replacing the pure mixture-of-experts setup used by the predecessor. The key technical change is ARIA, short for Adaptive Rate Interleave Alignment.

ARIA dynamically aligns and interleaves text and voice tokens. The Qwen team built it to address a real-time speech issue: text and voice tokens encode at different rates. In streaming conversations, that mismatch can lead to dropped words, mispronunciations, or garbled numbers.

The predecessor used a rigid 1:1 mapping between text and audio tokens. ARIA is intended to make speech synthesis more natural and robust while keeping real-time performance.

The Unexpected Coding Capability

The most unusual claim is an emergent skill the Qwen team calls "audio-visual vibe coding." While scaling omnimodal training, the model reportedly gained the ability to write code directly from spoken instructions and video content, even though that capability was not specifically trained.

In one demo, Qwen3.5-Omni-Plus builds a working snake game from a verbal description and a video clip. The broader implication is straightforward: if a model can connect what it hears, what it sees, and what it writes, coding interfaces may become less dependent on typed prompts alone.

The model also produces detailed descriptions of audio and video. It can segment content automatically, add timestamps accurate to the second, and provide information about characters, dialog, sound effects, and how those elements interact. In one demo, it breaks down a three-minute lion documentary scene by scene. In another, it flags violent scenes in video games for content moderation, listing them in a table with timestamps and risk levels.

Access Is API-Only for Now

Qwen3.5-Omni adds several real-time conversation features. "Semantic interruption" is designed to determine whether a user actually intends to speak, while ignoring background noise or brief interjections. The model can also decide on its own whether to run a web search for current questions and can handle complex function calls.

Users can adjust volume, tempo, and emotion through voice commands during a conversation. Voice cloning lets users upload their own voice and use it as the AI assistant voice. The Qwen team says these features are available through the real-time API.

The model is also accessible through Qwen Chat and Alibaba Cloud Model Studio. For developers, the important limitation is distribution: Qwen3.5-Omni is not being released with open weights at this stage.

The release arrives during a period of rapid Qwen model rollout and internal change. Alibaba launched Qwen3-Omni in April 2025, and has also expanded the Qwen 3.5 text model series to four models, including Qwen3.5-397B-A17B with 397 billion total parameters and 17 billion active. Alibaba's chief AI developer, Junyang Lin, recently announced his surprise departure, and other key team members followed. Alibaba CEO Eddie Wu responded by announcing a new "Foundation Model Task Force," saying foundation model development remains a "core strategic priority for our future."