The Decoder December 25, 2024 TERMINATOR

Open-source QVQ pushes visual reasoning toward top AI rivals

Qwen has introduced QVQ-72B-Preview, an open-source model built to reason over images and instructions. Early benchmark results put it near leading closed-source systems, though Qwen says the preview still has limits around language switching, circular reasoning, hallucinations, and safeguards.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 1 ►

The story mildly leans Terminator because it highlights more powerful open-source visual reasoning capabilities, though the risks are mostly ordinary preview limitations.

Open-source QVQ pushes visual reasoning toward top AI rivals

Alibaba's AI research team Qwen has released QVQ-72B-Preview, an experimental open-source model designed to inspect images, reason through what it sees, and produce answers. The preview is positioned as a visual reasoning system, not just a vision-language model that describes image content.

The release matters because Qwen says QVQ-72B-Preview brings reasoning-style behavior to an open-source vision model. In early tests, it improved on Qwen's earlier vision-language model and reached accuracy levels described as similar to major closed-source systems from OpenAI, Google, and Anthropic.

What QVQ is built to do

QVQ-72B-Preview takes an image and user instructions, then works through the problem before giving a response. The model can pause to reflect when needed and returns answers with confidence scores for each prediction.

That approach resembles the step-by-step behavior associated with reasoning models such as OpenAI's o1 and Google's Flash Thinking. The key difference is that QVQ is aimed at visual reasoning: it is meant to combine image understanding with structured problem solving.

Qwen built QVQ-72B-Preview on top of Qwen2-VL-72B, its existing vision-language model. The new preview adds capabilities for thinking and reasoning, and Qwen describes it as the first open-source model of its kind.

The release also follows Qwen's recently released QwQ reasoning model. However, the team has not explained whether QVQ and QwQ are technically connected or how the two models relate to each other.

How Qwen tested the model

Qwen evaluated QVQ-72B-Preview using four benchmarks. Each benchmark focuses on a different kind of visual or scientific reasoning, which gives a clearer picture of where the model is meant to perform.

MMMU tests college-level visual understanding.
MathVista checks reasoning through mathematical graphs.
MathVision focuses on math competition problems.
OlympiadBench covers Olympic-level math and physics problems in both Chinese and English.

Across these tests, QVQ performed better than its predecessor, Qwen2-VL-72B-Instruct. The source also reports that it reached similar levels of accuracy as closed-source models including OpenAI's o1 and Claude 3.5 Sonnet.

For open-source AI, that comparison is significant. Visual reasoning has become a major frontier because many real problems are not text-only. They involve diagrams, charts, equations, images, and instructions that must be interpreted together.

Where the preview still struggles

Qwen is clear that QVQ-72B-Preview is not a finished product. The team describes it as experimental and points to several limitations that still need work before the model is ready for broader use.

One issue is language consistency. The model can unexpectedly switch between languages, which can make answers harder to follow or less reliable in a controlled workflow.

Another problem is circular reasoning. Qwen says QVQ can get stuck in loops, a limitation that the source notes has not been solved even by OpenAI's o1.

During complex visual reasoning tasks, the model can also lose track of what it is looking at. That failure mode can lead to hallucinations, especially when the model is trying to reason over difficult visual information.

Qwen also says the model needs stronger safeguards before it is suitable for widespread use. That caveat is important because a system that reasons over images may be used in higher-stakes contexts where incorrect conclusions are more consequential.

The bigger direction for Qwen

Qwen describes QVQ as its "last gift" of the year. The team presents it as one step toward a larger ambition: building what it calls an "omniscient and intelligent model" on the path to artificial general intelligence.

The group also points toward a unified "omni" model, similar in broad direction to OpenAI with GPT-4o. The goal is a system that can handle more complex scientific challenges by combining perception, reasoning, and broader intelligence in a single model.

"Imagine an AI that can look at a complex physics problem, and methodically reason its way to a solution with the confidence of a master physicist," the team explains.

For now, QVQ-72B-Preview is best understood as a research preview with strong early signs and visible constraints. Its open-source code and model weights are available through the project page, and Qwen also offers a free demo on Hugging Face.

The practical takeaway is straightforward: QVQ moves open-source visual reasoning closer to the level of leading closed-source AI systems, while still showing the unresolved problems that make advanced reasoning models difficult to deploy at scale.