Alibaba says Qwen3-VL beats Gemini 2.5 Pro on vision tests

Alibaba has released Qwen3-VL, an open-source language vision model for images and text. The company says its top Qwen3-VL-235B-A22B variant outperforms Google's Gemini 2.5 Pro on major vision benchmarks.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

A more capable open-source multimodal model with interface interaction and video analysis nudges toward greater AI power and autonomy, though this is mostly a routine benchmark launch.

Alibaba says Qwen3-VL beats Gemini 2.5 Pro on vision tests

Alibaba has introduced Qwen3-VL, an open-source language vision model built to work across images and text. The release matters because Alibaba is positioning its top model, Qwen3-VL-235B-A22B, directly against Google's Gemini 2.5 Pro in major vision benchmarks.

The company reports that Qwen3-VL is not limited to basic image description. Its capabilities extend across interface interaction, screenshot-based coding, long video analysis, multilingual text recognition, spatial understanding, and math and science tasks.

What Alibaba released

Qwen3-VL is a language vision model, meaning it is designed to process visual information alongside text. In practical terms, that places it in the category of multimodal AI systems that can interpret what appears in an image or video and connect that visual input with written instructions or questions.

The top version named in the release is Qwen3-VL-235B-A22B. Alibaba is offering that version in two variants: "Instruct," and "Thinking,". The distinction is important because the two variants are being framed around different strengths.

Alibaba reports that the "Instruct," version outperforms Google's Gemini 2.5 Pro on major vision benchmarks. The "Thinking," version is described as scoring highly on multimodal reasoning tasks. Detailed benchmark results are available in Alibaba's technical blog, according to the source article.

That benchmark claim is the headline, but the broader point is that Alibaba is presenting Qwen3-VL as a general-purpose vision-language model rather than a narrow image analysis tool. The model is meant to handle images, text, video, code-related workflows, and reasoning-heavy tasks within one system.

Where Qwen3-VL is available

Alibaba has made Qwen3-VL available through several channels. The model can be found on Hugging Face, ModelScope, and Alibaba Cloud. Public chat access is also available at chat.qwen.ai.

That distribution matters for an open-source AI model because availability shapes who can test it, compare it, and build around it. Developers and researchers often look for models on widely used platforms first, while cloud access can make deployment and experimentation easier for teams already working inside that environment.

The source article describes Qwen3-VL as open-source. That detail is central to the release because it contrasts with the closed or tightly controlled access patterns often associated with leading frontier AI systems. For users evaluating vision-language models, open-source availability can make inspection, experimentation, and integration more direct.

What the model can do

Alibaba describes Qwen3-VL as capable of several visual and multimodal tasks. The model can interact with graphical interfaces, which means it is designed to reason about visual layouts and take action in interface-like environments.

It can also generate code from screenshots. That is a useful capability when a visual interface or design needs to be translated into working code. The source does not specify implementation details, but the stated ability shows that Alibaba is targeting workflows where images become structured output.

Video is another major part of the release. Qwen3-VL can analyze videos up to two hours long. That expands the model's potential use beyond static images and into longer visual sequences, where context may unfold over time rather than appearing in a single frame.

The model can recognize text in 32 languages, even when image quality is low. That places optical text recognition among its advertised strengths, especially in situations where documents, signs, screens, or other visual text may not be clean or easy to read.

Alibaba also says Qwen3-VL supports 2D and 3D spatial understanding. Spatial understanding is important for tasks where the model must reason about object placement, layout, depth, or relationships inside a scene. The model is also designed to handle math and science tasks, according to the source.

Why the Gemini 2.5 Pro comparison matters

The direct comparison with Google's Gemini 2.5 Pro gives the release a clear competitive frame. Alibaba is not simply saying Qwen3-VL is a new open-source vision model. It is reporting that the "Instruct," variant of Qwen3-VL-235B-A22B performs better than Gemini 2.5 Pro on major vision benchmarks.

Benchmarks do not cover every real-world use case, and the source article does not include the detailed results themselves. Still, benchmark positioning is often how AI model releases signal capability, especially in areas such as image understanding, multimodal reasoning, and visual problem solving.

The distinction between "Instruct," and "Thinking," also shows how Alibaba is segmenting the model. One variant is framed around strong vision benchmark performance, while the other is associated with multimodal reasoning. For users, that suggests different versions may be better suited to different types of work.

The model's feature list also points toward agentic and developer-oriented use cases. Graphical interface interaction, screenshot-to-code generation, long video analysis, and multilingual text recognition all imply workflows where the model is expected to do more than answer simple questions about an image.

The bigger picture for vision-language AI

Qwen3-VL arrives at a time when vision-language models are increasingly judged by how well they connect perception with action. A model that can identify visual elements is useful, but a model that can interpret an interface, reason across video, read degraded text, and support technical tasks has a wider role.

Alibaba's release focuses on that broader role. The company presents Qwen3-VL as a model for images and text, but the examples in the source point to a more expansive system: one that can read, reason, code from screenshots, understand spatial relationships, and process long-form video.

For teams comparing open-source AI models with proprietary alternatives, the key facts are straightforward. Qwen3-VL is open-source, it is available through Hugging Face, ModelScope, Alibaba Cloud, and chat.qwen.ai, and Alibaba says its top "Instruct," variant beats Google's Gemini 2.5 Pro on major vision benchmarks.

The most important next step for anyone evaluating the model is to look at Alibaba's detailed benchmark results and test Qwen3-VL against the specific tasks they care about. The release gives Qwen3-VL a strong claim in vision-language AI, but practical value will depend on how its image, video, text recognition, reasoning, and coding capabilities perform in real workflows.