Why DeepEyesV2 makes tool use matter more than model size

DeepEyesV2 is a multimodal AI model from Chinese researchers that analyzes images, runs code, and searches the web. Its results suggest smaller models can compete more effectively when they learn when and how to use external tools.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 0 ►

This is mainly a technical research story about tool use improving multimodal model performance, with only mild autonomy implications and no clear harm or degradation angle.

Why DeepEyesV2 makes tool use matter more than model size

DeepEyesV2 points to a practical shift in multimodal AI: better performance does not always come from adding more parameters. The model, built by Xiaohongshu's research team, improves by learning how to combine image understanding with code execution, image search, and text search.

That makes the project notable for a simple reason. Instead of depending only on knowledge stored during training, DeepEyesV2 reaches outside the model when the task demands it.

A training problem hidden inside tool use

The researchers began with a familiar goal: teach a multimodal model to solve tasks that involve images, reasoning, and outside information. Early experiments showed that reinforcement learning by itself was not enough.

The model initially tried to use Python for image analysis, but the generated code was often faulty. As training continued, the opposite problem appeared: the model started avoiding tools entirely.

That instability matters because many real-world image tasks cannot be solved by recognition alone. A model may need to crop part of an image, run calculations, compare visual results, or search for background context that is not visible in the picture.

To address this, the team created a two-stage training pipeline. First came a cold-start phase designed to teach the connection between visual understanding and tool use. Reinforcement learning then refined those behaviors.

For demonstrations, the team used Gemini 2.5 Pro, GPT-4o, and Claude Sonnet 4 to generate tool-use trajectories. Only examples with correct answers and clean code were kept. The reward system was kept simple, with rewards tied to answer accuracy and output format.

The three tools DeepEyesV2 can reach for

DeepEyesV2 uses three categories of external tools, each serving a different role in multimodal reasoning.

  • Code execution supports image processing and numerical analysis.
  • Image search retrieves visually similar content.
  • Text search adds context that is not contained in the image itself.

This tool mix is important because visual tasks often require more than one capability. A flower-identification question, for example, may require isolating the relevant part of an image, searching for similar flowers, and then using those results to decide on the species.

The model's behavior also varies by task type. For visual perception problems, it often crops the image to focus on the relevant region. For diagram-based math problems, it combines image analysis with numerical computation. For visually grounded knowledge questions, it launches targeted web searches based on the image.

RealX-Bench shows the integration gap

To test this approach, the researchers created RealX-Bench. The benchmark is designed to measure how well models coordinate visual understanding, web search, and reasoning.

The results show that this coordination remains difficult. Even the strongest proprietary model reached only 46 percent accuracy, while humans scored 70 percent.

The gap widened when tasks required all three capabilities to work together. According to the study, Gemini's accuracy fell from 46 percent overall to 27.8 percent when recognition, reasoning, and search all had to be integrated.

That decline highlights a key weakness in current multimodal systems. Models may handle individual skills reasonably well, but combining those skills into a single workflow is still hard.

DeepEyesV2 reached 28.3 percent overall accuracy on RealX-Bench. That was ahead of its base model, Qwen2.5-VL-7B, which scored 22.3 percent. It still trailed the 32-billion and 72-billion-parameter versions, but it outperformed other open-source models on tasks requiring coordination across all three capabilities.

The benchmark analysis also found that search tools were a major contributor to accuracy gains. Text search produced the largest gains, suggesting that many models still have difficulty making full use of visual search alone.

Where the smaller model gains ground

DeepEyesV2's strongest improvements appear in specialized benchmarks. On MathVerse, it scored 52.7 percent, a 7.1-point improvement over its base model.

It also performed strongly on search-driven tasks. On MMSearch, DeepEyesV2 reached 63.7 percent, ahead of the dedicated MMSearch-R1 model at 53.8 percent.

The model's everyday image understanding results are also notable. The 7-billion-parameter DeepEyesV2 surpassed Qwen2.5-VL-32B, despite having more than four times fewer parameters.

Taken together, these results support a clear implication: structured tool use can compensate for some limitations of smaller models. DeepEyesV2 does not win by storing more information inside the model. It improves by learning how to call outside resources more effectively.

Why the strategy matters

After reinforcement learning, DeepEyesV2 became more adaptive. It used tools less often overall, which suggests the model learned to call them only when needed. At the same time, the high variance in tool use across tasks shows that it continues to adjust its strategy depending on the problem.

That distinction is central. Effective tool use is not just about giving a model access to search or code. The harder challenge is teaching it to decide which tool fits the task, when to use it, and how to combine the results with its own reasoning.

The project also fits into Xiaohongshu's broader AI work. Its first open-source language model, dots.llm1, delivered competitive results and outperformed models from Alibaba and Deepseek in efficiency. Its character recognition model, dots.ocr, showed similar capabilities.

The earlier DeepEyes release in May combined reasoning with multimodal understanding. DeepEyesV2 builds on that foundation and aims to bring these capabilities together in more agent-like environments.

DeepEyesV2 is available on Hugging Face and GitHub under the Apache License 2.0 and can be used commercially. For developers and researchers watching multimodal AI, its main message is direct: the next performance gains may come as much from better tool coordination as from larger models.