GeoVista shows how quickly open-source AI geolocation is moving toward the performance of leading commercial models. Built by Tencent and several Chinese universities, the system identifies where an image was taken by combining visual analysis with live web searches.
The model is designed for a difficult task: looking at a photo, panorama, or satellite image and estimating its geographic location. Instead of relying only on what is visible in the pixels, GeoVista can inspect details more closely and retrieve outside information when it needs more evidence.
How GeoVista Works
GeoVista is built on Qwen2.5-VL-7B-Instruct and adds a tool-based workflow around the base model. Its two primary tools are simple in concept but important in practice.
- Zoom: the model can magnify specific regions of an image to inspect details.
- Search: the model can pull up to ten relevant sources from platforms like Tripadvisor, Instagram, Facebook, Pinterest, and Wikipedia.
GeoVista decides on its own when to use each tool. That matters because geolocation often depends on small clues: architecture, road layouts, terrain, public spaces, or other details that may not be obvious at full image size.
The researchers describe live search as a central advantage over approaches focused mainly on image manipulation. Models such as Mini-o3 or ByteDance's DeepEyes are named as examples of systems that do not use the same external-data approach. The paper does not specify which search provider GeoVista uses.
Training The Model To Reason With Tools
The team trained GeoVista in two phases. The first phase used supervised learning to teach basic reasoning and tool use with 2,000 curated examples.
Those examples were generated with help from commercial AI models, which produced tool calls and justifications. The researchers then assembled these into multi-level thought processes so GeoVista could learn not only an answer, but a structured way to work toward one.
The second phase used reinforcement learning with 12,000 examples. Here, the reward system pushed the model toward more precise geographic answers. A correct city-level answer earned more than an answer that stopped at the province or country level.
That tiered reward system is important because image geolocation has layers of difficulty. Identifying a country can be useful, but the real challenge is narrowing an image down to a province or city. GeoVista's training was shaped around that hierarchy.
How It Performs On GeoBench
On the team's custom GeoBench dataset, GeoVista reached 92.64 percent accuracy at the country level, 79.60 percent at the province level, and 72.68 percent at the city level.
Performance varied by image type. GeoVista worked best on panoramas, where it reached 79.49 percent city accuracy. Standard photos followed at 72.27 percent. Satellite images were the most difficult category, with city accuracy at 44.92 percent.
The comparison with commercial models is close but uneven. Gemini 2.5 Pro reached 78.98 percent city-level accuracy, Gemini 2.5 Flash scored 73.29 percent, and GPT-5 reached 67.11 percent. Mini-o3-7B, another open-source model cited in the source, reached only 11.3 percent.
The recently announced Gemini 3 could change these rankings in future tests. Based on the numbers reported for GeoBench, however, GeoVista places open-source AI geolocation much nearer to top commercial systems than earlier open-source results suggested.
Distance Accuracy Tells A More Nuanced Story
GeoVista's city-level performance is strong, but distance-based evaluation shows that commercial systems still hold an edge in some measurements. For distance measurements, 52.83 percent of GeoVista's predictions landed within 3 kilometers (1.86 miles) of the actual location.
Its median deviation was 2.35 kilometers (1.46 miles). Gemini 2.5 Pro performed better on this measure, with 64.45 percent accuracy and a median deviation of 800 meters (0.5 miles). GPT-5 reached 55.12 percent with a median deviation of 1.86 kilometers (1.15 miles).
This difference matters because naming the right city and placing a point close to the true location are related but not identical tasks. A model can be useful at one level while still leaving room for improvement at finer geographic precision.
Why GeoBench Matters
Alongside GeoVista, the researchers released GeoBench. The dataset contains 1,142 high-resolution images from 66 countries and 108 cities. It includes 512 standard photos, 512 panoramas, and 108 satellite images, all with a resolution of at least one million pixels.
The team designed GeoBench to avoid overly easy or unsuitable examples. They removed non-localizable images, including food close-ups and generic landscapes. They also removed easily recognizable landmarks, arguing that internet images vary widely in how easily they can be located.
GeoBench evaluates models in two ways. First, it checks accuracy step by step at the country, province, and city levels. Second, it measures distance by converting text addresses into coordinates.
Ablation tests showed that both stages of training were needed. Without the supervised learning phase, GeoVista produced overly short answers and failed to use tools effectively. Removing reinforcement learning also caused performance drops, and the tiered reward system proved essential for using multi-level geographic data.
Incorrect tool usage also dropped during reinforcement learning, even though the team did not directly optimize for it. The source also reports that performance improved as data volume increased across tests using 1,500, 3,000, 6,000, and 12,000 examples.
The model weights, code, and benchmark are available on the project page. The paper does not address potential misuse, but the broader implication is direct: public photos can expose location information when AI models combine visual clues with external search.