How Naver's Seoul World Model keeps AI cities real

Naver's Seoul World Model generates location-based video by grounding output in real geometry from Naver Map Street View data. The approach helps reduce hallucinated city layouts while still allowing prompts for weather, time of day, and hypothetical scenarios.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 0 ►

A grounded city video model slightly increases realistic simulation capability, but the story is mainly a technical reliability update rather than a clear harm or dependency signal.

How Naver's Seoul World Model keeps AI cities real

Naver and Naver Cloud are approaching video world models from a different direction: instead of letting an AI invent the unseen parts of a city, the Seoul World Model ties generation to actual Street View data from Naver Map.

The result is a system designed to create videos that follow real urban geometry, while still responding to user prompts and camera movement. It is built around the idea that convincing video is not enough if the streets, buildings, and route stop matching the physical place being simulated.

Why Grounding Matters

Previous video world models can produce scenes that look plausible, but those scenes may not correspond to any real location. Once the camera moves beyond the first frame, streets and buildings that were not visible at the start are generated from the model's learned expectations.

That creates a core problem for location-based video: visual realism can hide spatial fiction. A city may look coherent for a moment, but the model can drift into an environment that was never actually there.

The Seoul World Model, or SWM, tries to limit that failure mode by using real city geometry and appearance as anchors. Users provide geographic coordinates, a desired camera path, and a text prompt. The model then searches 1.2 million panoramic images from Naver Map and retrieves nearby Street View images to guide the video generation step by step.

According to the research paper described in the source article, this is the first world model tied to a real physical location. Naver, often called the "Google of South Korea," operates both the country's dominant search engine and Naver Map, which includes street panoramas similar to Google Maps.

The Hard Part: Real Street Data Is Messy

Using real images gives SWM a stronger connection to the physical world, but it also introduces problems that purely synthetic systems do not face. Street View imagery is not clean training footage. It is a collection of snapshots captured at different moments, from different positions, and with temporary objects in the scene.

The researchers identified several challenges:

  • Street View images may include cars, pedestrians, and other objects that were only present at the time of capture.
  • Images are taken every 5 to 20 meters, rather than as continuous video.
  • The camera viewpoint is tied to vehicle-mounted capture, leaving gaps for pedestrian, vehicle, and free-flight perspectives.
  • Small generation errors can accumulate as the model produces longer routes section by section.

To keep the model from treating parked cars or pedestrians as permanent parts of the city, the researchers used "cross-temporal pairing:" during training. They combined reference images and target sequences recorded at different times, teaching SWM to separate stable structures such as building facades from transient objects.

In ablation studies, this was the most effective component. That matters because the model needs to preserve the city while avoiding the accidental copying of random objects from the source imagery.

How SWM Fills The Gaps

Because Street View does not provide continuous footage, the researchers supplemented the data in two ways. They generated 12,700 synthetic videos in the Unreal Engine simulator CARLA, with camera paths covering pedestrian, vehicle, and free-flight perspectives. They also created a pipeline that turns scattered individual images into temporally coherent training videos.

SWM also uses a method called a "virtual lookahead sink:" to reduce drift over longer movement. Instead of relying only on the first image as a fixed reference, the model retrieves a Street View image slightly further along the route and uses it as a virtual destination for each new section.

That moving destination gives the model a fresh landmark as the camera advances. The approach is meant to stop small mistakes from compounding across hundreds of meters.

The model feeds Street View references into generation through two paths. First, it projects a nearby reference image into the target perspective using depth information, which helps define spatial layout. Second, it encodes reference images into latent representations and uses them as semantic references, allowing the model to recover appearance details from the environment.

The researchers report that quality drops significantly if either path is removed.

Training And Benchmark Results

SWM is based on Nvidia's Cosmos-Predict2.5-2B, a diffusion transformer with two billion parameters. Training used 24 Nvidia H100 GPUs, 440,000 Seoul Street View images, the synthetic CARLA data, and publicly available Waymo driving data.

The researchers tested SWM in Seoul, as well as in Busan and the U.S. city of Ann Arbor. Busan and Ann Arbor were completely absent from training, yet the model generalized to those unfamiliar cities without additional training.

On custom benchmarks with 30 test sequences of roughly 100 meters each, SWM outperformed six current video world models, including Aether, DeepVerse, and HY-World1.5. The benchmarks measured visual quality, camera fidelity, temporal consistency, and correspondence with real locations.

The source article notes that existing models tend to drift over longer distances, resulting in blur or full generation collapse. SWM keeps output stable over hundreds of meters while maintaining the underlying city layout.

That does not mean the model ignores prompts. Users can still change weather, time of day, or add hypothetical scenarios. The important distinction is that those changes happen on top of a location structure anchored to the real city.

Limits And Possible Uses

The model still has constraints. Continuous video recordings of entire cities are not freely available, so training depends on interpolated sequences made from individual images. Those sequences are lower quality than real video footage.

Incorrect timestamps in metadata can also create visible problems, including vehicles that appear or vanish abruptly in generated videos. These limits show that real-world grounding helps, but does not remove the need for cleaner temporal data.

The researchers say all Street View data was processed in compliance with privacy regulations, with faces and license plates anonymized before training. They point to urban planning, autonomous driving, and location-based exploration as potential use cases.

SWM arrives as world models are becoming a heavily researched area in AI. The source article also notes Runway's "General World Model," GWM-1, Google Deepmind CEO Demis Hassabis's view that such models are a critical step toward general artificial intelligence, and a recent study by Microsoft Research and several U.S. universities showing that large language models can function as world models.

For Naver's work, the practical message is narrower and concrete: a world model becomes more useful for cities when it is tied to the actual world it is supposed to represent.