How Streetscapes turns city maps into AI street-view video

Streetscapes is an AI system from Stanford University and Google Research that generates realistic street-view videos of entire virtual cities. It uses maps, building height data and camera paths, with diffusion models trained on millions of real street views from Google Street View.

WTF Index TERMINATOR
◄ Terminator 1 Idiocracy 0 ►

The story is mainly a research advance, with only a mild lean toward more powerful AI-generated navigable environments.

How Streetscapes turns city maps into AI street-view video

Streetscapes points to a broader shift in generative AI: from producing single images to building continuous, navigable environments. Developed by researchers from Stanford University and Google Research, the system can create realistic street-view sequences that simulate a drive through a virtual city.

The work matters because street-level environments are difficult to generate convincingly. A city scene is not just a collection of buildings. It includes roads, windows, cobblestones, vegetation, light, shadow and the visual continuity that makes movement through a place feel coherent.

What Streetscapes Generates

Streetscapes creates long, continuous video sequences that resemble street-view footage. Instead of generating one isolated frame, it produces a sequence that follows a desired camera path through a virtual city.

The system can also export the generated scenes in 3D format via NeRF. That makes the output more than a flat video preview: it gives the generated city scene a form that can be used beyond a single camera angle.

The source inputs are structured rather than purely descriptive. Streetscapes receives a street map, a height map of buildings and a camera path. From those elements, it builds a street-level sequence step by step.

This approach gives the model a clear spatial foundation. The map defines the road layout, the height map gives building scale, and the camera path tells the system how the viewer should move through the scene. The generated result then fills in the visual world around that structure.

How The AI Builds A City Drive

Streetscapes is based on diffusion models, the same broad class of technology widely used in image and video generation. The system was trained on millions of real street views from Google Street View, allowing it to learn the visual patterns common to street-level environments.

That training helps the model generate details that make the scenes look plausible. The created street views include features such as windows, cobblestones and vegetation. Light and shadows are also rendered in a way that supports the realism of the virtual drive.

The system does not simply place objects into a scene at random. Its goal is to generate a believable street-view sequence that remains visually consistent as the camera moves. That requirement is central, because even realistic individual frames can fail if the next frame breaks the illusion.

For that reason, Streetscapes includes a component called a Motion Module. Its role is to support movement and temporal consistency between consecutive images. In practical terms, it helps the generated video feel like a connected journey rather than a slideshow of unrelated street images.

Why Temporal Consistency Is The Core Challenge

Street-view generation is especially demanding because the viewer is moving. Buildings, roads and surfaces need to remain stable from one frame to the next. If a window shifts, a road edge changes shape, or lighting behaves inconsistently, the sequence quickly feels artificial.

Streetscapes addresses this with a technique called Temporal Imputation. With this technique, each new image is generated while taking previous images into account. That gives the system more context as it extends the sequence forward.

The result is longer video generation than alternative approaches described in the source. Streetscapes can generate up to 100 frames with camera movements covering more than 170 meters.

Those numbers are important because they show that the system is not limited to a very short visual sample. A longer sequence gives the model more room to demonstrate whether the scene holds together over distance and motion.

The source also notes a limitation: Streetscapes uses an architecture that has since been surpassed by other video generation models like OpenAI's Sora. However, the team says the underlying diffusion model is easily interchangeable, which means future versions could benefit from stronger video-generation foundations.

Text Prompts Add Creative Control

Beyond realistic city simulation, Streetscapes can be steered with text prompts. The appearance of the generated city can change based on text descriptions, including different times of day or weather conditions.

The system can also mix layouts and visual styles. One example given in the source is visualizing Parisian streets in the style of New York City. That points to a creative use case where the structure of one place can be combined with the architectural character of another.

This kind of control could make Streetscapes useful for exploring alternate versions of an urban environment. The key point, based on the source, is that the system is not only generating static street scenes. It is generating a controllable visual world built from maps, building data and motion through space.

What Comes Next For Streetscapes

The research team presents Streetscapes as a step toward AI systems that can generate entire, unlimited scenes, not only individual objects. That is the larger technical direction behind the project: moving from isolated generated content toward coherent environments that can extend over time and distance.

The team plans to improve control over moving objects like cars. That is a logical next focus because moving objects add another layer of complexity to an already difficult task. A static building can remain consistent across frames, but traffic and other moving elements must behave plausibly while the camera also moves.

The researchers also want to further increase consistency between consecutive images. That goal reinforces the main challenge of city-scale video generation: realism depends not only on how each frame looks, but on how well the world persists from one moment to the next.

Streetscapes shows how structured city inputs and generative video models can work together. By combining maps, building height data, camera paths, diffusion models and temporal techniques, it turns the idea of an AI-generated street view into a continuous virtual drive.