A single image can show what is directly visible, but it cannot reveal what sits behind a corner, around an object, or above the current viewpoint. GenEx, an AI system developed by researchers at Johns Hopkins University, is built around that missing context.
The system generates a fully explorable 3D environment from one photo. Its purpose is not just to make a convincing visual scene. It is meant to give robots and AI agents a way to reason about places they have not physically observed yet.
From One Image To An Explorable Space
GenEx starts with a limited visual input and expands it into a navigable environment. That matters because many real decisions depend on information outside the immediate field of view. An AI agent may see a road, a vehicle, or a nearby obstacle, but still need to infer what lies beyond that visible slice.
The researchers describe this ability as a form of machine imagination. GenEx lets an agent explore possible viewpoints inside a generated 3D space, instead of being locked to the original image. In practical terms, that gives the system a way to ask: what would this scene look like from another angle?
This is especially relevant for robots and AI agents operating in complicated settings. A system that can only react to one image may behave cautiously or miss a hidden risk. A system that can inspect additional imagined views has more context for planning its next move.
Why Virtual Worlds Were Used For Training
The team trained GenEx with virtual environments rather than real-world photo collections. Those environments came from game engines such as Unreal Engine 5 and Unity. The advantage was access to rich and varied training data that could be collected efficiently.
A key part of the training process involved cubemaps. A cubemap represents a 360-degree view across six square faces arranged like a cube. This format helped the researchers capture directions around a point in space and teach the system how viewpoints relate to each other.
The team also gathered predefined exploration paths through the virtual worlds. By scanning movement directions in a systematic way, the dataset supported the kind of viewpoint transitions GenEx needs to produce. The goal was continuity: when the agent moves through the generated environment, the scene should remain visually coherent.
According to the researchers, GenEx can keep images stable and coherent while exploring up to 20 meters inside generated environments. Standard quality metrics also showed low error rates. Within the source material, that is presented as evidence that the visualizations are highly realistic.
What GenEx Can Generate
GenEx is not limited to horizontal movement through a scene. It can also move along the vertical axis to generate bird's-eye views. For an AI agent, that broader perspective can function like an overhead view without requiring a drone.
The system also performs well in multi-view videos of objects, according to the researchers. Other open-source models have difficulty with that task, but GenEx is reported to keep backgrounds consistent and lighting realistic across the sequence.
Another major capability is active 3D mapping. As an AI agent explores the generated environment, it can build a three-dimensional map of what it sees. The source compares this to the mapping behavior used by autonomous vehicles, with an important distinction: here, the exploration happens inside GenEx's imagined space rather than in the real world.
Together, these capabilities make GenEx more than an image generator. It is a tool for producing usable spatial context. That context can then support navigation, object inspection, and decision-making.
How Imaginative Exploration Changes Decisions
The most important use case in the source is AI decision-making. The researchers call this approach "Imaginative Exploration," and they demonstrate it with two traffic scenarios where a single image leaves out critical context.
In the first scenario, an AI agent approaches an unmarked intersection and sees a silver car coming from the front. With only one image, the agent would stop to stay safe. With GenEx, the agent explores other viewpoints, sees a stop sign facing the other car, and decides to keep driving to prevent traffic backup.
In the second scenario, an agent is waiting at a red light and must decide whether to make a right turn. The situation includes an approaching car and a crossing pedestrian. By using GenEx to inspect multiple viewpoints, the agent recognizes that it is blocking the line of sight between the car and the pedestrian. Instead of only waiting, it chooses to warn both parties about the possible danger.
These examples show why viewpoint generation can matter. The decision is not improved simply because the image looks better. It improves because the agent can reason over scene structure that was not visible in the starting view.
The source compares this to human imagination. People often infer what is hidden without needing to inspect every angle physically. We can reason that a fire truck may block the whole road, or understand that a stop sign has another side, without walking around it. GenEx aims to provide a similar capability for AI agents.
The Results And The Remaining Gap
The reported decision results are significant. When a GPT-4o agent used GenEx, it made correct decisions 85% of the time. An agent working from a single image reached 46%.
The difference was larger in multi-agent scenarios. With GenEx, accuracy was 95%. Without it, accuracy was 22%.
Those numbers point to the value of giving an AI agent more than the original viewpoint. The system can test visual possibilities, inspect hidden context, and use that information before acting. In the scenarios described, that led to better outcomes than relying on one image alone.
Still, the researchers note important limitations. The gap between imagined environments and real environments remains difficult. Future work will need to adapt the system to real-world sensor data and dynamic conditions.
That limitation is central. GenEx shows how generated 3D environments can improve reasoning, but real places change, sensors are imperfect, and traffic scenes involve movement. The system's promise is clear from the source: better decisions can come from better imagined context. The next challenge is making that context reliable outside generated space.