Nvidia DreamDojo points to a different way of training robots: build a model of what could happen next, then let a robot operate inside that learned future before taking action in the real world.
The system, released by Nvidia's AI research team, is described as an open-source, interactive world model for robotics. Jim Fan, Director of AI and Distinguished Scientist at Nvidia, calls it "Simulation 2.0."
A world model instead of a hand-built simulator
Traditional robot training can depend on structured simulation assets and carefully specified behavior. DreamDojo takes another route. It receives robot motor controls and generates a simulated future directly in pixels.
According to the source, that process does not require an engine, meshes, or hand-authored dynamics. In plain terms, DreamDojo is not presented as a conventional simulator with every object and rule manually prepared in advance. It is a model that learns to predict visual outcomes.
That distinction matters because robotics work often runs into practical limits. Fan explains that real-world robot learning is bottlenecked by time, wear, safety, and resets. Each real attempt can carry cost, delay, or risk, especially when a robot must physically interact with the environment.
DreamDojo is designed to reduce that pressure by moving more learning into a neural simulator. The robot can evaluate behavior in a generated future, while the model handles the visual and physical consequences implied by its training.
Human video comes first
One of the central ideas behind DreamDojo is that it starts with human video rather than robot data. The model is pre-trained on 44,000 hours of first-person human video footage with zero robot-in-the-loop.
This is possible because DreamDojo uses so-called latent actions. These are described as a unified representation inferred directly from videos. They capture what changed between world states without requiring knowledge of the hardware that produced the motion.
The practical effect is that first-person video can be treated as if it came with motor commands attached. The model does not need the source video to come from the same robot, or from a robot at all, in order to learn patterns about how scenes change over time.
For robot training, that shifts the focus. Instead of asking a robot to generate all the experience needed from scratch, DreamDojo first learns broad regularities from human footage. The robot-specific step comes later.
Post-training adapts the model to the robot
After pre-training, DreamDojo is post-trained on a specific robot so the model can fit that robot's hardware. Fan describes this as separating "how the world looks and behaves" from "how this particular robot actuates."
That separation is the core design logic. The base model learns general physical rules from video. Then it snaps onto the robot's unique mechanics through post-training.
The result is a robotics world model that can combine broad visual-world understanding with the control realities of one machine. DreamDojo does not treat every robot as the same. It builds a foundation first, then adapts that foundation to the hardware.
The source also notes several supported uses inside the world model:
- Live VR teleoperation inside the dream.
- Policy evaluation in the neural simulator.
- Model-based planning.
These uses all point in the same direction: DreamDojo is not only a passive prediction system. It is meant to be interactive, with robot behavior tested and planned inside the generated environment.
Real-time operation and open availability
A real-time version of DreamDojo runs at 10 frames per second. The source says it remains stable for over a minute of continuous rollout.
That detail is important because a world model for robotics must do more than create a single plausible frame. It has to keep generating a coherent sequence as actions continue. Stability over a continuous rollout is central to using the model for teleoperation, evaluation, and planning.
Fan also says all weights, code, post-training dataset, eval set, and whitepaper are openly available. DreamDojo is built on Nvidia Cosmos, which is open-weight too.
For the robotics community, that open-source framing is part of the story. The release includes not only a model description, but also the materials needed for others to examine, test, and build on the work, according to Fan.
What DreamDojo changes in the training loop
The most important shift is conceptual. DreamDojo treats robot training as a problem that can begin with human video, continue through a learned world model, and then specialize to a particular robot.
That does not remove the need to understand hardware. The post-training step exists precisely because a specific robot has specific mechanics. But it changes where the broad learning happens.
Instead of relying only on real robot operation, DreamDojo learns from first-person video footage first. Instead of requiring a traditional simulation stack with engines, meshes, and hand-authored dynamics, it generates future pixels from actions.
For Nvidia's AI research team, this is the basis for an open-source robotics world model. For robot training, the larger implication is clear from the source: DreamDojo tries to work around real-world learning limits by letting robots train, evaluate, and plan inside a learned simulation of the future.