Ars Technica AI August 28, 2024 NEUTRAL

How GameNGen turns Doom into a real-time AI simulation

GameNGen, from researchers at Google and Tel Aviv University, uses AI image generation techniques to simulate the 1993 game Doom in real time. It points toward a possible future for neural game engines, but the current system remains limited, glitch-prone and focused on one existing game.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

A limited research demo of neural game simulation hints at AI replacing some handcrafted software work, but it is not clearly harmful or socially degrading.

How GameNGen turns Doom into a real-time AI simulation

GameNGen is a research system that asks a direct question about the future of game engines: what if a playable world could be generated frame by frame by an AI model instead of rendered through conventional software rules?

Researchers from Google and Tel Aviv University unveiled the model on Tuesday. Their work uses AI image generation techniques associated with Stable Diffusion to interactively simulate the classic 1993 first-person shooter Doom in real time.

A game engine that predicts pixels

Traditional video games are built from code that defines rules, assets, physics, rendering behavior and player interaction. GameNGen approaches the problem differently. It functions as a limited neural network game engine, generating the next visual frame as a prediction task while responding to player input.

The result is not simply a video clip. The system produces new Doom gameplay frames interactively, reportedly at over 20 frames per second on a single tensor processing unit, or TPU. The source describes a TPU as a specialized processor similar to a GPU, optimized for machine learning tasks.

That is why the project has attracted attention beyond Doom itself. If a model can generate a playable version of an existing game in real time, it suggests a possible path toward future systems where game graphics are synthesized by AI as the player acts.

App developer Nick Dobos framed the idea bluntly in reaction to the news: “Why write complex rules for software by hand when the AI can just think every pixel for you?”

How GameNGen was trained

The technical foundation is described in a preprint research paper titled “Diffusion Models Are Real-Time Game Engines.” The listed authors are Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter.

GameNGen uses a modified version of Stable Diffusion 1.4, an image synthesis diffusion model released in 2022. Instead of producing standalone images, the model predicts the next game state from earlier game states while being guided by player input.

The training process had two main phases:

The researchers first trained a reinforcement learning agent to play Doom and recorded its gameplay sessions.
They then used that automatically generated footage to train the custom Stable Diffusion model.

In other words, GameNGen learned from Doom in action. It was not hand-coded as a conventional Doom engine. It was trained to produce plausible next frames based on what had already happened and what the player was doing.

Stability AI Research Director Tanishq Mathew Abraham, who was not involved with the project, reacted to the research by writing: “Turns out the answer to ‘can it run DOOM?’ is yes for diffusion models,”

Human raters sometimes struggled to tell the difference

The researchers tested how convincing the generated output looked in short clips. According to the source, ten human raters viewed 1.6 seconds and 3.2 seconds clips of actual Doom footage and GameNGen output.

Those raters sometimes failed to distinguish the real game footage from the AI-generated clips. They identified the true gameplay footage 58 percent or 60 percent of the time.

Those results do not mean the system is indistinguishable from the original game in every circumstance. The clips were short, and the system still has important weaknesses. But the test shows that, at least in brief windows, the generated frames can be close enough to create uncertainty for human viewers.

The broader idea is often described in the source as real-time video game synthesis or neural rendering. The article also places GameNGen alongside earlier work including World Models in 2018, GameGAN in 2020, Google’s Genie in March, and DIAMOND, a diffusion-based system trained to simulate vintage Atari video games earlier this year.

The limits are as important as the breakthrough

GameNGen is not a general-purpose replacement for game development. The most obvious limitation is that the research focused on one existing game. The source also notes that Stable Diffusion is best at imitation and plausible outputs, not true novelty.

The model has another major constraint: it only has access to three seconds of history. That matters because a game world is not only what appears on screen right now. If a player returns to a previously seen part of a Doom level, GameNGen would have to make probabilistic guesses about that earlier state without actual knowledge of the full past.

The source compares this to the kind of confabulation or hallucination seen in other generative AI outputs. For an interactive game, that issue can become more visible because the player expects a consistent world that remembers what happened.

There are also graphical problems. The researchers note that Stable Diffusion v1.4’s pre-trained auto-encoder compresses 8×8 pixel patches into 4 latent channels, creating artifacts that affect small details and especially the bottom bar HUD.

Maintaining consistency over time is another challenge. The paper says “interactive world simulation is more than just very fast video generation,” because the model must respond to a stream of input actions available only during generation. Repeatedly generating new frames from previous frames can allow small errors to build up, making the world degrade or become nonsensical over time.

To address that, the researchers added varying levels of random noise to the training data and trained the model to correct it. That helped preserve generated world quality over longer durations.

What this could mean for games

For now, GameNGen is best understood as a proof of concept. It shows that a diffusion model can simulate Doom in real time under specific conditions, not that future games are ready to abandon traditional engines.

The research still points toward a significant possibility. If neural game engines improve, games and simulations might eventually be generated from examples or descriptions instead of being programmed in the usual way. The researchers speculate that new video games might be created “via textual descriptions or examples images,” and that still images could potentially become the basis for a playable level or character in an existing game.

The source is careful to frame that future as speculation. Scaling the approach to more complex environments or different genres would bring new challenges, and the computing requirements could be difficult in the short term.

Still, GameNGen captures a real shift in how researchers are thinking about interactive media. The paper puts the idea this way: “Today, video games are programmed by humans,” and “GameNGen is a proof-of-concept for one part of a new paradigm where games are weights of a neural model, not lines of code.”