maketitle thanks aketitle

Introduction

Computer games are manually crafted software systems centered around the following game loop: (1) gather user inputs, (2) update the game state, and (3) render it to screen pixels. This game loop, running at high frame rates, creates the illusion of an interactive virtual world for the player. Such game loops are classically run on standard computers, and while there have been many amazing attempts at running games on bespoke hardware (e.g. the iconic game DOOM has been run on kitchen appliances such as a toaster and a microwave, a treadmill, a camera, an iPod, and within the game of Minecraft, to name just a few examples^[1]), in all of these cases the hardware is still emulating the manually written game software as-is. Furthermore, while vastly different game engines exist, the game state updates and rendering logic in all are composed of a set of manual rules, programmed or configured by hand.

In recent years, generative models made significant progress in producing images and videos conditioned on multi-modal inputs, such as text or images. At the forefront of this wave, diffusion models became the de-facto standard in media (i.e. non-language) generation, with works like Dall-E , Stable Diffusion and Sora . At a glance, simulating the interactive worlds of video games may seem similar to video generation. However, interactive world simulation is more than just very fast video generation. The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures. Notably, it requires generating frames autoregressively which tends to be unstable and leads to sampling divergence (see section 3.2.1).

Several important works (see Section 6) simulate interactive video games with neural models. Nevertheless, most of these approaches are limited in respect to the complexity of the simulated games, simulation speed, stability over long time periods, or visual quality (see Figure 2). It is therefore natural to ask:

Can a neural model running in real-time simulate a complex game at high quality?

In this work we demonstrate that the answer is yes. Specifically, we show that a complex video game, the iconic game DOOM, can be run on a neural network (an augmented version of the open Stable Diffusion v1.4 ), in real-time, while achieving a visual quality comparable to that of the original game. While not an exact simulation, the neural model is able to perform complex game state updates, such as tallying health and ammo, attacking enemies, damaging objects, opening doors, and persist the game state over long trajectories.

GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years. Key questions remain, such as how these neural game engines would be trained and how games would be effectively created in the first place, including how to best leverage human inputs. We are nevertheless extremely excited for the possibilities of this new paradigm.

File:Figures/compare to sota.pdf

Interactive World Simulation

An Interactive Environment ${\textstyle {\mathcal {E}}}$ consists of a space of latent states ${\textstyle {\mathcal {S}}}$ , a space of partial projections of the latent space ${\textstyle {\mathcal {O}}}$ , a partial projection function ${\textstyle V:{\mathcal {S}}\to {\mathcal {O}}}$ , a set of actions ${\textstyle {\mathcal {A}}}$ , and a transition probability function ${\textstyle p(s|a,s')}$ such that ${\textstyle s,s'\in {\mathcal {S}},a\in {\mathcal {A}}}$ .

For example, in the case of the game DOOM, ${\textstyle {\mathcal {S}}}$ is the program’s dynamic memory contents, ${\textstyle {\mathcal {O}}}$ is the rendered screen pixels, ${\textstyle V}$ is the game’s rendering logic, ${\textstyle {\mathcal {A}}}$ is the set of key presses and mouse movements, and ${\textstyle p}$ is the program’s logic given the player’s input (including any potential non-determinism).

Given an input interactive environment ${\textstyle {\mathcal {E}}}$ , and an initial state ${\textstyle s_{0}\in {\mathcal {S}}}$ , an Interactive World Simulation is a simulation distribution function ${\textstyle q(o_{n}|o_{<n},a_{\leq n}),o_{i}\in {\mathcal {O}},a_{i}\in {\mathcal {A}}}$ . Given a distance metric between observations ${\textstyle D:{\mathcal {O}}\times {\mathcal {O}}\to \mathbb {R} }$ , a policy, i.e. a distribution on agent actions given past actions and observations ${\textstyle \pi (a_{n}|o_{<n},a{<n})}$ , a distribution ${\textstyle S_{0}}$ on initial states, and a distribution ${\textstyle N_{0}}$ on episode lengths, the Interactive World Simulation objective consists of minimizing ${\textstyle E(D(o_{q}^{i},o_{p}^{i}))}$ where ${\textstyle n\sim N_{0}}$ , ${\textstyle 0\leq i\leq n}$ , and ${\textstyle o_{q}^{i}\sim q,o_{p}^{i}\sim V(p)}$ are sampled observations from the environment and the simulation when enacting the agent’s policy ${\textstyle \pi }$ . Importantly, the conditioning actions for these samples are always obtained by the agent interacting with the environment ${\textstyle {\mathcal {E}}}$ , while the conditioning observations can either be obtained from ${\textstyle {\mathcal {E}}}$ (the teacher forcing objective) or from the simulation (the auto-regressive objective).

We always train our generative model with the teacher forcing objective. Given a simulation distribution function ${\textstyle q}$ , the environment ${\textstyle {\mathcal {E}}}$ can be simulated by auto-regressively sampling observations.

GameNGen

GameNGen (pronounced “game engine”) is a generative diffusion model that learns to simulate the game under the settings of Section 2. In order to collect training data for this model, with the teacher forcing objective, we first train a separate model to interact with the environment. The two models (agent and generative) are trained in sequence. The entirety of the agent’s actions and observations corpus ${\textstyle {\mathcal {T}}_{agent}}$ during training is maintained and becomes the training dataset for the generative model in a second stage. See Figure 3.

File:Figures/Architecture 08 27 b.pdf

Data Collection via Agent Play

Our end goal is to have human players interact with our simulation. To that end, the policy ${\textstyle \pi }$ as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play. Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific (see Appendix 8.3).

We record the agent’s training trajectories throughout the entire training process, which includes different skill levels of play. This set of recorded trajectories is our ${\textstyle {\mathcal {T}}_{agent}}$ dataset, used for training the generative model (see Section 3.2).

Training the Generative Diffusion Model

We now train a generative diffusion model conditioned on the agent’s trajectories ${\textstyle {\mathcal {T}}_{agent}}$ (actions and observations) collected during the previous stage.

We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 . We condition the model ${\textstyle f_{\theta }}$ on trajectories ${\textstyle T\sim {\mathcal {T}}_{agent}}$ , i.e. on a sequence of previous actions ${\textstyle a_{<n}}$ and observations (frames) ${\textstyle o_{<n}}$ and remove all text conditioning. Specifically, to condition on actions, we simply learn an embedding ${\textstyle A_{emb}}$ from each action (e.g. a specific key press) into a single token and replace the cross attention from the text into this encoded actions sequence. In order to condition on observations (i.e. previous frames) we encode them into latent space using the auto-encoder ${\textstyle \phi }$ and concatenate them in the latent channels dimension to the noised latents (see Figure 3). We also experimented conditioning on these past observations via cross-attention but observed no meaningful improvements.

We train the model to minimize the diffusion loss with velocity parameterization :

{\mathcal {L}}=\mathbb {E} _{t,\epsilon ,T}\left[\|v(\epsilon ,x_{0},t)-v_{\theta '}(x_{t},t,\{\phi (o_{i<n})\},\{A_{emb}(a_{i<n})\})\|_{2}^{2}\right]

where ${\textstyle T=\{{o_{i\leq n},a_{i\leq n}}\}\sim {\mathcal {T}}_{agent}}$ , ${\textstyle x_{0}=\phi (o_{n})}$ , ${\textstyle t\sim {\mathcal {U}}(0,1)}$ , ${\textstyle \epsilon \sim {\mathcal {N}}(0,\mathbf {I} )}$ , ${\textstyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t}}}\epsilon }$ , ${\textstyle v(\epsilon ,x_{0},t)={\sqrt {{\bar {\alpha }}_{t}}}\epsilon -{\sqrt {1-{\bar {\alpha }}_{t}}}x_{0}}$ , and ${\textstyle v_{\theta '}}$ is the v-prediction output of the model ${\textstyle f_{\theta }}$ . The noise schedule ${\textstyle {\bar {\alpha }}_{t}}$ is linear, similarly to .

Mitigating Auto-Regressive Drift Using Noise Augmentation

The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in Figure 4. To avoid this divergence due to auto-regressive application of the model, we corrupt context frames by adding a varying amount of Gaussian noise to encoded frames in training time, while providing the noise level as input to the model, following . To that effect, we sample a noise level ${\textstyle \alpha }$ uniformly up to a maximal value, discretize it and learn an embedding for each bucket (see Figure 3). This allows the network to correct information sampled in previous frames, and is critical for preserving frame quality over time. During inference, the added noise level can be controlled to maximize quality, although we find that even with no added noise the results are significantly improved. We ablate the impact of this method in section 5.2.2.

File:Figures/noise aug ablation new.png

Latent Decoder Fine-tuning

The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8x8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD (“heads up display”). To leverage the pre-trained knowledge while improving image quality, we train just the decoder of the latent auto-encoder using an MSE loss computed against the target frame pixels. It might be possible to improve quality even further using a perceptual loss such as LPIPS (), which we leave to future work. Importantly, note that this fine tuning process happens completely separately from the U-Net fine-tuning, and that notably the auto-regressive generation isn’t affected by it (we only condition auto-regressively on the latents, not the pixels). Appendix 8.2 shows examples of generations with and without fine-tuning the auto-encoder.

Inference

Setup

We use DDIM sampling . We employ Classifier-Free Guidance only for the past observations condition ${\textstyle o_{<n}}$ . We didn’t find guidance for the past actions condition ${\textstyle a_{<n}}$ to improve quality. The weight we use is relatively small (1.5) as larger weights create artifacts which increase due to our auto-regressive sampling.

We also experimented with generating 4 samples in parallel and combining the results, with the hope of preventing rare extreme predictions from being accepted and to reduce error accumulation. We experimented both with averaging the samples and with choosing the sample closest to the median. Averaging performed slightly worse than single frame, and choosing the closest to the median performed only negligibly better. Since both increase the hardware requirements to 4 TPUs, we opt to not use them, but note that this might be an interesting area for future work.

Denoiser Sampling Steps

During inference, we need to run both the U-Net denoiser (for a number of steps) and the auto-encoder. On our hardware configuration (a TPU-v5), a single denoiser step and an evaluation of the auto-encoder both takes 10ms. If we ran our model with a single denoiser step, the minimum total latency possible in our setup would be 20ms per frame, or 50 frames per second. Usually, generative diffusion models, such as Stable Diffusion, don’t produce high quality results with a single denoising step, and instead require dozens of sampling steps to generate a high quality image. Surprisingly, we found that we can robustly simulate DOOM, with only 4 DDIM sampling steps . In fact, we observe no degradation in simulation quality when using 4 sampling steps vs 20 steps or more (see Appendix [appendix:ablate num sampling steps]).

Using just 4 denoising steps leads to a total U-Net cost of 40ms (and total inference cost of 50ms, including the auto encoder) or 20 frames per second. We hypothesize that the negligible impact to quality with few steps in our case stems from a combination of: (1) a constrained images space, and (2) strong conditioning by the previous frames.

Since we do observe degradation when using just a single sampling step, we also experimented with model distillation similarly to in the single-step setting. Distillation does help substantially there (allowing us to reach 50 FPS as above), but still comes at a some cost to simulation quality, so we opt to use the 4-step version without distillation for our method (see Appendix 8.4). This is an interesting area for further research.

We note that it is trivial to further increase the image generation rate substantially by parallelizing the generation of several frames on additional hardware, similarly to NVidia’s classic SLI Alternate Frame Rendering (AFR) technique. Similarly to AFR, the actual simulation rate would not increase and input lag would not reduce.

Experimental Setup

Agent Training

The agent model is trained using PPO , with a simple CNN as the feature network, following . It is trained on CPU using the Stable Baselines 3 infrastructure . The agent is provided with downscaled versions of the frame images and in-game map, each at resolution 160x120. The agent also has access to the last 32 actions it performed. The feature network computes a representation of size 512 for each image. PPO’s actor and critic are 2-layer MLP heads on top of a concatenation of the outputs of the image feature network and the sequence of past actions. We train the agent to play the game using the Vizdoom environment . We run 8 games in parallel, each with a replay buffer size of 512, a discount factor ${\textstyle \gamma =0.99}$ , and an entropy coefficient of ${\textstyle 0.1}$ . In each iteration, the network is trained using a batch size of 64 for 10 epochs, with a learning rate of 1e-4. We perform a total of 10M environment steps.

Generative Model Training

We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4, unfreezing all U-Net parameters. We use a batch size of 128 and a constant learning rate of 2e-5, with the Adafactor optimizer without weight decay and gradient clipping of 1.0. We change the diffusion loss parameterization to be v-prediction (. The context frames condition is dropped with probability 0.1 to allow CFG during inference. We train using 128 TPU-v5e devices with data parallelization. Unless noted otherwise, all results in the paper are after 700,000 training steps. For noise augmentation (Section 3.2.1), we use a maximal noise level of 0.7, with 10 embedding buckets. We use a batch size of 2,048 for optimizing the latent decoder, other training parameters are identical to those of the denoiser. For training data, we use all trajectories played by the agent during RL training as well as evaluation data during training, unless mentioned otherwise. Overall we generate 900M frames for training. All image frames (during training, inference, and conditioning) are at a resolution of 320x240 padded to 320x256. We use a context length of 64 (i.e. the model is provided its own last 64 predictions as well as the last 64 actions).

Results

Simulation Quality

Overall, our method achieves a simulation quality comparable to the original game over long trajectories in terms of image quality. For short trajectories, human raters are only slightly better than random chance at distinguishing between clips of the simulation and the actual game.

Image Quality. We measure LPIPS and PSNR using the teacher-forcing setup described in Section 2, where we sample an initial state and predict a single frame based on a trajectory of ground-truth past observations. When evaluated over a random holdout of 2048 trajectories taken in 5 different levels, our model achieves a PSNR of ${\textstyle 29.43}$ and an LPIPS of ${\textstyle 0.249}$ . The PSNR value is similar to lossy JPEG compression with quality settings of 20-30 . Figure 5 shows examples of model predictions and the corresponding ground truth samples.

File:Figures/image quality.pdf

Video Quality. We use the auto-regressive setup described in Section 2, where we iteratively sample frames following the sequences of actions defined by the ground-truth trajectory, while conditioning the model on its own past predictions. When sampled auto-regressively, the predicted and ground-truth trajectories often diverge after a few steps, mostly due to the accumulation of small amounts of different movement velocities between frames in each trajectory. For that reason, per-frame PSNR and LPIPS values gradually decrease and increase respectively, as can be seen in Figure 6. The predicted trajectory is still similar to the actual game in terms of content and image quality, but per-frame metrics are limited in their ability to capture this (see Appendix 8.1 for samples of auto-regressively generated trajectories).

We therefore measure the FVD computed over a random holdout of 512 trajectories, measuring the distance between the predicted and ground truth trajectory distributions, for simulations of length 16 frames (0.8 seconds) and 32 frames (1.6 seconds). For 16 frames our model obtains an FVD of ${\textstyle 114.02}$ . For 32 frames our model obtains an FVD of ${\textstyle 186.23}$ .

Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix 8.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively).

image image

Ablations

To evaluate the importance of the different components of our methods, we sample trajectories from the evaluation dataset and compute LPIPS and PSNR metrics between the ground truth and the predicted frames.

Context Length

We evaluate the impact of changing the number ${\textstyle N}$ of past observations in the conditioning context by training models with ${\textstyle N\in \{1,2,4,8,16,32,64\}}$ (recall that our method uses ${\textstyle N=64}$ ). This affects both the number of historical frames and actions. We train the models for 200,000 steps keeping the decoder frozen and evaluate on test-set trajectories from 5 levels. See the results in Table [table:number_of_frames_ablation]. As expected, we observe that generation quality improves with the length of the context. Interestingly, we observe that while the improvement is large at first (e.g. between 1 and 2 frames), we quickly approach an asymptote and further increasing the context size provides only small improvements in quality. This is somewhat surprising as even with our maximal context length, the model only has access to a little over 3 seconds of history. Notably, we observe that much of the game state is persisted for much longer periods (see Section 7). While the length of the conditioning context is an important limitation, Table [table:number_of_frames_ablation] hints that we’d likely need to change the architecture of our model to efficiently support longer contexts, and employ better selection of the past frames to condition on, which we leave for future work.

Noise Augmentation

To ablate the impact of noise augmentation we train a model without added noise. We evaluate both our standard model with noise augmentation and the model without added noise (after 200k training steps) auto-regressively and compute PSNR and LPIPS metrics between the predicted frames and the ground-truth over a random holdout of 512 trajectories. We report average metric values for each auto-regressive step up to a total of 64 frames in Figure 7.

Without noise augmentation, LPIPS distance from the ground truth increases rapidly compared to our standard noise-augmented model, while PSNR drops, indicating a divergence of the simulation from ground truth.

image image

Agent Play

We compare training on agent-generated data to training on data generated using a random policy. For the random policy, we sample actions following a uniform categorical distribution that doesn’t depend on the observations. We compare the random and agent datasets by training 2 models for 700k steps along with their decoder. The models are evaluated on a dataset of 2048 human-play trajectories from 5 levels. We compare the first frame of generation, conditioned on a history context of 64 ground-truth frames, as well as a frame after 3 seconds of auto-regressive generation.

Overall, we observe that training the model on random trajectories works surprisingly well, but is limited by the exploration ability of the random policy. When comparing the single frame generation the agent works only slightly better, achieving a PNSR of 25.06 vs 24.42 for the random policy. When comparing a frame after 3 seconds of auto-regressive generation, the difference increases to 19.02 vs 16.84. When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better. With that, we manually split 456 examples into 3 buckets: easy, medium, and hard, manually, based on their distance from the starting position in the game. We observe that on the easy and hard sets, the agent performs only slightly better than random, while on the medium set the difference is much larger in favor of the agent as expected (see Table [table:difficulty_split_performance]). See Figure 13 in Appendix 8.5 for an example of the scores during a single session of human play.

Related Work

Interactive 3D Simulation

Simulating visual and physical processes of 2D and 3D environments and allowing interactive exploration of them is an extensively developed field in computer graphics . Game Engines, such as Unreal and Unity, are software that processes representations of scene geometry and renders a stream of images in response to user interactions. The game engine is responsible for keeping track of all world state, e.g. the player position and movement, objects, character animation and lighting. It also tracks the game logic, e.g. points gained by accomplishing game objectives. Film and television productions use variants of ray-tracing , which are too slow and compute-intensive for real time applications. In contrast, game engines must keep a very high frame rate (typically 30-60 FPS), and therefore rely on highly-optimized polygon rasterization, often accelerated by GPUs. Physical effects such as shadows, particles and lighting are often implemented using efficient heuristics rather than physically accurate simulation.

Neural 3D Simulation

Neural methods for reconstructing 3D representations have made significant advances over the last years. NeRFs parameterize radiance fields using a deep neural network that is specifically optimized for a given scene from a set of images taken from various camera poses. Once trained, novel point of views of the scene can be sampled using volume rendering methods. Gaussian Splatting approaches build on NeRFs but represent scenes using 3D Gaussians and adapted rasterization methods, unlocking faster training and rendering times. While demonstrating impressive reconstruction results and real-time interactivity, these methods are often limited to static scenes.

Video Diffusion Models

Diffusion models achieved state-of-the-art results in text-to-image generation , a line of work that has also been applied for text-to-video generation tasks . Despite impressive advancement in realism, text adherence and temporal consistency, video diffusion models remain too slow for real-time applications. Our work extends this line of work and adapts it for real-time generation conditioned autoregressively on a history of past observations and actions.

Game Simulation and World Models

Several works attempted to train models for game simulation with actions inputs. build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. and focus on unsupervised learning of actions from videos. converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on interactive playable real-time simulation, and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. train a Variational Auto-Encoder to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e. selecting an action at random). Then controller policy is learned by playing within the “hallucinated” environment. demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is , that use an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game, see Figure 2. Finally, concurrently with our work, train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.

DOOM

When DOOM released in 1993 it revolutionized the gaming industry. Introducing groundbreaking 3D graphics technology, it became a cornerstone of the first-person shooter genre, influencing countless other games. DOOM was studied by numerous research works. It provides an open-source implementation and a native resolution that is low enough for small sized models to simulate, while being complex enough to be a challenging test case. Finally, the authors have spent countless youth hours with the game. It was a trivial choice to use it in this work.

Discussion

Summary. We introduced GameNGen, and demonstrated that high-quality real-time game play at 20 frames per second is possible on a neural model. We also provided a recipe for converting an interactive piece of software such as a computer game into a neural model.

Limitations. GameNGen suffers from a limited amount of memory. The model only has access to a little over 3 seconds of history, so it’s remarkable that much of the game logic is persisted for drastically longer time horizons. While some of the game state is persisted through screen pixels (e.g. ammo and health tallies, available weapons, etc.), the model likely learns strong heuristics that allow meaningful generalizations. For example, from the rendered view the model learns to infer the player’s location, and from the ammo and health tallies, the model might infer whether the player has already been through an area and defeated the enemies there. That said, it’s easy to create situations where this context length is not enough. Continuing to increase the context size with our existing architecture yields only marginal benefits (Section 5.2.1), and the model’s short context length remains an important limitation. The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases.

Future Work. We demonstrate GameNGen on the classic game DOOM. It would be interesting to test it on other games or more generally on other interactive software systems; We note that nothing in our technique is DOOM specific except for the reward function for the RL-agent. We plan on addressing that in a future work; While GameNGen manages to maintain game state accurately, it isn’t perfect, as per the discussion above. A more sophisticated architecture might be needed to mitigate these; GameNGen currently has a limited capability to leverage more than a minimal amount of memory. Experimenting with further expanding the memory effectively could be critical for more complex games/software; GameNGen runs at 20 or 50 FPS^[2] on a TPUv5. It would be interesting to experiment with further optimization techniques to get it to run at higher frame rates and on consumer hardware.

Towards a New Paradigm for Interactive Video Games. Today, video games are programmed by humans. GameNGen is a proof-of-concept for one part of a new paradigm where games are weights of a neural model, not lines of code. GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware. While many important questions remain, we are hopeful that this paradigm could have important benefits. For example, the development process for video games under this new paradigm might be less costly and more accessible, whereby games could be developed and edited via textual descriptions or examples images. A small part of this vision, namely creating modifications or novel behaviors for existing games, might be achievable in the shorter term. For example, we might be able to convert a set of frames into a new playable level or create a new character just based on example images, without having to author code. Other advantages of this new paradigm include strong guarantees on frame rates and memory footprints. We have not experimented with these directions yet and much more work is required here, but we are excited to try! Hopefully this small step will someday contribute to a meaningful improvement in people’s experience with video games, or maybe even more generally, in day-to-day interactions with interactive software systems.

Acknowledgements

We’d like to extend a huge thank you to Eyal Segalis, Eyal Molad, Matan Kalman, Nataniel Ruiz, Amir Hertz, Matan Cohen, Yossi Matias, Yael Pritch, Danny Lumen, Valerie Nygaard, the Theta Labs and Google Research teams, and our families for insightful feedback, ideas, suggestions, and support.

Contribution

Dani Valevski: Developed much of the codebase, tuned parameters and details across the system, added autoencoder fine-tuning, agent training, and distillation.
Yaniv Leviathan: Proposed project, method, and architecture, developed the initial implementation, key contributor to implementation and writing.
Moab Arar: Led auto-regressive stabilization with noise-augmentation, many of the ablations, and created the dataset of human-play data.
Shlomi Fruchter: Proposed project, method, and architecture. Project leadership, initial implementation using DOOM, main manuscript writing, evaluation metrics, random policy data pipeline.

Correspondence to shlomif@google.com and leviathan@google.com.

Appendix

Samples

Figs. 8,9,10,11 provide selected samples from GameNGen.

image image

Fine-Tuning Latent Decoder Examples

Fig. 12 demonstrates the effect of fine-tuning the vae decoder.

File:Figures/fine tuning autoencoder.png

Reward Function

The RL-agent’s reward function, the only part of our method which is specific to the game Doom, is a sum of the following conditions:

Player hit: -100 points.
Player death: -5,000 points.
Enemy hit: 300 points.
Enemy kill: 1,000 points.
Item/weapon pick up: 100 points.
Secret found: 500 points.
New area: 20 * (1 + 0.5 * ${\textstyle L_{1}}$ distance) points.
Health delta: 10 * delta points.
Armor delta: 10 * delta points.
Ammo delta: 10 * max(0, delta) + min(0, delta) points.

Further, to encourage the agent to simulate smooth human play, we apply each agent action for 4 frames and additionally artificially increase the probability of repeating the previous action.

Reducing Inference Steps

We evaluated the performance of a GameNGen model with varying amounts of sampling steps when generating 2048 frames using teacher-forced trajectories on 35FPS data (the maximal sampling rate allowed by VizDoom, lower than the maximal rate our model achieves with distillation, see below). Surprisingly, we observe that quality does not deteriorate when decreasing the number of steps to 4, but does deteriorate when using just a single sampling step (see Table [table:diffusion_steps_quality]).

As a potential remedy, we experimented with distilling our model, following and . During distillation training we use 3 U-Nets, all initialized with a GameNGen model: generator, teacher, and fake-score model. The teacher remains frozen throughout the training. The fake-score model is continuously trained to predict the outputs of the generator with the standard diffusion loss. To train the generator, we use the teacher and the fake-score model to predict the noise added to an input image - ${\textstyle \epsilon _{\text{real}}}$ and ${\textstyle \epsilon _{\text{fake}}}$ . We optimize the weights of the generator to minimize the generator gradient value at each pixel weighted by ${\textstyle \epsilon _{\text{real}}-\epsilon _{\text{fake}}}$ . When distilling we use a CFG of 1.5 to generate ${\textstyle \epsilon _{\text{real}}}$ . We train for 1000 steps with a batch size of 128. Note that unlike we train with varying amounts of noise and do not use a regularization loss (we hope to explore other distillation variants in future work). With distillation we are able to significantly improve the quality of a 1-step model (see “D” in Table [table:diffusion_steps_quality]), enabling running the game at 50FPS, albeit with a small impact to quality.

Agent vs Random Policy

Figure 13 shows the PSNR values compared to ground truth for a model train on the RL-agent’s data and a model trained on the data from a random policy, after 3 second of auto-regressive generation, for a short session of human play. We observe that the agent is sometimes comparable to and sometime much better than the random policy.

File:Figures/psnr over trajectory.png

Human Eval Tool

Figure 14 depicts a screenshot of the tool used for the human evaluations (Section 5.1).

File:Figures/eval tool.png

↑ See https://www.reddit.com/r/itrunsdoom/
↑ Faster than the original game DOOM ran on the some of the authors’ 80386 machines at the time!

[1] See https://www.reddit.com/r/itrunsdoom/

[2] Faster than the original game DOOM ran on the some of the authors’ 80386 machines at the time!

[1]

[2]