Diffusion Models Are Real-Time Game Engines: Difference between revisions
(Created page with "maketitle thanks aketitle <div id="fig:teaser" class="figure"> File:figures/teaser.png </div> <span id="introduction"></span> = Introduction = Computer games are manually crafted software systems centered around the following ''game loop'': (1) gather user inputs, (2) update the game state, and (3) render it to screen pixels. This game loop, running at high frame rates, creates the illusion of an interactive virtual world for the player. Such game loops are class...") |
No edit summary |
||
Line 1: | Line 1: | ||
<div class="html-header-logo"> | |||
[https://arxiv.org/ [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/arxiv-logomark-small-white.svg|100px|class=logomark|logo]] <span class="sr-only">Back to arXiv</span>] | |||
</div> | |||
<div class="html-header-nav"> | |||
</div> | </div> | ||
< | <div class="html-header-logo"> | ||
[https://arxiv.org/ [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/arxiv-logo-one-color-white.svg|100px|class=logo|logo]] <span class="sr-only">Back to arXiv</span>] | |||
</div> | |||
<div class="html-header-message" role="banner"> | |||
This is '''experimental HTML''' to improve accessibility. We invite you to report rendering errors. <span class="sr-only">Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off.</span> Learn more [https://info.arxiv.org/about/accessible_HTML.html about this project] and [https://info.arxiv.org/help/submit_latex_best_practices.html help improve conversions]. | |||
In this work we demonstrate that the answer is yes. Specifically, we show that a complex video game, the iconic game DOOM, can be run on a neural network (an augmented version of the open Stable Diffusion v1.4 ), in real-time, while achieving a visual quality comparable to that of the original game. While not an exact simulation, the neural model is able to perform complex game state updates, such as tallying health and ammo, attacking enemies, damaging objects, opening doors, and persist the game state over long trajectories. | </div> | ||
[https://info.arxiv.org/about/accessible_HTML.html Why HTML?] [https://arxiv.org/html/2408.14837v1/#myForm Report Issue] [https://arxiv.org/abs/2408.14837v1 Back to Abstract] [https://arxiv.org/pdf/2408.14837v1 Download PDF] | |||
<div id="main" class="ltx_page_main"> | |||
== Table of Contents == | |||
<div id="listIcon" class="hide" type="button"> | |||
</div> | |||
<div id="arrowIcon" type="button"> | |||
</div> | |||
# [https://arxiv.org/html/2408.14837v1#abstract <span class="ltx_text ltx_ref_title"> <span class="ltx_tag ltx_tag_ref"></span> Abstract </span>] | |||
# [https://arxiv.org/html/2408.14837v1#S1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span>] | |||
# [https://arxiv.org/html/2408.14837v1#S2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Interactive World Simulation</span>] | |||
# [https://arxiv.org/html/2408.14837v1#S3 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>GameNGen</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S3.SS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Data Collection via Agent Play</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S3.SS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Training the Generative Diffusion Model</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S3.SS2.SSS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2.1 </span>Mitigating Auto-Regressive Drift Using Noise Augmentation</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S3.SS2.SSS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2.2 </span>Latent Decoder Fine-tuning</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S3.SS3 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>Inference</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S3.SS3.SSS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3.1 </span>Setup</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S3.SS3.SSS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3.2 </span>Denoiser Sampling Steps</span>] | |||
# [https://arxiv.org/html/2408.14837v1#S4 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Experimental Setup</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S4.SS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Agent Training</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S4.SS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Generative Model Training</span>] | |||
# [https://arxiv.org/html/2408.14837v1#S5 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Results</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S5.SS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.1 </span>Simulation Quality</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S5.SS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2 </span>Ablations</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S5.SS2.SSS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2.1 </span>Context Length</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S5.SS2.SSS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2.2 </span>Noise Augmentation</span>] | |||
### [https://arxiv.org/html/2408.14837v1#S5.SS2.SSS3 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2.3 </span>Agent Play</span>] | |||
# [https://arxiv.org/html/2408.14837v1#S6 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Related Work</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S6.SS0.SSS0.Px1 <span class="ltx_text ltx_ref_title">Interactive 3D Simulation</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S6.SS0.SSS0.Px2 <span class="ltx_text ltx_ref_title">Neural 3D Simulation</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S6.SS0.SSS0.Px3 <span class="ltx_text ltx_ref_title">Video Diffusion Models</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S6.SS0.SSS0.Px4 <span class="ltx_text ltx_ref_title">Game Simulation and World Models</span>] | |||
## [https://arxiv.org/html/2408.14837v1#S6.SS0.SSS0.Px5 <span class="ltx_text ltx_ref_title">DOOM</span>] | |||
# [https://arxiv.org/html/2408.14837v1#S7 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">7 </span>Discussion</span>] | |||
# [https://arxiv.org/html/2408.14837v1#A1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Appendix</span>] | |||
## [https://arxiv.org/html/2408.14837v1#A1.SS1 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.1 </span>Samples</span>] | |||
## [https://arxiv.org/html/2408.14837v1#A1.SS2 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.2 </span>Fine-Tuning Latent Decoder Examples</span>] | |||
## [https://arxiv.org/html/2408.14837v1#A1.SS3 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.3 </span>Reward Function</span>] | |||
## [https://arxiv.org/html/2408.14837v1#A1.SS4 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.4 </span>Reducing Inference Steps</span>] | |||
## [https://arxiv.org/html/2408.14837v1#A1.SS5 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.5 </span>Agent vs Random Policy</span>] | |||
## [https://arxiv.org/html/2408.14837v1#A1.SS6 <span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.6 </span>Human Eval Tool</span>] | |||
# [https://arxiv.org/html/2408.14837v1#bib <span class="ltx_text ltx_ref_title"> <span class="ltx_tag ltx_tag_ref"></span> References </span>] | |||
<div class="ltx_page_content"> | |||
<div id="target-section" class="section"> | |||
[https://info.arxiv.org/help/license/index.html#licenses-available License: CC BY-NC-SA 4.0] | |||
<div id="watermark-tr"> | |||
arXiv:2408.14837v1 [cs.LG] 27 Aug 2024 | |||
</div> | |||
</div> | |||
= Diffusion Models Are Real-Time Game Engines = | |||
Report issue for preceding element | |||
<div class="ltx_authors"> | |||
<span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Dani Valevski<br /> | |||
Google Research<br /> | |||
&Yaniv Leviathan<span id="footnotex1" class="ltx_note ltx_role_footnotemark"><sup>1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup>1</sup><span class="ltx_note_type">footnotemark: </span><span class="ltx_tag ltx_tag_note">1</span></span></span></span><br /> | |||
Google Research<br /> | |||
&Moab Arar<span id="footnotex2" class="ltx_note ltx_role_footnotemark"><sup>1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup>1</sup><span class="ltx_note_type">footnotemark: </span><span class="ltx_tag ltx_tag_note">1</span></span></span></span><br /> | |||
Tel Aviv University<br /> | |||
&Shlomi Fruchter<span id="footnotex3" class="ltx_note ltx_role_footnotemark"><sup>1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup>1</sup><span class="ltx_note_type">footnotemark: </span><span class="ltx_tag ltx_tag_note">1</span></span></span></span><br /> | |||
Google DeepMind<br /> | |||
</span><span class="ltx_author_notes">Equal contribution.Work done while at Google Research.</span></span> | |||
</div> | |||
Report issue for preceding element | |||
<div id="abstract" class="ltx_abstract"> | |||
====== Abstract ====== | |||
Report issue for preceding element | |||
We present ''GameNGen'', the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories. | |||
Report issue for preceding element | |||
</div> | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/teaser.png|frame|none|548x240px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 1: </span> A human player is playing DOOM on <span id="S0.F1.2.1" class="ltx_text ltx_font_bold">GameNGen</span> at 20 FPS.<br /> | |||
See [https://gamengen.github.io/ https://gamengen.github.io] for demo videos.]] | |||
Report issue for preceding element | |||
<div id="S1" class="section ltx_section"> | |||
== <span class="ltx_tag ltx_tag_section">1 </span>Introduction == | |||
Report issue for preceding element | |||
<div id="S1.p1" class="ltx_para ltx_noindent"> | |||
Computer games are manually crafted software systems centered around the following ''game loop'': (1) gather user inputs, (2) update the game state, and (3) render it to screen pixels. This game loop, running at high frame rates, creates the illusion of an interactive virtual world for the player. Such game loops are classically run on standard computers, and while there have been many amazing attempts at running games on bespoke hardware (e.g. the iconic game DOOM has been run on kitchen appliances such as a toaster and a microwave, a treadmill, a camera, an iPod, and within the game of Minecraft, to name just a few examples<span id="footnote1" class="ltx_note ltx_role_footnote"><sup>1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup>1</sup><span class="ltx_tag ltx_tag_note">1</span>See https://www.reddit.com/r/itrunsdoom/</span></span></span>), in all of these cases the hardware is still emulating the manually written game software as-is. Furthermore, while vastly different game engines exist, the game state updates and rendering logic in all are composed of a set of manual rules, programmed or configured by hand. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S1.p2" class="ltx_para ltx_noindent"> | |||
In recent years, generative models made significant progress in producing images and videos conditioned on multi-modal inputs, such as text or images. At the forefront of this wave, diffusion models became the de-facto standard in media (i.e. non-language) generation, with works like Dall-E (Ramesh et al., [https://arxiv.org/html/2408.14837v1#bib.bib25 2022]), Stable Diffusion (Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022]) and Sora (Brooks et al., [https://arxiv.org/html/2408.14837v1#bib.bib6 2024]). At a glance, simulating the interactive worlds of video games may seem similar to video generation. However, <span id="S1.p2.1.1" class="ltx_text ltx_font_italic">interactive</span> world simulation is more than just very fast video generation. The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures. Notably, it requires generating frames autoregressively which tends to be unstable and leads to sampling divergence (see section [https://arxiv.org/html/2408.14837v1#S3.SS2.SSS1 <span class="ltx_text ltx_ref_tag">3.2.1</span>]). | |||
Report issue for preceding element | |||
</div> | |||
<div id="S1.p3" class="ltx_para ltx_noindent"> | |||
Several important works (Ha & Schmidhuber, [https://arxiv.org/html/2408.14837v1#bib.bib10 2018]; Kim et al., [https://arxiv.org/html/2408.14837v1#bib.bib16 2020]; Bruce et al., [https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) (see Section [https://arxiv.org/html/2408.14837v1#S6 <span class="ltx_text ltx_ref_tag">6</span>]) simulate interactive video games with neural models. Nevertheless, most of these approaches are limited in respect to the complexity of the simulated games, simulation speed, stability over long time periods, or visual quality (see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 <span class="ltx_text ltx_ref_tag">2</span>]). It is therefore natural to ask: | |||
Report issue for preceding element | |||
</div> | |||
<div id="S1.p4" class="ltx_para ltx_noindent"> | |||
<span id="S1.p4.1.1" class="ltx_text ltx_font_italic">Can a neural model running in real-time simulate a complex game at high quality?</span> | |||
Report issue for preceding element | |||
</div> | |||
<div id="S1.p5" class="ltx_para ltx_noindent"> | |||
In this work we demonstrate that the answer is yes. Specifically, we show that a complex video game, the iconic game DOOM, can be run on a neural network (an augmented version of the open Stable Diffusion v1.4 (Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022])), in real-time, while achieving a visual quality comparable to that of the original game. While not an exact simulation, the neural model is able to perform complex game state updates, such as tallying health and ammo, attacking enemies, damaging objects, opening doors, and persist the game state over long trajectories. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S1.p6" class="ltx_para ltx_noindent"> | |||
GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years. Key questions remain, such as how these neural game engines would be trained and how games would be effectively created in the first place, including how to best leverage human inputs. We are nevertheless extremely excited for the possibilities of this new paradigm. | GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years. Key questions remain, such as how these neural game engines would be trained and how games would be effectively created in the first place, including how to best leverage human inputs. We are nevertheless extremely excited for the possibilities of this new paradigm. | ||
<div id=" | Report issue for preceding element | ||
</div> | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/x1.png|frame|none|761x216px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 2: </span> <span id="S1.F2.2.1" class="ltx_text ltx_font_bold">GameNGen</span> compared to prior state-of-the-art simulations of DOOM.]] | |||
Report issue for preceding element | |||
</div> | </div> | ||
<span | <div id="S2" class="section ltx_section"> | ||
= Interactive | |||
== <span class="ltx_tag ltx_tag_section">2 </span>Interactive World Simulation == | |||
Report issue for preceding element | |||
<div id="S2.p1" class="ltx_para ltx_noindent"> | |||
An ''Interactive Environment'' <math display="inline">\mathcal{E}</math> consists of a space of latent states <math display="inline">\mathcal{S}</math>, a space of partial projections of the latent space <math display="inline">\mathcal{O}</math>, a partial projection function <math display="inline">V:{\mathcal{S}\rightarrow\mathcal{O}}</math>, a set of actions <math display="inline">\mathcal{A}</math>, and a transition probability function <math display="inline">p{(\left. s \middle| {a,s^{\prime}} \right.)}</math> such that <math display="inline">{{s,s^{\prime}} \in \mathcal{S}},{a \in \mathcal{A}}</math>. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S2.p2" class="ltx_para ltx_noindent"> | |||
For example, in the case of the game DOOM, <math display="inline">\mathcal{S}</math> is the program’s dynamic memory contents, <math display="inline">\mathcal{O}</math> is the rendered screen pixels, <math display="inline">V</math> is the game’s rendering logic, <math display="inline">\mathcal{A}</math> is the set of key presses and mouse movements, and <math display="inline">p</math> is the program’s logic given the player’s input (including any potential non-determinism). | For example, in the case of the game DOOM, <math display="inline">\mathcal{S}</math> is the program’s dynamic memory contents, <math display="inline">\mathcal{O}</math> is the rendered screen pixels, <math display="inline">V</math> is the game’s rendering logic, <math display="inline">\mathcal{A}</math> is the set of key presses and mouse movements, and <math display="inline">p</math> is the program’s logic given the player’s input (including any potential non-determinism). | ||
Given an input interactive environment <math display="inline">\mathcal{E}</math>, and an initial state <math display="inline"> | Report issue for preceding element | ||
</div> | |||
<div id="S2.p3" class="ltx_para ltx_noindent"> | |||
Given an input interactive environment <math display="inline">\mathcal{E}</math>, and an initial state <math display="inline">s_{0} \in \mathcal{S}</math>, an ''Interactive World Simulation'' is a ''simulation distribution function'' <math display="inline">{{{q{(\left. o_{n} \middle| {o_{< n},a_{\leq n}} \right.)}},o_{i}} \in \mathcal{O}},{a_{i} \in \mathcal{A}}</math>. Given a distance metric between observations <math display="inline">D:{{\mathcal{O} \times \mathcal{O}}\rightarrow{\mathbb{R}}}</math>, a ''policy'', i.e. a distribution on agent actions given past actions and observations <math display="inline">\pi{({\left. a_{n} \middle| {o_{< n},a} \right. < n})}</math>, a distribution <math display="inline">S_{0}</math> on initial states, and a distribution <math display="inline">N_{0}</math> on episode lengths, the ''Interactive World Simulation'' objective consists of minimizing <math display="inline">E{({D{(o_{q}^{i},o_{p}^{i})}})}</math> where <math display="inline">n \sim N_{0}</math>, <math display="inline">0 \leq i \leq n</math>, and <math display="inline">{o_{q}^{i} \sim q},{o_{p}^{i} \sim {V{(p)}}}</math> are sampled observations from the environment and the simulation when enacting the agent’s policy <math display="inline">\pi</math>. Importantly, the conditioning actions for these samples are always obtained by the agent interacting with the environment <math display="inline">\mathcal{E}</math>, while the conditioning observations can either be obtained from <math display="inline">\mathcal{E}</math> (the ''teacher forcing objective'') or from the simulation (the ''auto-regressive objective''). | |||
Report issue for preceding element | |||
</div> | |||
<div id="S2.p4" class="ltx_para ltx_noindent"> | |||
We always train our generative model with the teacher forcing objective. Given a simulation distribution function <math display="inline">q</math>, the environment <math display="inline">\mathcal{E}</math> can be simulated by auto-regressively sampling observations. | We always train our generative model with the teacher forcing objective. Given a simulation distribution function <math display="inline">q</math>, the environment <math display="inline">\mathcal{E}</math> can be simulated by auto-regressively sampling observations. | ||
<span id=" | Report issue for preceding element | ||
= GameNGen = | |||
</div> | |||
</div> | |||
<div id="S3" class="section ltx_section"> | |||
== <span class="ltx_tag ltx_tag_section">3 </span>GameNGen == | |||
Report issue for preceding element | |||
<div id="S3.p1" class="ltx_para ltx_noindent"> | |||
GameNGen (pronounced “game engine”) is a generative diffusion model that learns to simulate the game under the settings of Section [https://arxiv.org/html/2408.14837v1#S2 <span class="ltx_text ltx_ref_tag">2</span>]. In order to collect training data for this model, with the teacher forcing objective, we first train a separate model to interact with the environment. The two models (agent and generative) are trained in sequence. The entirety of the agent’s actions and observations corpus <math display="inline">\mathcal{T}_{agent}</math> during training is maintained and becomes the training dataset for the generative model in a second stage. See Figure [https://arxiv.org/html/2408.14837v1#S3.F3 <span class="ltx_text ltx_ref_tag">3</span>]. | |||
Report issue for preceding element | |||
</div> | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/x2.png|frame|none|822x339px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 3: </span><span id="S3.F3.2.1" class="ltx_text ltx_font_bold">GameNGen</span> method overview. v-prediction details are omitted for brevity.]] | |||
Report issue for preceding element | |||
<div id="S3.SS1" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">3.1 </span>Data Collection via Agent Play === | |||
Report issue for preceding element | |||
<div id="S3.SS1.p1" class="ltx_para ltx_noindent"> | |||
Our end goal is to have human players interact with our simulation. To that end, the policy <math display="inline">\pi</math> as in Section [https://arxiv.org/html/2408.14837v1#S2 <span class="ltx_text ltx_ref_tag">2</span>] is that of ''human gameplay''. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play. Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific (see Appendix [https://arxiv.org/html/2408.14837v1#A1.SS3 <span class="ltx_text ltx_ref_tag">A.3</span>]). | |||
Report issue for preceding element | |||
</div> | |||
<div id="S3.SS1.p2" class="ltx_para ltx_noindent"> | |||
< | We record the agent’s training trajectories throughout the entire training process, which includes different skill levels of play. This set of recorded trajectories is our <math display="inline">\mathcal{T}_{agent}</math> dataset, used for training the generative model (see Section [https://arxiv.org/html/2408.14837v1#S3.SS2 <span class="ltx_text ltx_ref_tag">3.2</span>]). | ||
Report issue for preceding element | |||
</div> | </div> | ||
</div> | |||
<div id="S3.SS2" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">3.2 </span>Training the Generative Diffusion Model === | |||
< | Report issue for preceding element | ||
<div id="S3.SS2.p1" class="ltx_para ltx_noindent"> | |||
We now train a generative diffusion model conditioned on the agent’s trajectories <math display="inline">\mathcal{T}_{agent}</math> (actions and observations) collected during the previous stage. | We now train a generative diffusion model conditioned on the agent’s trajectories <math display="inline">\mathcal{T}_{agent}</math> (actions and observations) collected during the previous stage. | ||
We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 . We condition the model <math display="inline">f_{\theta}</math> on trajectories <math display="inline">T \sim \mathcal{T}_{agent}</math>, i.e. on a sequence of previous actions <math display="inline">a_{<n}</math> and observations (frames) <math display="inline">o_{<n}</math> and remove all text conditioning. Specifically, to condition on actions, we simply learn an embedding <math display="inline">A_{emb}</math> from each action (e.g. a specific key press) into a single token and replace the cross attention from the text into this encoded actions sequence. In order to condition on observations (i.e. previous frames) we encode them into latent space using the auto-encoder <math display="inline">\phi</math> and concatenate them in the latent channels dimension to the noised latents (see Figure [ | Report issue for preceding element | ||
</div> | |||
<div id="S3.SS2.p2" class="ltx_para ltx_noindent"> | |||
We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 (Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022]). We condition the model <math display="inline">f_{\theta}</math> on trajectories <math display="inline">T \sim \mathcal{T}_{agent}</math>, i.e. on a sequence of previous actions <math display="inline">a_{< n}</math> and observations (frames) <math display="inline">o_{< n}</math> and remove all text conditioning. Specifically, to condition on actions, we simply learn an embedding <math display="inline">A_{emb}</math> from each action (e.g. a specific key press) into a single token and replace the cross attention from the text into this encoded actions sequence. In order to condition on observations (i.e. previous frames) we encode them into latent space using the auto-encoder <math display="inline">\phi</math> and concatenate them in the latent channels dimension to the noised latents (see Figure [https://arxiv.org/html/2408.14837v1#S3.F3 <span class="ltx_text ltx_ref_tag">3</span>]). We also experimented conditioning on these past observations via cross-attention but observed no meaningful improvements. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S3.SS2.p3" class="ltx_para ltx_noindent"> | |||
We train the model to minimize the diffusion loss with velocity parameterization (Salimans & Ho, [https://arxiv.org/html/2408.14837v1#bib.bib29 2022b]): | |||
Report issue for preceding element | |||
< | </div> | ||
<div id="S3.SS2.p4" class="ltx_para ltx_noindent"> | |||
{| | |||
| | |||
| <math display="block">\mathcal{L} = {{\mathbb{E}}_{t,\epsilon,T}\left\lbrack {\|{{v{(\epsilon,x_{0},t)}} - {v_{\theta^{\prime}}{(x_{t},t,{\{{\phi{(o_{i < n})}}\}},{\{{A_{emb}{(a_{i < n})}}\}})}}}\|}_{2}^{2} \right\rbrack}</math> | |||
| | |||
| <span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span> | |||
|} | |||
</div> | |||
<div id="S3.SS2.p5" class="ltx_para ltx_noindent"> | |||
< | where <math display="inline">T = {\{ o_{i \leq n},a_{i \leq n}\}} \sim \mathcal{T}_{agent}</math>, <math display="inline">x_{0} = {\phi{(o_{n})}}</math>, <math display="inline">t \sim {\mathcal{U}{(0,1)}}</math>, <math display="inline">\epsilon \sim {\mathcal{N}{(0,\mathbf{I})}}</math>, <math display="inline">x_{t} = {{\sqrt{{\overline{\alpha}}_{t}}x_{0}} + {\sqrt{1 - {\overline{\alpha}}_{t}}\epsilon}}</math>, <math display="inline">{v{(\epsilon,x_{0},t)}} = {{\sqrt{{\overline{\alpha}}_{t}}\epsilon} - {\sqrt{1 - {\overline{\alpha}}_{t}}x_{0}}}</math>, and <math display="inline">v_{\theta^{\prime}}</math> is the v-prediction output of the model <math display="inline">f_{\theta}</math>. The noise schedule <math display="inline">{\overline{\alpha}}_{t}</math> is linear, similarly to Rombach et al. ([https://arxiv.org/html/2408.14837v1#bib.bib26 2022]). | ||
Report issue for preceding element | |||
</div> | </div> | ||
<span | <div id="S3.SS2.SSS1" class="section ltx_subsubsection"> | ||
= | |||
==== <span class="ltx_tag ltx_tag_subsubsection">3.2.1 </span>Mitigating Auto-Regressive Drift Using Noise Augmentation ==== | |||
Report issue for preceding element | |||
<div id="S3.SS2.SSS1.p1" class="ltx_para ltx_noindent"> | |||
<span | The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in Figure [https://arxiv.org/html/2408.14837v1#S3.F4 <span class="ltx_text ltx_ref_tag">4</span>]. To avoid this divergence due to auto-regressive application of the model, we corrupt context frames by adding a varying amount of Gaussian noise to encoded frames in training time, while providing the noise level as input to the model, following Ho et al. ([https://arxiv.org/html/2408.14837v1#bib.bib13 2021]). To that effect, we sample a noise level <math display="inline">\alpha</math> uniformly up to a maximal value, discretize it and learn an embedding for each bucket (see Figure [https://arxiv.org/html/2408.14837v1#S3.F3 <span class="ltx_text ltx_ref_tag">3</span>]). This allows the network to correct information sampled in previous frames, and is critical for preserving frame quality over time. During inference, the added noise level can be controlled to maximize quality, although we find that even with no added noise the results are significantly improved. We ablate the impact of this method in section [https://arxiv.org/html/2408.14837v1#S5.SS2.SSS2 <span class="ltx_text ltx_ref_tag">5.2.2</span>]. | ||
== | |||
Report issue for preceding element | |||
We use DDIM sampling . We employ Classifier-Free Guidance | </div> | ||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/noise_aug_ablation_new.png|frame|none|494x158px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 4: </span><span id="S3.F4.2.1" class="ltx_text ltx_font_bold">Auto-regressive drift.</span> Top: we present every 10th frame of a simple trajectory with 50 frames in which the player is not moving. Quality degrades fast after 20-30 steps. Bottom: the same trajectory with noise augmentation does not suffer from quality degradation.]] | |||
Report issue for preceding element | |||
</div> | |||
<div id="S3.SS2.SSS2" class="section ltx_subsubsection"> | |||
==== <span class="ltx_tag ltx_tag_subsubsection">3.2.2 </span>Latent Decoder Fine-tuning ==== | |||
Report issue for preceding element | |||
<div id="S3.SS2.SSS2.p1" class="ltx_para ltx_noindent"> | |||
The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8x8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD (“heads up display”). To leverage the pre-trained knowledge while improving image quality, we train just the decoder of the latent auto-encoder using an MSE loss computed against the target frame pixels. It might be possible to improve quality even further using a perceptual loss such as LPIPS (Zhang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib40 2018])), which we leave to future work. Importantly, note that this fine tuning process happens completely separately from the U-Net fine-tuning, and that notably the auto-regressive generation isn’t affected by it (we only condition auto-regressively on the latents, not the pixels). Appendix [https://arxiv.org/html/2408.14837v1#A1.SS2 <span class="ltx_text ltx_ref_tag">A.2</span>] shows examples of generations with and without fine-tuning the auto-encoder. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
</div> | |||
<div id="S3.SS3" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">3.3 </span>Inference === | |||
Report issue for preceding element | |||
<div id="S3.SS3.SSS1" class="section ltx_subsubsection"> | |||
==== <span class="ltx_tag ltx_tag_subsubsection">3.3.1 </span>Setup ==== | |||
Report issue for preceding element | |||
<div id="S3.SS3.SSS1.p1" class="ltx_para ltx_noindent"> | |||
We use DDIM sampling (Song et al., [https://arxiv.org/html/2408.14837v1#bib.bib34 2022]). We employ Classifier-Free Guidance (Ho & Salimans, [https://arxiv.org/html/2408.14837v1#bib.bib12 2022]) only for the past observations condition <math display="inline">o_{< n}</math>. We didn’t find guidance for the past actions condition <math display="inline">a_{< n}</math> to improve quality. The weight we use is relatively small (1.5) as larger weights create artifacts which increase due to our auto-regressive sampling. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S3.SS3.SSS1.p2" class="ltx_para ltx_noindent"> | |||
We also experimented with generating 4 samples in parallel and combining the results, with the hope of preventing rare extreme predictions from being accepted and to reduce error accumulation. We experimented both with averaging the samples and with choosing the sample closest to the median. Averaging performed slightly worse than single frame, and choosing the closest to the median performed only negligibly better. Since both increase the hardware requirements to 4 TPUs, we opt to not use them, but note that this might be an interesting area for future work. | We also experimented with generating 4 samples in parallel and combining the results, with the hope of preventing rare extreme predictions from being accepted and to reduce error accumulation. We experimented both with averaging the samples and with choosing the sample closest to the median. Averaging performed slightly worse than single frame, and choosing the closest to the median performed only negligibly better. Since both increase the hardware requirements to 4 TPUs, we opt to not use them, but note that this might be an interesting area for future work. | ||
<span id="denoiser-sampling | Report issue for preceding element | ||
</div> | |||
</div> | |||
<div id="S3.SS3.SSS2" class="section ltx_subsubsection"> | |||
==== <span class="ltx_tag ltx_tag_subsubsection">3.3.2 </span>Denoiser Sampling Steps ==== | |||
Report issue for preceding element | |||
<div id="S3.SS3.SSS2.p1" class="ltx_para ltx_noindent"> | |||
During inference, we need to run both the U-Net denoiser (for a number of steps) and the auto-encoder. On our hardware configuration (a TPU-v5), a single denoiser step and an evaluation of the auto-encoder both takes 10ms. If we ran our model with a single denoiser step, the minimum total latency possible in our setup would be 20ms per frame, or 50 frames per second. Usually, generative diffusion models, such as Stable Diffusion, don’t produce high quality results with a single denoising step, and instead require dozens of sampling steps to generate a high quality image. Surprisingly, we found that we can robustly simulate DOOM, with only 4 DDIM sampling steps (Song et al., [https://arxiv.org/html/2408.14837v1#bib.bib33 2020]). In fact, we observe no degradation in simulation quality when using 4 sampling steps vs 20 steps or more (see Appendix [https://arxiv.org/html/2408.14837v1#A1.SS4 <span class="ltx_text ltx_ref_tag">A.4</span>]). | |||
Report issue for preceding element | |||
</div> | |||
<div id="S3.SS3.SSS2.p2" class="ltx_para ltx_noindent"> | |||
Using just 4 denoising steps leads to a total U-Net cost of 40ms (and total inference cost of 50ms, including the auto encoder) or 20 frames per second. We hypothesize that the negligible impact to quality with few steps in our case stems from a combination of: (1) a constrained images space, and (2) strong conditioning by the previous frames. | Using just 4 denoising steps leads to a total U-Net cost of 40ms (and total inference cost of 50ms, including the auto encoder) or 20 frames per second. We hypothesize that the negligible impact to quality with few steps in our case stems from a combination of: (1) a constrained images space, and (2) strong conditioning by the previous frames. | ||
Since we do observe degradation when using just a single sampling step, we also experimented with model distillation similarly to | Report issue for preceding element | ||
</div> | |||
<div id="S3.SS3.SSS2.p3" class="ltx_para ltx_noindent"> | |||
Since we do observe degradation when using just a single sampling step, we also experimented with model distillation similarly to (Yin et al., [https://arxiv.org/html/2408.14837v1#bib.bib39 2024]; Wang et al., [https://arxiv.org/html/2408.14837v1#bib.bib36 2023]) in the single-step setting. Distillation does help substantially there (allowing us to reach 50 FPS as above), but still comes at a some cost to simulation quality, so we opt to use the 4-step version without distillation for our method (see Appendix [https://arxiv.org/html/2408.14837v1#A1.SS4 <span class="ltx_text ltx_ref_tag">A.4</span>]). This is an interesting area for further research. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S3.SS3.SSS2.p4" class="ltx_para ltx_noindent"> | |||
We note that it is trivial to further increase the image generation rate substantially by parallelizing the generation of several frames on additional hardware, similarly to NVidia’s classic SLI Alternate Frame Rendering (AFR) technique. Similarly to AFR, the actual simulation rate would not increase and input lag would not reduce. | We note that it is trivial to further increase the image generation rate substantially by parallelizing the generation of several frames on additional hardware, similarly to NVidia’s classic SLI Alternate Frame Rendering (AFR) technique. Similarly to AFR, the actual simulation rate would not increase and input lag would not reduce. | ||
<span id=" | Report issue for preceding element | ||
= | |||
</div> | |||
</div> | |||
</div> | |||
</div> | |||
<div id="S4" class="section ltx_section"> | |||
== <span class="ltx_tag ltx_tag_section">4 </span>Experimental Setup == | |||
Report issue for preceding element | |||
<div id="S4.SS1" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">4.1 </span>Agent Training === | |||
Report issue for preceding element | |||
<div id="S4.SS1.p1" class="ltx_para ltx_noindent"> | |||
The agent model is trained using PPO (Schulman et al., [https://arxiv.org/html/2408.14837v1#bib.bib30 2017]), with a simple CNN as the feature network, following Mnih et al. ([https://arxiv.org/html/2408.14837v1#bib.bib21 2015]). It is trained on CPU using the Stable Baselines 3 infrastructure (Raffin et al., [https://arxiv.org/html/2408.14837v1#bib.bib24 2021]). The agent is provided with downscaled versions of the frame images and in-game map, each at resolution 160x120. The agent also has access to the last 32 actions it performed. The feature network computes a representation of size 512 for each image. PPO’s actor and critic are 2-layer MLP heads on top of a concatenation of the outputs of the image feature network and the sequence of past actions. We train the agent to play the game using the Vizdoom environment (Wydmuch et al., [https://arxiv.org/html/2408.14837v1#bib.bib37 2019]). We run 8 games in parallel, each with a replay buffer size of 512, a discount factor <math display="inline">\gamma = 0.99</math>, and an entropy coefficient of <math display="inline">0.1</math>. In each iteration, the network is trained using a batch size of 64 for 10 epochs, with a learning rate of 1e-4. We perform a total of 10M environment steps. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
<div id="S4.SS2" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">4.2 </span>Generative Model Training === | |||
Report issue for preceding element | |||
<div id="S4.SS2.p1" class="ltx_para ltx_noindent"> | |||
We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4, unfreezing all U-Net parameters. We use a batch size of 128 and a constant learning rate of 2e-5, with the Adafactor optimizer without weight decay (Shazeer & Stern, [https://arxiv.org/html/2408.14837v1#bib.bib31 2018]) and gradient clipping of 1.0. We change the diffusion loss parameterization to be v-prediction (Salimans & Ho ([https://arxiv.org/html/2408.14837v1#bib.bib28 2022a]). The context frames condition is dropped with probability 0.1 to allow CFG during inference. We train using 128 TPU-v5e devices with data parallelization. Unless noted otherwise, all results in the paper are after 700,000 training steps. For noise augmentation (Section [https://arxiv.org/html/2408.14837v1#S3.SS2.SSS1 <span class="ltx_text ltx_ref_tag">3.2.1</span>]), we use a maximal noise level of 0.7, with 10 embedding buckets. We use a batch size of 2,048 for optimizing the latent decoder, other training parameters are identical to those of the denoiser. For training data, we use all trajectories played by the agent during RL training as well as evaluation data during training, unless mentioned otherwise. Overall we generate 900M frames for training. All image frames (during training, inference, and conditioning) are at a resolution of 320x240 padded to 320x256. We use a context length of 64 (i.e. the model is provided its own last 64 predictions as well as the last 64 actions). | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
</div> | |||
<div id="S5" class="section ltx_section"> | |||
<span | == <span class="ltx_tag ltx_tag_section">5 </span>Results == | ||
Report issue for preceding element | |||
<div id="S5.SS1" class="section ltx_subsection"> | |||
<span | === <span class="ltx_tag ltx_tag_subsection">5.1 </span>Simulation Quality === | ||
= | |||
< | Report issue for preceding element | ||
<div id="S5.SS1.p1" class="ltx_para ltx_noindent"> | |||
Overall, our method achieves a simulation quality comparable to the original game over long trajectories in terms of image quality. For short trajectories, human raters are only slightly better than random chance at distinguishing between clips of the simulation and the actual game. | Overall, our method achieves a simulation quality comparable to the original game over long trajectories in terms of image quality. For short trajectories, human raters are only slightly better than random chance at distinguishing between clips of the simulation and the actual game. | ||
Report issue for preceding element | |||
</div> | |||
<div id="S5.SS1.p2" class="ltx_para ltx_noindent"> | |||
<span id="S5.SS1.p2.2.1" class="ltx_text ltx_font_bold">Image Quality.</span> We measure LPIPS (Zhang et al., [https://arxiv.org/html/2408.14837v1#bib.bib40 2018]) and PSNR using the teacher-forcing setup described in Section [https://arxiv.org/html/2408.14837v1#S2 <span class="ltx_text ltx_ref_tag">2</span>], where we sample an initial state and predict a single frame based on a trajectory of ground-truth past observations. When evaluated over a random holdout of 2048 trajectories taken in 5 different levels, our model achieves a PSNR of <math display="inline">29.43</math> and an LPIPS of <math display="inline">0.249</math>. The PSNR value is similar to lossy JPEG compression with quality settings of 20-30 (Petric & Milinkovic, [https://arxiv.org/html/2408.14837v1#bib.bib22 2018]). Figure [https://arxiv.org/html/2408.14837v1#S5.F5 <span class="ltx_text ltx_ref_tag">5</span>] shows examples of model predictions and the corresponding ground truth samples. | |||
Report issue for preceding element | |||
</div> | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/x3.png|frame|none|761x342px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 5: </span><span id="S5.F5.2.1" class="ltx_text ltx_font_bold">Model predictions vs. ground truth</span>. Only the last 4 frames of the past observations context are shown.]] | |||
Report issue for preceding element | |||
<div id="S5.SS1.p3" class="ltx_para ltx_noindent"> | |||
< | <span id="S5.SS1.p3.1.1" class="ltx_text ltx_font_bold">Video Quality.</span> We use the auto-regressive setup described in Section [https://arxiv.org/html/2408.14837v1#S2 <span class="ltx_text ltx_ref_tag">2</span>], where we iteratively sample frames following the sequences of actions defined by the ground-truth trajectory, while conditioning the model on its own past predictions. When sampled auto-regressively, the predicted and ground-truth trajectories often diverge after a few steps, mostly due to the accumulation of small amounts of different movement velocities between frames in each trajectory. For that reason, per-frame PSNR and LPIPS values gradually decrease and increase respectively, as can be seen in Figure [https://arxiv.org/html/2408.14837v1#S5.F6 <span class="ltx_text ltx_ref_tag">6</span>]. The predicted trajectory is still similar to the actual game in terms of content and image quality, but per-frame metrics are limited in their ability to capture this (see Appendix [https://arxiv.org/html/2408.14837v1#A1.SS1 <span class="ltx_text ltx_ref_tag">A.1</span>] for samples of auto-regressively generated trajectories). | ||
Report issue for preceding element | |||
</div> | </div> | ||
<div id="S5.SS1.p4" class="ltx_para ltx_noindent"> | |||
We therefore measure the FVD (Unterthiner et al., [https://arxiv.org/html/2408.14837v1#bib.bib35 2019]) computed over a random holdout of 512 trajectories, measuring the distance between the predicted and ground truth trajectory distributions, for simulations of length 16 frames (0.8 seconds) and 32 frames (1.6 seconds). For 16 frames our model obtains an FVD of <math display="inline">114.02</math>. For 32 frames our model obtains an FVD of <math display="inline">186.23</math>. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S5.SS1.p5" class="ltx_para ltx_noindent"> | |||
<span id="S5.SS1.p5.1.1" class="ltx_text ltx_font_bold">Human Evaluation.</span> As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure [https://arxiv.org/html/2408.14837v1#A1.F14 <span class="ltx_text ltx_ref_tag">14</span>] in Appendix [https://arxiv.org/html/2408.14837v1#A1.SS6 <span class="ltx_text ltx_ref_tag">A.6</span>]). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively). | |||
Report issue for preceding element | |||
[[File: | </div> | ||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/psnr_step_700k_08212004.png|frame|none|220x137px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 6: </span><span id="S5.F6.2.1" class="ltx_text ltx_font_bold">Auto-regressive evaluation</span>. PSNR and LPIPS metrics over 64 auto-regressive steps.]] | |||
Report issue for preceding element | |||
</div> | </div> | ||
<span | <div id="S5.SS2" class="section ltx_subsection"> | ||
== | |||
=== <span class="ltx_tag ltx_tag_subsection">5.2 </span>Ablations === | |||
Report issue for preceding element | |||
<div id="S5.SS2.p1" class="ltx_para ltx_noindent"> | |||
To evaluate the importance of the different components of our methods, we sample trajectories from the evaluation dataset and compute LPIPS and PSNR metrics between the ground truth and the predicted frames. | To evaluate the importance of the different components of our methods, we sample trajectories from the evaluation dataset and compute LPIPS and PSNR metrics between the ground truth and the predicted frames. | ||
<span id=" | Report issue for preceding element | ||
=== Context Length === | |||
</div> | |||
<div id="S5.SS2.SSS1" class="section ltx_subsubsection"> | |||
==== <span class="ltx_tag ltx_tag_subsubsection">5.2.1 </span>Context Length ==== | |||
Report issue for preceding element | |||
<div id="S5.SS2.SSS1.p1" class="ltx_para ltx_noindent"> | |||
We evaluate the impact of changing the number <math display="inline">N</math> of past observations in the conditioning context by training models with <math display="inline">N \in {\{ 1,2,4,8,16,32,64\}}</math> (recall that our method uses <math display="inline">N = 64</math>). This affects both the number of historical frames and actions. We train the models for 200,000 steps keeping the decoder frozen and evaluate on test-set trajectories from 5 levels. See the results in Table [https://arxiv.org/html/2408.14837v1#S5.T1 <span class="ltx_text ltx_ref_tag">1</span>]. As expected, we observe that generation quality improves with the length of the context. Interestingly, we observe that while the improvement is large at first (e.g. between 1 and 2 frames), we quickly approach an asymptote and further increasing the context size provides only small improvements in quality. This is somewhat surprising as even with our maximal context length, the model only has access to a little over 3 seconds of history. Notably, we observe that much of the game state is persisted for much longer periods (see Section [https://arxiv.org/html/2408.14837v1#S7 <span class="ltx_text ltx_ref_tag">7</span>]). While the length of the conditioning context is an important limitation, Table [https://arxiv.org/html/2408.14837v1#S5.T1 <span class="ltx_text ltx_ref_tag">1</span>] hints that we’d likely need to change the architecture of our model to efficiently support longer contexts, and employ better selection of the past frames to condition on, which we leave for future work. | |||
Report issue for preceding element | |||
</div> | |||
<span class="ltx_tag ltx_tag_table">Table 1: </span><span id="S5.T1.18.1" class="ltx_text ltx_font_bold">Number of history frames.</span> We ablate the number of history frames used as context using 8912 test-set examples from 5 levels. More frames generally improve both PSNR and LPIPS metrics. | |||
<div id="S5.T1.16" class="ltx_inline-block ltx_align_center ltx_transformed_outer" style="width:198.7pt;height:106.2pt;vertical-align:-0.0pt;"> | |||
<span class="ltx_transformed_inner" style="transform:translate(-35.4pt,18.9pt) scale(0.737296541759712,0.737296541759712) ;"> </span> | |||
{| | |||
! History Context Length | |||
! PSNR <math display="inline">\uparrow</math> | |||
! LPIPS <math display="inline">\downarrow</math> | |||
|- | |||
| 64 | |||
| <math display="inline">22.36 \pm 0.033</math> | |||
| <math display="inline">0.295 \pm 0.001</math> | |||
|- | |||
| 32 | |||
| <math display="inline">22.31 \pm 0.033</math> | |||
| <math display="inline">0.296 \pm 0.001</math> | |||
|- | |||
| 16 | |||
| <math display="inline">22.28 \pm 0.033</math> | |||
| <math display="inline">0.296 \pm 0.001</math> | |||
|- | |||
| 8 | |||
| <math display="inline">22.26 \pm 0.033</math> | |||
| <math display="inline">0.296 \pm 0.001</math> | |||
|- | |||
| 4 | |||
| <math display="inline">22.26 \pm 0.034</math> | |||
| <math display="inline">0.298 \pm 0.001</math> | |||
|- | |||
| 2 | |||
| <math display="inline">22.03 \pm 0.037</math> | |||
| <math display="inline">0.304 \pm 0.001</math> | |||
|- | |||
| 1 | |||
| <math display="inline">20.94 \pm 0.044</math> | |||
| <math display="inline">0.358 \pm 0.001</math> | |||
|} | |||
</div> | |||
Report issue for preceding element | |||
</div> | |||
<div id="S5.SS2.SSS2" class="section ltx_subsubsection"> | |||
==== <span class="ltx_tag ltx_tag_subsubsection">5.2.2 </span>Noise Augmentation ==== | |||
Report issue for preceding element | |||
<div id="S5.SS2.SSS2.p1" class="ltx_para ltx_noindent"> | |||
To ablate the impact of noise augmentation we train a model without added noise. We evaluate both our standard model with noise augmentation and the model without added noise (after 200k training steps) auto-regressively and compute PSNR and LPIPS metrics between the predicted frames and the ground-truth over a random holdout of 512 trajectories. We report average metric values for each auto-regressive step up to a total of 64 frames in Figure [https://arxiv.org/html/2408.14837v1#S5.F7 <span class="ltx_text ltx_ref_tag">7</span>]. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S5.SS2.SSS2.p2" class="ltx_para ltx_noindent"> | |||
Without noise augmentation, LPIPS distance from the ground truth increases rapidly compared to our standard noise-augmented model, while PSNR drops, indicating a divergence of the simulation from ground truth. | Without noise augmentation, LPIPS distance from the ground truth increases rapidly compared to our standard noise-augmented model, while PSNR drops, indicating a divergence of the simulation from ground truth. | ||
Report issue for preceding element | |||
[[File: | </div> | ||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/noise_aug_ablation_lpips_step_200k_08212024.png|frame|none|247x153px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 7: </span><span id="S5.F7.2.1" class="ltx_text ltx_font_bold">Impact of Noise Augmentation.</span> The plots show average LPIPS (lower is better) and PSNR (higher is better) values for each auto-regressive step. When noise augmentation is not used quality degrades quickly after 10-20 frames. This is prevented by noise augmentation.]] | |||
Report issue for preceding element | |||
</div> | </div> | ||
<span | <div id="S5.SS2.SSS3" class="section ltx_subsubsection"> | ||
=== | |||
==== <span class="ltx_tag ltx_tag_subsubsection">5.2.3 </span>Agent Play ==== | |||
Report issue for preceding element | |||
<div id="S5.SS2.SSS3.p1" class="ltx_para ltx_noindent"> | |||
We compare training on agent-generated data to training on data generated using a random policy. For the random policy, we sample actions following a uniform categorical distribution that doesn’t depend on the observations. We compare the random and agent datasets by training 2 models for 700k steps along with their decoder. The models are evaluated on a dataset of 2048 human-play trajectories from 5 levels. We compare the first frame of generation, conditioned on a history context of 64 ground-truth frames, as well as a frame after 3 seconds of auto-regressive generation. | We compare training on agent-generated data to training on data generated using a random policy. For the random policy, we sample actions following a uniform categorical distribution that doesn’t depend on the observations. We compare the random and agent datasets by training 2 models for 700k steps along with their decoder. The models are evaluated on a dataset of 2048 human-play trajectories from 5 levels. We compare the first frame of generation, conditioned on a history context of 64 ground-truth frames, as well as a frame after 3 seconds of auto-regressive generation. | ||
Report issue for preceding element | |||
< | </div> | ||
<div id="S5.SS2.SSS3.p2" class="ltx_para ltx_noindent"> | |||
<span | Overall, we observe that training the model on random trajectories works surprisingly well, but is limited by the exploration ability of the random policy. When comparing the single frame generation the agent works only slightly better, achieving a PNSR of 25.06 vs 24.42 for the random policy. When comparing a frame after 3 seconds of auto-regressive generation, the difference increases to 19.02 vs 16.84. When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better. With that, we manually split 456 examples into 3 buckets: easy, medium, and hard, manually, based on their distance from the starting position in the game. We observe that on the easy and hard sets, the agent performs only slightly better than random, while on the medium set the difference is much larger in favor of the agent as expected (see Table [https://arxiv.org/html/2408.14837v1#S5.T2 <span class="ltx_text ltx_ref_tag">2</span>]). See Figure [https://arxiv.org/html/2408.14837v1#A1.F13 <span class="ltx_text ltx_ref_tag">13</span>] in Appendix [https://arxiv.org/html/2408.14837v1#A1.SS5 <span class="ltx_text ltx_ref_tag">A.5</span>] for an example of the scores during a single session of human play. | ||
Report issue for preceding element | |||
<span id=" | </div> | ||
=== | <span class="ltx_tag ltx_tag_table">Table 2: </span><span id="S5.T2.16.1" class="ltx_text ltx_font_bold">Performance on Different Difficulty Levels.</span> We compare the performance of models trained using Agent-generated and Random-generated data across easy, medium, and hard splits of the dataset. Easy and medium have 112 items, hard has 232 items. Metrics are computed for each trajectory on a single frame after 3 seconds. | ||
<div id="S5.T2.14" class="ltx_inline-block ltx_align_center ltx_transformed_outer" style="width:298.1pt;height:113pt;vertical-align:-0.0pt;"> | |||
<span class="ltx_transformed_inner" style="transform:translate(-17.2pt,6.5pt) scale(0.896469703350536,0.896469703350536) ;"> </span> | |||
{| | |||
! Difficulty Level | |||
! Data Generation Policy | |||
! PSNR <math display="inline">\uparrow</math> | |||
! LPIPS <math display="inline">\downarrow</math> | |||
|- | |||
| Easy | |||
| Agent | |||
| <math display="inline">20.94 \pm 0.76</math> | |||
| <math display="inline">0.48 \pm 0.01</math> | |||
|- | |||
| | |||
| Random | |||
| <math display="inline">20.20 \pm 0.83</math> | |||
| <math display="inline">0.48 \pm 0.01</math> | |||
|- | |||
| Medium | |||
| Agent | |||
| <math display="inline">20.21 \pm 0.36</math> | |||
| <math display="inline">0.50 \pm 0.01</math> | |||
|- | |||
| | |||
| Random | |||
| <math display="inline">16.50 \pm 0.41</math> | |||
| <math display="inline">0.59 \pm 0.01</math> | |||
|- | |||
| Hard | |||
| Agent | |||
| <math display="inline">17.51 \pm 0.35</math> | |||
| <math display="inline">0.60 \pm 0.01</math> | |||
|- | |||
| | |||
| Random | |||
| <math display="inline">15.39 \pm 0.43</math> | |||
| <math display="inline">0.61 \pm 0.00</math> | |||
|} | |||
Diffusion models achieved state-of-the-art results in text-to-image generation , a line of work that has also been applied for text-to-video generation tasks . Despite impressive advancement in realism, text adherence and temporal consistency, video diffusion models remain too slow for real-time applications. Our work extends this line of work and adapts it for real-time generation conditioned autoregressively on a history of past observations and actions. | </div> | ||
Report issue for preceding element | |||
</div> | |||
</div> | |||
</div> | |||
<div id="S6" class="section ltx_section"> | |||
== <span class="ltx_tag ltx_tag_section">6 </span>Related Work == | |||
Report issue for preceding element | |||
<div id="S6.SS0.SSS0.Px1" class="section ltx_paragraph"> | |||
===== Interactive 3D Simulation ===== | |||
Report issue for preceding element | |||
<div id="S6.SS0.SSS0.Px1.p1" class="ltx_para ltx_noindent"> | |||
Simulating visual and physical processes of 2D and 3D environments and allowing interactive exploration of them is an extensively developed field in computer graphics (Akenine-Mller et al., [https://arxiv.org/html/2408.14837v1#bib.bib1 2018]). Game Engines, such as Unreal and Unity, are software that processes representations of scene geometry and renders a stream of images in response to user interactions. The game engine is responsible for keeping track of all world state, e.g. the player position and movement, objects, character animation and lighting. It also tracks the game logic, e.g. points gained by accomplishing game objectives. Film and television productions use variants of ray-tracing (Shirley & Morley, [https://arxiv.org/html/2408.14837v1#bib.bib32 2008]), which are too slow and compute-intensive for real time applications. In contrast, game engines must keep a very high frame rate (typically 30-60 FPS), and therefore rely on highly-optimized polygon rasterization, often accelerated by GPUs. Physical effects such as shadows, particles and lighting are often implemented using efficient heuristics rather than physically accurate simulation. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
<div id="S6.SS0.SSS0.Px2" class="section ltx_paragraph"> | |||
===== Neural 3D Simulation ===== | |||
Report issue for preceding element | |||
<div id="S6.SS0.SSS0.Px2.p1" class="ltx_para ltx_noindent"> | |||
Neural methods for reconstructing 3D representations have made significant advances over the last years. NeRFs (Mildenhall et al., [https://arxiv.org/html/2408.14837v1#bib.bib20 2020]) parameterize radiance fields using a deep neural network that is specifically optimized for a given scene from a set of images taken from various camera poses. Once trained, novel point of views of the scene can be sampled using volume rendering methods. Gaussian Splatting (Kerbl et al., [https://arxiv.org/html/2408.14837v1#bib.bib15 2023]) approaches build on NeRFs but represent scenes using 3D Gaussians and adapted rasterization methods, unlocking faster training and rendering times. While demonstrating impressive reconstruction results and real-time interactivity, these methods are often limited to static scenes. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
<div id="S6.SS0.SSS0.Px3" class="section ltx_paragraph"> | |||
===== Video Diffusion Models ===== | |||
Report issue for preceding element | |||
<div id="S6.SS0.SSS0.Px3.p1" class="ltx_para ltx_noindent"> | |||
Diffusion models achieved state-of-the-art results in text-to-image generation (Saharia et al., [https://arxiv.org/html/2408.14837v1#bib.bib27 2022]; Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022]; Ramesh et al., [https://arxiv.org/html/2408.14837v1#bib.bib25 2022]; Podell et al., [https://arxiv.org/html/2408.14837v1#bib.bib23 2023]), a line of work that has also been applied for text-to-video generation tasks (Ho et al., [https://arxiv.org/html/2408.14837v1#bib.bib14 2022]; Blattmann et al., [https://arxiv.org/html/2408.14837v1#bib.bib5 2023b]; [https://arxiv.org/html/2408.14837v1#bib.bib4 a]; Gupta et al., [https://arxiv.org/html/2408.14837v1#bib.bib9 2023]; Girdhar et al., [https://arxiv.org/html/2408.14837v1#bib.bib8 2023]; Bar-Tal et al., [https://arxiv.org/html/2408.14837v1#bib.bib3 2024]). Despite impressive advancement in realism, text adherence and temporal consistency, video diffusion models remain too slow for real-time applications. Our work extends this line of work and adapts it for real-time generation conditioned autoregressively on a history of past observations and actions. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
<div id="S6.SS0.SSS0.Px4" class="section ltx_paragraph"> | |||
===== Game Simulation and World Models ===== | |||
Report issue for preceding element | |||
<div id="S6.SS0.SSS0.Px4.p1" class="ltx_para ltx_noindent"> | |||
Several works attempted to train models for game simulation with actions inputs. Yang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) and Bruce et al. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) focus on unsupervised learning of actions from videos. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on <span id="S6.SS0.SSS0.Px4.p1.1.1" class="ltx_text ltx_font_italic">interactive playable real-time simulation</span>, and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) train a Variational Auto-Encoder (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e. selecting an action at random). Then controller policy is learned by playing within the “hallucinated” environment. Hafner et al. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), that use an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game, see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 <span class="ltx_text ltx_ref_tag">2</span>]. Finally, concurrently with our work, Alonso et al. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games. | |||
Report issue for preceding element | |||
</div> | |||
< | </div> | ||
<div id="S6.SS0.SSS0.Px5" class="section ltx_paragraph"> | |||
===== DOOM ===== | |||
< | Report issue for preceding element | ||
<div id="S6.SS0.SSS0.Px5.p1" class="ltx_para ltx_noindent"> | |||
When DOOM released in 1993 it revolutionized the gaming industry. Introducing groundbreaking 3D graphics technology, it became a cornerstone of the first-person shooter genre, influencing countless other games. DOOM was studied by numerous research works. It provides an open-source implementation and a native resolution that is low enough for small sized models to simulate, while being complex enough to be a challenging test case. Finally, the authors have spent countless youth hours with the game. It was a trivial choice to use it in this work. | When DOOM released in 1993 it revolutionized the gaming industry. Introducing groundbreaking 3D graphics technology, it became a cornerstone of the first-person shooter genre, influencing countless other games. DOOM was studied by numerous research works. It provides an open-source implementation and a native resolution that is low enough for small sized models to simulate, while being complex enough to be a challenging test case. Finally, the authors have spent countless youth hours with the game. It was a trivial choice to use it in this work. | ||
<span id=" | Report issue for preceding element | ||
= | |||
</div> | |||
</div> | |||
</div> | |||
<div id="S7" class="section ltx_section"> | |||
== <span class="ltx_tag ltx_tag_section">7 </span>Discussion == | |||
Report issue for preceding element | |||
<div id="S7.p1" class="ltx_para ltx_noindent"> | |||
<span id="S7.p1.1.1" class="ltx_text ltx_font_bold">Summary.</span> We introduced ''GameNGen'', and demonstrated that high-quality real-time game play at 20 frames per second is possible on a neural model. We also provided a recipe for converting an interactive piece of software such as a computer game into a neural model. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S7.p2" class="ltx_para ltx_noindent"> | |||
<span id="S7.p2.1.1" class="ltx_text ltx_font_bold">Limitations.</span> GameNGen suffers from a limited amount of memory. The model only has access to a little over 3 seconds of history, so it’s remarkable that much of the game logic is persisted for drastically longer time horizons. While some of the game state is persisted through screen pixels (e.g. ammo and health tallies, available weapons, etc.), the model likely learns strong heuristics that allow meaningful generalizations. For example, from the rendered view the model learns to infer the player’s location, and from the ammo and health tallies, the model might infer whether the player has already been through an area and defeated the enemies there. That said, it’s easy to create situations where this context length is not enough. Continuing to increase the context size with our existing architecture yields only marginal benefits (Section [https://arxiv.org/html/2408.14837v1#S5.SS2.SSS1 <span class="ltx_text ltx_ref_tag">5.2.1</span>]), and the model’s short context length remains an important limitation. The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S7.p3" class="ltx_para ltx_noindent"> | |||
<span id="S7.p3.1.1" class="ltx_text ltx_font_bold">Future Work.</span> We demonstrate ''GameNGen'' on the classic game DOOM. It would be interesting to test it on other games or more generally on other interactive software systems; We note that nothing in our technique is DOOM specific except for the reward function for the RL-agent. We plan on addressing that in a future work; While ''GameNGen'' manages to maintain game state accurately, it isn’t perfect, as per the discussion above. A more sophisticated architecture might be needed to mitigate these; ''GameNGen'' currently has a limited capability to leverage more than a minimal amount of memory. Experimenting with further expanding the memory effectively could be critical for more complex games/software; ''GameNGen'' runs at 20 or 50 FPS<span id="footnote2" class="ltx_note ltx_role_footnote"><sup>2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup>2</sup><span class="ltx_tag ltx_tag_note">2</span>Faster than the original game DOOM ran on the some of the authors’ 80386 machines at the time!</span></span></span> on a TPUv5. It would be interesting to experiment with further optimization techniques to get it to run at higher frame rates and on consumer hardware. | |||
Report issue for preceding element | |||
</div> | |||
<div id="S7.p4" class="ltx_para ltx_noindent"> | |||
<span id="S7.p4.1.1" class="ltx_text ltx_font_bold">Towards a New Paradigm for Interactive Video Games.</span> Today, video games are ''programmed'' by humans. ''GameNGen'' is a proof-of-concept for one part of a new paradigm where games are weights of a neural model, not lines of code. ''GameNGen'' shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware. While many important questions remain, we are hopeful that this paradigm could have important benefits. For example, the development process for video games under this new paradigm might be less costly and more accessible, whereby games could be developed and edited via textual descriptions or examples images. A small part of this vision, namely creating modifications or novel behaviors for existing games, might be achievable in the shorter term. For example, we might be able to convert a set of frames into a new playable level or create a new character just based on example images, without having to author code. Other advantages of this new paradigm include strong guarantees on frame rates and memory footprints. We have not experimented with these directions yet and much more work is required here, but we are excited to try! Hopefully this small step will someday contribute to a meaningful improvement in people’s experience with video games, or maybe even more generally, in day-to-day interactions with interactive software systems. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
<div id="Sx1" class="section ltx_section"> | |||
== Acknowledgements == | |||
< | Report issue for preceding element | ||
<div id="Sx1.p1" class="ltx_para ltx_noindent"> | |||
We’d like to extend a huge thank you to Eyal Segalis, Eyal Molad, Matan Kalman, Nataniel Ruiz, Amir Hertz, Matan Cohen, Yossi Matias, Yael Pritch, Danny Lumen, Valerie Nygaard, the Theta Labs and Google Research teams, and our families for insightful feedback, ideas, suggestions, and support. | We’d like to extend a huge thank you to Eyal Segalis, Eyal Molad, Matan Kalman, Nataniel Ruiz, Amir Hertz, Matan Cohen, Yossi Matias, Yael Pritch, Danny Lumen, Valerie Nygaard, the Theta Labs and Google Research teams, and our families for insightful feedback, ideas, suggestions, and support. | ||
<span id=" | Report issue for preceding element | ||
= | |||
</div> | |||
</div> | |||
<div id="Sx2" class="section ltx_section"> | |||
== Contribution == | |||
Report issue for preceding element | |||
<div id="Sx2.p1" class="ltx_para ltx_noindent"> | |||
<ul> | |||
<li><span id="Sx2.I1.i1"><span class="ltx_tag ltx_tag_item">•</span></span> | |||
<div id="Sx2.I1.i1.p1" class="ltx_para"> | |||
<p><span id="Sx2.I1.i1.p1.1.1" class="ltx_text ltx_font_bold">Dani Valevski</span>: Developed much of the codebase, tuned parameters and details across the system, added autoencoder fine-tuning, agent training, and distillation.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="Sx2.I1.i2"><span class="ltx_tag ltx_tag_item">•</span></span> | |||
<div id="Sx2.I1.i2.p1" class="ltx_para"> | |||
<p><span id="Sx2.I1.i2.p1.1.1" class="ltx_text ltx_font_bold">Yaniv Leviathan</span>: Proposed project, method, and architecture, developed the initial implementation, key contributor to implementation and writing.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="Sx2.I1.i3"><span class="ltx_tag ltx_tag_item">•</span></span> | |||
<div id="Sx2.I1.i3.p1" class="ltx_para"> | |||
<p><span id="Sx2.I1.i3.p1.1.1" class="ltx_text ltx_font_bold">Moab Arar</span>: Led auto-regressive stabilization with noise-augmentation, many of the ablations, and created the dataset of human-play data.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="Sx2.I1.i4"><span class="ltx_tag ltx_tag_item">•</span></span> | |||
<div id="Sx2.I1.i4.p1" class="ltx_para ltx_noindent"> | |||
<p><span id="Sx2.I1.i4.p1.1.1" class="ltx_text ltx_font_bold">Shlomi Fruchter</span>: Proposed project, method, and architecture. Project leadership, initial implementation using DOOM, main manuscript writing, evaluation metrics, random policy data pipeline.</p> | |||
Report issue for preceding element | |||
</div></li></ul> | |||
</div> | |||
<div id="Sx2.p2" class="ltx_para ltx_noindent"> | |||
Correspondence to <span id="Sx2.p2.1.1" class="ltx_text ltx_font_typewriter">shlomif@google.com</span> and <span id="Sx2.p2.1.2" class="ltx_text ltx_font_typewriter">leviathan@google.com</span>. | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
<div id="bib" class="section ltx_bibliography"> | |||
== References == | |||
Report issue for preceding element | |||
<ul> | |||
<li><span id="bib.bib1">Akenine-Mller et al. (2018)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. </span> <span class="ltx_bibblock">''Real-Time Rendering, Fourth Edition''. </span> <span class="ltx_bibblock">A. K. Peters, Ltd., USA, 4th edition, 2018. </span> <span class="ltx_bibblock">ISBN 0134997832. </span></li> | |||
<li><span id="bib.bib2">Alonso et al. (2024)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. </span> <span class="ltx_bibblock">Diffusion for world modeling: Visual details matter in atari, 2024. </span></li> | |||
<li><span id="bib.bib3">Bar-Tal et al. (2024)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. </span> <span class="ltx_bibblock">Lumiere: A space-time diffusion model for video generation, 2024. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2401.12945. </span></li> | |||
<li><span id="bib.bib4">Blattmann et al. (2023a)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. </span> <span class="ltx_bibblock">Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023a. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2311.15127. </span></li> | |||
<li><span id="bib.bib5">Blattmann et al. (2023b)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. </span> <span class="ltx_bibblock">Align your latents: High-resolution video synthesis with latent diffusion models, 2023b. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2304.08818. </span></li> | |||
<li><span id="bib.bib6">Brooks et al. (2024)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. </span> <span class="ltx_bibblock">Video generation models as world simulators, 2024. </span> <span class="ltx_bibblock">URL https://openai.com/research/video-generation-models-as-world-simulators. </span></li> | |||
<li><span id="bib.bib7">Bruce et al. (2024)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, and Tim Rocktäschel. </span> <span class="ltx_bibblock">Genie: Generative interactive environments, 2024. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2402.15391. </span></li> | |||
<li><span id="bib.bib8">Girdhar et al. (2023)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. </span> <span class="ltx_bibblock">Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2311.10709. </span></li> | |||
<li><span id="bib.bib9">Gupta et al. (2023)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. </span> <span class="ltx_bibblock">Photorealistic video generation with diffusion models, 2023. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2312.06662. </span></li> | |||
<li><span id="bib.bib10">Ha & Schmidhuber (2018)</span> | |||
↑ | |||
<span class="ltx_bibblock"> David Ha and Jürgen Schmidhuber. </span> <span class="ltx_bibblock">World models, 2018. </span></li> | |||
<li><span id="bib.bib11">Hafner et al. (2020)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. </span> <span class="ltx_bibblock">Dream to control: Learning behaviors by latent imagination, 2020. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/1912.01603. </span></li> | |||
<li><span id="bib.bib12">Ho & Salimans (2022)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Jonathan Ho and Tim Salimans. </span> <span class="ltx_bibblock">Classifier-free diffusion guidance, 2022. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2207.12598. </span></li> | |||
<li><span id="bib.bib13">Ho et al. (2021)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. </span> <span class="ltx_bibblock">Cascaded diffusion models for high fidelity image generation. </span> <span class="ltx_bibblock">''arXiv preprint arXiv:2106.15282'', 2021. </span></li> | |||
<li><span id="bib.bib14">Ho et al. (2022)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. </span> <span class="ltx_bibblock">Imagen video: High definition video generation with diffusion models. </span> <span class="ltx_bibblock">''ArXiv'', abs/2210.02303, 2022. </span> <span class="ltx_bibblock">URL https://api.semanticscholar.org/CorpusID:252715883. </span></li> | |||
<li><span id="bib.bib15">Kerbl et al. (2023)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. </span> <span class="ltx_bibblock">3d gaussian splatting for real-time radiance field rendering. </span> <span class="ltx_bibblock">''ACM Transactions on Graphics'', 42(4), July 2023. </span> <span class="ltx_bibblock">URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/. </span></li> | |||
<li><span id="bib.bib16">Kim et al. (2020)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. </span> <span class="ltx_bibblock">Learning to Simulate Dynamic Environments with GameGAN. </span> <span class="ltx_bibblock">In ''IEEE Conference on Computer Vision and Pattern Recognition (CVPR)'', Jun. 2020. </span></li> | |||
<li><span id="bib.bib17">Kingma & Welling (2014)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Diederik P. Kingma and Max Welling. </span> <span class="ltx_bibblock">Auto-Encoding Variational Bayes. </span> <span class="ltx_bibblock">In ''2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings'', 2014. </span></li> | |||
<li><span id="bib.bib18">Menapace et al. (2021)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. </span> <span class="ltx_bibblock">Playable video generation, 2021. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2101.12195. </span></li> | |||
<li><span id="bib.bib19">Menapace et al. (2024)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, and Elisa Ricci. </span> <span class="ltx_bibblock">Promptable game models: Text-guided game simulation via masked diffusion models. </span> <span class="ltx_bibblock">''ACM Transactions on Graphics'', 43(2):1–16, January 2024. </span> <span class="ltx_bibblock">ISSN 1557-7368. </span> <span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.1145/3635705</span>. </span> <span class="ltx_bibblock">URL http://dx.doi.org/10.1145/3635705. </span></li> | |||
<li><span id="bib.bib20">Mildenhall et al. (2020)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. </span> <span class="ltx_bibblock">Nerf: Representing scenes as neural radiance fields for view synthesis. </span> <span class="ltx_bibblock">In ''ECCV'', 2020. </span></li> | |||
<li><span id="bib.bib21">Mnih et al. (2015)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. </span> <span class="ltx_bibblock">Human-level control through deep reinforcement learning. </span> <span class="ltx_bibblock">''Nature'', 518:529–533, 2015. </span> <span class="ltx_bibblock">URL https://api.semanticscholar.org/CorpusID:205242740. </span></li> | |||
<li><span id="bib.bib22">Petric & Milinkovic (2018)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Danko Petric and Marija Milinkovic. </span> <span class="ltx_bibblock">Comparison between cs and jpeg in terms of image compression, 2018. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/1802.05114. </span></li> | |||
<li><span id="bib.bib23">Podell et al. (2023)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. </span> <span class="ltx_bibblock">Sdxl: Improving latent diffusion models for high-resolution image synthesis. </span> <span class="ltx_bibblock">''arXiv preprint arXiv:2307.01952'', 2023. </span></li> | |||
<li><span id="bib.bib24">Raffin et al. (2021)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. </span> <span class="ltx_bibblock">Stable-baselines3: Reliable reinforcement learning implementations. </span> <span class="ltx_bibblock">''Journal of Machine Learning Research'', 22(268):1–8, 2021. </span> <span class="ltx_bibblock">URL http://jmlr.org/papers/v22/20-1364.html. </span></li> | |||
<li><span id="bib.bib25">Ramesh et al. (2022)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. </span> <span class="ltx_bibblock">Hierarchical text-conditional image generation with clip latents. </span> <span class="ltx_bibblock">''arXiv preprint arXiv:2204.06125'', 2022. </span></li> | |||
<li><span id="bib.bib26">Rombach et al. (2022)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. </span> <span class="ltx_bibblock">High-resolution image synthesis with latent diffusion models. </span> <span class="ltx_bibblock">In ''Proceedings of the IEEE/CVF conference on computer vision and pattern recognition'', pp. 10684–10695, 2022. </span></li> | |||
<li><span id="bib.bib27">Saharia et al. (2022)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. </span> <span class="ltx_bibblock">Photorealistic text-to-image diffusion models with deep language understanding. </span> <span class="ltx_bibblock">''Advances in Neural Information Processing Systems'', 35:36479–36494, 2022. </span></li> | |||
<li><span id="bib.bib28">Salimans & Ho (2022a)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Tim Salimans and Jonathan Ho. </span> <span class="ltx_bibblock">Progressive distillation for fast sampling of diffusion models. </span> <span class="ltx_bibblock">In ''The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022''. OpenReview.net, 2022a. </span> <span class="ltx_bibblock">URL https://openreview.net/forum?id=TIdIXIpzhoI. </span></li> | |||
<li><span id="bib.bib29">Salimans & Ho (2022b)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Tim Salimans and Jonathan Ho. </span> <span class="ltx_bibblock">Progressive distillation for fast sampling of diffusion models, 2022b. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2202.00512. </span></li> | |||
<li><span id="bib.bib30">Schulman et al. (2017)</span> | |||
↑ | |||
<span class="ltx_bibblock"> John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. </span> <span class="ltx_bibblock">Proximal policy optimization algorithms. </span> <span class="ltx_bibblock">''CoRR'', abs/1707.06347, 2017. </span> <span class="ltx_bibblock">URL http://arxiv.org/abs/1707.06347. </span></li> | |||
<li><span id="bib.bib31">Shazeer & Stern (2018)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Noam Shazeer and Mitchell Stern. </span> <span class="ltx_bibblock">Adafactor: Adaptive learning rates with sublinear memory cost. </span> <span class="ltx_bibblock">''CoRR'', abs/1804.04235, 2018. </span> <span class="ltx_bibblock">URL http://arxiv.org/abs/1804.04235. </span></li> | |||
<li><span id="bib.bib32">Shirley & Morley (2008)</span> | |||
↑ | |||
<span class="ltx_bibblock"> P. Shirley and R.K. Morley. </span> <span class="ltx_bibblock">''Realistic Ray Tracing, Second Edition''. </span> <span class="ltx_bibblock">Taylor & Francis, 2008. </span> <span class="ltx_bibblock">ISBN 9781568814612. </span> <span class="ltx_bibblock">URL https://books.google.ch/books?id=knpN6mnhJ8QC. </span></li> | |||
<li><span id="bib.bib33">Song et al. (2020)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Jiaming Song, Chenlin Meng, and Stefano Ermon. </span> <span class="ltx_bibblock">Denoising diffusion implicit models. </span> <span class="ltx_bibblock">''arXiv:2010.02502'', October 2020. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2010.02502. </span></li> | |||
<li><span id="bib.bib34">Song et al. (2022)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Jiaming Song, Chenlin Meng, and Stefano Ermon. </span> <span class="ltx_bibblock">Denoising diffusion implicit models, 2022. </span> <span class="ltx_bibblock">URL https://arxiv.org/abs/2010.02502. </span></li> | |||
<li><span id="bib.bib35">Unterthiner et al. (2019)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. </span> <span class="ltx_bibblock">FVD: A new metric for video generation. </span> <span class="ltx_bibblock">In ''Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019'', 2019. </span></li> | |||
<li><span id="bib.bib36">Wang et al. (2023)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. </span> <span class="ltx_bibblock">Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. </span> <span class="ltx_bibblock">''arXiv preprint arXiv:2305.16213'', 2023. </span></li> | |||
<li><span id="bib.bib37">Wydmuch et al. (2019)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Marek Wydmuch, Michał Kempka, and Wojciech Jaśkowski. </span> <span class="ltx_bibblock">ViZDoom Competitions: Playing Doom from Pixels. </span> <span class="ltx_bibblock">''IEEE Transactions on Games'', 11(3):248–259, 2019. </span> <span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.1109/TG.2018.2877047</span>. </span> <span class="ltx_bibblock">The 2022 IEEE Transactions on Games Outstanding Paper Award. </span></li> | |||
<li><span id="bib.bib38">Yang et al. (2023)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. </span> <span class="ltx_bibblock">Learning interactive real-world simulators. </span> <span class="ltx_bibblock">''arXiv preprint arXiv:2310.06114'', 2023. </span></li> | |||
<li><span id="bib.bib39">Yin et al. (2024)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. </span> <span class="ltx_bibblock">One-step diffusion with distribution matching distillation. </span> <span class="ltx_bibblock">In ''CVPR'', 2024. </span></li> | |||
<li><span id="bib.bib40">Zhang et al. (2018)</span> | |||
↑ | |||
<span class="ltx_bibblock"> Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. </span> <span class="ltx_bibblock">The unreasonable effectiveness of deep features as a perceptual metric. </span> <span class="ltx_bibblock">In ''CVPR'', 2018. </span></li></ul> | |||
< | </div> | ||
<div class="ltx_pagination ltx_role_newpage"> | |||
<div | </div> | ||
<div class="ltx_pagination ltx_role_newpage"> | |||
</div> | </div> | ||
<div id=" | <div id="A1" class="section ltx_appendix"> | ||
== <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Appendix == | |||
Report issue for preceding element | |||
<div id="A1.SS1" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">A.1 </span>Samples === | |||
Report issue for preceding element | |||
<div id="A1.SS1.p1" class="ltx_para"> | |||
[[ | Figs. [https://arxiv.org/html/2408.14837v1#A1.F8 <span class="ltx_text ltx_ref_tag">8</span>],[https://arxiv.org/html/2408.14837v1#A1.F9 <span class="ltx_text ltx_ref_tag">9</span>],[https://arxiv.org/html/2408.14837v1#A1.F10 <span class="ltx_text ltx_ref_tag">10</span>],[https://arxiv.org/html/2408.14837v1#A1.F11 <span class="ltx_text ltx_ref_tag">11</span>] provide selected samples from GameNGen. | ||
Report issue for preceding element | |||
</div> | </div> | ||
< | [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_1.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 8: </span><span id="A1.F8.2.1" class="ltx_text ltx_font_bold">Auto-regressive evaluation of the simulation model: Sample #1</span>. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]] | ||
Report issue for preceding element | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_3.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 9: </span><span id="A1.F9.2.1" class="ltx_text ltx_font_bold">Auto-regressive evaluation of the simulation model: Sample #2</span>. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]] | |||
Report issue for preceding element | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_5.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 10: </span><span id="A1.F10.2.1" class="ltx_text ltx_font_bold">Auto-regressive evaluation of the simulation model: Sample #3</span>. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]] | |||
Report issue for preceding element | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_7.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 11: </span><span id="A1.F11.2.1" class="ltx_text ltx_font_bold">Auto-regressive evaluation of the simulation model: Sample #4</span>. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]] | |||
Report issue for preceding element | |||
<div class="ltx_pagination ltx_role_newpage"> | |||
</div> | |||
</div> | </div> | ||
<div id=" | <div id="A1.SS2" class="section ltx_subsection"> | ||
=== <span class="ltx_tag ltx_tag_subsection">A.2 </span>Fine-Tuning Latent Decoder Examples === | |||
Report issue for preceding element | |||
<div id="A1.SS2.p1" class="ltx_para"> | |||
[ | Fig. [https://arxiv.org/html/2408.14837v1#A1.F12 <span class="ltx_text ltx_ref_tag">12</span>] demonstrates the effect of fine-tuning the vae decoder. | ||
Report issue for preceding element | |||
</div> | </div> | ||
<span | [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/fine_tuning_autoencoder.png|frame|none|548x742px|class=ltx_graphics ltx_centering ltx_img_portrait|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 12: </span>A comparison of generations with the standard latent decoder from Stable Diffusion v1.4 (Left), our fine-tuned decoder (Middle), and ground truth (Right). Artifacts in the frozen decoder are noticeable (e.g. in the numbers in the bottom HUD).]] | ||
= | |||
Report issue for preceding element | |||
<div class="ltx_pagination ltx_role_newpage"> | |||
</div> | |||
</div> | </div> | ||
<span | <div id="A1.SS3" class="section ltx_subsection"> | ||
== | |||
=== <span class="ltx_tag ltx_tag_subsection">A.3 </span>Reward Function === | |||
Report issue for preceding element | |||
<div id="A1.SS3.p1" class="ltx_para ltx_noindent"> | |||
The RL-agent’s reward function, the only part of our method which is specific to the game Doom, is a sum of the following conditions: | The RL-agent’s reward function, the only part of our method which is specific to the game Doom, is a sum of the following conditions: | ||
Report issue for preceding element | |||
</div> | |||
<div id="A1.SS3.p2" class="ltx_para ltx_noindent"> | |||
<ol> | |||
<li><span id="A1.I1.i1"><span class="ltx_tag ltx_tag_item">1.</span></span> | |||
<div id="A1.I1.i1.p1" class="ltx_para"> | |||
<p>Player hit: -100 points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i2"><span class="ltx_tag ltx_tag_item">2.</span></span> | |||
<div id="A1.I1.i2.p1" class="ltx_para"> | |||
<p>Player death: -5,000 points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i3"><span class="ltx_tag ltx_tag_item">3.</span></span> | |||
<div id="A1.I1.i3.p1" class="ltx_para"> | |||
<p>Enemy hit: 300 points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i4"><span class="ltx_tag ltx_tag_item">4.</span></span> | |||
<div id="A1.I1.i4.p1" class="ltx_para"> | |||
<p>Enemy kill: 1,000 points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i5"><span class="ltx_tag ltx_tag_item">5.</span></span> | |||
<div id="A1.I1.i5.p1" class="ltx_para"> | |||
<p>Item/weapon pick up: 100 points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i6"><span class="ltx_tag ltx_tag_item">6.</span></span> | |||
<div id="A1.I1.i6.p1" class="ltx_para"> | |||
<p>Secret found: 500 points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i7"><span class="ltx_tag ltx_tag_item">7.</span></span> | |||
<div id="A1.I1.i7.p1" class="ltx_para"> | |||
<p>New area: 20 * (1 + 0.5 * <math display="inline">L_{1}</math> distance) points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i8"><span class="ltx_tag ltx_tag_item">8.</span></span> | |||
<div id="A1.I1.i8.p1" class="ltx_para"> | |||
<p>Health delta: 10 * delta points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i9"><span class="ltx_tag ltx_tag_item">9.</span></span> | |||
<div id="A1.I1.i9.p1" class="ltx_para"> | |||
<p>Armor delta: 10 * delta points.</p> | |||
Report issue for preceding element | |||
</div></li> | |||
<li><span id="A1.I1.i10"><span class="ltx_tag ltx_tag_item">10.</span></span> | |||
<div id="A1.I1.i10.p1" class="ltx_para ltx_noindent"> | |||
<p>Ammo delta: 10 * max(0, delta) + min(0, delta) points.</p> | |||
Report issue for preceding element | |||
</div></li></ol> | |||
</div> | |||
<div id="A1.SS3.p3" class="ltx_para ltx_noindent"> | |||
Further, to encourage the agent to simulate smooth human play, we apply each agent action for 4 frames and additionally artificially increase the probability of repeating the previous action. | Further, to encourage the agent to simulate smooth human play, we apply each agent action for 4 frames and additionally artificially increase the probability of repeating the previous action. | ||
<span id=" | Report issue for preceding element | ||
== | |||
</div> | |||
</div> | |||
<div id="A1.SS4" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">A.4 </span>Reducing Inference Steps === | |||
Report issue for preceding element | |||
<div id="A1.SS4.p1" class="ltx_para ltx_noindent"> | |||
We evaluated the performance of a GameNGen model with varying amounts of sampling steps when generating 2048 frames using teacher-forced trajectories on 35FPS data (the maximal sampling rate allowed by VizDoom, lower than the maximal rate our model achieves with distillation, see below). Surprisingly, we observe that quality does not deteriorate when decreasing the number of steps to 4, but does deteriorate when using just a single sampling step (see Table [https://arxiv.org/html/2408.14837v1#A1.T3 <span class="ltx_text ltx_ref_tag">3</span>]). | |||
Report issue for preceding element | |||
</div> | |||
<div id="A1.SS4.p2" class="ltx_para ltx_noindent"> | |||
As a potential remedy, we experimented with distilling our model, following Wang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib36 2023]) and Yin et al. ([https://arxiv.org/html/2408.14837v1#bib.bib39 2024]). During distillation training we use 3 U-Nets, all initialized with a GameNGen model: generator, teacher, and fake-score model. The teacher remains frozen throughout the training. The fake-score model is continuously trained to predict the outputs of the generator with the standard diffusion loss. To train the generator, we use the teacher and the fake-score model to predict the noise added to an input image - <math display="inline">\epsilon_{\text{real}}</math> and <math display="inline">\epsilon_{\text{fake}}</math>. We optimize the weights of the generator to minimize the generator gradient value at each pixel weighted by <math display="inline">\epsilon_{\text{real}} - \epsilon_{\text{fake}}</math>. When distilling we use a CFG of 1.5 to generate <math display="inline">\epsilon_{\text{real}}</math>. We train for 1000 steps with a batch size of 128. Note that unlike Yin et al. ([https://arxiv.org/html/2408.14837v1#bib.bib39 2024]) we train with varying amounts of noise and do not use a regularization loss (we hope to explore other distillation variants in future work). With distillation we are able to significantly improve the quality of a 1-step model (see “D” in Table [https://arxiv.org/html/2408.14837v1#A1.T3 <span class="ltx_text ltx_ref_tag">3</span>]), enabling running the game at 50FPS, albeit with a small impact to quality. | |||
Report issue for preceding element | |||
</div> | |||
<span class="ltx_tag ltx_tag_table">Table 3: </span><span id="A1.T3.20.1" class="ltx_text ltx_font_bold">Generation with Varying Sampling Steps.</span> We evaluate the generation quality of a GameNGen model with an increasing number of steps using PSNR and LPIPS metrics. “D” marks a 1-step distilled model. | |||
<div id="A1.T3.18" class="ltx_inline-block ltx_align_center ltx_transformed_outer" style="width:198.7pt;height:171pt;vertical-align:-0.0pt;"> | |||
<span class="ltx_transformed_inner" style="transform:translate(5.2pt,-4.5pt) scale(1.05558137708566,1.05558137708566) ;"> </span> | |||
{| | |||
! Steps | |||
! PSNR <math display="inline">\uparrow</math> | |||
! LPIPS <math display="inline">\downarrow</math> | |||
|- | |||
| D | |||
| <math display="inline">31.10 \pm 0.098</math> | |||
| <math display="inline">0.208 \pm 0.002</math> | |||
|- | |||
| 1 | |||
| <math display="inline">25.47 \pm 0.098</math> | |||
| <math display="inline">0.255 \pm 0.002</math> | |||
|- | |||
| 2 | |||
| <math display="inline">31.91 \pm 0.104</math> | |||
| <math display="inline">0.205 \pm 0.002</math> | |||
|- | |||
| 4 | |||
| <math display="inline">32.58 \pm 0.108</math> | |||
| <math display="inline">0.198 \pm 0.002</math> | |||
|- | |||
| 8 | |||
| <math display="inline">32.55 \pm 0.110</math> | |||
| <math display="inline">0.196 \pm 0.002</math> | |||
|- | |||
| 16 | |||
| <math display="inline">32.44 \pm 0.110</math> | |||
| <math display="inline">0.196 \pm 0.002</math> | |||
|- | |||
| 32 | |||
| <math display="inline">32.32 \pm 0.110</math> | |||
| <math display="inline">0.196 \pm 0.002</math> | |||
|- | |||
| 64 | |||
| <math display="inline">32.19 \pm 0.110</math> | |||
| <math display="inline">0.197 \pm 0.002</math> | |||
|} | |||
</div> | |||
Report issue for preceding element | |||
</div> | |||
<div id="A1.SS5" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">A.5 </span>Agent vs Random Policy === | |||
Report issue for preceding element | |||
<div id="A1.SS5.p1" class="ltx_para ltx_noindent"> | |||
Figure [https://arxiv.org/html/2408.14837v1#A1.F13 <span class="ltx_text ltx_ref_tag">13</span>] shows the PSNR values compared to ground truth for a model train on the RL-agent’s data and a model trained on the data from a random policy, after 3 second of auto-regressive generation, for a short session of human play. We observe that the agent is sometimes comparable to and sometime much better than the random policy. | |||
Report issue for preceding element | |||
</div> | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/psnr_over_trajectory.png|frame|none|548x284px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 13: </span>The PSNR values compared to ground truth for the agent (orange) and random (blue) after 3 second of auto-regressive generation for a short session of human play smoothed with an EMA factor of 0.05.]] | |||
Report issue for preceding element | |||
</div> | |||
<div id="A1.SS6" class="section ltx_subsection"> | |||
=== <span class="ltx_tag ltx_tag_subsection">A.6 </span>Human Eval Tool === | |||
Report issue for preceding element | |||
<div id="A1.SS6.p1" class="ltx_para"> | |||
Figure [https://arxiv.org/html/2408.14837v1#A1.F14 <span class="ltx_text ltx_ref_tag">14</span>] depicts a screenshot of the tool used for the human evaluations (Section [https://arxiv.org/html/2408.14837v1#S5.SS1 <span class="ltx_text ltx_ref_tag">5.1</span>]). | |||
Report issue for preceding element | |||
</div> | |||
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/eval_tool.png|frame|none|548x266px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption <span class="ltx_tag ltx_tag_figure">Figure 14: </span>A screenshot of the tool used for human evaluations (see Section [https://arxiv.org/html/2408.14837v1#S5.SS1 <span class="ltx_text ltx_ref_tag">5.1</span>]).]] | |||
Report issue for preceding element | |||
</div> | |||
</div> | |||
</div> | |||
</div> | |||
Report Issue | |||
<div id="myForm" class="modal" role="dialog" aria-labelledby="modal-title"> | |||
<div class="modal-dialog"> | |||
<div id="modal-header" class="modal-header"> | |||
===== Report Github Issue ===== | |||
</div> | |||
<div class="modal-body"> | |||
Title:Content selection saved. Describe the issue below:Description: | |||
</div> | |||
<div class="modal-footer d-flex justify-content-end"> | |||
Submit without Github | |||
Submit in Github | |||
</div> | |||
</div> | |||
< | </div> | ||
Report Issue for Selection | |||
<div class="ltx_page_footer"> | |||
<div class="ltx_page_logo"> | |||
< | Generated by [https://math.nist.gov/~BMiller/LaTeXML/ <span style="letter-spacing: -0.2em; margin-right: 0.1em;"> L <span style="font-size: 70%; position: relative; bottom: 2.2pt;">A</span> T <span style="position: relative; bottom: -0.4ex;">E</span> </span> <span class="ltx_font_smallcaps">xml</span> [[File:|[LOGO]]]] | ||
</div> | |||
</div> | </div> | ||
< | <div class="keyboard-glossary"> | ||
== | |||
== Instructions for reporting errors == | |||
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below: | |||
* Click the "Report Issue" button. | |||
* Open a report feedback form via keyboard, use "'''Ctrl + ?'''". | |||
* Make a text selection and click the "Report Issue for Selection" button near your cursor. | |||
* You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section. | |||
Our team has already identified [https://github.com/arXiv/html_feedback/issues the following issues]. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all. | |||
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a [https://github.com/brucemiller/LaTeXML/wiki/Porting-LaTeX-packages-for-LaTeXML list of packages that need conversion], and welcome [https://github.com/brucemiller/LaTeXML/issues developer contributions]. | |||
</div> | </div> | ||
Revision as of 03:42, 29 August 2024
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/arxiv-logomark-small-white.svg|100px|class=logomark|logo] Back to arXiv]
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/arxiv-logo-one-color-white.svg|100px|class=logo|logo] Back to arXiv]
Why HTML? Report Issue Back to Abstract Download PDF
Table of Contents
- Abstract
- 1 Introduction
- 2 Interactive World Simulation
- 3 GameNGen
- 4 Experimental Setup
- 5 Results
- 6 Related Work
- 7 Discussion
- A Appendix
- References
arXiv:2408.14837v1 [cs.LG] 27 Aug 2024
Diffusion Models Are Real-Time Game Engines
Report issue for preceding element
Report issue for preceding element
Abstract
Report issue for preceding element We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/teaser.png|frame|none|548x240px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 1: A human player is playing DOOM on GameNGen at 20 FPS.
See https://gamengen.github.io for demo videos.]]
Report issue for preceding element
1 Introduction
Report issue for preceding element
Computer games are manually crafted software systems centered around the following game loop: (1) gather user inputs, (2) update the game state, and (3) render it to screen pixels. This game loop, running at high frame rates, creates the illusion of an interactive virtual world for the player. Such game loops are classically run on standard computers, and while there have been many amazing attempts at running games on bespoke hardware (e.g. the iconic game DOOM has been run on kitchen appliances such as a toaster and a microwave, a treadmill, a camera, an iPod, and within the game of Minecraft, to name just a few examples111See https://www.reddit.com/r/itrunsdoom/), in all of these cases the hardware is still emulating the manually written game software as-is. Furthermore, while vastly different game engines exist, the game state updates and rendering logic in all are composed of a set of manual rules, programmed or configured by hand.
Report issue for preceding element
In recent years, generative models made significant progress in producing images and videos conditioned on multi-modal inputs, such as text or images. At the forefront of this wave, diffusion models became the de-facto standard in media (i.e. non-language) generation, with works like Dall-E (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2022) and Sora (Brooks et al., 2024). At a glance, simulating the interactive worlds of video games may seem similar to video generation. However, interactive world simulation is more than just very fast video generation. The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures. Notably, it requires generating frames autoregressively which tends to be unstable and leads to sampling divergence (see section 3.2.1).
Report issue for preceding element
Several important works (Ha & Schmidhuber, 2018; Kim et al., 2020; Bruce et al., 2024) (see Section 6) simulate interactive video games with neural models. Nevertheless, most of these approaches are limited in respect to the complexity of the simulated games, simulation speed, stability over long time periods, or visual quality (see Figure 2). It is therefore natural to ask:
Report issue for preceding element
Can a neural model running in real-time simulate a complex game at high quality?
Report issue for preceding element
In this work we demonstrate that the answer is yes. Specifically, we show that a complex video game, the iconic game DOOM, can be run on a neural network (an augmented version of the open Stable Diffusion v1.4 (Rombach et al., 2022)), in real-time, while achieving a visual quality comparable to that of the original game. While not an exact simulation, the neural model is able to perform complex game state updates, such as tallying health and ammo, attacking enemies, damaging objects, opening doors, and persist the game state over long trajectories.
Report issue for preceding element
GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years. Key questions remain, such as how these neural game engines would be trained and how games would be effectively created in the first place, including how to best leverage human inputs. We are nevertheless extremely excited for the possibilities of this new paradigm.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/x1.png|frame|none|761x216px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 2: GameNGen compared to prior state-of-the-art simulations of DOOM.]]
Report issue for preceding element
2 Interactive World Simulation
Report issue for preceding element
An Interactive Environment consists of a space of latent states , a space of partial projections of the latent space , a partial projection function , a set of actions , and a transition probability function Failed to parse (unknown function "\middle"): {\textstyle p{(\left. s \middle| {a,s^{\prime}} \right.)}} such that .
Report issue for preceding element
For example, in the case of the game DOOM, is the program’s dynamic memory contents, is the rendered screen pixels, is the game’s rendering logic, is the set of key presses and mouse movements, and is the program’s logic given the player’s input (including any potential non-determinism).
Report issue for preceding element
Given an input interactive environment , and an initial state , an Interactive World Simulation is a simulation distribution function Failed to parse (SVG with PNG fallback (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle {{{q{(\left. o_{n} \middle| {o_{< n},a_{\leq n}} \right.)}},o_{i}} \in \mathcal{O}},{a_{i} \in \mathcal{A}}} . Given a distance metric between observations , a policy, i.e. a distribution on agent actions given past actions and observations Failed to parse (SVG with PNG fallback (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \pi{({\left. a_{n} \middle| {o_{< n},a} \right. < n})}} , a distribution on initial states, and a distribution on episode lengths, the Interactive World Simulation objective consists of minimizing where , , and are sampled observations from the environment and the simulation when enacting the agent’s policy . Importantly, the conditioning actions for these samples are always obtained by the agent interacting with the environment , while the conditioning observations can either be obtained from (the teacher forcing objective) or from the simulation (the auto-regressive objective).
Report issue for preceding element
We always train our generative model with the teacher forcing objective. Given a simulation distribution function , the environment can be simulated by auto-regressively sampling observations.
Report issue for preceding element
3 GameNGen
Report issue for preceding element
GameNGen (pronounced “game engine”) is a generative diffusion model that learns to simulate the game under the settings of Section 2. In order to collect training data for this model, with the teacher forcing objective, we first train a separate model to interact with the environment. The two models (agent and generative) are trained in sequence. The entirety of the agent’s actions and observations corpus during training is maintained and becomes the training dataset for the generative model in a second stage. See Figure 3.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/x2.png|frame|none|822x339px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 3: GameNGen method overview. v-prediction details are omitted for brevity.]]
Report issue for preceding element
3.1 Data Collection via Agent Play
Report issue for preceding element
Our end goal is to have human players interact with our simulation. To that end, the policy as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play. Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific (see Appendix A.3).
Report issue for preceding element
We record the agent’s training trajectories throughout the entire training process, which includes different skill levels of play. This set of recorded trajectories is our dataset, used for training the generative model (see Section 3.2).
Report issue for preceding element
3.2 Training the Generative Diffusion Model
Report issue for preceding element
We now train a generative diffusion model conditioned on the agent’s trajectories (actions and observations) collected during the previous stage.
Report issue for preceding element
We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 (Rombach et al., 2022). We condition the model on trajectories , i.e. on a sequence of previous actions and observations (frames) and remove all text conditioning. Specifically, to condition on actions, we simply learn an embedding from each action (e.g. a specific key press) into a single token and replace the cross attention from the text into this encoded actions sequence. In order to condition on observations (i.e. previous frames) we encode them into latent space using the auto-encoder and concatenate them in the latent channels dimension to the noised latents (see Figure 3). We also experimented conditioning on these past observations via cross-attention but observed no meaningful improvements.
Report issue for preceding element
We train the model to minimize the diffusion loss with velocity parameterization (Salimans & Ho, 2022b):
Report issue for preceding element
(1) |
where , , , , , , and is the v-prediction output of the model . The noise schedule is linear, similarly to Rombach et al. (2022).
Report issue for preceding element
3.2.1 Mitigating Auto-Regressive Drift Using Noise Augmentation
Report issue for preceding element
The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in Figure 4. To avoid this divergence due to auto-regressive application of the model, we corrupt context frames by adding a varying amount of Gaussian noise to encoded frames in training time, while providing the noise level as input to the model, following Ho et al. (2021). To that effect, we sample a noise level uniformly up to a maximal value, discretize it and learn an embedding for each bucket (see Figure 3). This allows the network to correct information sampled in previous frames, and is critical for preserving frame quality over time. During inference, the added noise level can be controlled to maximize quality, although we find that even with no added noise the results are significantly improved. We ablate the impact of this method in section 5.2.2.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/noise_aug_ablation_new.png|frame|none|494x158px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 4: Auto-regressive drift. Top: we present every 10th frame of a simple trajectory with 50 frames in which the player is not moving. Quality degrades fast after 20-30 steps. Bottom: the same trajectory with noise augmentation does not suffer from quality degradation.]]
Report issue for preceding element
3.2.2 Latent Decoder Fine-tuning
Report issue for preceding element
The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8x8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD (“heads up display”). To leverage the pre-trained knowledge while improving image quality, we train just the decoder of the latent auto-encoder using an MSE loss computed against the target frame pixels. It might be possible to improve quality even further using a perceptual loss such as LPIPS (Zhang et al. (2018)), which we leave to future work. Importantly, note that this fine tuning process happens completely separately from the U-Net fine-tuning, and that notably the auto-regressive generation isn’t affected by it (we only condition auto-regressively on the latents, not the pixels). Appendix A.2 shows examples of generations with and without fine-tuning the auto-encoder.
Report issue for preceding element
3.3 Inference
Report issue for preceding element
3.3.1 Setup
Report issue for preceding element
We use DDIM sampling (Song et al., 2022). We employ Classifier-Free Guidance (Ho & Salimans, 2022) only for the past observations condition . We didn’t find guidance for the past actions condition to improve quality. The weight we use is relatively small (1.5) as larger weights create artifacts which increase due to our auto-regressive sampling.
Report issue for preceding element
We also experimented with generating 4 samples in parallel and combining the results, with the hope of preventing rare extreme predictions from being accepted and to reduce error accumulation. We experimented both with averaging the samples and with choosing the sample closest to the median. Averaging performed slightly worse than single frame, and choosing the closest to the median performed only negligibly better. Since both increase the hardware requirements to 4 TPUs, we opt to not use them, but note that this might be an interesting area for future work.
Report issue for preceding element
3.3.2 Denoiser Sampling Steps
Report issue for preceding element
During inference, we need to run both the U-Net denoiser (for a number of steps) and the auto-encoder. On our hardware configuration (a TPU-v5), a single denoiser step and an evaluation of the auto-encoder both takes 10ms. If we ran our model with a single denoiser step, the minimum total latency possible in our setup would be 20ms per frame, or 50 frames per second. Usually, generative diffusion models, such as Stable Diffusion, don’t produce high quality results with a single denoising step, and instead require dozens of sampling steps to generate a high quality image. Surprisingly, we found that we can robustly simulate DOOM, with only 4 DDIM sampling steps (Song et al., 2020). In fact, we observe no degradation in simulation quality when using 4 sampling steps vs 20 steps or more (see Appendix A.4).
Report issue for preceding element
Using just 4 denoising steps leads to a total U-Net cost of 40ms (and total inference cost of 50ms, including the auto encoder) or 20 frames per second. We hypothesize that the negligible impact to quality with few steps in our case stems from a combination of: (1) a constrained images space, and (2) strong conditioning by the previous frames.
Report issue for preceding element
Since we do observe degradation when using just a single sampling step, we also experimented with model distillation similarly to (Yin et al., 2024; Wang et al., 2023) in the single-step setting. Distillation does help substantially there (allowing us to reach 50 FPS as above), but still comes at a some cost to simulation quality, so we opt to use the 4-step version without distillation for our method (see Appendix A.4). This is an interesting area for further research.
Report issue for preceding element
We note that it is trivial to further increase the image generation rate substantially by parallelizing the generation of several frames on additional hardware, similarly to NVidia’s classic SLI Alternate Frame Rendering (AFR) technique. Similarly to AFR, the actual simulation rate would not increase and input lag would not reduce.
Report issue for preceding element
4 Experimental Setup
Report issue for preceding element
4.1 Agent Training
Report issue for preceding element
The agent model is trained using PPO (Schulman et al., 2017), with a simple CNN as the feature network, following Mnih et al. (2015). It is trained on CPU using the Stable Baselines 3 infrastructure (Raffin et al., 2021). The agent is provided with downscaled versions of the frame images and in-game map, each at resolution 160x120. The agent also has access to the last 32 actions it performed. The feature network computes a representation of size 512 for each image. PPO’s actor and critic are 2-layer MLP heads on top of a concatenation of the outputs of the image feature network and the sequence of past actions. We train the agent to play the game using the Vizdoom environment (Wydmuch et al., 2019). We run 8 games in parallel, each with a replay buffer size of 512, a discount factor , and an entropy coefficient of . In each iteration, the network is trained using a batch size of 64 for 10 epochs, with a learning rate of 1e-4. We perform a total of 10M environment steps.
Report issue for preceding element
4.2 Generative Model Training
Report issue for preceding element
We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4, unfreezing all U-Net parameters. We use a batch size of 128 and a constant learning rate of 2e-5, with the Adafactor optimizer without weight decay (Shazeer & Stern, 2018) and gradient clipping of 1.0. We change the diffusion loss parameterization to be v-prediction (Salimans & Ho (2022a). The context frames condition is dropped with probability 0.1 to allow CFG during inference. We train using 128 TPU-v5e devices with data parallelization. Unless noted otherwise, all results in the paper are after 700,000 training steps. For noise augmentation (Section 3.2.1), we use a maximal noise level of 0.7, with 10 embedding buckets. We use a batch size of 2,048 for optimizing the latent decoder, other training parameters are identical to those of the denoiser. For training data, we use all trajectories played by the agent during RL training as well as evaluation data during training, unless mentioned otherwise. Overall we generate 900M frames for training. All image frames (during training, inference, and conditioning) are at a resolution of 320x240 padded to 320x256. We use a context length of 64 (i.e. the model is provided its own last 64 predictions as well as the last 64 actions).
Report issue for preceding element
5 Results
Report issue for preceding element
5.1 Simulation Quality
Report issue for preceding element
Overall, our method achieves a simulation quality comparable to the original game over long trajectories in terms of image quality. For short trajectories, human raters are only slightly better than random chance at distinguishing between clips of the simulation and the actual game.
Report issue for preceding element
Image Quality. We measure LPIPS (Zhang et al., 2018) and PSNR using the teacher-forcing setup described in Section 2, where we sample an initial state and predict a single frame based on a trajectory of ground-truth past observations. When evaluated over a random holdout of 2048 trajectories taken in 5 different levels, our model achieves a PSNR of and an LPIPS of . The PSNR value is similar to lossy JPEG compression with quality settings of 20-30 (Petric & Milinkovic, 2018). Figure 5 shows examples of model predictions and the corresponding ground truth samples.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/x3.png|frame|none|761x342px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 5: Model predictions vs. ground truth. Only the last 4 frames of the past observations context are shown.]]
Report issue for preceding element
Video Quality. We use the auto-regressive setup described in Section 2, where we iteratively sample frames following the sequences of actions defined by the ground-truth trajectory, while conditioning the model on its own past predictions. When sampled auto-regressively, the predicted and ground-truth trajectories often diverge after a few steps, mostly due to the accumulation of small amounts of different movement velocities between frames in each trajectory. For that reason, per-frame PSNR and LPIPS values gradually decrease and increase respectively, as can be seen in Figure 6. The predicted trajectory is still similar to the actual game in terms of content and image quality, but per-frame metrics are limited in their ability to capture this (see Appendix A.1 for samples of auto-regressively generated trajectories).
Report issue for preceding element
We therefore measure the FVD (Unterthiner et al., 2019) computed over a random holdout of 512 trajectories, measuring the distance between the predicted and ground truth trajectory distributions, for simulations of length 16 frames (0.8 seconds) and 32 frames (1.6 seconds). For 16 frames our model obtains an FVD of . For 32 frames our model obtains an FVD of .
Report issue for preceding element
Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively).
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/psnr_step_700k_08212004.png|frame|none|220x137px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption Figure 6: Auto-regressive evaluation. PSNR and LPIPS metrics over 64 auto-regressive steps.]]
Report issue for preceding element
5.2 Ablations
Report issue for preceding element
To evaluate the importance of the different components of our methods, we sample trajectories from the evaluation dataset and compute LPIPS and PSNR metrics between the ground truth and the predicted frames.
Report issue for preceding element
5.2.1 Context Length
Report issue for preceding element
We evaluate the impact of changing the number of past observations in the conditioning context by training models with (recall that our method uses ). This affects both the number of historical frames and actions. We train the models for 200,000 steps keeping the decoder frozen and evaluate on test-set trajectories from 5 levels. See the results in Table 1. As expected, we observe that generation quality improves with the length of the context. Interestingly, we observe that while the improvement is large at first (e.g. between 1 and 2 frames), we quickly approach an asymptote and further increasing the context size provides only small improvements in quality. This is somewhat surprising as even with our maximal context length, the model only has access to a little over 3 seconds of history. Notably, we observe that much of the game state is persisted for much longer periods (see Section 7). While the length of the conditioning context is an important limitation, Table 1 hints that we’d likely need to change the architecture of our model to efficiently support longer contexts, and employ better selection of the past frames to condition on, which we leave for future work.
Report issue for preceding element
Table 1: Number of history frames. We ablate the number of history frames used as context using 8912 test-set examples from 5 levels. More frames generally improve both PSNR and LPIPS metrics.
History Context Length | PSNR | LPIPS |
---|---|---|
64 | ||
32 | ||
16 | ||
8 | ||
4 | ||
2 | ||
1 |
Report issue for preceding element
5.2.2 Noise Augmentation
Report issue for preceding element
To ablate the impact of noise augmentation we train a model without added noise. We evaluate both our standard model with noise augmentation and the model without added noise (after 200k training steps) auto-regressively and compute PSNR and LPIPS metrics between the predicted frames and the ground-truth over a random holdout of 512 trajectories. We report average metric values for each auto-regressive step up to a total of 64 frames in Figure 7.
Report issue for preceding element
Without noise augmentation, LPIPS distance from the ground truth increases rapidly compared to our standard noise-augmented model, while PSNR drops, indicating a divergence of the simulation from ground truth.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/noise_aug_ablation_lpips_step_200k_08212024.png|frame|none|247x153px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption Figure 7: Impact of Noise Augmentation. The plots show average LPIPS (lower is better) and PSNR (higher is better) values for each auto-regressive step. When noise augmentation is not used quality degrades quickly after 10-20 frames. This is prevented by noise augmentation.]]
Report issue for preceding element
5.2.3 Agent Play
Report issue for preceding element
We compare training on agent-generated data to training on data generated using a random policy. For the random policy, we sample actions following a uniform categorical distribution that doesn’t depend on the observations. We compare the random and agent datasets by training 2 models for 700k steps along with their decoder. The models are evaluated on a dataset of 2048 human-play trajectories from 5 levels. We compare the first frame of generation, conditioned on a history context of 64 ground-truth frames, as well as a frame after 3 seconds of auto-regressive generation.
Report issue for preceding element
Overall, we observe that training the model on random trajectories works surprisingly well, but is limited by the exploration ability of the random policy. When comparing the single frame generation the agent works only slightly better, achieving a PNSR of 25.06 vs 24.42 for the random policy. When comparing a frame after 3 seconds of auto-regressive generation, the difference increases to 19.02 vs 16.84. When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better. With that, we manually split 456 examples into 3 buckets: easy, medium, and hard, manually, based on their distance from the starting position in the game. We observe that on the easy and hard sets, the agent performs only slightly better than random, while on the medium set the difference is much larger in favor of the agent as expected (see Table 2). See Figure 13 in Appendix A.5 for an example of the scores during a single session of human play.
Report issue for preceding element
Table 2: Performance on Different Difficulty Levels. We compare the performance of models trained using Agent-generated and Random-generated data across easy, medium, and hard splits of the dataset. Easy and medium have 112 items, hard has 232 items. Metrics are computed for each trajectory on a single frame after 3 seconds.
Difficulty Level | Data Generation Policy | PSNR | LPIPS |
---|---|---|---|
Easy | Agent | ||
Random | |||
Medium | Agent | ||
Random | |||
Hard | Agent | ||
Random |
Report issue for preceding element
6 Related Work
Report issue for preceding element
Interactive 3D Simulation
Report issue for preceding element
Simulating visual and physical processes of 2D and 3D environments and allowing interactive exploration of them is an extensively developed field in computer graphics (Akenine-Mller et al., 2018). Game Engines, such as Unreal and Unity, are software that processes representations of scene geometry and renders a stream of images in response to user interactions. The game engine is responsible for keeping track of all world state, e.g. the player position and movement, objects, character animation and lighting. It also tracks the game logic, e.g. points gained by accomplishing game objectives. Film and television productions use variants of ray-tracing (Shirley & Morley, 2008), which are too slow and compute-intensive for real time applications. In contrast, game engines must keep a very high frame rate (typically 30-60 FPS), and therefore rely on highly-optimized polygon rasterization, often accelerated by GPUs. Physical effects such as shadows, particles and lighting are often implemented using efficient heuristics rather than physically accurate simulation.
Report issue for preceding element
Neural 3D Simulation
Report issue for preceding element
Neural methods for reconstructing 3D representations have made significant advances over the last years. NeRFs (Mildenhall et al., 2020) parameterize radiance fields using a deep neural network that is specifically optimized for a given scene from a set of images taken from various camera poses. Once trained, novel point of views of the scene can be sampled using volume rendering methods. Gaussian Splatting (Kerbl et al., 2023) approaches build on NeRFs but represent scenes using 3D Gaussians and adapted rasterization methods, unlocking faster training and rendering times. While demonstrating impressive reconstruction results and real-time interactivity, these methods are often limited to static scenes.
Report issue for preceding element
Video Diffusion Models
Report issue for preceding element
Diffusion models achieved state-of-the-art results in text-to-image generation (Saharia et al., 2022; Rombach et al., 2022; Ramesh et al., 2022; Podell et al., 2023), a line of work that has also been applied for text-to-video generation tasks (Ho et al., 2022; Blattmann et al., 2023b; a; Gupta et al., 2023; Girdhar et al., 2023; Bar-Tal et al., 2024). Despite impressive advancement in realism, text adherence and temporal consistency, video diffusion models remain too slow for real-time applications. Our work extends this line of work and adapts it for real-time generation conditioned autoregressively on a history of past observations and actions.
Report issue for preceding element
Game Simulation and World Models
Report issue for preceding element
Several works attempted to train models for game simulation with actions inputs. Yang et al. (2023) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. (2021) and Bruce et al. (2024) focus on unsupervised learning of actions from videos. Menapace et al. (2024) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on interactive playable real-time simulation, and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber (2018) train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e. selecting an action at random). Then controller policy is learned by playing within the “hallucinated” environment. Hafner et al. (2020) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. (2020), that use an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game, see Figure 2. Finally, concurrently with our work, Alonso et al. (2024) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.
Report issue for preceding element
DOOM
Report issue for preceding element
When DOOM released in 1993 it revolutionized the gaming industry. Introducing groundbreaking 3D graphics technology, it became a cornerstone of the first-person shooter genre, influencing countless other games. DOOM was studied by numerous research works. It provides an open-source implementation and a native resolution that is low enough for small sized models to simulate, while being complex enough to be a challenging test case. Finally, the authors have spent countless youth hours with the game. It was a trivial choice to use it in this work.
Report issue for preceding element
7 Discussion
Report issue for preceding element
Summary. We introduced GameNGen, and demonstrated that high-quality real-time game play at 20 frames per second is possible on a neural model. We also provided a recipe for converting an interactive piece of software such as a computer game into a neural model.
Report issue for preceding element
Limitations. GameNGen suffers from a limited amount of memory. The model only has access to a little over 3 seconds of history, so it’s remarkable that much of the game logic is persisted for drastically longer time horizons. While some of the game state is persisted through screen pixels (e.g. ammo and health tallies, available weapons, etc.), the model likely learns strong heuristics that allow meaningful generalizations. For example, from the rendered view the model learns to infer the player’s location, and from the ammo and health tallies, the model might infer whether the player has already been through an area and defeated the enemies there. That said, it’s easy to create situations where this context length is not enough. Continuing to increase the context size with our existing architecture yields only marginal benefits (Section 5.2.1), and the model’s short context length remains an important limitation. The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases.
Report issue for preceding element
Future Work. We demonstrate GameNGen on the classic game DOOM. It would be interesting to test it on other games or more generally on other interactive software systems; We note that nothing in our technique is DOOM specific except for the reward function for the RL-agent. We plan on addressing that in a future work; While GameNGen manages to maintain game state accurately, it isn’t perfect, as per the discussion above. A more sophisticated architecture might be needed to mitigate these; GameNGen currently has a limited capability to leverage more than a minimal amount of memory. Experimenting with further expanding the memory effectively could be critical for more complex games/software; GameNGen runs at 20 or 50 FPS222Faster than the original game DOOM ran on the some of the authors’ 80386 machines at the time! on a TPUv5. It would be interesting to experiment with further optimization techniques to get it to run at higher frame rates and on consumer hardware.
Report issue for preceding element
Towards a New Paradigm for Interactive Video Games. Today, video games are programmed by humans. GameNGen is a proof-of-concept for one part of a new paradigm where games are weights of a neural model, not lines of code. GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware. While many important questions remain, we are hopeful that this paradigm could have important benefits. For example, the development process for video games under this new paradigm might be less costly and more accessible, whereby games could be developed and edited via textual descriptions or examples images. A small part of this vision, namely creating modifications or novel behaviors for existing games, might be achievable in the shorter term. For example, we might be able to convert a set of frames into a new playable level or create a new character just based on example images, without having to author code. Other advantages of this new paradigm include strong guarantees on frame rates and memory footprints. We have not experimented with these directions yet and much more work is required here, but we are excited to try! Hopefully this small step will someday contribute to a meaningful improvement in people’s experience with video games, or maybe even more generally, in day-to-day interactions with interactive software systems.
Report issue for preceding element
Acknowledgements
Report issue for preceding element
We’d like to extend a huge thank you to Eyal Segalis, Eyal Molad, Matan Kalman, Nataniel Ruiz, Amir Hertz, Matan Cohen, Yossi Matias, Yael Pritch, Danny Lumen, Valerie Nygaard, the Theta Labs and Google Research teams, and our families for insightful feedback, ideas, suggestions, and support.
Report issue for preceding element
Contribution
Report issue for preceding element
- •
Dani Valevski: Developed much of the codebase, tuned parameters and details across the system, added autoencoder fine-tuning, agent training, and distillation.
Report issue for preceding element
- •
Yaniv Leviathan: Proposed project, method, and architecture, developed the initial implementation, key contributor to implementation and writing.
Report issue for preceding element
- •
Moab Arar: Led auto-regressive stabilization with noise-augmentation, many of the ablations, and created the dataset of human-play data.
Report issue for preceding element
- •
Shlomi Fruchter: Proposed project, method, and architecture. Project leadership, initial implementation using DOOM, main manuscript writing, evaluation metrics, random policy data pipeline.
Report issue for preceding element
Correspondence to shlomif@google.com and leviathan@google.com.
Report issue for preceding element
References
Report issue for preceding element
- Akenine-Mller et al. (2018) ↑ Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. Real-Time Rendering, Fourth Edition. A. K. Peters, Ltd., USA, 4th edition, 2018. ISBN 0134997832.
- Alonso et al. (2024) ↑ Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari, 2024.
- Bar-Tal et al. (2024) ↑ Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024. URL https://arxiv.org/abs/2401.12945.
- Blattmann et al. (2023a) ↑ Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023a. URL https://arxiv.org/abs/2311.15127.
- Blattmann et al. (2023b) ↑ Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models, 2023b. URL https://arxiv.org/abs/2304.08818.
- Brooks et al. (2024) ↑ Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- Bruce et al. (2024) ↑ Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, and Tim Rocktäschel. Genie: Generative interactive environments, 2024. URL https://arxiv.org/abs/2402.15391.
- Girdhar et al. (2023) ↑ Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023. URL https://arxiv.org/abs/2311.10709.
- Gupta et al. (2023) ↑ Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023. URL https://arxiv.org/abs/2312.06662.
- Ha & Schmidhuber (2018) ↑ David Ha and Jürgen Schmidhuber. World models, 2018.
- Hafner et al. (2020) ↑ Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603.
- Ho & Salimans (2022) ↑ Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207.12598.
- Ho et al. (2021) ↑ Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
- Ho et al. (2022) ↑ Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022. URL https://api.semanticscholar.org/CorpusID:252715883.
- Kerbl et al. (2023) ↑ Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.
- Kim et al. (2020) ↑ Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020.
- Kingma & Welling (2014) ↑ Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- Menapace et al. (2021) ↑ Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. Playable video generation, 2021. URL https://arxiv.org/abs/2101.12195.
- Menapace et al. (2024) ↑ Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, and Elisa Ricci. Promptable game models: Text-guided game simulation via masked diffusion models. ACM Transactions on Graphics, 43(2):1–16, January 2024. ISSN 1557-7368. doi: 10.1145/3635705. URL http://dx.doi.org/10.1145/3635705.
- Mildenhall et al. (2020) ↑ Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Mnih et al. (2015) ↑ Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. URL https://api.semanticscholar.org/CorpusID:205242740.
- Petric & Milinkovic (2018) ↑ Danko Petric and Marija Milinkovic. Comparison between cs and jpeg in terms of image compression, 2018. URL https://arxiv.org/abs/1802.05114.
- Podell et al. (2023) ↑ Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Raffin et al. (2021) ↑ Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
- Ramesh et al. (2022) ↑ Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Rombach et al. (2022) ↑ Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Saharia et al. (2022) ↑ Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Salimans & Ho (2022a) ↑ Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=TIdIXIpzhoI.
- Salimans & Ho (2022b) ↑ Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022b. URL https://arxiv.org/abs/2202.00512.
- Schulman et al. (2017) ↑ John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Shazeer & Stern (2018) ↑ Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235.
- Shirley & Morley (2008) ↑ P. Shirley and R.K. Morley. Realistic Ray Tracing, Second Edition. Taylor & Francis, 2008. ISBN 9781568814612. URL https://books.google.ch/books?id=knpN6mnhJ8QC.
- Song et al. (2020) ↑ Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020. URL https://arxiv.org/abs/2010.02502.
- Song et al. (2022) ↑ Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URL https://arxiv.org/abs/2010.02502.
- Unterthiner et al. (2019) ↑ Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019, 2019.
- Wang et al. (2023) ↑ Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
- Wydmuch et al. (2019) ↑ Marek Wydmuch, Michał Kempka, and Wojciech Jaśkowski. ViZDoom Competitions: Playing Doom from Pixels. IEEE Transactions on Games, 11(3):248–259, 2019. doi: 10.1109/TG.2018.2877047. The 2022 IEEE Transactions on Games Outstanding Paper Award.
- Yang et al. (2023) ↑ Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
- Yin et al. (2024) ↑ Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024.
- Zhang et al. (2018) ↑ Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Appendix A Appendix
Report issue for preceding element
A.1 Samples
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_1.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption Figure 8: Auto-regressive evaluation of the simulation model: Sample #1. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]]
Report issue for preceding element [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_3.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption Figure 9: Auto-regressive evaluation of the simulation model: Sample #2. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]]
Report issue for preceding element [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_5.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption Figure 10: Auto-regressive evaluation of the simulation model: Sample #3. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]]
Report issue for preceding element [[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/context_7.png|frame|none|548x103px|class=ltx_graphics ltx_centering ltx_figure_panel ltx_img_landscape|alt=|caption Figure 11: Auto-regressive evaluation of the simulation model: Sample #4. Top row: Context frames. Middle row: Ground truth frames. Bottom row: Model predictions.]]
Report issue for preceding element
A.2 Fine-Tuning Latent Decoder Examples
Report issue for preceding element
Fig. 12 demonstrates the effect of fine-tuning the vae decoder.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/fine_tuning_autoencoder.png|frame|none|548x742px|class=ltx_graphics ltx_centering ltx_img_portrait|alt=|caption Figure 12: A comparison of generations with the standard latent decoder from Stable Diffusion v1.4 (Left), our fine-tuned decoder (Middle), and ground truth (Right). Artifacts in the frozen decoder are noticeable (e.g. in the numbers in the bottom HUD).]]
Report issue for preceding element
A.3 Reward Function
Report issue for preceding element
The RL-agent’s reward function, the only part of our method which is specific to the game Doom, is a sum of the following conditions:
Report issue for preceding element
- 1.
Player hit: -100 points.
Report issue for preceding element
- 2.
Player death: -5,000 points.
Report issue for preceding element
- 3.
Enemy hit: 300 points.
Report issue for preceding element
- 4.
Enemy kill: 1,000 points.
Report issue for preceding element
- 5.
Item/weapon pick up: 100 points.
Report issue for preceding element
- 6.
Secret found: 500 points.
Report issue for preceding element
- 7.
New area: 20 * (1 + 0.5 * distance) points.
Report issue for preceding element
- 8.
Health delta: 10 * delta points.
Report issue for preceding element
- 9.
Armor delta: 10 * delta points.
Report issue for preceding element
- 10.
Ammo delta: 10 * max(0, delta) + min(0, delta) points.
Report issue for preceding element
Further, to encourage the agent to simulate smooth human play, we apply each agent action for 4 frames and additionally artificially increase the probability of repeating the previous action.
Report issue for preceding element
A.4 Reducing Inference Steps
Report issue for preceding element
We evaluated the performance of a GameNGen model with varying amounts of sampling steps when generating 2048 frames using teacher-forced trajectories on 35FPS data (the maximal sampling rate allowed by VizDoom, lower than the maximal rate our model achieves with distillation, see below). Surprisingly, we observe that quality does not deteriorate when decreasing the number of steps to 4, but does deteriorate when using just a single sampling step (see Table 3).
Report issue for preceding element
As a potential remedy, we experimented with distilling our model, following Wang et al. (2023) and Yin et al. (2024). During distillation training we use 3 U-Nets, all initialized with a GameNGen model: generator, teacher, and fake-score model. The teacher remains frozen throughout the training. The fake-score model is continuously trained to predict the outputs of the generator with the standard diffusion loss. To train the generator, we use the teacher and the fake-score model to predict the noise added to an input image - and . We optimize the weights of the generator to minimize the generator gradient value at each pixel weighted by . When distilling we use a CFG of 1.5 to generate . We train for 1000 steps with a batch size of 128. Note that unlike Yin et al. (2024) we train with varying amounts of noise and do not use a regularization loss (we hope to explore other distillation variants in future work). With distillation we are able to significantly improve the quality of a 1-step model (see “D” in Table 3), enabling running the game at 50FPS, albeit with a small impact to quality.
Report issue for preceding element
Table 3: Generation with Varying Sampling Steps. We evaluate the generation quality of a GameNGen model with an increasing number of steps using PSNR and LPIPS metrics. “D” marks a 1-step distilled model.
Steps | PSNR | LPIPS |
---|---|---|
D | ||
1 | ||
2 | ||
4 | ||
8 | ||
16 | ||
32 | ||
64 |
Report issue for preceding element
A.5 Agent vs Random Policy
Report issue for preceding element
Figure 13 shows the PSNR values compared to ground truth for a model train on the RL-agent’s data and a model trained on the data from a random policy, after 3 second of auto-regressive generation, for a short session of human play. We observe that the agent is sometimes comparable to and sometime much better than the random policy.
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/psnr_over_trajectory.png|frame|none|548x284px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 13: The PSNR values compared to ground truth for the agent (orange) and random (blue) after 3 second of auto-regressive generation for a short session of human play smoothed with an EMA factor of 0.05.]]
Report issue for preceding element
A.6 Human Eval Tool
Report issue for preceding element
Figure 14 depicts a screenshot of the tool used for the human evaluations (Section 5.1).
Report issue for preceding element
[[File:./Diffusion%20Models%20Are%20Real-Time%20Game%20Engines_files/eval_tool.png|frame|none|548x266px|class=ltx_graphics ltx_centering ltx_img_landscape|alt=|caption Figure 14: A screenshot of the tool used for human evaluations (see Section 5.1).]]
Report issue for preceding element
Report Issue
Report Github Issue
Title:Content selection saved. Describe the issue below:Description:
Report Issue for Selection
Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
- Click the "Report Issue" button.
- Open a report feedback form via keyboard, use "Ctrl + ?".
- Make a text selection and click the "Report Issue for Selection" button near your cursor.
- You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.