Translations:Diffusion Models Are Real-Time Game Engines/85/en: Difference between revisions

Latest revision as of 03:06, 7 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

Several works attempted to train models for game simulation with actions inputs. Yang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) and Bruce et al. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) focus on unsupervised learning of actions from videos. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on ''interactive playable real-time simulation'', and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) train a Variational Auto-Encoder (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e., selecting an action at random). Then a controller policy is learned by playing within the “hallucinated” environment. Hafner et al. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), which uses an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game; see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 2]. Finally, concurrently with our work, Alonso et al. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.

Several works attempted to train models for game simulation with actions inputs. Yang et al. (2023) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. (2021) and Bruce et al. (2024) focus on unsupervised learning of actions from videos. Menapace et al. (2024) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on interactive playable real-time simulation, and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber (2018) train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e., selecting an action at random). Then a controller policy is learned by playing within the “hallucinated” environment. Hafner et al. (2020) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. (2020), which uses an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game; see Figure 2. Finally, concurrently with our work, Alonso et al. (2024) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.

Revision as of 05:28, 1 September 2024 (view source) FuzzyBot (talk \| contribs) (Importing a new version from external source)		Latest revision as of 03:06, 7 September 2024 (view source) FuzzyBot (talk \| contribs) (Importing a new version from external source)
Line 1:		Line 1:
	Several works attempted to train models for game simulation with actions inputs. Yang ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) and Bruce ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) focus on unsupervised learning of actions from videos. Menapace ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on ''interactive playable real-time simulation'', and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) train a Variational Auto-Encoder (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e., selecting an action at random). Then a controller policy is learned by playing within the “hallucinated” environment. Hafner ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), which uses an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game; see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 2]. Finally, concurrently with our work, Alonso ~~et al~~. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.		Several works attempted to train models for game simulation with actions inputs. Yang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) and Bruce et al. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) focus on unsupervised learning of actions from videos. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on ''interactive playable real-time simulation'', and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) train a Variational Auto-Encoder (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e., selecting an action at random). Then a controller policy is learned by playing within the “hallucinated” environment. Hafner et al. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), which uses an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game; see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 2]. Finally, concurrently with our work, Alonso et al. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.