Translations:Diffusion Models Are Real-Time Game Engines/85/zh: Difference between revisions

Latest revision as of 00:30, 9 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

Several works attempted to train models for game simulation with actions inputs. Yang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) and Bruce et al. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) focus on unsupervised learning of actions from videos. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on ''interactive playable real-time simulation'', and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) train a Variational Auto-Encoder (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e., selecting an action at random). Then a controller policy is learned by playing within the “hallucinated” environment. Hafner et al. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), which uses an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game; see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 2]. Finally, concurrently with our work, Alonso et al. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.

有几项研究试图利用动作输入来训练游戏仿真模型。Yang 等人（2023）建立了一个包含真实世界和模拟视频的多样化数据集，并训练了一个扩散模型，根据前一个视频片段和动作的文字描述来预测后续视频。Menapace 等人（2021）和 Bruce 等人（2024）专注于从视频中无监督地学习动作。Menapace 等人（2024）将文本提示转换为游戏状态，然后使用 NeRF 将其转换为三维表示。与这些研究不同，我们专注于“交互式可玩实时仿真”，并展示了长时间跨度轨迹的鲁棒性。我们利用强化学习代理探索游戏环境，并创建观察和交互的轨迹以训练我们的交互式游戏模型。另一项研究探索了学习环境的预测模型，并将其用于训练强化学习代理。Ha 和 Schmidhuber（2018）训练了变分自动编码器（Kingma & Welling，2014），将游戏帧编码为潜在向量，然后使用 RNN 模拟 VizDoom 游戏环境，从随机策略（即随机选择动作）的随机轨迹中进行训练。然后通过在“虚构”环境中进行游戏来学习控制器策略。Hafner 等人（2020）证明，强化学习代理可以完全在由潜在空间中的学习世界模型生成的情节上进行训练。与我们的工作也接近的是 Kim 等人（2020），他们使用 LSTM 架构来建模世界状态，同时结合卷积解码器生成输出帧，并在对抗性目标下联合训练。虽然这种方法对《吃豆人》等简单游戏似乎给出了合理的结果，但在模拟 VizDoom 的复杂环境时会产生模糊样本。相比之下，GameNGen 能够生成与原始游戏相当的样本；见图 2。最后，与我们的工作同步进行的还有 Alonso 等人（2024）训练的扩散世界模型，该模型可根据观察历史预测下一步观察，并在雅达利游戏上迭代训练世界模型和强化学习模型。