All translations

Enter a message name below to show all available translations.

Found 3 translations.

Name	Current message text
^h English (en)	Several works attempted to train models for game simulation with actions inputs. Yang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) build a diverse dataset of real-world and simulated videos and train a diffusion model to predict a continuation video given a previous video segment and a textual description of an action. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) and Bruce et al. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) focus on unsupervised learning of actions from videos. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) converts textual prompts to game states, which are later converted to a 3D representation using NeRF. Unlike these works, we focus on ''interactive playable real-time simulation'', and demonstrate robustness over long-horizon trajectories. We leverage an RL agent to explore the game environment and create rollouts of observations and interactions for training our interactive game model. Another line of work explored learning a predictive model of the environment and using it for training an RL agent. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) train a Variational Auto-Encoder (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) to encode game frames into a latent vector, and then use an RNN to mimic the VizDoom game environment, training on random rollouts from a random policy (i.e., selecting an action at random). Then a controller policy is learned by playing within the “hallucinated” environment. Hafner et al. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demonstrate that an RL agent can be trained entirely on episodes generated by a learned world model in latent space. Also close to our work is Kim et al. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), which uses an LSTM architecture for modeling the world state, coupled with a convolutional decoder for producing output frames and jointly trained under an adversarial objective. While this approach seems to produce reasonable results for simple games like PacMan, it struggles with simulating the complex environment of VizDoom and produces blurry samples. In contrast, GameNGen is able to generate samples comparable to those of the original game; see Figure [https://arxiv.org/html/2408.14837v1#S1.F2 2]. Finally, concurrently with our work, Alonso et al. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) train a diffusion world model to predict the next observation given observation history, and iteratively train the world model and an RL model on Atari games.
^h Spanish (es)	Varios trabajos han intentado entrenar modelos para la simulación de juegos con entradas de acciones. Yang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib38 2023]) construyeron un conjunto de datos diverso de vídeos del mundo real y simulados y entrenaron un modelo de difusión para predecir un vídeo de continuación dado un segmento de vídeo anterior y una descripción textual de una acción. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib18 2021]) y Bruce et al. ([https://arxiv.org/html/2408.14837v1#bib.bib7 2024]) se enfocan en el aprendizaje no supervisado de acciones a partir de vídeos. Menapace et al. ([https://arxiv.org/html/2408.14837v1#bib.bib19 2024]) convierten las indicaciones textuales en estados del juego, que posteriormente se convierten en una representación 3D mediante NeRF. A diferencia de estos trabajos, nosotros nos centramos en la "simulación interactiva jugable en tiempo real", y demostramos robustez en trayectorias de largo alcance. Aprovechamos un agente de RL para explorar el entorno del juego y crear despliegues de observaciones e interacciones para entrenar nuestro modelo de juego interactivo. Otra línea de trabajo exploró el aprendizaje de un modelo predictivo del entorno y su uso para entrenar a un agente de RL. Ha & Schmidhuber ([https://arxiv.org/html/2408.14837v1#bib.bib10 2018]) entrenaron un autocodificador variacional (Kingma & Welling, [https://arxiv.org/html/2408.14837v1#bib.bib17 2014]) para codificar los fotogramas del juego en un vector latente y, a continuación, utilizaron una RNN para imitar el entorno de juego de VizDoom, entrenándose en rollouts aleatorios a partir de una política aleatoria (es decir, seleccionando una acción al azar). Luego, se aprendió una política de controlador jugando dentro del entorno "simulado". Hafner et al. ([https://arxiv.org/html/2408.14837v1#bib.bib11 2020]) demostraron que un agente de RL puede entrenarse íntegramente sobre episodios generados por un modelo de mundo aprendido en el espacio latente. También cercano a nuestro trabajo es Kim et al. ([https://arxiv.org/html/2408.14837v1#bib.bib16 2020]), que utilizaron una arquitectura LSTM para modelar el estado del mundo, acoplada a un decodificador convolucional para producir fotogramas de salida y entrenada conjuntamente bajo un objetivo adversarial. Aunque este enfoque parece producir resultados razonables para juegos sencillos como PacMan, tiene dificultades para simular el complejo entorno de VizDoom y produce muestras borrosas. En cambio, GameNGen es capaz de generar muestras comparables a las del juego original; véase la figura [https://arxiv.org/html/2408.14837v1#S1.F2 2]. Por último, simultáneamente a nuestro trabajo, Alonso et al. ([https://arxiv.org/html/2408.14837v1#bib.bib2 2024]) entrenaron un modelo de mundo de difusión para predecir la siguiente observación dado el historial de observaciones, y entrenaron iterativamente el modelo de mundo y un modelo de RL en juegos Atari.
^h Chinese (zh)	有几项研究试图利用动作输入来训练游戏仿真模型。Yang 等人（[https://arxiv.org/html/2408.14837v1#bib.bib38 2023]）建立了一个包含真实世界和模拟视频的多样化数据集，并训练了一个扩散模型，根据前一个视频片段和动作的文字描述来预测后续视频。Menapace 等人（[https://arxiv.org/html/2408.14837v1#bib.bib18 2021]）和 Bruce 等人（[https://arxiv.org/html/2408.14837v1#bib.bib7 2024]）专注于从视频中无监督地学习动作。Menapace 等人（[https://arxiv.org/html/2408.14837v1#bib.bib19 2024]）将文本提示转换为游戏状态，然后使用 NeRF 将其转换为三维表示。与这些研究不同，我们专注于“交互式可玩实时仿真”，并展示了长时间跨度轨迹的鲁棒性。我们利用强化学习代理探索游戏环境，并创建观察和交互的轨迹以训练我们的交互式游戏模型。另一项研究探索了学习环境的预测模型，并将其用于训练强化学习代理。Ha 和 Schmidhuber（[https://arxiv.org/html/2408.14837v1#bib.bib10 2018]）训练了变分自动编码器（Kingma & Welling，[https://arxiv.org/html/2408.14837v1#bib.bib17 2014]），将游戏帧编码为潜在向量，然后使用 RNN 模拟 VizDoom 游戏环境，从随机策略（即随机选择动作）的随机轨迹中进行训练。然后通过在“虚构”环境中进行游戏来学习控制器策略。Hafner 等人（[https://arxiv.org/html/2408.14837v1#bib.bib11 2020]）证明，强化学习代理可以完全在由潜在空间中的学习世界模型生成的情节上进行训练。与我们的工作也接近的是 Kim 等人（[https://arxiv.org/html/2408.14837v1#bib.bib16 2020]），他们使用 LSTM 架构来建模世界状态，同时结合卷积解码器生成输出帧，并在对抗性目标下联合训练。虽然这种方法对《吃豆人》等简单游戏似乎给出了合理的结果，但在模拟 VizDoom 的复杂环境时会产生模糊样本。相比之下，GameNGen 能够生成与原始游戏相当的样本；见图 [https://arxiv.org/html/2408.14837v1#S1.F2 2]。最后，与我们的工作同步进行的还有 Alonso 等人（[https://arxiv.org/html/2408.14837v1#bib.bib2 2024]）训练的扩散世界模型，该模型可根据观察历史预测下一步观察，并在雅达利游戏上迭代训练世界模型和强化学习模型。