Translations:Diffusion Models Are Real-Time Game Engines/48/es: Difference between revisions

Latest revision as of 03:22, 7 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

The agent model is trained using PPO (Schulman et al., [https://arxiv.org/html/2408.14837v1#bib.bib30 2017]), with a simple CNN as the feature network, following Mnih et al. ([https://arxiv.org/html/2408.14837v1#bib.bib21 2015]). It is trained on CPU using the Stable Baselines 3 infrastructure (Raffin et al., [https://arxiv.org/html/2408.14837v1#bib.bib24 2021]). The agent is provided with downscaled versions of the frame images and in-game map, each at resolution 160x120. The agent also has access to the last 32 actions it performed. The feature network computes a representation of size 512 for each image. PPO’s actor and critic are 2-layer MLP heads on top of a concatenation of the outputs of the image feature network and the sequence of past actions. We train the agent to play the game using the Vizdoom environment (Wydmuch et al., [https://arxiv.org/html/2408.14837v1#bib.bib37 2019]). We run 8 games in parallel, each with a replay buffer size of 512, a discount factor <math>\gamma = 0.99</math>, and an entropy coefficient of <math>0.1</math>. In each iteration, the network is trained using a batch size of 64 for 10 epochs, with a learning rate of 1e-4. We perform a total of 10M environment steps.

El modelo de agente se entrena utilizando PPO (Schulman et al., 2017), con una CNN simple como red de características, siguiendo a Mnih et al. (2015). Se entrena en CPU utilizando la infraestructura de Stable Baselines 3 (Raffin et al., 2021). Al agente se le proporcionan versiones reducidas de las imágenes de los fotogramas y del mapa del juego, cada una con una resolución de 160x120. El agente también tiene acceso a las últimas 32 acciones que realizó. La red de características calcula una representación de tamaño 512 para cada imagen. El actor y el crítico de PPO son cabezas MLP de 2 capas sobre una concatenación de las salidas de la red de características de la imagen y la secuencia de acciones pasadas. Entrenamos al agente para que juegue utilizando el entorno de Vizdoom (Wydmuch et al., 2019). Ejecutamos 8 juegos en paralelo, cada uno con un tamaño de búfer de repetición de 512, un factor de descuento $\gamma =0.99$ , y un coeficiente de entropía de $0.1$ . En cada iteración, la red se entrena utilizando un tamaño de lote de 64 durante 10 épocas, con una tasa de aprendizaje de 1e-4. Realizamos un total de 10 millones de pasos de entorno.