All translations

Enter a message name below to show all available translations.

Found 3 translations.

Name	Current message text
^h English (en)	The agent model is trained using PPO (Schulman et al., [https://arxiv.org/html/2408.14837v1#bib.bib30 2017]), with a simple CNN as the feature network, following Mnih et al. ([https://arxiv.org/html/2408.14837v1#bib.bib21 2015]). It is trained on CPU using the Stable Baselines 3 infrastructure (Raffin et al., [https://arxiv.org/html/2408.14837v1#bib.bib24 2021]). The agent is provided with downscaled versions of the frame images and in-game map, each at resolution 160x120. The agent also has access to the last 32 actions it performed. The feature network computes a representation of size 512 for each image. PPO’s actor and critic are 2-layer MLP heads on top of a concatenation of the outputs of the image feature network and the sequence of past actions. We train the agent to play the game using the Vizdoom environment (Wydmuch et al., [https://arxiv.org/html/2408.14837v1#bib.bib37 2019]). We run 8 games in parallel, each with a replay buffer size of 512, a discount factor <math>\gamma = 0.99</math>, and an entropy coefficient of <math>0.1</math>. In each iteration, the network is trained using a batch size of 64 for 10 epochs, with a learning rate of 1e-4. We perform a total of 10M environment steps.
^h Spanish (es)	El modelo de agente se entrena utilizando PPO (Schulman et al., [https://arxiv.org/html/2408.14837v1#bib.bib30 2017]), con una CNN simple como red de características, siguiendo a Mnih et al. ([https://arxiv.org/html/2408.14837v1#bib.bib21 2015]). Se entrena en CPU utilizando la infraestructura de Stable Baselines 3 (Raffin et al., [https://arxiv.org/html/2408.14837v1#bib.bib24 2021]). Al agente se le proporcionan versiones reducidas de las imágenes de los fotogramas y del mapa del juego, cada una con una resolución de 160x120. El agente también tiene acceso a las últimas 32 acciones que realizó. La red de características calcula una representación de tamaño 512 para cada imagen. El actor y el crítico de PPO son cabezas MLP de 2 capas sobre una concatenación de las salidas de la red de características de la imagen y la secuencia de acciones pasadas. Entrenamos al agente para que juegue utilizando el entorno de Vizdoom (Wydmuch et al., [https://arxiv.org/html/2408.14837v1#bib.bib37 2019]). Ejecutamos 8 juegos en paralelo, cada uno con un tamaño de búfer de repetición de 512, un factor de descuento <math>\gamma = 0.99</math>, y un coeficiente de entropía de <math>0.1</math>. En cada iteración, la red se entrena utilizando un tamaño de lote de 64 durante 10 épocas, con una tasa de aprendizaje de 1e-4. Realizamos un total de 10 millones de pasos de entorno.
^h Chinese (zh)	代理模型使用 PPO（Schulman 等人，[https://arxiv.org/html/2408.14837v1#bib.bib30 2017]）进行训练，采用简单的 CNN 作为特征网络，基于 Mnih 等人（[https://arxiv.org/html/2408.14837v1#bib.bib21 2015]）的方法。在 CPU 上使用 Stable Baselines 3 基础架构（Raffin 等人，[https://arxiv.org/html/2408.14837v1#bib.bib24 2021]）进行训练。代理接收缩小后的帧图像和游戏地图，每个分辨率为 160x120。代理还可以访问其最近执行的 32 次动作。特征网络为每幅图像计算出大小为 512 的表示。PPO 的 actor 和 critic 是基于图像特征网络输出和过去动作序列连接的两层 MLP 头。我们使用 Vizdoom 环境（Wydmuch 等人，[https://arxiv.org/html/2408.14837v1#bib.bib37 2019]）训练代理来玩游戏。我们并行运行了 8 个游戏，每个游戏的回放缓冲区大小为 512，折扣因子为 <math>\gamma = 0.99</math>，熵系数为 <math>0.1</math>。在每次迭代中，我们使用批量大小为 64 的数据进行 10 个时代的训练，学习率为 1e-4。我们总共执行了 1000 万个环境步骤。