Translations:Diffusion Models Are Real-Time Game Engines/48/zh: Difference between revisions

    From Marovi AI
    (Created page with "代理模型使用 PPO(Schulman 等人,[https://arxiv.org/html/2408.14837v1#bib.bib30 2017])进行训练,采用简单的 CNN 作为特征网络,基于 Mnih 等人([https://arxiv.org/html/2408.14837v1#bib.bib21 2015])的方法。在 CPU 上使用 Stable Baselines 3 基础架构(Raffin 等人,[https://arxiv.org/html/2408.14837v1#bib.bib24 2021])进行训练。代理接收缩小后的帧图像和游戏地图,每个分辨率为 160x120。代理还可以...")
     
    (No difference)

    Latest revision as of 00:25, 9 September 2024

    Information about message (contribute)
    This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.
    Message definition (Diffusion Models Are Real-Time Game Engines)
    The agent model is trained using PPO (Schulman et al., [https://arxiv.org/html/2408.14837v1#bib.bib30 2017]), with a simple CNN as the feature network, following Mnih et al. ([https://arxiv.org/html/2408.14837v1#bib.bib21 2015]). It is trained on CPU using the Stable Baselines 3 infrastructure (Raffin et al., [https://arxiv.org/html/2408.14837v1#bib.bib24 2021]). The agent is provided with downscaled versions of the frame images and in-game map, each at resolution 160x120. The agent also has access to the last 32 actions it performed. The feature network computes a representation of size 512 for each image. PPO’s actor and critic are 2-layer MLP heads on top of a concatenation of the outputs of the image feature network and the sequence of past actions. We train the agent to play the game using the Vizdoom environment (Wydmuch et al., [https://arxiv.org/html/2408.14837v1#bib.bib37 2019]). We run 8 games in parallel, each with a replay buffer size of 512, a discount factor <math>\gamma = 0.99</math>, and an entropy coefficient of <math>0.1</math>. In each iteration, the network is trained using a batch size of 64 for 10 epochs, with a learning rate of 1e-4. We perform a total of 10M environment steps.

    代理模型使用 PPO(Schulman 等人,2017)进行训练,采用简单的 CNN 作为特征网络,基于 Mnih 等人(2015)的方法。在 CPU 上使用 Stable Baselines 3 基础架构(Raffin 等人,2021)进行训练。代理接收缩小后的帧图像和游戏地图,每个分辨率为 160x120。代理还可以访问其最近执行的 32 次动作。特征网络为每幅图像计算出大小为 512 的表示。PPO 的 actor 和 critic 是基于图像特征网络输出和过去动作序列连接的两层 MLP 头。我们使用 Vizdoom 环境(Wydmuch 等人,2019)训练代理来玩游戏。我们并行运行了 8 个游戏,每个游戏的回放缓冲区大小为 512,折扣因子为 ,熵系数为 。在每次迭代中,我们使用批量大小为 64 的数据进行 10 个时代的训练,学习率为 1e-4。我们总共执行了 1000 万个环境步骤。