Translations:Diffusion Models Are Real-Time Game Engines/9/zh: Difference between revisions

Latest revision as of 00:18, 9 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

In recent years, generative models made significant progress in producing images and videos conditioned on multi-modal inputs, such as text or images. At the forefront of this wave, diffusion models became the de-facto standard in media (i.e., non-language) generation, with works like Dall-E (Ramesh et al., [https://arxiv.org/html/2408.14837v1#bib.bib25 2022]), Stable Diffusion (Rombach et al., [https://arxiv.org/html/2408.14837v1#bib.bib26 2022]), and Sora (Brooks et al., [https://arxiv.org/html/2408.14837v1#bib.bib6 2024]). At a glance, simulating the interactive worlds of video games may seem similar to video generation. However, ''interactive'' world simulation is more than just very fast video generation. The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures. Notably, it requires generating frames autoregressively, which tends to be unstable and leads to sampling divergence (see section [https://arxiv.org/html/2408.14837v1#S3.SS2.SSS1 3.2.1]).

近年来，生成模型在根据文本或图像等多模态输入生成图像和视频方面取得了重大进展。在这一浪潮的前沿，扩散模型成为非语言媒体生成的事实标准，如 Dall-E（Ramesh 等人，2022）、Stable Diffusion（Rombach 等人，2022）和 Sora（Brooks 等人，2024）。乍一看，模拟视频游戏的交互世界似乎与视频生成类似。然而，"交互式"世界模拟不仅仅是快速生成视频。因为生成过程中需要以输入动作流为条件，而输入动作流只能在生成时获取，这打破了现有扩散模型架构的一些假设。尤其是，它要求自回归地生成帧，这往往是不稳定的，并导致采样发散（见 3.2.1 节）。