Translations:Diffusion Models Are Real-Time Game Engines/42/zh: Difference between revisions

Latest revision as of 00:24, 9 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

During inference, we need to run both the U-Net denoiser (for a number of steps) and the auto-encoder. On our hardware configuration (a TPU-v5), a single denoiser step and an evaluation of the auto-encoder both takes 10ms. If we ran our model with a single denoiser step, the minimum total latency possible in our setup would be 20ms per frame, or 50 frames per second. Usually, generative diffusion models, such as Stable Diffusion, don’t produce high quality results with a single denoising step, and instead require dozens of sampling steps to generate a high-quality image. Surprisingly, we found that we can robustly simulate DOOM, with only 4 DDIM sampling steps (Song et al., [https://arxiv.org/html/2408.14837v1#bib.bib33 2020]). In fact, we observe no degradation in simulation quality when using 4 sampling steps vs 20 steps or more (see Appendix [https://arxiv.org/html/2408.14837v1#A1.SS4 A.4]).

在推理过程中，我们需要运行 U-Net 去噪器（进行若干步）和自动编码器。在我们的硬件配置（TPU-v5）下，一次去噪步骤和自动编码器的评估各需 10 毫秒。如果我们以单步去噪器运行模型，设置中的最小总延迟为每帧 20 毫秒，即每秒 50 帧。通常情况下，生成扩散模型（如 Stable Diffusion）通过单次去噪步骤无法产生高质量结果，而是需要数十个采样步骤才能生成高质量图像。令人惊讶的是，我们发现只需 4 个 DDIM 采样步骤，就能稳健地模拟 DOOM（Song 等人，2020）。实际上，我们观察到使用 4 步采样与使用 20 步或更多步采样相比，模拟质量没有下降（见附录 A.4）。