Translations:Diffusion Models Are Real-Time Game Engines/36/zh: Difference between revisions

Latest revision as of 00:22, 9 September 2024

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Diffusion Models Are Real-Time Game Engines)

The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8x8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD (“heads up display”). To leverage the pre-trained knowledge while improving image quality, we train just the decoder of the latent auto-encoder using an MSE loss computed against the target frame pixels. It might be possible to improve quality even further using a perceptual loss such as LPIPS (Zhang et al. ([https://arxiv.org/html/2408.14837v1#bib.bib40 2018])), which we leave to future work. Importantly, note that this fine-tuning process happens completely separately from the U-Net fine-tuning, and that notably the auto-regressive generation isn’t affected by it (we only condition auto-regressively on the latents, not the pixels). Appendix [https://arxiv.org/html/2408.14837v1#A1.SS2 A.2] shows examples of generations with and without fine-tuning the auto-encoder.

Stable Diffusion v1.4 的预训练自动编码器将 8x8 像素块压缩为 4 个潜通道，在预测游戏帧时会导致有意义的伪影，影响小细节，尤其是底栏 HUD（“抬头显示”）。为了在提高图像质量的同时利用预训练的知识，我们仅使用针对目标帧像素计算的 MSE 损失来训练潜在自动编码器的解码器。使用 LPIPS（Zhang 等人（2018））等感知损失可能会进一步提高质量，我们将其留待未来工作中研究。重要的是，请注意这个微调过程完全独立于 U-Net 微调过程，而且自回归生成不受其影响（我们仅对潜变量自回归地进行条件设置，而非像素）。附录 A.2 展示了对自动编码器进行微调和不进行微调的生成示例。