Translations:Diffusion Models Are Real-Time Game Engines/33/en: Difference between revisions
|  (Importing a new version from external source) |  (Importing a new version from external source) | ||
| Line 1: | Line 1: | ||
| The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in  | The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in Figure [https://arxiv.org/html/2408.14837v1#S3.F4 4]. To avoid this divergence due to auto-regressive application of the model, we corrupt context frames by adding a varying amount of Gaussian noise to encoded frames in training time, while providing the noise level as input to the model, following Ho et al. ([https://arxiv.org/html/2408.14837v1#bib.bib13 2021]). To that effect, we sample a noise level <math>\alpha</math> uniformly up to a maximal value, discretize it and learn an embedding for each bucket (see Figure [https://arxiv.org/html/2408.14837v1#S3.F3 3]). This allows the network to correct information sampled in previous frames, and is critical for preserving frame quality over time. During inference, the added noise level can be controlled to maximize quality, although we find that even with no added noise the results are significantly improved. We ablate the impact of this method in section [https://arxiv.org/html/2408.14837v1#S5.SS2.SSS2 5.2.2]. | ||
Latest revision as of 03:06, 7 September 2024
The domain shift between training with teacher-forcing and auto-regressive sampling leads to error accumulation and fast degradation in sample quality, as demonstrated in Figure 4. To avoid this divergence due to auto-regressive application of the model, we corrupt context frames by adding a varying amount of Gaussian noise to encoded frames in training time, while providing the noise level as input to the model, following Ho et al. (2021). To that effect, we sample a noise level uniformly up to a maximal value, discretize it and learn an embedding for each bucket (see Figure 3). This allows the network to correct information sampled in previous frames, and is critical for preserving frame quality over time. During inference, the added noise level can be controlled to maximize quality, although we find that even with no added noise the results are significantly improved. We ablate the impact of this method in section 5.2.2.