Diffusion Models

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area Generative Models
    Prerequisites Variational Autoencoder, Backpropagation, KL Divergence


    Overview

    Diffusion models are a family of generative models that learn a data distribution by reversing a gradual noising process. Starting from pure Gaussian noise, the model takes a sequence of small denoising steps until a sample resembling real data emerges. The framework was popularized in machine learning by Sohl-Dickstein and colleagues in 2015 and brought to state-of-the-art image quality by the Denoising Diffusion Probabilistic Models (DDPM) paper of Ho, Jain, and Abbeel in 2020.[1] Diffusion models now underlie much of modern image, audio, and video synthesis, including systems such as Stable Diffusion, DALL-E 3, and Imagen, and have expanded into molecular design, robotics, and scientific simulation.

    Compared with Generative Adversarial Networks and Variational Autoencoders, diffusion models trade slower sampling for a stable training objective, high sample diversity, and strong likelihood properties. Their core insight is that learning to remove a small amount of noise is far easier than learning to map noise directly to data, and that iterating this simpler problem can compose into a powerful generative process.

    Intuition

    The forward process repeatedly adds a small amount of Gaussian noise to a clean sample until, after many steps, only noise remains. The reverse process aims to undo each noising step, but the exact reverse is intractable. Instead, the model is trained to approximate the reverse step at every noise level. At inference time, sampling begins from noise and applies the learned denoiser many times to walk back to the data manifold.

    A useful analogy is sculpture. The forward process buries a statue in sand one grain at a time. The model learns, for any partially buried statue, what the next grain to remove looks like. With enough practice, the model can start from a pile of sand and remove grains one by one until the statue reappears.

    Forward Process

    The forward process is a fixed Markov chain that gradually corrupts a data sample $ x_0 $ over $ T $ timesteps according to a variance schedule $ \beta_1, \dots, \beta_T \in (0, 1) $:

    $ {\displaystyle q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I).} $

    A convenient property is that $ x_t $ can be sampled in closed form directly from $ x_0 $. With $ \alpha_t = 1 - \beta_t $ and $ \bar\alpha_t = \prod_{s=1}^{t} \alpha_s $:

    $ {\displaystyle q(x_t \mid x_0) = \mathcal{N}(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\, (1 - \bar\alpha_t) I).} $

    When $ \bar\alpha_T \approx 0 $, the marginal $ q(x_T) $ is effectively a standard Gaussian, which gives the reverse process a tractable starting point.

    Reverse Process and Training

    The reverse process is parameterized by a neural network $ p_\theta(x_{t-1} \mid x_t) $, typically a U-Net for images or a Transformer (DiT) for sequences and high-resolution generation. The training objective minimizes a variational bound on the negative log-likelihood. Ho et al. showed that, with a particular reweighting, this bound reduces to a remarkably simple noise-prediction loss:

    $ {\displaystyle L_{\mathrm{simple}} = \mathbb{E}_{x_0,\, t,\, \epsilon}\!\left[\,\lVert \epsilon - \epsilon_\theta(x_t, t) \rVert^2\,\right],} $

    where $ \epsilon \sim \mathcal{N}(0, I) $, $ x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon $, and $ t $ is sampled uniformly from $ \{1, \dots, T\} $. The network is trained to predict the noise that was added, conditional on the noisy input and the timestep. Equivalent parameterizations predict $ x_0 $ directly or the velocity $ v $, which can improve numerical stability at the extremes of the schedule.

    Sampling

    Once trained, sampling proceeds by drawing $ x_T \sim \mathcal{N}(0, I) $ and iterating the reverse step from $ t = T $ down to $ t = 1 $. The DDPM update is

    $ {\displaystyle x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t)\right) + \sigma_t z,\quad z \sim \mathcal{N}(0, I).} $

    Naive DDPM sampling requires hundreds or thousands of network evaluations per image, which is the central efficiency concern of diffusion models. Faster samplers such as DDIM,[2] DPM-Solver, and consistency models reduce this to between one and fifty steps by interpreting sampling as solving an ordinary or stochastic differential equation and applying higher-order numerical integrators or distillation.

    Score-Based View and SDEs

    Song and Ermon framed denoising as estimating the Score Function $ \nabla_x \log p_t(x) $ of the noised data distribution at each noise level.[3] In the continuous limit, the forward process becomes a stochastic differential equation, and the reverse process is governed by a corresponding reverse-time SDE that depends only on the score. There is also a deterministic probability-flow ODE with the same marginals, which enables exact likelihood computation and the use of off-the-shelf ODE solvers. The DDPM noise predictor and the score network are equivalent up to a known scaling, unifying the two perspectives.

    Conditioning and Guidance

    Conditional generation is achieved by feeding extra information, such as a class label, text embedding, or low-resolution image, into the denoiser. Two widely used techniques amplify conditional alignment at sampling time. Classifier guidance perturbs the score with the gradient of an external classifier. Classifier-Free Guidance trains a single network with random condition dropout and combines conditional and unconditional predictions at inference, trading diversity for prompt adherence with a single guidance scale. Guidance is the main reason text-to-image diffusion models follow prompts so closely.

    Variants and Extensions

    Latent Diffusion Models run the diffusion process in the compressed latent space of a Variational Autoencoder, reducing compute by an order of magnitude and enabling high-resolution image and video synthesis on commodity hardware. Cascaded diffusion stacks a low-resolution base model with one or more super-resolution diffusion stages. Consistency models distill a multi-step sampler into a single-step generator. Flow matching and rectified flow generalize the framework by training networks to predict velocity fields along straighter probability paths. Discrete diffusion adapts the formulation to text and graphs by replacing Gaussian noise with absorbing or uniform corruption processes.

    Comparison with Other Generative Models

    Diffusion models offer stable maximum-likelihood-style training, mode coverage that exceeds Generative Adversarial Networks, and sample quality that rivals or surpasses GANs in image domains. Compared to Variational Autoencoders, they avoid the blurry reconstructions that follow from a single-step Gaussian decoder. Compared to autoregressive models, they parallelize generation across spatial dimensions and do not impose an artificial generation order. The main drawback is sampling cost. Even with modern fast samplers, generating a high-resolution image typically requires more compute than a single forward pass of a comparable GAN or autoregressive transformer.

    Limitations and Open Problems

    Sampling speed remains the dominant practical bottleneck, motivating ongoing work on distillation, few-step solvers, and consistency objectives. Likelihood evaluation requires the probability-flow ODE and is expensive at high resolution. Diffusion models can memorize and regurgitate training examples when conditioned on rare prompts, raising privacy and copyright concerns. Controllability beyond text prompts, including precise spatial layout, identity preservation across edits, and faithful counting and text rendering, remains an active research area. Finally, scaling diffusion to long sequences and to physically grounded modalities such as molecules, proteins, and 3D scenes requires custom noise schedules and equivariant architectures rather than off-the-shelf U-Nets.

    References