Image Generation Models
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Neural Networks, Generative Models, Probability Distributions |
Overview
Image generation models are a family of generative models that learn to synthesize new images by approximating the data distribution of a training set. Given samples drawn from an unknown distribution $ p_{\text{data}}(x) $ over images, the goal is to learn a model distribution $ p_\theta(x) $ from which novel samples can be drawn that resemble training data without copying it. Modern image generation underpins applications ranging from photorealistic synthesis and image editing to data augmentation, scientific visualization, and design tools.[1]
The field has evolved through several model families, each making different tradeoffs between sample quality, diversity, training stability, sampling speed, and likelihood tractability. The four dominant paradigms today are variational autoencoders (VAEs), generative adversarial networks (GANs), autoregressive models, and diffusion models. Normalizing flows form a fifth, smaller family. Hybrid approaches combine pieces of each.
Problem Formulation
Image generation can be cast as density estimation, sampling, or both. Let $ x \in \mathbb{R}^{H \times W \times C} $ be an image with height $ H $, width $ W $, and $ C $ channels. A generative model is parameterized by $ \theta $ and aims to make $ p_\theta(x) $ close to $ p_{\text{data}}(x) $ under some divergence or distance.
Training objectives differ by family. Likelihood-based models maximize
$ {\displaystyle \mathcal{L}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]} $
or a tractable lower bound. Implicit models, such as GANs, never compute likelihoods and instead match distributions through a learned discriminator. Score-based and diffusion models match the gradient of the log density (the score function) $ \nabla_x \log p(x) $ rather than the density itself. The choice of objective shapes everything: which architectures work, what artifacts appear, and how sampling proceeds at inference time.
Variational Autoencoders
A VAE pairs an encoder $ q_\phi(z \mid x) $ with a decoder $ p_\theta(x \mid z) $ over a latent variable $ z $, typically a low-dimensional Gaussian. Training maximizes the evidence lower bound (ELBO):
$ {\displaystyle \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)] - D_{\text{KL}}(q_\phi(z \mid x) \,\|\, p(z))} $
The reparameterization trick allows gradients to flow through stochastic sampling.[2] VAEs offer stable training, an explicit latent space useful for interpolation and editing, and tractable likelihood bounds. Their main weakness is blurry samples, traceable to the Gaussian likelihood at pixel level and the gap between the ELBO and true log-likelihood. Hierarchical and discrete-latent variants such as VQ-VAE narrow this gap and are now common as the first stage of two-stage pipelines.
Generative Adversarial Networks
GANs sidestep likelihood entirely. A generator $ G_\theta(z) $ maps noise to images while a discriminator $ D_\phi(x) $ tries to distinguish real from generated samples. The minimax objective is
$ {\displaystyle \min_\theta \max_\phi \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D_\phi(G_\theta(z)))]} $
When the discriminator is optimal, this game minimizes the Jensen-Shannon divergence between data and model.[3] GANs reached striking photorealism early, especially via StyleGAN and BigGAN, but training is notoriously unstable. Mode collapse, where the generator produces only a narrow slice of the data distribution, is a recurring failure. Wasserstein GANs replace the original loss with the Earth-Mover distance to improve gradient signal, and spectral normalization or gradient penalties stabilize the discriminator. Sampling is fast: a single forward pass through $ G_\theta $.
Autoregressive Models
Autoregressive image models factorize the joint distribution over pixels (or learned tokens) as a product of conditionals:
$ {\displaystyle p_\theta(x) = \prod_{i=1}^{N} p_\theta(x_i \mid x_{<i})} $
PixelRNN and PixelCNN model pixel-level conditionals directly, while Image Transformers and modern token-based pipelines (e.g. VQGAN plus a Transformer) operate over discrete codes from a learned tokenizer.[4] Autoregressive models give exact likelihoods, train stably with cross-entropy, and scale well with parameters and compute. The dominant cost is sampling: producing an image requires $ N $ sequential forward passes, where $ N $ can be thousands of tokens. Parallel decoding, caching, and speculative decoding partially mitigate this.
Diffusion Models
Diffusion models define a forward process that progressively corrupts data with Gaussian noise across $ T $ timesteps and learn a reverse process that denoises samples back to data. The forward process
$ {\displaystyle q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I)} $
admits a closed-form marginal at any step, and the reverse process is parameterized by a network $ \epsilon_\theta(x_t, t) $ trained to predict the noise. The simplified training objective is
$ {\displaystyle \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right]} $
This is equivalent to denoising score matching at noise scale $ t $.[5] Diffusion currently sets the state of the art in sample quality, supports flexible conditioning through classifier-free guidance, and trains stably. Its principal cost is multi-step sampling, often 20 to 1000 network evaluations per image. Latent diffusion runs the diffusion process in a compressed latent space produced by a VAE, cutting compute by an order of magnitude and enabling text-to-image systems such as Stable Diffusion.[6] Consistency models, distillation, and rectified flows reduce sampling to a handful of steps.
Conditioning and Guidance
Most practical systems are conditional, generating $ p_\theta(x \mid c) $ for a class label, text caption, segmentation map, or reference image. Text-to-image pipelines pair a frozen text encoder (often CLIP or a large language model encoder) with a generator. Classifier-free guidance trades diversity for fidelity by extrapolating between conditional and unconditional predictions:
$ {\displaystyle \hat{\epsilon}_\theta(x_t, c) = (1+w)\, \epsilon_\theta(x_t, c) - w\, \epsilon_\theta(x_t, \emptyset)} $
with guidance scale $ w $ typically between 3 and 15. ControlNet and IP-Adapter add structural or stylistic conditioning to a frozen base model without retraining it.
Evaluation
No single metric captures generation quality. Frechet Inception Distance (FID) compares moments of Inception-feature distributions between real and generated samples; lower is better.[7] Inception Score, precision/recall for generative models, and CLIP score (for text alignment) complement FID. Likelihood-based models also report bits per dimension. Human preference studies remain the ground truth for perceptual quality, especially for text-to-image systems where automatic metrics correlate weakly with judgment.
Comparisons and Tradeoffs
Among the four major families, diffusion models currently lead in sample fidelity and conditional controllability; GANs remain attractive when single-step inference is required (real-time graphics, mobile); autoregressive models shine where exact likelihood or unified handling of multiple modalities matters; VAEs are workhorse first-stage encoders for tokenization and latent compression. Two-stage pipelines, with a VAE or VQ-VAE compressing pixels and a diffusion or transformer modeling the latents, dominate large-scale text-to-image and video generation in 2024-2026.
Limitations
Image generation models inherit and can amplify biases in their training data. They can memorize and regurgitate training examples, especially under distribution shift or for rare prompts, raising copyright and privacy concerns. Detecting machine-generated images is an open problem with adversarial dynamics. Compute and energy costs of training and inference are nontrivial, and small distributional shifts (out-of-domain prompts, unusual compositions) can produce subtle artifacts that automatic metrics miss. Safety filtering, watermarking, and provenance tracking are active areas of research and policy.
References
- ↑ Goodfellow, Bengio, Courville, Deep Learning, MIT Press, 2016, ch. 20.
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv