Generative Adversarial Networks

Article
Topic area	Deep Learning
Prerequisites	Neural Networks, Backpropagation, Stochastic Gradient Descent

This page contains changes which are not marked for translation.

Other languages:

English
Español
中文

Overview

Generative adversarial networks (GANs) are a class of generative models in which two neural networks are trained jointly through a two-player game. A generator network maps samples from a simple noise distribution to candidate samples in the data space, while a discriminator network attempts to tell apart real training samples from generated ones. Training proceeds by updating the discriminator to improve its classification accuracy and updating the generator to produce samples the discriminator cannot distinguish from real data. Introduced by Goodfellow and colleagues in 2014, GANs became one of the dominant approaches to high-fidelity image synthesis during the late 2010s and remain influential in domains where likelihood-based training is awkward or where sharp, high-frequency outputs are desired.^[1] Although diffusion-based methods have since taken over much of the image-generation frontier, GANs continue to be used for image-to-image translation, super-resolution, audio synthesis, and as building blocks within larger systems.

Intuition

The standard analogy describes the generator as a counterfeiter and the discriminator as a detective. The counterfeiter produces fake currency; the detective inspects bills and labels each as real or fake. Each side improves in response to the other: as the detective gets better at spotting fakes, the counterfeiter must produce more convincing forgeries, and as the counterfeit quality rises, the detective must look at finer details. At equilibrium, the counterfeit currency is indistinguishable from genuine bills and the detective can do no better than chance.

This adversarial setup avoids an explicit likelihood objective. Rather than asking the generator to maximize the probability it assigns to training data, GANs ask only that its samples be statistically indistinguishable from training samples under a learned classifier. The training signal therefore comes from a learned, adaptive critic rather than a fixed loss surface, which is a key reason GAN samples often look sharper than samples from likelihood-trained models that smear probability mass over implausible regions.

Formulation

Let $p_{\text{data}}$ denote the unknown distribution of real data and let $$ p_z $$ denote a fixed prior over a latent vector $$ z $$ , typically a standard Gaussian or uniform. The generator $G_\theta : \mathcal{Z} \to \mathcal{X}$ maps latents to data space, inducing an implicit distribution $$ p_g $$ over $\mathcal{X}$ . The discriminator $D_\phi : \mathcal{X} \to (0, 1)$ outputs the probability that its input is a real sample. The original GAN objective is the minimax value function

$\min_\theta \max_\phi \; V(D_\phi, G_\theta) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D_\phi(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D_\phi(G_\theta(z)))].$

For a fixed generator, the optimal discriminator is

$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)},$

and substituting this back yields a generator objective equal (up to constants) to the Jensen-Shannon divergence between $p_{\text{data}}$ and $$ p_g $$ . The unique global minimum is therefore $p_g = p_{\text{data}}$ , at which point the discriminator outputs $$ 1/2 $$ everywhere. This theoretical result motivates the design but does not by itself guarantee convergence under realistic, non-convex parameterizations.

Training and inference

Training alternates stochastic gradient steps on $\phi$ and $\theta$ . A common schedule performs one or more discriminator steps per generator step, since a strong discriminator is needed to provide useful gradients to the generator. Both updates rely on backpropagation through the discriminator: the generator gradient is computed by differentiating $\log(1 - D_\phi(G_\theta(z)))$ with respect to $\theta$ through the composed network.

In practice, the original generator loss $\log(1 - D)$ saturates when the discriminator confidently rejects fakes early in training, producing vanishing gradients. The standard fix is the non-saturating formulation

$\mathcal{L}_G = -\mathbb{E}_{z \sim p_z}[\log D_\phi(G_\theta(z))],$

which has the same fixed points but provides stronger gradients when the generator is losing. Other widely used stabilizers include label smoothing, spectral normalization on the discriminator, gradient penalties, and exponential moving averages of generator weights at evaluation time. Inference is straightforward: draw $z \sim p_z$ and return $G_\theta(z)$ . The discriminator is normally discarded after training, although it is sometimes reused as a perceptual feature extractor or critic.

Variants

The space of GAN variants is large; a few representative families are listed below.

DCGAN replaces fully connected layers with deep convolutional architectures and prescribes batch normalization, strided convolutions, and ReLU/LeakyReLU activations to stabilize image synthesis.
Conditional GANs condition both $$ G $$ and $$ D $$ on auxiliary information $$ y $$ such as a class label or a text embedding, enabling controllable generation.
Wasserstein GAN replaces the Jensen-Shannon objective with the Wasserstein-1 distance, using a 1-Lipschitz critic enforced by weight clipping or, more commonly, a gradient penalty (WGAN-GP). This often yields more stable training and a loss that correlates with sample quality.
CycleGAN learns image-to-image translation between unpaired domains by composing two generators with a cycle-consistency loss, enabling photo-to-painting and similar mappings without paired training data.
StyleGAN introduces a mapping network that disentangles latent factors and a style-modulated synthesis network, achieving state-of-the-art photorealism on faces and other domains for several years.
Progressive growing starts training at low resolution and incrementally adds layers, which simplifies the optimization landscape for high-resolution synthesis.

Comparisons

Compared to other generative model families, GANs occupy a particular point in a quality-versus-tractability trade-off. Likelihood-based models such as autoencoders, variational autoencoders, and autoregressive models permit exact or bounded log-likelihood evaluation but tend to produce blurrier samples because likelihood penalizes mass placement on the support of the data. GANs sidestep this by training a learned critic, which often yields sharper outputs but offers no straightforward likelihood evaluation, complicating model selection and density estimation.

Against diffusion models, GANs are typically much faster at sampling, since one forward pass through $$ G $$ produces a sample, whereas diffusion requires many denoising steps. Diffusion models, however, are usually easier to train, scale more predictably with data and compute, and have largely overtaken GANs on benchmarks such as ImageNet class-conditional synthesis. Recent work has narrowed this sampling-speed gap by distilling diffusion into few-step samplers, while GAN research has correspondingly explored hybrid objectives.

Limitations

GAN training is notoriously fragile. Because the objective is a saddle-point problem rather than a minimization, ordinary gradient descent need not converge; oscillations and divergent dynamics are common. The most cited failure mode is mode collapse, in which the generator concentrates on a small subset of the data distribution that the current discriminator cannot reject, leading to outputs that lack diversity. Diagnosing mode collapse is hard because GANs lack a tractable likelihood; practitioners rely on proxy metrics such as Inception Score and Frechet Inception Distance, which are themselves imperfect.

Other practical issues include sensitivity to architectural choices and hyperparameters, vanishing or exploding discriminator gradients, and the difficulty of detecting when training has finished. From a theoretical standpoint, the existence and uniqueness of equilibria for non-convex parameterizations is not guaranteed, and the Jensen-Shannon divergence becomes uninformative when $p_{\text{data}}$ and $$ p_g $$ have disjoint supports, which motivated the move to Wasserstein-style objectives.

Applications

GANs have been applied to image super-resolution, inpainting, style transfer, domain adaptation, semantic segmentation, voice and music generation, and physics simulation. They are also used as a learned perceptual loss within other systems and as data augmenters for downstream classifiers. In production settings where sampling latency dominates, single-pass GAN generators remain attractive even where diffusion-based models would yield higher peak quality.

References

↑ Template:Cite arxiv

[1] Template:Cite arxiv

[1]