Classifier-Free Guidance
| Article | |
|---|---|
| Topic area | Generative Models |
| Prerequisites | Diffusion Models, Score Matching |
Overview
Classifier-free guidance (CFG) is an inference-time technique that lets a single conditional diffusion model trade sample diversity for fidelity to a conditioning signal, without training a separate classifier. It was introduced by Ho and Salimans in 2021 as a simpler alternative to classifier guidance, which had previously been used to push samples from text-to-image and class-conditional diffusion models toward higher quality. The trick is to train one network that can serve both as a conditional and an unconditional model by randomly dropping the conditioning input during training, and then extrapolate at sampling time along the direction from the unconditional to the conditional prediction.
CFG has become the default sampling strategy for almost every modern conditional diffusion model, including text-to-image systems such as Imagen, Stable Diffusion, and DALL-E 2, as well as text-to-video and text-to-audio variants. The same idea has been adapted to autoregressive and flow-matching generative models. A scalar guidance scale controls how aggressively the sampler follows the conditioning, giving practitioners a single knob to tune the conditioning-versus-diversity tradeoff at inference.
Background: classifier guidance
Earlier work by Dhariwal and Nichol used a classifier $ p_\phi(y \mid x_t) $ trained on noisy images to steer an unconditional diffusion model toward a target class. At each reverse-diffusion step, the score of the noisy data distribution is augmented with the gradient of the classifier log-likelihood:
$ {\displaystyle \nabla_{x_t} \log p(x_t \mid y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(y \mid x_t).} $
Multiplying the classifier gradient by a scale $ w $ sharpens the conditional distribution and trades diversity for sample quality. The drawback is practical: the classifier must be trained on noisy inputs across all noise levels, it is hard to extend to free-form conditioning such as text, and the classifier itself is an extra model to maintain.
Formulation
CFG eliminates the external classifier by reusing the diffusion network as an implicit one. Bayes' rule applied to the conditional score gives
$ {\displaystyle \nabla_{x_t} \log p(x_t \mid y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p(y \mid x_t),} $
so the implicit classifier gradient is the difference between the conditional and unconditional scores. A guidance-weighted score is defined by extrapolating along that difference with weight $ w \ge 0 $:
$ {\displaystyle \tilde{\epsilon}_\theta(x_t, y) = (1 + w)\,\epsilon_\theta(x_t, y) - w\,\epsilon_\theta(x_t, \varnothing),} $
where $ \epsilon_\theta $ is the noise-prediction network, $ \varnothing $ is a learned null token that stands in for "no conditioning", and $ w $ is the guidance scale. Setting $ w = 0 $ recovers the conditional model; $ w \to \infty $ drives samples toward the modes most strongly preferred by the conditioning. Some references parameterize the same operation with a scale $ s = 1 + w $ applied to the conditional and $ s - 1 $ subtracted from the unconditional; the two conventions are equivalent.
The same expression rewritten as a score update is
$ {\displaystyle \tilde{s}_\theta(x_t, y) = s_\theta(x_t, y) + w \big( s_\theta(x_t, y) - s_\theta(x_t, \varnothing) \big),} $
which makes the geometric reading explicit: take a step from the unconditional score toward the conditional score, then continue past it. CFG is therefore an extrapolation, not an interpolation, in score space.
Training
A single network is trained to handle both conditional and unconditional inputs. With probability $ p_\text{drop} $ the conditioning $ y $ is replaced by the null token $ \varnothing $; otherwise the true conditioning is used. The standard denoising objective then becomes
$ {\displaystyle \mathcal{L}_\text{CFG} = \mathbb{E}_{x_0, y, \epsilon, t}\big[\,\lVert \epsilon - \epsilon_\theta(x_t, c)\rVert^2\,\big], \quad c = \begin{cases} \varnothing & \text{with probability } p_\text{drop} \\ y & \text{otherwise.} \end{cases}} $
Typical drop probabilities are 10 to 20 percent. The same parameters thus learn to denoise at every noise level both with and without the conditioning, and the unconditional pathway plays the role of the implicit classifier.
Inference
At each reverse-diffusion step the sampler runs two forward passes: one with the conditioning and one with the null token. The two predictions are combined with the CFG formula above, and the result is fed into the chosen DDPM or DDIM update rule. Doubling the compute is the main operational cost; in practice the two passes are batched together so the wall-clock penalty is closer to a constant factor than two times.
The guidance scale is the dominant hyperparameter at sampling time. Values around $ w = 7.5 $ are common defaults for text-to-image models; class-conditional ImageNet models often use smaller values around $ w = 1 $ to $ 3 $. Larger scales improve metrics that reward sample-prompt alignment such as CLIP score, but degrade FID and visibly oversaturate or simplify outputs. The optimal scale is dataset and model dependent and is typically chosen by sweeping a small grid.
Tradeoffs and pathologies
CFG provides a one-knob tradeoff between fidelity to the prompt and sample diversity, but high guidance scales introduce systematic artifacts. Pixel-space diffusion models tend to oversaturate colors and produce overly contrasty images at large $ w $; this is sometimes mitigated by dynamic thresholding, which clips and rescales pixel statistics during sampling to keep them in range, as used in Imagen. Latent-space models such as Stable Diffusion show similar tendencies in the form of cartoonized textures and mode collapse onto a small number of canonical compositions.
Empirically, CFG also amplifies biases present in the training data. Because the technique pushes samples toward modes the model considers most strongly conditional on the prompt, stereotypical associations are emphasized. This is a model and data property rather than a flaw of guidance itself, but it interacts with guidance scale in ways that make audits sensitive to the choice of $ w $.
A separate concern is that CFG corresponds to no proper probabilistic distribution: the guided score is generally not the score of any normalizable density. Sampling under CFG is therefore best understood as a heuristic that shifts probability mass rather than as exact inference under a well-defined posterior.
Variants and extensions
Several refinements aim to retain CFG's benefits while reducing its costs. CFG++ reformulates the guidance update so that high scales preserve more of the unconditional distribution's structure, mitigating saturation. Autoguidance uses a smaller or earlier-checkpoint version of the same model as the unconditional branch instead of a learned null token, decoupling the strength of guidance from the quality gap between conditional and unconditional pathways. Dynamic CFG schedules the guidance scale across noise levels, often using less guidance early in sampling and more at low noise. Negative prompting replaces the null token with a user-specified negative prompt $ y^- $, so that the sampler is pushed away from $ p(x_t \mid y^-) $ toward $ p(x_t \mid y) $; this is widely used in text-to-image interfaces.
CFG has also been ported to flow matching, rectified flows, and consistency models, and to autoregressive sequence models where the analog is to interpolate logits between a conditional and an unconditional pass. Related ideas appear in RLHF and reward-tilted sampling, which can be viewed as a learned alternative to the implicit classifier.
Comparison with classifier guidance
CFG and classifier guidance produce qualitatively similar effects, but CFG is preferred in almost every modern setting. It avoids training a noise-aware classifier; it handles arbitrary conditioning, including free-form text, without architectural changes; and it benefits from the same scaling laws as the underlying diffusion model. Classifier guidance retains a niche role when an external reward model is genuinely needed, for example to bias generation toward a property that was not part of training conditioning.
Limitations
The doubled inference cost is the most often cited limitation; ongoing work on guidance distillation tries to fold the effect of CFG into a single forward pass. The lack of a proper probabilistic interpretation makes it awkward to combine CFG with techniques that require well-defined likelihoods, such as some posterior sampling methods. And the saturation, mode-narrowing, and bias-amplification effects mean that CFG is not a free lunch: it improves nominal metrics while producing distributions that differ in kind, not just in degree, from the unguided model.