KL Divergence/en

    From Marovi AI
    Other languages:
    Article
    Topic area Information Theory
    Prerequisites Cross-Entropy Loss, Softmax Function


    Overview

    The Kullback-Leibler divergence, often abbreviated KL divergence and written $ D_{\mathrm{KL}}(P \parallel Q) $, is a measure of how one probability distribution $ P $ differs from a second reference distribution $ Q $ over the same outcome space. It originates in information theory, where it quantifies the expected number of extra bits required to encode samples from $ P $ when using a code optimised for $ Q $. In modern machine learning it is one of the most heavily used objects, appearing implicitly inside the Cross-Entropy Loss of every classifier, explicitly inside variational autoencoders and variational inference, and as a regulariser in policy optimisation methods such as PPO and the KL penalty used during reinforcement learning from human feedback.

    KL divergence is non-negative, equals zero if and only if the two distributions are equal almost everywhere, and is asymmetric in its arguments. The asymmetry is not a defect to be patched away; it is the source of much of its expressive power, since the choice of which distribution sits in the first slot encodes a modelling decision about which directions of error are penalised. Because KL is not a metric (it fails symmetry and the triangle inequality) it is instead called a divergence, and it is the prototypical example of a broader family known as f-divergences.

    Definition

    For two discrete distributions $ P $ and $ Q $ on the same support $ \mathcal{X} $,

    $ {\displaystyle D_{\mathrm{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}.} $

    For continuous distributions with densities $ p $ and $ q $,

    $ {\displaystyle D_{\mathrm{KL}}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx.} $

    The base of the logarithm sets the units: base 2 gives bits, the natural logarithm gives nats. By convention $ 0 \log 0 = 0 $, but if there exists any $ x $ with $ P(x) > 0 $ and $ Q(x) = 0 $ then the divergence is infinite. This absolute-continuity requirement is consequential in practice and is what motivates the smoothing tricks discussed below.

    KL divergence can equivalently be written as the difference between cross-entropy and entropy,

    $ {\displaystyle D_{\mathrm{KL}}(P \parallel Q) = H(P, Q) - H(P),} $

    which is the cleanest way to see why minimising cross-entropy with respect to a model $ Q $ when $ P $ is the data distribution is exactly equivalent to minimising $ D_{\mathrm{KL}}(P \parallel Q) $: the entropy term $ H(P) $ does not depend on the model and so drops out of the gradient.

    Properties

    The most-cited property is non-negativity, sometimes called Gibbs' inequality:

    $ {\displaystyle D_{\mathrm{KL}}(P \parallel Q) \geq 0,} $

    with equality if and only if $ P = Q $ almost everywhere. This follows from Jensen's inequality applied to the convex function $ -\log $. KL divergence is also jointly convex in the pair $ (P, Q) $, which is what makes minimisation with respect to either argument well-posed in many settings.

    It is, however, neither symmetric nor subject to a triangle inequality. In general

    $ {\displaystyle D_{\mathrm{KL}}(P \parallel Q) \neq D_{\mathrm{KL}}(Q \parallel P),} $

    and the gap between the two orderings can be arbitrarily large. The chain rule for KL divergence states that for joint distributions

    $ {\displaystyle D_{\mathrm{KL}}(P(X, Y) \parallel Q(X, Y)) = D_{\mathrm{KL}}(P(X) \parallel Q(X)) + \mathbb{E}_{P(X)}\big[D_{\mathrm{KL}}(P(Y \mid X) \parallel Q(Y \mid X))\big],} $

    a decomposition that is the workhorse behind most variational bounds.

    Information-theoretic intuition

    The cleanest interpretation is the coding one. Suppose Alice draws symbols from a distribution $ P $ and Bob designs a code, but Bob only knows a different distribution $ Q $. Bob will assign code lengths roughly $ -\log Q(x) $ bits per symbol. Alice's expected message length is therefore $ H(P, Q) $, while the optimal length, achievable by a code matched to $ P $, is $ H(P) $. The difference, $ D_{\mathrm{KL}}(P \parallel Q) $, is the expected number of wasted bits per symbol, and this waste is necessarily non-negative.

    A second interpretation comes from hypothesis testing. If samples are drawn from $ P $ and the question is how quickly a Neyman-Pearson test can distinguish $ P $ from $ Q $, the asymptotic exponent of the type-II error rate (Stein's lemma) is exactly $ D_{\mathrm{KL}}(P \parallel Q) $. KL divergence is thus the natural distinguishability measure between distributions in a sample-efficiency sense.

    Forward versus reverse KL

    When fitting an approximate distribution $ Q_\theta $ to a target $ P $, the choice of which argument to put first changes the resulting fit qualitatively. Minimising the forward (or moment-projection) form

    $ {\displaystyle D_{\mathrm{KL}}(P \parallel Q_\theta)} $

    heavily penalises any region where $ P $ has mass but $ Q_\theta $ does not, since the integrand contains $ \log(P/Q_\theta) $ weighted by $ P $. The result is mass-covering or mean-seeking behaviour: $ Q_\theta $ spreads itself to cover every mode of $ P $, even at the cost of placing mass in low-probability regions. Maximum likelihood estimation against a parametric family is exactly forward-KL minimisation against the empirical distribution.

    Minimising the reverse (or information-projection) form

    $ {\displaystyle D_{\mathrm{KL}}(Q_\theta \parallel P)} $

    instead penalises mass that $ Q_\theta $ places where $ P $ has none. The result is mode-seeking behaviour: $ Q_\theta $ tends to lock onto a single mode of $ P $ and ignore the rest. This is the form used inside variational inference (the ELBO is a reverse-KL bound) and inside the KL penalty of policy optimisation, where staying close to a reference policy is more important than covering every behaviour the reference might emit.

    The practical takeaway is that the two orderings answer different questions, and a method that "minimises KL" without specifying which one is ambiguous in a way that affects results.

    Uses in machine learning

    KL divergence is the implicit objective behind much of supervised learning. Training a classifier with the Cross-Entropy Loss over a Softmax Function output is, up to a constant, minimising forward KL between the empirical label distribution and the model's predicted distribution. The same observation extends to language modelling, where next-token cross-entropy is forward KL against the empirical next-token distribution, and to knowledge distillation, where the student is trained to minimise KL against the teacher's softened predictive distribution.

    In variational inference and variational autoencoders the evidence lower bound contains a reverse-KL term $ D_{\mathrm{KL}}(q_\phi(z \mid x) \parallel p(z)) $ that pulls the approximate posterior toward the prior. In policy optimisation, both trust-region methods (TRPO) and clipped surrogates (PPO) use a KL constraint or penalty between the new and old policies to prevent destructive update steps. In reinforcement learning from human feedback, a KL term against the supervised-fine-tuned reference policy stops the optimiser from drifting into reward-model exploits.

    KL also shows up in model selection (the Akaike information criterion is a KL-based estimator of out-of-sample fit) and in mutual information, which is itself a KL divergence: $ I(X; Y) = D_{\mathrm{KL}}(P(X, Y) \parallel P(X) P(Y)) $.

    Related divergences

    Several variants address particular weaknesses. The Jensen-Shannon divergence

    $ {\displaystyle D_{\mathrm{JS}}(P, Q) = \tfrac{1}{2} D_{\mathrm{KL}}(P \parallel M) + \tfrac{1}{2} D_{\mathrm{KL}}(Q \parallel M),\quad M = \tfrac{1}{2}(P + Q)} $

    is symmetric, always finite, and bounded above by $ \log 2 $; its square root is a true metric. Renyi divergences $ D_\alpha(P \parallel Q) $ generalise KL with a tunable order $ \alpha $ and recover KL as $ \alpha \to 1 $. Both KL and JS are members of the f-divergence family, which is the natural setting for f-GAN objectives. Optimal-transport-based distances such as the Wasserstein distance are an entirely different family that remains finite and informative even when supports are disjoint, which is precisely the regime in which KL becomes infinite.

    Estimation and computational considerations

    Three issues recur in practice. First, the absolute-continuity requirement: any sampled $ x $ with $ Q(x) = 0 $ sends the estimate to infinity. Practitioners avoid this by smoothing $ Q $ (label smoothing, Laplace smoothing, additive noise) or by switching to JS or Wasserstein when supports are likely disjoint. Second, plug-in estimators of KL from samples are biased and high-variance in high dimensions; nearest-neighbour estimators and variational lower bounds (such as the Donsker-Varadhan and MINE estimators for mutual information) are common alternatives. Third, when $ P $ and $ Q $ are both Gaussian, KL has a closed form that is widely used inside VAE encoders and inside Laplace approximations:

    $ {\displaystyle D_{\mathrm{KL}}(\mathcal{N}(\mu_1, \Sigma_1) \parallel \mathcal{N}(\mu_2, \Sigma_2)) = \tfrac{1}{2}\Big[\operatorname{tr}(\Sigma_2^{-1} \Sigma_1) + (\mu_2 - \mu_1)^\top \Sigma_2^{-1} (\mu_2 - \mu_1) - k + \log \tfrac{\det \Sigma_2}{\det \Sigma_1}\Big],} $

    with $ k $ the dimension. This closed form is one reason Gaussian assumptions are so prevalent in variational machinery.

    Limitations

    KL divergence is not a distance and so should not be treated as one. The asymmetry surprises practitioners who expect $ D_{\mathrm{KL}}(P \parallel Q) $ to behave like a norm; the failure of the triangle inequality breaks any attempt to bound errors by composing KL terms additively. The infinite-divergence pathology when supports differ is a serious problem in adversarial settings, where the generator and the data may live on lower-dimensional manifolds that do not overlap. Sample-based estimates suffer from the curse of dimensionality, and naive Monte Carlo estimates can have unbounded variance when $ P $ and $ Q $ are far apart. None of this makes KL the wrong tool, but each of these failure modes has motivated a specific alternative (Loss Functions based on Wasserstein, JS, or hybrid criteria) that is preferable in its respective regime.

    References

    [1] [2] [3] [4] [5] [6]

    1. Kullback, S. and Leibler, R. A. On Information and Sufficiency. Annals of Mathematical Statistics, 1951.
    2. Cover, T. M. and Thomas, J. A. Elements of Information Theory, 2nd edition. Wiley, 2006.
    3. MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
    4. Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. Template:Cite arxiv
    5. Schulman, J. et al. Proximal Policy Optimization Algorithms. Template:Cite arxiv
    6. Ouyang, L. et al. Training Language Models to Follow Instructions with Human Feedback. Template:Cite arxiv