Entropy
| Article | |
|---|---|
| Topic area | Information Theory |
| Prerequisites | Probability, Random Variable |
Overview
Entropy is a measure of the uncertainty, randomness, or average information content of a probability distribution over a random variable. Introduced by Claude Shannon in 1948 as the foundation of information theory, it quantifies how unpredictable an outcome is on average: a fair coin has higher entropy than a biased one, and a uniform distribution over many outcomes has higher entropy than a peaked one. In machine learning, entropy underlies loss functions for classification, splitting criteria for decision trees, exploration bonuses in reinforcement learning, and the variational objectives used to train generative models. It also provides the language used to compare distributions through cross-entropy and Kullback-Leibler divergence.
Intuition
Imagine receiving messages from a source whose alphabet is known, but whose specific outputs are not. If the source almost always emits the same symbol, observing it conveys very little new information; if every symbol is equally likely, each observation is maximally informative. Entropy captures this average informativeness in units of bits (when the logarithm is base 2) or nats (natural logarithm).
A common interpretation is the expected number of yes/no questions needed to identify an outcome under an optimal questioning strategy. A uniform distribution over eight outcomes has an entropy of three bits, matching the three binary questions required to isolate one of eight possibilities. A skewed distribution requires fewer questions on average because likely outcomes can be identified with shorter codewords, which is the basis of entropy coding in data compression.
Discrete Formulation
For a discrete random variable $ X $ taking values in a finite alphabet $ \mathcal{X} $ with probability mass function $ p(x) = \Pr(X = x) $, the Shannon entropy is
$ {\displaystyle H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x),} $
with the convention $ 0 \log 0 = 0 $. The choice of logarithm base sets the unit: base 2 yields bits, base $ e $ yields nats, and base 10 yields hartleys. Entropy depends only on the probabilities, not on the labels of the outcomes.
Important properties include:
- Non-negativity: $ H(X) \geq 0 $, with equality only when one outcome has probability one.
- Maximum at uniform: $ H(X) \leq \log |\mathcal{X}| $, achieved when $ p $ is uniform.
- Concavity: $ H $ is a concave function of $ p $, which underpins the non-negativity of mutual information.
- Invariance to relabelling: permuting the outcomes leaves $ H(X) $ unchanged.
Joint, Conditional, and Mutual Information
For two random variables $ X $ and $ Y $ with joint distribution $ p(x, y) $, the joint entropy is
$ {\displaystyle H(X, Y) = -\sum_{x, y} p(x, y) \log p(x, y),} $
and the conditional entropy of $ Y $ given $ X $ is
$ {\displaystyle H(Y \mid X) = -\sum_{x, y} p(x, y) \log p(y \mid x).} $
These satisfy the chain rule $ H(X, Y) = H(X) + H(Y \mid X) $, expressing that the uncertainty in a pair equals the uncertainty in the first variable plus the residual uncertainty in the second once the first is known. The reduction in uncertainty about $ Y $ after observing $ X $ is the mutual information,
$ {\displaystyle I(X; Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y),} $
which is symmetric, non-negative, and zero if and only if $ X $ and $ Y $ are independent. Mutual information is widely used as a model-free measure of statistical dependence and as a training signal in representation learning.
Differential Entropy
For a continuous random variable with density $ f(x) $, the analogous quantity is the differential entropy,
$ {\displaystyle h(X) = -\int f(x) \log f(x) \, dx.} $
Unlike the discrete case, differential entropy can be negative and is not invariant under change of variables, so it should not be interpreted as an absolute information content. Nevertheless, its differences and conditional forms remain meaningful, and it appears throughout continuous information theory. Among distributions on the real line with fixed mean and variance, the Gaussian uniquely maximizes differential entropy, a result that motivates the use of Gaussians as maximum-entropy priors when only the first two moments are known.
Relation to Cross-Entropy and KL Divergence
If $ p $ is a true distribution and $ q $ is a model distribution, the cross-entropy is
$ {\displaystyle H(p, q) = -\sum_{x} p(x) \log q(x) = H(p) + D_{\mathrm{KL}}(p \,\|\, q),} $
where $ D_{\mathrm{KL}}(p \,\|\, q) $ is the Kullback-Leibler divergence. Because $ H(p) $ does not depend on $ q $, minimizing the cross-entropy with respect to model parameters is equivalent to minimizing the KL divergence from the model to the data distribution. This identity is the bridge between entropy and the Cross-Entropy Loss used to train probabilistic classifiers, language models, and many other modern systems.
Applications in Machine Learning
Entropy and its derivatives appear throughout the field:
- Decision trees use the information gain $ I(Y; X_j) = H(Y) - H(Y \mid X_j) $ to choose splits that reduce label uncertainty most effectively. ID3 and C4.5 are built on this criterion; CART variants also support Gini impurity, a closely related concave measure.
- Classification losses minimize the cross-entropy between the empirical label distribution and the model's predictions, equivalently the negative log-likelihood, providing well-calibrated gradients for softmax and sigmoid outputs.
- Reinforcement learning adds an entropy bonus to the policy objective in algorithms such as soft actor-critic and entropy-regularized policy gradients, encouraging stochastic exploration and preventing premature convergence to deterministic policies.
- Variational inference and the evidence lower bound decompose into a reconstruction term and an entropy or KL term, enabling latent-variable models such as variational autoencoders to be trained by maximum-likelihood-like objectives.
- Maximum-entropy modelling selects the distribution with the largest entropy consistent with observed constraints, a principle that recovers softmax classifiers, exponential-family models, and many physical statistics from a single information-theoretic axiom.
- Active learning and Bayesian experimental design rank candidate queries by their expected information gain, choosing the inputs that most reduce posterior uncertainty.
Estimation from Data
Estimating entropy from finite samples is non-trivial. The naive plug-in estimator that substitutes empirical frequencies into the entropy formula is biased downward, especially when the alphabet is large or the distribution has a long tail. Bias-corrected estimators such as Miller-Madow, jackknife, and the NSB estimator reduce this bias under various assumptions. For continuous variables, k-nearest-neighbour estimators (Kozachenko-Leonenko) and kernel density estimators are common; mutual information is often estimated using neural network bounds such as MINE in high-dimensional settings.
Limitations and Caveats
Entropy summarizes a distribution into a single scalar and therefore discards structural detail: two very different distributions can share the same entropy. It assumes that probabilities are well defined and well estimated, which can fail under distribution shift or in the small-sample regime. Differential entropy is not invariant under reparameterization, and information-theoretic quantities estimated from neural network representations are notoriously sensitive to architecture and binning choices. Finally, entropy is a property of distributions, not of individual outcomes; statements such as "this image has high entropy" require a reference distribution to be meaningful.
See also
References
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379-423, 623-656.
- Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press, Chapter 3.
- Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review, 106(4), 620-630.
- Belghazi, M. I. et al. (2018). Mutual Information Neural Estimation. ICML.