Categorical Cross-Entropy/en
| Article | |
|---|---|
| Topic area | Loss Functions |
| Prerequisites | Softmax Function, Cross-Entropy Loss, KL Divergence |
Overview
Categorical cross-entropy is the loss function used to train classifiers that must assign each input to exactly one of several mutually exclusive categories. It measures the discrepancy between a predicted probability distribution over classes and the true distribution, which for ordinary supervised learning is concentrated on a single correct label. Combined with a softmax output layer, it is the standard objective for multi-class classification in modern neural networks, including image classifiers, language models, and most pretrained transformers.
The loss has two complementary interpretations. From the information-theoretic side, it equals the average number of bits (or nats) needed to encode the true labels using a code that is optimal for the predicted distribution; minimizing it pushes the predicted distribution toward the true one. From the statistical side, it is the negative log-likelihood of the observed labels under a categorical model, so minimizing categorical cross-entropy is exactly maximum likelihood estimation. These two views explain why the same formula appears across information theory, statistics, and deep learning, and why it has well-behaved gradients that make optimization with stochastic gradient descent practical.
Intuition
Imagine a model that, given a photograph, must decide whether the image shows a cat, a dog, or a bird. The model emits a probability for each class, such as 0.7 for cat, 0.2 for dog, and 0.1 for bird. If the true answer is "cat", a good loss function should reward the high probability assigned to that class and ignore the others. Categorical cross-entropy does exactly this: it takes the negative logarithm of the probability assigned to the correct class. Probabilities close to one yield small losses, while probabilities close to zero yield arbitrarily large losses, so the model is strongly penalized for being confidently wrong.
The asymmetry between confident-correct and confident-wrong predictions is the loss's defining feature. A prediction of 0.99 for the true class produces a loss of about 0.01, whereas a prediction of 0.01 for the true class produces a loss of about 4.6. This logarithmic scaling means the gradient grows large precisely when the model is far off, providing a strong corrective signal early in training and a gentle one near convergence. The behavior contrasts sharply with squared error, whose gradient vanishes for confident predictions and which can saturate when paired with a softmax output, slowing learning.
Formulation
Let there be $ K $ classes. For a single example with one-hot true label $ y \in \{0,1\}^K $ (so exactly one entry is 1) and predicted probability vector $ \hat{y} \in [0,1]^K $ with $ \sum_k \hat{y}_k = 1 $, the categorical cross-entropy is
$ {\displaystyle L(y, \hat{y}) = -\sum_{k=1}^{K} y_k \log \hat{y}_k.} $
Because $ y $ is one-hot, the sum collapses to a single term $ -\log \hat{y}_{c} $, where $ c $ is the index of the true class. For a dataset of $ N $ independent examples the empirical risk is the mean over the per-example losses,
$ {\displaystyle \mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \hat{y}^{(i)}_{c^{(i)}}.} $
In practice the predicted probabilities come from a softmax applied to logits $ z \in \mathbb{R}^K $:
$ {\displaystyle \hat{y}_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}.} $
Substituting and simplifying yields the loss in terms of logits,
$ {\displaystyle L(y, z) = -z_c + \log \sum_{j=1}^{K} \exp(z_j).} $
The right-hand term is the log-sum-exp of the logits, also known as the cumulant function of the categorical exponential family. This form is what numerical libraries actually compute, because applying softmax and then log separately is unstable for large logits.
Connection to Information Theory and Maximum Likelihood
Categorical cross-entropy is the cross-entropy $ H(p, q) = -\sum_k p_k \log q_k $ evaluated where $ p $ is the empirical (one-hot) label distribution and $ q $ is the model's prediction. It decomposes as $ H(p, q) = H(p) + D_{\text{KL}}(p \,\|\, q) $, where $ H(p) $ is the entropy of the true distribution and $ D_{\text{KL}} $ is the Kullback-Leibler divergence. For one-hot labels $ H(p) = 0 $, so cross-entropy and KL divergence coincide, and minimizing one minimizes the other.
The statistical interpretation follows from observing that a softmax classifier defines a categorical likelihood $ P(Y = c \mid x; \theta) = \hat{y}_c $. The negative log-likelihood of the data is
$ {\displaystyle -\sum_{i=1}^{N} \log P(Y = c^{(i)} \mid x^{(i)}; \theta),} $
which up to a constant equals the categorical cross-entropy. Training a softmax classifier with this loss is therefore numerically identical to maximum likelihood estimation under a multinomial model, a fact that justifies the use of standard frequentist tools, such as the Fisher information matrix, when reasoning about classifier calibration and confidence.
Training and Backpropagation
The combination of softmax and cross-entropy has an unusually clean gradient. With logits $ z $ and one-hot label $ y $,
$ {\displaystyle \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k.} $
The gradient is simply the difference between the predicted probability and the indicator of the true class. There are no exponential or logarithmic factors left over, which is why deep learning frameworks merge softmax and cross-entropy into a single primitive (e.g., nn.CrossEntropyLoss in PyTorch, SparseCategoricalCrossentropy in TensorFlow). Backpropagation then proceeds normally through the rest of the network using this residual-shaped error signal.
For numerical stability, libraries compute the loss directly from logits using the log-sum-exp trick, subtracting $ \max_k z_k $ before exponentiation:
$ {\displaystyle \log \sum_j \exp(z_j) = m + \log \sum_j \exp(z_j - m), \quad m = \max_k z_k.} $
This avoids overflow when logits are large and underflow when they are very negative. Naive pipelines that apply softmax and then take the logarithm separately can produce NaN values during training, especially in mixed-precision settings.
Variants
Several variants address specific practical issues:
- Sparse categorical cross-entropy takes integer class indices instead of one-hot vectors. The mathematics is identical, but the implementation avoids materializing a large one-hot matrix, which matters for vocabularies with hundreds of thousands of tokens in language models.
- Label smoothing replaces the one-hot target with a softened distribution that puts $ 1-\epsilon $ on the true class and $ \epsilon/(K-1) $ on the others. It prevents the model from producing arbitrarily sharp predictions, improving calibration and often generalization. It was popularized by Inception-v3 and is standard in transformer training.[1]
- Focal loss multiplies the per-example loss by $ (1 - \hat{y}_c)^\gamma $, down-weighting easy examples to focus optimization on hard ones. It was introduced for dense object detection where most candidate boxes are background.[2]
- Class-weighted cross-entropy multiplies each example's loss by a class-specific weight to compensate for class imbalance, a simple alternative to resampling.
- Temperature-scaled cross-entropy divides logits by a temperature $ T $ before softmax. It does not change the argmax but is widely used for knowledge distillation and post-hoc calibration.
Comparisons with Other Losses
Categorical cross-entropy is one of several losses available for classification, and the right choice depends on the label structure and the output activation:
- Binary cross-entropy is the two-class special case, used with a sigmoid output for single-label binary problems and applied independently per class for multi-label problems where labels are not mutually exclusive.
- Mean squared error against one-hot targets technically defines a valid loss, but its gradient interacts poorly with softmax saturation, leading to slow training and poorly calibrated probabilities. It is rarely used for classification.
- Hinge loss and its multi-class extensions, used in support vector machines, optimize a margin rather than a likelihood. They produce sparse decision boundaries but no probability estimates and are uncommon in modern deep learning.
- Triplet, contrastive, and InfoNCE losses optimize relative similarity rather than absolute class probabilities, and are preferred for representation learning and retrieval.
For standard classification with a fixed label set and softmax outputs, categorical cross-entropy is the default, and the alternatives have largely been displaced.
Practical Considerations and Limitations
Despite its dominance, categorical cross-entropy has known limitations. Models trained with it are often overconfident: probabilities estimated by deep networks tend to be poorly calibrated, with confidence systematically higher than empirical accuracy.[3] Temperature scaling, label smoothing, and focal loss all partially address this. The loss is also sensitive to label noise: a single mislabeled example can produce an unbounded gradient because $ -\log \hat{y}_c \to \infty $ as $ \hat{y}_c \to 0 $. Robust alternatives such as the generalized cross-entropy and symmetric cross-entropy trade off some accuracy on clean data for resilience to noisy labels.
A more subtle issue is that categorical cross-entropy treats all classes as exchangeable: misclassifying a husky as a wolf incurs the same loss as misclassifying it as a banana. When the label space has structure (hierarchical categories, semantic embeddings, ordinal levels), specialized losses or auxiliary objectives can encode that structure and outperform the plain cross-entropy. Finally, in extreme classification with millions of classes, the normalizing log-sum-exp becomes a computational bottleneck, motivating sampled approximations such as sampled softmax, noise-contrastive estimation, and hierarchical softmax.