Cross-Entropy Loss: Difference between revisions

Article
Topic area	Machine Learning
Difficulty	Intermediate
Prerequisites	Loss Functions, Softmax Function

Revision as of 02:09, 27 April 2026

Other languages:

English
Español
中文

Languages: English | Español | 中文

Cross-entropy loss (also called log loss) is the most widely used loss function for classification tasks in machine learning. Rooted in information theory, it measures the dissimilarity between the true label distribution and the model's predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions.

Information-Theoretic Foundations

Entropy

The entropy of a discrete probability distribution $$ p $$ quantifies its uncertainty:

H(p) = -\sum_{k=1}^{K} p_k \log p_k

For a deterministic distribution (one-hot label), $$ H(p) = 0 $$ . Entropy is maximized when all outcomes are equally likely.

KL Divergence

The Kullback-Leibler divergence measures how one distribution $$ q $$ differs from a reference distribution $$ p $$ :

D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k}

KL divergence is non-negative and equals zero if and only if $$ p = q $$ .

Cross-Entropy

The cross-entropy between distributions $$ p $$ (true) and $$ q $$ (predicted) is:

H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q)

Since $$ H(p) $$ is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution $$ q $$ as close to the true distribution $$ p $$ as possible.

Binary Cross-Entropy

For binary classification with true label $y \in \{0, 1\}$ and predicted probability $\hat{y} = \sigma(z)$ (where $\sigma$ is the sigmoid function):

\mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr]

Over a dataset of $$ N $$ samples:

\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr]

The gradient with respect to the logit $$ z $$ takes the elegantly simple form $\hat{y} - y$ , which is both intuitive and computationally efficient.

Categorical Cross-Entropy

For multi-class classification with $$ K $$ classes, the true label is typically a one-hot vector $\mathbf{y}$ with $$ y_c = 1 $$ for the correct class $$ c $$ . The predicted probabilities $\hat{\mathbf{y}}$ are obtained via the Softmax Function:

\mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c

This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called negative log-likelihood in this context.

Numerical Stability

The Log-Sum-Exp Trick

Naively computing $\log(\mathrm{softmax}(z_k))$ involves exponentiating potentially large logits, causing overflow. The log-sum-exp trick avoids this:

\log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right)

where $m = \max_j z_j$ . Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch's CrossEntropyLoss accepts raw logits).

Clamping

Predicted probabilities should be clamped away from exactly 0 and 1 to avoid $\log(0) = -\infty$ . A small epsilon (e.g., $10^{-7}$ ) is typically used.

Label Smoothing

Label smoothing (Szegedy et al., 2016) replaces the hard one-hot target with a soft distribution:

y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K}

where $\alpha$ is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and Transformer models.

Comparison with Other Losses

Loss	Formula	Typical use
Cross-entropy	$-\sum y_k \log \hat{y}_k$	Classification
Mean squared error	$\frac{1}{K}\sum(y_k - \hat{y}_k)^2$	Regression (poor for classification)
Hinge loss	$\max(0, 1 - y \cdot z)$	SVM-style classification
Focal loss	$-(1-\hat{y}_c)^\gamma \log \hat{y}_c$	Imbalanced classification

Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors.

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication". Bell System Technical Journal.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press, Chapter 6.
Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision". CVPR.
Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection". ICCV.

@@ Line 8: / Line 8: @@
 '''Cross-entropy loss''' (also called '''log loss''') is the most widely used loss function for classification tasks in machine learning. Rooted in information theory, it measures the dissimilarity between the true label distribution and the model's predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions.
-== Information-Theoretic Foundations == <!--T:2-->
+<!--T:2-->
+== Information-Theoretic Foundations ==
-=== Entropy === <!--T:3-->
+<!--T:3-->
+=== Entropy ===
 <!--T:4-->
@@ Line 21: / Line 23: @@
 For a deterministic distribution (one-hot label), <math>H(p) = 0</math>. Entropy is maximized when all outcomes are equally likely.
-=== KL Divergence === <!--T:7-->
+<!--T:7-->
+=== KL Divergence ===
 <!--T:8-->
@@ Line 32: / Line 35: @@
 KL divergence is non-negative and equals zero if and only if <math>p = q</math>.
-=== Cross-Entropy === <!--T:11-->
+<!--T:11-->
+=== Cross-Entropy ===
 <!--T:12-->
@@ Line 43: / Line 47: @@
 Since <math>H(p)</math> is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution <math>q</math> as close to the true distribution <math>p</math> as possible.
-== Binary Cross-Entropy == <!--T:15-->
+<!--T:15-->
+== Binary Cross-Entropy ==
 <!--T:16-->
@@ Line 60: / Line 65: @@
 The gradient with respect to the logit <math>z</math> takes the elegantly simple form <math>\hat{y} - y</math>, which is both intuitive and computationally efficient.
-== Categorical Cross-Entropy == <!--T:21-->
+<!--T:21-->
+== Categorical Cross-Entropy ==
 <!--T:22-->
@@ Line 71: / Line 77: @@
 This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called '''negative log-likelihood''' in this context.
-== Numerical Stability == <!--T:25-->
+<!--T:25-->
+== Numerical Stability ==
-=== The Log-Sum-Exp Trick === <!--T:26-->
+<!--T:26-->
+=== The Log-Sum-Exp Trick ===
 <!--T:27-->
@@ Line 84: / Line 92: @@
 where <math>m = \max_j z_j</math>. Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch's <code>CrossEntropyLoss</code> accepts raw logits).
-=== Clamping === <!--T:30-->
+<!--T:30-->
+=== Clamping ===
 <!--T:31-->
 Predicted probabilities should be clamped away from exactly 0 and 1 to avoid <math>\log(0) = -\infty</math>. A small epsilon (e.g., <math>10^{-7}</math>) is typically used.
-== Label Smoothing == <!--T:32-->
+<!--T:32-->
+== Label Smoothing ==
 <!--T:33-->
@@ Line 100: / Line 110: @@
 where <math>\alpha</math> is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and Transformer models.
-== Comparison with Other Losses == <!--T:36-->
+<!--T:36-->
+== Comparison with Other Losses ==
 <!--T:37-->
@@ Line 119: / Line 130: @@
 Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors.
-== See also == <!--T:39-->
+<!--T:39-->
+== See also ==
 <!--T:40-->
@@ Line 128: / Line 140: @@
 * [[Neural Networks]]
-== References == <!--T:41-->
+<!--T:41-->
+== References ==
 <!--T:42-->