Cross-Entropy Loss: Difference between revisions
([deploy-bot] Deploy from CI (8c92aeb)) Tags: ci-deploy Manual revert |
([deploy-bot] Page updated via CLI) Tag: content-generation |
||
| (14 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
<languages /> | |||
{{ArticleInfobox | topic_area = Machine Learning | {{ArticleInfobox | topic_area = Machine Learning| prerequisites = [[Loss Functions]], [[Softmax Function]]}} | ||
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}} | {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}} | ||
'''Cross-entropy loss''' (also called '''log loss''') is the most widely used loss function for classification tasks in machine learning. Rooted in information theory, it measures the dissimilarity between the true label distribution and the model's predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions. | <translate> | ||
<!--T:1--> | |||
'''Cross-entropy loss''' (also called '''log loss''') is the most widely used {{Term|loss function}} for classification tasks in machine learning. Rooted in {{Term|information theory}}, it measures the dissimilarity between the true label distribution and the model's predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions. | |||
<!--T:2--> | |||
== Information-Theoretic Foundations == | == Information-Theoretic Foundations == | ||
<!--T:3--> | |||
=== Entropy === | === Entropy === | ||
<!--T:4--> | |||
The '''entropy''' of a discrete probability distribution <math>p</math> quantifies its uncertainty: | The '''entropy''' of a discrete probability distribution <math>p</math> quantifies its uncertainty: | ||
<!--T:5--> | |||
:<math>H(p) = -\sum_{k=1}^{K} p_k \log p_k</math> | :<math>H(p) = -\sum_{k=1}^{K} p_k \log p_k</math> | ||
For a deterministic distribution (one-hot label), <math>H(p) = 0</math>. Entropy is maximized when all outcomes are equally likely. | <!--T:6--> | ||
For a deterministic distribution ({{Term|one-hot encoding|one-hot}} label), <math>H(p) = 0</math>. Entropy is maximized when all outcomes are equally likely. | |||
<!--T:7--> | |||
=== KL Divergence === | === KL Divergence === | ||
<!--T:8--> | |||
The '''Kullback-Leibler divergence''' measures how one distribution <math>q</math> differs from a reference distribution <math>p</math>: | The '''Kullback-Leibler divergence''' measures how one distribution <math>q</math> differs from a reference distribution <math>p</math>: | ||
<!--T:9--> | |||
:<math>D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k}</math> | :<math>D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k}</math> | ||
<!--T:10--> | |||
KL divergence is non-negative and equals zero if and only if <math>p = q</math>. | KL divergence is non-negative and equals zero if and only if <math>p = q</math>. | ||
<!--T:11--> | |||
=== Cross-Entropy === | === Cross-Entropy === | ||
<!--T:12--> | |||
The '''cross-entropy''' between distributions <math>p</math> (true) and <math>q</math> (predicted) is: | The '''cross-entropy''' between distributions <math>p</math> (true) and <math>q</math> (predicted) is: | ||
<!--T:13--> | |||
:<math>H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q)</math> | :<math>H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q)</math> | ||
<!--T:14--> | |||
Since <math>H(p)</math> is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution <math>q</math> as close to the true distribution <math>p</math> as possible. | Since <math>H(p)</math> is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution <math>q</math> as close to the true distribution <math>p</math> as possible. | ||
<!--T:15--> | |||
== Binary Cross-Entropy == | == Binary Cross-Entropy == | ||
<!--T:16--> | |||
For binary classification with true label <math>y \in \{0, 1\}</math> and predicted probability <math>\hat{y} = \sigma(z)</math> (where <math>\sigma</math> is the [[Softmax Function|sigmoid function]]): | For binary classification with true label <math>y \in \{0, 1\}</math> and predicted probability <math>\hat{y} = \sigma(z)</math> (where <math>\sigma</math> is the [[Softmax Function|sigmoid function]]): | ||
<!--T:17--> | |||
:<math>\mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr]</math> | :<math>\mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr]</math> | ||
<!--T:18--> | |||
Over a dataset of <math>N</math> samples: | Over a dataset of <math>N</math> samples: | ||
<!--T:19--> | |||
:<math>\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr]</math> | :<math>\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr]</math> | ||
The gradient with respect to the logit <math>z</math> takes the elegantly simple form <math>\hat{y} - y</math>, which is both intuitive and computationally efficient. | <!--T:20--> | ||
The gradient with respect to the {{Term|logits|logit}} <math>z</math> takes the elegantly simple form <math>\hat{y} - y</math>, which is both intuitive and computationally efficient. | |||
<!--T:21--> | |||
== Categorical Cross-Entropy == | == Categorical Cross-Entropy == | ||
For multi-class classification with <math>K</math> classes, the true label is typically a one-hot vector <math>\mathbf{y}</math> with <math>y_c = 1</math> for the correct class <math>c</math>. The predicted probabilities <math>\hat{\mathbf{y}}</math> are obtained via the [[Softmax Function]]: | <!--T:22--> | ||
For multi-class classification with <math>K</math> classes, the true label is typically a {{Term|one-hot encoding|one-hot}} vector <math>\mathbf{y}</math> with <math>y_c = 1</math> for the correct class <math>c</math>. The predicted probabilities <math>\hat{\mathbf{y}}</math> are obtained via the [[Softmax Function]]: | |||
<!--T:23--> | |||
:<math>\mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c</math> | :<math>\mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c</math> | ||
<!--T:24--> | |||
This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called '''negative log-likelihood''' in this context. | This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called '''negative log-likelihood''' in this context. | ||
<!--T:25--> | |||
== Numerical Stability == | == Numerical Stability == | ||
<!--T:26--> | |||
=== The Log-Sum-Exp Trick === | === The Log-Sum-Exp Trick === | ||
Naively computing <math>\log(\mathrm{softmax}(z_k))</math> involves exponentiating potentially large logits, causing overflow. The '''log-sum-exp''' trick avoids this: | <!--T:27--> | ||
Naively computing <math>\log(\mathrm{softmax}(z_k))</math> involves exponentiating potentially large {{Term|logits}}, causing overflow. The '''log-sum-exp''' trick avoids this: | |||
<!--T:28--> | |||
:<math>\log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right)</math> | :<math>\log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right)</math> | ||
where <math>m = \max_j z_j</math>. Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch's <code>CrossEntropyLoss</code> accepts raw logits). | <!--T:29--> | ||
where <math>m = \max_j z_j</math>. Subtracting the maximum {{Term|logits|logit}} ensures the largest exponent is zero, preventing overflow. All major {{Term|deep learning}} frameworks implement this fused operation (e.g., PyTorch's <code>CrossEntropyLoss</code> accepts raw {{Term|logits}}). | |||
<!--T:30--> | |||
=== Clamping === | === Clamping === | ||
<!--T:31--> | |||
Predicted probabilities should be clamped away from exactly 0 and 1 to avoid <math>\log(0) = -\infty</math>. A small epsilon (e.g., <math>10^{-7}</math>) is typically used. | Predicted probabilities should be clamped away from exactly 0 and 1 to avoid <math>\log(0) = -\infty</math>. A small epsilon (e.g., <math>10^{-7}</math>) is typically used. | ||
<!--T:32--> | |||
== Label Smoothing == | == Label Smoothing == | ||
'''Label smoothing''' (Szegedy et al., 2016) replaces the hard one-hot target with a soft distribution: | <!--T:33--> | ||
'''Label smoothing''' (Szegedy et al., 2016) replaces the hard {{Term|one-hot encoding|one-hot}} target with a soft distribution: | |||
<!--T:34--> | |||
:<math>y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K}</math> | :<math>y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K}</math> | ||
where <math>\alpha</math> is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and | <!--T:35--> | ||
where <math>\alpha</math> is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and {{Term|transformer}} models. | |||
<!--T:36--> | |||
== Comparison with Other Losses == | == Comparison with Other Losses == | ||
<!--T:37--> | |||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
| Line 88: | Line 126: | ||
|} | |} | ||
<!--T:38--> | |||
Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors. | Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors. | ||
<!--T:39--> | |||
== See also == | == See also == | ||
<!--T:40--> | |||
* [[Loss Functions]] | * [[Loss Functions]] | ||
* [[Softmax Function]] | * [[Softmax Function]] | ||
| Line 98: | Line 139: | ||
* [[Neural Networks]] | * [[Neural Networks]] | ||
<!--T:41--> | |||
== References == | == References == | ||
<!--T:42--> | |||
* Shannon, C. E. (1948). "A Mathematical Theory of Communication". ''Bell System Technical Journal''. | * Shannon, C. E. (1948). "A Mathematical Theory of Communication". ''Bell System Technical Journal''. | ||
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning''. MIT Press, Chapter 6. | * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning''. MIT Press, Chapter 6. | ||
* Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision". ''CVPR''. | * Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision". ''CVPR''. | ||
* Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection". ''ICCV''. | * Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection". ''ICCV''. | ||
</translate> | |||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:Intermediate]] | [[Category:Intermediate]] | ||
Latest revision as of 03:11, 28 April 2026
| Article | |
|---|---|
| Topic area | Machine Learning |
| Prerequisites | Loss Functions, Softmax Function |
Cross-entropy loss (also called log loss) is the most widely used loss function for classification tasks in machine learning. Rooted in information theory, it measures the dissimilarity between the true label distribution and the model's predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions.
Information-Theoretic Foundations
Entropy
The entropy of a discrete probability distribution $ p $ quantifies its uncertainty:
- $ H(p) = -\sum_{k=1}^{K} p_k \log p_k $
For a deterministic distribution (one-hot label), $ H(p) = 0 $. Entropy is maximized when all outcomes are equally likely.
KL Divergence
The Kullback-Leibler divergence measures how one distribution $ q $ differs from a reference distribution $ p $:
- $ D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} $
KL divergence is non-negative and equals zero if and only if $ p = q $.
Cross-Entropy
The cross-entropy between distributions $ p $ (true) and $ q $ (predicted) is:
- $ H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q) $
Since $ H(p) $ is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution $ q $ as close to the true distribution $ p $ as possible.
Binary Cross-Entropy
For binary classification with true label $ y \in \{0, 1\} $ and predicted probability $ \hat{y} = \sigma(z) $ (where $ \sigma $ is the sigmoid function):
- $ \mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr] $
Over a dataset of $ N $ samples:
- $ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr] $
The gradient with respect to the logit $ z $ takes the elegantly simple form $ \hat{y} - y $, which is both intuitive and computationally efficient.
Categorical Cross-Entropy
For multi-class classification with $ K $ classes, the true label is typically a one-hot vector $ \mathbf{y} $ with $ y_c = 1 $ for the correct class $ c $. The predicted probabilities $ \hat{\mathbf{y}} $ are obtained via the Softmax Function:
- $ \mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c $
This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called negative log-likelihood in this context.
Numerical Stability
The Log-Sum-Exp Trick
Naively computing $ \log(\mathrm{softmax}(z_k)) $ involves exponentiating potentially large logits, causing overflow. The log-sum-exp trick avoids this:
- $ \log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right) $
where $ m = \max_j z_j $. Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch's CrossEntropyLoss accepts raw logits).
Clamping
Predicted probabilities should be clamped away from exactly 0 and 1 to avoid $ \log(0) = -\infty $. A small epsilon (e.g., $ 10^{-7} $) is typically used.
Label Smoothing
Label smoothing (Szegedy et al., 2016) replaces the hard one-hot target with a soft distribution:
- $ y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K} $
where $ \alpha $ is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and transformer models.
Comparison with Other Losses
| Loss | Formula | Typical use |
|---|---|---|
| Cross-entropy | $ -\sum y_k \log \hat{y}_k $ | Classification |
| Mean squared error | $ \frac{1}{K}\sum(y_k - \hat{y}_k)^2 $ | Regression (poor for classification) |
| Hinge loss | $ \max(0, 1 - y \cdot z) $ | SVM-style classification |
| Focal loss | $ -(1-\hat{y}_c)^\gamma \log \hat{y}_c $ | Imbalanced classification |
Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors.
See also
References
- Shannon, C. E. (1948). "A Mathematical Theory of Communication". Bell System Technical Journal.
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press, Chapter 6.
- Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision". CVPR.
- Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection". ICCV.