Cross-Entropy Loss/zh

    From Marovi AI
    Revision as of 03:29, 27 April 2026 by DeployBot (talk | contribs) (Batch translate Cross-Entropy Loss unit 8 → zh)
    (diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
    Other languages:
    Article
    Topic area Machine Learning
    Difficulty Intermediate
    Prerequisites Loss Functions, Softmax Function

    交叉熵损失(也称为对数损失)是机器学习中分类任务最广泛使用的损失函数。它植根于信息论,衡量真实标签分布与模型预测概率分布之间的差异,提供一个平滑、可微的目标,驱使概率分类器做出自信且正确的预测。

    信息论基础

    离散概率分布 $ p $量化其不确定性:

    $ H(p) = -\sum_{k=1}^{K} p_k \log p_k $

    对于确定性分布(one-hot 标签),$ H(p) = 0 $。当所有结果等可能时,熵达到最大值。

    KL 散度

    Kullback-Leibler 散度衡量一个分布 $ q $ 与参考分布 $ p $ 之间的差异:

    $ D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k} $

    KL 散度是非负的,并且当且仅当 $ p = q $ 时等于零。

    交叉熵

    The cross-entropy between distributions $ p $ (true) and $ q $ (predicted) is:

    $ H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q) $

    Since $ H(p) $ is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution $ q $ as close to the true distribution $ p $ as possible.

    Binary Cross-Entropy

    For binary classification with true label $ y \in \{0, 1\} $ and predicted probability $ \hat{y} = \sigma(z) $ (where $ \sigma $ is the sigmoid function):

    $ \mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr] $

    Over a dataset of $ N $ samples:

    $ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr] $

    The gradient with respect to the logit $ z $ takes the elegantly simple form $ \hat{y} - y $, which is both intuitive and computationally efficient.

    Categorical Cross-Entropy

    For multi-class classification with $ K $ classes, the true label is typically a one-hot vector $ \mathbf{y} $ with $ y_c = 1 $ for the correct class $ c $. The predicted probabilities $ \hat{\mathbf{y}} $ are obtained via the Softmax Function:

    $ \mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c $

    This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called negative log-likelihood in this context.

    Numerical Stability

    The Log-Sum-Exp Trick

    Naively computing $ \log(\mathrm{softmax}(z_k)) $ involves exponentiating potentially large logits, causing overflow. The log-sum-exp trick avoids this:

    $ \log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right) $

    where $ m = \max_j z_j $. Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch's CrossEntropyLoss accepts raw logits).

    Clamping

    Predicted probabilities should be clamped away from exactly 0 and 1 to avoid $ \log(0) = -\infty $. A small epsilon (e.g., $ 10^{-7} $) is typically used.

    Label Smoothing

    Label smoothing (Szegedy et al., 2016) replaces the hard one-hot target with a soft distribution:

    $ y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K} $

    where $ \alpha $ is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and Transformer models.

    Comparison with Other Losses

    Loss Formula Typical use
    Cross-entropy $ -\sum y_k \log \hat{y}_k $ Classification
    Mean squared error $ \frac{1}{K}\sum(y_k - \hat{y}_k)^2 $ Regression (poor for classification)
    Hinge loss $ \max(0, 1 - y \cdot z) $ SVM-style classification
    Focal loss $ -(1-\hat{y}_c)^\gamma \log \hat{y}_c $ Imbalanced classification

    Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors.

    See also

    References

    • Shannon, C. E. (1948). "A Mathematical Theory of Communication". Bell System Technical Journal.
    • Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press, Chapter 6.
    • Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision". CVPR.
    • Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection". ICCV.