Focal Loss

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area Deep Learning
    Prerequisites Cross-entropy Loss, Backpropagation, Logistic Regression


    Overview

    Focal loss is a modification of the standard cross-entropy loss designed to address extreme class imbalance during training, particularly in dense object detection. Introduced by Lin et al. in 2017 as the central training objective of the RetinaNet detector, focal loss reshapes the loss function so that well-classified examples contribute exponentially less to the gradient, allowing the optimizer to focus on a sparse set of hard, misclassified examples.[1] The technique made it possible for one-stage detectors to match the accuracy of two-stage detectors such as Faster R-CNN while retaining the speed advantage of single-shot architectures.

    Although focal loss was developed for object detection, it has become a general-purpose tool whenever a Classification task is dominated by easy negatives or by a majority class. Common application domains include medical imaging, anomaly detection, semantic segmentation, and rare-event prediction. The loss adds a single non-negative hyperparameter, the focusing parameter $ \gamma $, that smoothly interpolates between standard cross-entropy and a hard-mining behaviour.

    Motivation: foreground-background imbalance

    Dense detectors evaluate $ 10^4 $-$ 10^5 $ candidate locations per image, of which only a handful overlap a true object. The remaining locations are easy background examples whose individual loss is small but whose aggregate contribution overwhelms the gradient. Two-stage pipelines mitigate this with a region proposal network that filters most of the background before the classifier ever sees it. One-stage detectors had no such filter, and earlier remedies such as hard negative mining, OHEM, or fixed foreground-background ratios required heuristics that were brittle across datasets.

    Focal loss attacks the imbalance directly, in the loss itself, by down-weighting the contribution of confident predictions regardless of class. Because the down-weighting is smooth and differentiable, no example is ever discarded; the network simply allocates less capacity to learning what it already knows.

    Cross-entropy as the starting point

    For a binary problem with label $ y \in \{0, 1\} $ and predicted probability $ p \in [0, 1] $ for the positive class, cross-entropy is

    $ {\displaystyle \mathrm{CE}(p, y) = -y \log p - (1 - y) \log(1 - p).} $

    Defining

    $ {\displaystyle p_t = \begin{cases} p & \text{if } y = 1 \\ 1 - p & \text{otherwise} \end{cases}} $

    allows the loss to be written compactly as $ \mathrm{CE}(p_t) = -\log p_t $. The quantity $ p_t $ is the model's probability assigned to the true class. A well-classified example has $ p_t \to 1 $ and incurs a small but non-negligible loss; summed over tens of thousands of easy negatives, this small loss dominates training.

    Formulation

    Focal loss adds a modulating factor $ (1 - p_t)^\gamma $ in front of the cross-entropy term:

    $ {\displaystyle \mathrm{FL}(p_t) = -(1 - p_t)^\gamma \log p_t,} $

    with $ \gamma \geq 0 $ the tunable focusing parameter. When $ \gamma = 0 $ the modulating factor is one and focal loss collapses to cross-entropy. As $ \gamma $ grows, the loss for confident predictions decays much faster than for uncertain ones. For example, with $ \gamma = 2 $ and $ p_t = 0.9 $, the modulating factor is $ 0.01 $, giving a 100x reduction relative to cross-entropy.

    A second adjustment, often combined with the focusing factor, multiplies the loss by a class-balancing weight $ \alpha_t \in [0, 1] $:

    $ {\displaystyle \mathrm{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log p_t.} $

    In the original paper $ \alpha = 0.25 $ for the foreground class (paired with $ 1 - \alpha = 0.75 $ for background) and $ \gamma = 2 $ were reported as a robust default. The combination of the multiplicative factors pushes the loss landscape towards hard foregrounds and hard backgrounds simultaneously.

    The multiclass extension applies the same factor per class using a softmax-derived $ p_t $, or independently per class with a sigmoid activation, the latter being the configuration used in RetinaNet.

    Gradients and training dynamics

    Differentiating focal loss with respect to the logit $ z $ yields a gradient that is itself attenuated by a power of $ (1 - p_t) $. Easy examples therefore contribute little to either the forward loss or to backpropagation, so the optimizer effectively spends each step on the hardest examples in the batch.

    Two implementation details matter in practice. First, the bias of the final classification layer should be initialized so that $ p \approx \pi $ for some small prior such as $ \pi = 0.01 $; without this initialization, the loss at the first iteration is dominated by tens of thousands of confident-but-wrong negative predictions and training diverges. Second, focal loss is typically computed per anchor and normalized by the number of positive anchors, not by the total number of anchors. This keeps the gradient magnitude comparable across images of varying object density.

    Variants and extensions

    Several follow-up losses build on the focal-loss skeleton. Quality Focal Loss, introduced in the Generalized Focal Loss family, replaces the binary target with a continuous quality score (such as IoU) so that the classifier directly predicts localization confidence.[2] Distribution Focal Loss models the bounding box regression target as a discrete distribution, again using a focal-style modulating factor.

    In semantic segmentation, focal loss is frequently combined with Dice loss to handle pixel-level class imbalance, particularly for thin structures and small lesions in medical images. Variants such as Focal Tversky Loss generalize the formulation by combining focal modulation with the Tversky index.[3]

    Comparison with alternatives

    Focal loss is one of several techniques for handling imbalance. Compared with hard example mining and OHEM, focal loss is fully differentiable, has no discrete selection step, and is straightforward to implement. Compared with class-balanced sampling, it does not require knowing per-class frequencies in advance and adapts dynamically as the model improves. Compared with simple class weighting (the $ \alpha $ term alone), the multiplicative $ (1 - p_t)^\gamma $ factor additionally suppresses easy examples within each class.

    When the imbalance is mild, the gains over cross-entropy are small and sometimes negative because focal loss can under-weight signals from already well-classified but informative examples. The technique is most useful when the easy-example fraction is overwhelming.

    Limitations

    Focal loss assumes that low $ p_t $ reliably identifies useful learning signal. In datasets with substantial label noise this assumption fails: noisy positives produce permanently low $ p_t $ and consume disproportionate gradient. Several works have studied this failure mode and proposed noise-robust variants that cap or smooth the modulating factor. Focal loss can also be sensitive to $ \gamma $; very large values starve the optimizer of signal during early training, while very small values fall back to cross-entropy.

    Calibration is a further concern. A network trained with focal loss tends to produce more confident but less calibrated probabilities than one trained with cross-entropy, which matters when downstream systems consume the predicted probabilities directly rather than the argmax.

    References

    1. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. "Focal Loss for Dense Object Detection." ICCV 2017. Template:Cite arxiv
    2. Li, X., Wang, W., Wu, L., Chen, S., et al. "Generalized Focal Loss." NeurIPS 2020. Template:Cite arxiv
    3. Abraham, N., Khan, N. M. "A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation." ISBI 2019. Template:Cite arxiv