Huber Loss
| Article | |
|---|---|
| Topic area | Machine learning — loss functions |
| Prerequisites | Mean Squared Error, Mean Absolute Error, Gradient Descent |
Overview
The Huber loss is a regression loss function that behaves quadratically for small residuals and linearly for large ones, combining the smooth optimization geometry of squared error with the outlier robustness of absolute error. Introduced by Peter J. Huber in 1964 as part of his work on robust statistics, it is parameterized by a threshold $ \delta > 0 $ that defines the crossover between the two regimes.[1] In modern machine learning the Huber loss is widely used for regression tasks where targets contain heavy-tailed noise, and a smoothed close relative known as the Smooth L1 loss has become the standard choice for bounding-box regression in object detection.
Intuition
Mean squared error penalizes residuals by their square, so a single mislabeled point with a large residual can dominate the gradient and pull the fit toward itself. Mean absolute error treats every residual on a linear scale and is therefore far less sensitive to outliers, but its gradient is constant in magnitude and its derivative is undefined at zero, which slows convergence near the optimum and produces unstable updates with momentum-based optimizers.
The Huber loss interpolates between these two extremes. When the residual is small the loss looks quadratic, so gradients shrink as the model approaches the target and optimization converges smoothly. When the residual exceeds the threshold $ \delta $ the loss switches to a linear regime, capping the gradient magnitude at $ \delta $ and preventing a few extreme errors from steering training. The result is a loss that is robust like the absolute error tail but well-behaved like the squared error core.
Formulation
For a residual $ r = y - \hat{y} $ the Huber loss is defined piecewise:
$ {\displaystyle L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2 & \text{if } |r| \le \delta, \\ \delta\,(|r| - \tfrac{1}{2}\delta) & \text{if } |r| > \delta. \end{cases} } $
The constant $ \tfrac{1}{2}\delta $ in the linear branch is chosen so that the two pieces match in value and in first derivative at $ |r| = \delta $. The derivative with respect to the prediction $ \hat{y} $ is
$ {\displaystyle \frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} -r & \text{if } |r| \le \delta, \\ -\delta\,\operatorname{sign}(r) & \text{if } |r| > \delta, \end{cases} } $
which makes clear that the gradient magnitude is clipped at $ \delta $. Because the loss is continuously differentiable but only once (the second derivative jumps from 1 to 0 at $ |r|=\delta $), it is C^1 but not C^2.
Properties
The Huber loss is convex in the residual, so summing it over a dataset yields a convex empirical risk for linear models. It is bounded below by zero and grows without bound, which keeps the optimization problem well-posed. Unlike pure absolute error, the loss is differentiable everywhere, so first-order methods such as Gradient Descent and Stochastic Gradient Descent do not need a subgradient treatment near zero. Unlike squared error, the influence function (the contribution of a single observation to the gradient) is bounded, which is the formal sense in which Huber regression is an M-estimator with bounded influence.[2]
Choosing the threshold
The hyperparameter $ \delta $ controls how aggressively large residuals are downweighted. Small $ \delta $ approaches mean absolute error and maximizes robustness; large $ \delta $ approaches mean squared error and emphasizes smoothness. A common heuristic, due to Huber, is to set $ \delta $ to a multiple (often 1.345) of a robust scale estimate such as the median absolute deviation, which yields about 95 percent asymptotic efficiency under Gaussian noise while retaining robustness against contamination. In deep learning $ \delta $ is typically fixed at 1 (the Smooth L1 convention) or tuned on a validation split.
Training and inference
Because the loss is convex and smooth, Huber regression for linear models can be solved by iteratively reweighted least squares, by quasi-Newton methods, or by ordinary Gradient Descent. In neural networks the loss is plugged in as a drop-in replacement for mean squared error in the output layer and propagates gradients through Backpropagation in the usual way. The clipped-gradient behavior in the linear regime acts as a built-in form of gradient clipping for the regression head, which can stabilize training when targets occasionally contain extreme values, for example in reward prediction for reinforcement learning.
Variants
Several smooth approximations and generalizations of the Huber loss are in common use:
- Smooth L1 loss is the special case $ \delta = 1 $ popularized by the Fast R-CNN object detector for bounding-box regression; it is sometimes written without the factor of one half but is otherwise identical.[3]
- Pseudo-Huber loss replaces the piecewise definition with the smooth surrogate $ L(r) = \delta^2\bigl(\sqrt{1+(r/\delta)^2} - 1\bigr) $, which is C-infinity and approximates the Huber shape closely; it is convenient when second derivatives are needed.
- Log-cosh loss uses $ L(r) = \log\cosh(r) $ and behaves like a squared error near zero and an absolute error in the tails, with a similar motivation but different curvature.
- Tukey biweight goes further and redescends to zero influence beyond a cutoff, fully discarding extreme outliers; unlike Huber it is non-convex, so it is used mostly with good initializations.
Comparisons
Compared with mean squared error, the Huber loss trades a small loss of efficiency under perfectly Gaussian noise for substantially better behavior under heavy-tailed or contaminated noise. Compared with mean absolute error, it sacrifices some breakdown robustness for a smoother optimization surface and faster convergence near the optimum. In deep regression heads it is often preferred over both: it gives stabler training than mean squared error and faster convergence than mean absolute error, particularly when used with adaptive optimizers whose momentum estimates are sensitive to gradient spikes.
Applications
The Huber loss appears throughout machine learning. In computer vision the Smooth L1 variant is the de facto standard for bounding-box and keypoint regression in detectors such as Fast R-CNN, Faster R-CNN, and many of their successors. In reinforcement learning it is used as the temporal-difference loss in deep Q-learning, where it tames large Bellman errors that would otherwise destabilize training.[4] In tabular regression and time-series forecasting it is a standard choice when targets contain occasional spikes, and in classical statistics Huber regression remains a foundational example of a bounded-influence M-estimator.
Limitations
The Huber loss has limitations. The threshold $ \delta $ is a free hyperparameter that must be chosen with care; a value calibrated to one dataset may be inappropriate when the noise scale shifts. The loss is convex but not strictly convex outside the quadratic region, which can slow convergence in flat directions. It does not redescend, so unlike Tukey-style losses it still allows extreme observations to influence the fit, only with bounded weight. Finally, while it is robust to symmetric heavy-tailed noise, it does not by itself correct for systematic biases or asymmetric error distributions; in those cases Quantile Regression or asymmetric variants of the Huber loss are typically more appropriate.
References
- ↑ Huber, P. J. Robust Estimation of a Location Parameter. Annals of Mathematical Statistics, 35(1):73-101, 1964.
- ↑ Huber, P. J. and Ronchetti, E. M. Robust Statistics, 2nd ed., Wiley, 2009.
- ↑ Girshick, R. Fast R-CNN. ICCV, 2015.
- ↑ Mnih, V. et al. Human-level control through deep reinforcement learning. Nature, 518:529-533, 2015.