Mean Absolute Error

    From Marovi AI
    Other languages:
    • English
    Article
    Topic area Machine Learning
    Prerequisites Loss Function, Regression, Gradient Descent


    Overview

    Mean Absolute Error (MAE), also called the L1 loss or average absolute deviation, is a measure of prediction error that averages the absolute differences between predicted values and observed values. Given a dataset with $ n $ samples, predictions $ \hat{y}_i $, and ground-truth targets $ y_i $, MAE is defined as

    $ {\displaystyle \mathrm{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|.} $

    MAE serves two distinct roles in Machine Learning: as an evaluation metric for Regression models and as a training Loss Function that the model directly minimizes. In both roles its defining property is that errors enter linearly rather than quadratically, which makes MAE less sensitive to large outliers than the Mean Squared Error. The metric is expressed in the same physical units as the target variable, so a MAE of 3.2 on a house-price model in dollars means predictions are off by 3.2 dollars on average, which gives MAE a directly interpretable meaning that squared-error metrics lack.

    Intuition

    Every regression model produces a residual $ r_i = y_i - \hat{y}_i $ for each example. A loss function summarizes this vector of residuals into a single scalar that captures "how wrong" the model is. MAE applies the absolute-value function elementwise and then averages, treating overestimates and underestimates symmetrically and weighting all residuals in proportion to their size. A residual of 10 contributes ten times as much to MAE as a residual of 1, whereas under Mean Squared Error it would contribute one hundred times as much.

    A useful statistical interpretation comes from the optimization problem MAE solves. If the model is reduced to a single constant prediction $ c $ and we minimize $ \sum_i |y_i - c| $ over $ c $, the minimizer is the Median of the targets, not the mean. This makes MAE the natural loss when one cares about the conditional median of the response rather than its conditional mean. By contrast, minimizing squared error returns the conditional mean. The choice between MAE and squared error therefore reflects a modeling decision about which central tendency of the response distribution best summarizes a "typical" value.

    Formulation

    Let $ f_\theta : \mathcal{X} \to \mathbb{R} $ be a parameterized model with parameters $ \theta $, applied to inputs $ x_i \in \mathcal{X} $. The empirical MAE is

    $ {\displaystyle \mathcal{L}_{\mathrm{MAE}}(\theta) = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - f_\theta(x_i) \right|.} $

    For multi-output regression with target vectors $ \mathbf{y}_i \in \mathbb{R}^d $, MAE generalizes by averaging absolute differences across both samples and output dimensions:

    $ {\displaystyle \mathcal{L}_{\mathrm{MAE}} = \frac{1}{n d} \sum_{i=1}^{n} \sum_{j=1}^{d} \left| y_{ij} - \hat{y}_{ij} \right|.} $

    This is equivalent to the average L1 norm of the residual vectors divided by $ d $, and some software packages omit the factor of $ 1/d $, reporting the summed-over-dimension form instead. When evaluating a model, MAE is computed on a held-out test set; when training, the same expression (typically without the $ 1/n $ normalization for a minibatch) is used as the Loss Function and minimized via Gradient Descent or a variant.

    The probabilistic justification of MAE comes from Maximum Likelihood Estimation under a Laplace Distribution noise model. If we assume $ y_i = f_\theta(x_i) + \varepsilon_i $ with $ \varepsilon_i \sim \mathrm{Laplace}(0, b) $ independently, the log-likelihood reduces to a constant minus a multiple of $ \sum_i |y_i - f_\theta(x_i)| $, so maximum-likelihood estimation under Laplacian noise is identical to MAE minimization. This stands in contrast to the Gaussian-noise assumption that justifies squared error.

    Optimization and gradients

    The Gradient of $ |r| $ with respect to $ r $ is the Sign Function $ \mathrm{sign}(r) $, defined as $ +1 $ for positive $ r $, $ -1 $ for negative $ r $, and undefined at zero. By the chain rule, the gradient of MAE with respect to model parameters is

    $ {\displaystyle \nabla_\theta \mathcal{L}_{\mathrm{MAE}} = -\frac{1}{n} \sum_{i=1}^{n} \mathrm{sign}(y_i - f_\theta(x_i)) \, \nabla_\theta f_\theta(x_i).} $

    Two practical consequences follow. First, the gradient magnitude is constant in the size of the residual: a prediction off by a tiny amount and a prediction off by a huge amount contribute gradient updates of the same magnitude (only the sign of the residual changes). This is what makes MAE robust to outliers but also what makes Convergence near the optimum slow, since the gradient does not naturally taper as the model approaches the targets. Second, the absolute value is not differentiable at zero, which can cause numerical issues in gradient-based optimizers when residuals reach exactly zero. In practice, frameworks return zero as a Subgradient at that point, and exact-zero residuals are rare in continuous regression.

    To mitigate the slow late-stage convergence, practitioners often pair MAE with adaptive optimizers such as Adam that rescale gradients per-parameter, or switch to a smoothed variant like the Huber Loss that is quadratic near zero and linear in the tails.

    Variants

    Several closely related losses extend or smooth MAE.

    The Mean Absolute Percentage Error (MAPE) divides each absolute error by $ |y_i| $ before averaging, producing a unit-free measure useful when comparing forecasts across series of different scales. MAPE is undefined when any $ y_i = 0 $ and is asymmetric in over- versus under-prediction, which has motivated alternatives such as symmetric MAPE.

    The Huber Loss, $ \rho_\delta(r) = \tfrac{1}{2} r^2 $ for $ |r| \leq \delta $ and $ \delta(|r| - \tfrac{1}{2} \delta) $ otherwise, behaves like squared error near zero and like MAE in the tails, combining the smooth gradients of Mean Squared Error with MAE's Robustness to outliers.

    Quantile Regression generalizes MAE to estimate arbitrary quantiles by replacing the symmetric absolute value with the asymmetric pinball loss $ \rho_\tau(r) = \max(\tau r, (\tau - 1) r) $, where $ \tau \in (0, 1) $ selects the target quantile. Setting $ \tau = 0.5 $ recovers MAE up to a factor of two and estimates the conditional median.

    The Log-Cosh Loss approximates MAE in the tails while remaining smooth and twice-differentiable everywhere, making it convenient for second-order optimization methods.

    MAE versus mean squared error

    The most common comparison is between MAE and Mean Squared Error (MSE). MSE squares residuals before averaging, which has three coupled effects: large residuals dominate the loss, the loss surface is smooth and quadratic near the optimum (helping convergence), and the resulting estimator targets the conditional mean. MAE keeps the loss linear in the residual, treats all residuals proportionally, has constant-magnitude gradients, and targets the conditional median.

    In datasets with Heavy-tailed Distribution noise or systematic Outlier contamination, MAE is preferable: a single residual of size 100 contributes 100 to MAE but 10000 to summed squared error, so MSE-trained models will distort their predictions to chase the outlier while MAE-trained models will largely ignore it. In datasets where errors are approximately Gaussian and small variance reductions matter, MSE is preferable: it converges faster, its gradient signal at small residuals is informative rather than constant, and its conditional-mean target is the more standard summary statistic.

    A practical compromise is to evaluate models with MAE for interpretability while training with MSE or Huber Loss for optimization stability, and many published Benchmarks report both metrics to characterize different aspects of model error.

    Limitations

    MAE has known weaknesses that should inform its use. Its constant-gradient property can stall optimization once residuals are small, requiring careful learning-rate schedules or adaptive optimizers. The non-differentiability at zero, while typically handled by Subgradient methods, can interact poorly with techniques that assume strict smoothness, such as some second-order methods.

    Like all single-number summaries, MAE collapses the entire residual distribution into a single value. Two models with identical MAE can have very different residual distributions: one with many small errors, another with mostly perfect predictions but a few moderate errors. Reporting MAE alongside median absolute error, the residual distribution, or quantile-based summaries gives a more complete picture.

    Finally, MAE is scale-dependent, so it cannot be compared across regression problems with different target scales without Normalization. Scale-free alternatives such as MAPE, normalized MAE (dividing by the target's range or standard deviation), or R-squared address this when cross-task comparison is needed.

    References

    [1]

    [2]

    [3]

    [4]

    1. Template:Cite arxiv
    2. Willmott, C. J., and Matsuura, K. "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance." Climate Research, 30(1), 79-82, 2005.
    3. Chai, T., and Draxler, R. R. "Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature." Geoscientific Model Development, 7(3), 1247-1250, 2014.
    4. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed., Springer, 2009.