Mean Squared Error

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area supervised learning
    Prerequisites Loss function, Linear regression, Maximum likelihood estimation


    Overview

    Mean squared error (MSE) is the most widely used loss function for regression: the average of the squared differences between predicted and target values. For predictions $ \hat{y}_i $ and targets $ y_i $ over $ n $ examples, $ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $. It is convex, smooth, and admits closed-form solutions in linear settings, which made it the default objective for least-squares regression dating back to Gauss and Legendre. In modern machine learning MSE is the standard loss for continuous targets, the natural quantity in bias-variance analysis, and the maximum-likelihood objective when measurement noise is Gaussian with constant variance. Its main weakness is heavy weighting of large errors, which makes it sensitive to outliers and to scale.

    Definition

    Given a dataset of $ n $ input-output pairs $ \{(x_i, y_i)\}_{i=1}^{n} $ and a predictor $ f $ producing $ \hat{y}_i = f(x_i) $, the empirical mean squared error is

    $ {\displaystyle \mathrm{MSE}(f) = \frac{1}{n} \sum_{i=1}^{n} \bigl(y_i - f(x_i)\bigr)^2.} $

    The corresponding population quantity, the expected squared error or risk under the joint distribution $ p(x, y) $, is

    $ {\displaystyle R(f) = \mathbb{E}_{(x, y) \sim p}\!\left[(y - f(x))^2\right].} $

    A standard exercise shows that the minimizer of $ R $ over all measurable functions is the conditional mean $ f^*(x) = \mathbb{E}[y \mid x] $, which is why MSE-trained models are interpreted as regressors of the conditional expectation. The square root $ \sqrt{\mathrm{MSE}} $ is the root mean squared error (RMSE), reported in the same units as the target.

    When estimating a parameter $ \theta $ from data, the same quantity appears in statistics as the MSE of an estimator $ \hat{\theta} $: $ \mathbb{E}[(\hat{\theta} - \theta)^2] $. The two usages — loss for prediction, risk for estimation — are conceptually distinct but mathematically identical.

    Statistical interpretation

    MSE is the negative log-likelihood (up to constants) of an additive Gaussian noise model $ y = f(x) + \varepsilon $ with $ \varepsilon \sim \mathcal{N}(0, \sigma^2) $ and known, constant $ \sigma^2 $. The log-likelihood of the dataset is

    $ {\displaystyle \log p(y \mid x; f) = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - f(x_i))^2 + \text{const},} $

    so maximum-likelihood estimation of $ f $ under Gaussian noise is exactly empirical MSE minimization. This connection justifies MSE whenever the residuals are approximately Gaussian and homoscedastic; when they are not — heavy-tailed errors, multiplicative noise, count data — other losses such as the mean absolute error, Huber loss, or appropriate generalized linear model objective are statistically more appropriate.

    The Gaussian view also yields the natural Bayesian counterpart: with a Gaussian prior on $ f $'s parameters, MSE plus L2 regularization is the negative log-posterior, recovering ridge regression. The regularization coefficient corresponds to the prior-to-noise variance ratio.

    Bias-variance decomposition

    The expected squared error of an estimator $ \hat{f}(x) $ at a point $ x $ decomposes as

    $ {\displaystyle \mathbb{E}\!\left[(y - \hat{f}(x))^2\right] = \underbrace{\bigl(\mathbb{E}[\hat{f}(x)] - f^*(x)\bigr)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}\!\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]}_{\text{variance}} + \underbrace{\mathrm{Var}(\varepsilon)}_{\text{irreducible}}.} $

    The decomposition is specific to squared loss — analogous expressions for other losses involve cross-terms or no clean decomposition at all — and makes MSE the natural metric for analyzing the fundamental trade-off in supervised learning. Increasing model capacity typically reduces bias and inflates variance; regularization, ensembling, and early stopping can be understood as variance-reduction techniques. The irreducible term lower-bounds achievable test error: even the Bayes-optimal predictor incurs $ \mathrm{Var}(\varepsilon) $.

    Properties and gradient

    The per-example loss $ \ell(y, \hat{y}) = (y - \hat{y})^2 $ is convex in $ \hat{y} $, infinitely differentiable, and grows quadratically. Its derivative with respect to the prediction is

    $ {\displaystyle \frac{\partial \ell}{\partial \hat{y}} = -2 (y - \hat{y}),} $

    so the gradient magnitude scales linearly with the residual. This linear-in-residual gradient is convenient for gradient descent and means well-fit examples contribute almost nothing to the update, while large residuals dominate. Combined with weight initialization and learning-rate scaling, this property explains why MSE training in deep networks can suffer slow progress when predictions are near the target but a few examples remain far off.

    For a linear model $ \hat{y} = w^\top x + b $, the MSE objective is a positive-semidefinite quadratic form in $ (w, b) $ and admits the closed-form normal equations solution $ w^* = (X^\top X)^{-1} X^\top y $ when $ X^\top X $ is invertible. The Gauss-Markov theorem guarantees that this estimator is the best linear unbiased estimator under homoscedastic, uncorrelated noise — the historical reason MSE became the default regression criterion.

    Variants

    Several modifications of MSE address its limitations or specialize it for particular tasks:

    • Root mean squared error (RMSE)$ \sqrt{\mathrm{MSE}} $. Reported in the original target units; preferred for human-readable evaluation but equivalent to MSE for ranking models.
    • Mean squared logarithmic error (MSLE)$ \frac{1}{n} \sum (\log(1 + y_i) - \log(1 + \hat{y}_i))^2 $. Penalizes relative rather than absolute error; appropriate for targets spanning multiple orders of magnitude such as prices or counts.
    • Weighted MSE$ \frac{1}{n} \sum w_i (y_i - \hat{y}_i)^2 $. Allows per-example reweighting for class imbalance, importance sampling, or heteroscedasticity correction (with $ w_i = 1/\sigma_i^2 $, the generalized least squares objective).
    • Mean squared percentage error (MSPE)$ \frac{1}{n} \sum ((y_i - \hat{y}_i)/y_i)^2 $. Scale-free but undefined or unstable when $ y_i \approx 0 $.
    • Truncated or trimmed MSE — caps or removes the largest residuals before averaging, a practical robustness fix when a small number of outliers dominate.
    • Mean integrated squared error (MISE) — the function-space analogue used to evaluate density estimators and kernel smoothers.

    Comparison with other regression losses

    Choice of regression loss is dominated by the noise distribution and the desired robustness profile:

    • Mean absolute error (MAE) uses $ |y - \hat{y}| $. The optimal predictor is the conditional median rather than the mean, and the gradient has constant magnitude, making MAE more robust to outliers but harder to optimize near zero error. MAE is the maximum-likelihood objective under Laplacian noise.
    • Huber loss interpolates: quadratic for small residuals, linear for large ones. It retains MSE's smoothness near zero while bounding gradient magnitude for outliers, and is a common default in robust regression.
    • Quantile loss (pinball loss) targets a specified quantile rather than the mean and underlies quantile regression and probabilistic forecasting.
    • Log-cosh$ \log(\cosh(y - \hat{y})) $ — is approximately MSE for small residuals and approximately MAE for large ones, fully differentiable everywhere.
    • Cross-entropy loss is the analogous default for classification and density estimation; using MSE on classification logits is generally inferior because gradients vanish for confident wrong predictions.

    When residuals are approximately Gaussian and outliers rare, MSE is statistically optimal. When the noise is heavy-tailed, asymmetric, or scale-dependent, a tailored loss usually outperforms.

    Limitations

    The squared term gives MSE several well-known failure modes:

    • Outlier sensitivity. A single example with a large residual can dominate the gradient and the parameter estimate. Robust alternatives or preprocessing (winsorizing, log transforms) are advisable when outliers are present.
    • Scale dependence. MSE values are not directly comparable across datasets or tasks. Normalized variants (RMSE divided by target standard deviation, R^2) are preferred for cross-task comparison.
    • Mean-targeting. The optimal MSE predictor is the conditional mean. For skewed conditional distributions this can be a poor point estimate; quantile or expectile losses give different summaries.
    • Misleading on bounded targets. For probabilities, percentages, or other bounded targets, MSE does not respect the boundary and can produce predictions outside the valid range.
    • Vanishing gradients with sigmoid outputs. Combining MSE with a saturating output activation produces gradients proportional to $ (y - \hat{y}) \sigma'(z) $, which can be extremely small for confident wrong predictions; cross-entropy avoids this pathology.
    • No probabilistic calibration. MSE training yields a point estimate, not a predictive distribution. Methods such as Gaussian process regression or deep ensembles are required when uncertainty quantification is needed.

    Despite these caveats, MSE remains the default regression loss in scientific computing, statistics, and machine learning because of its mathematical tractability, its connection to Gaussian likelihood, and its compatibility with the bias-variance decomposition that organizes much of supervised learning theory.

    References