Calibration of Predictions

Article
Topic area	Machine Learning
Prerequisites	Logistic Regression, Cross-Entropy Loss, Probability

This page contains changes which are not marked for translation.

Other languages:

English
Español
中文

Overview

Calibration of predictions is the property that the probabilistic outputs of a model match the empirical frequencies of the events they predict. A binary classifier is calibrated when, among all inputs assigned probability $$ p $$ , a fraction approximately equal to $$ p $$ actually belong to the positive class. Calibration is distinct from accuracy: a model can be highly accurate yet poorly calibrated, or well calibrated yet inaccurate. In modern deep networks calibration is typically degraded by overconfidence, where softmax probabilities concentrate near 0 and 1 even on examples the model would otherwise classify correctly only by chance.

Calibration matters wherever predicted probabilities feed downstream decisions: medical risk scoring, weather forecasting, ranking, ensembling, active learning, selective prediction, and Bayesian decision theory. For these applications, the loss incurred from a miscalibrated probability can dominate the loss from a misclassification. As a result, calibration is studied as both a diagnostic property of trained models and a target of dedicated post-hoc and in-training methods.

Formal definition

Let $$ (X, Y) $$ be a random pair with $Y \in \{1, \ldots, K\}$ and let $f: \mathcal{X} \to \Delta^{K-1}$ be a probabilistic classifier outputting a distribution over classes. Write $\hat{p}(x) = \max_k f_k(x)$ for the confidence and $\hat{y}(x) = \arg\max_k f_k(x)$ for the predicted label. The model is perfectly calibrated if for every confidence level $p \in [0, 1]$ :

$\Pr\bigl[\hat{Y} = Y \mid \hat{P} = p\bigr] = p.$

A finer notion, class-wise calibration, requires for every class $$ k $$ and probability level $$ p $$ :

$\Pr\bigl[Y = k \mid f_k(X) = p\bigr] = p.$

The strongest form, multiclass calibration or distribution calibration, requires the entire predicted distribution to match the conditional class distribution. These notions form a hierarchy: distribution calibration implies class-wise calibration, which implies top-label calibration. Most empirical work measures top-label calibration because it is identifiable from a moderate sample.

Measuring calibration

Because the conditional probability $\Pr[\hat{Y} = Y \mid \hat{P} = p]$ cannot be estimated pointwise from finite data, calibration is measured through aggregated statistics.

Reliability diagrams

A reliability diagram bins predictions by confidence into $$ M $$ intervals $B_1, \ldots, B_M$ and plots, for each bin, the average confidence against the empirical accuracy. Perfect calibration appears as the identity line; systematic deviation above the line indicates underconfidence and below the line indicates overconfidence.

Expected Calibration Error

The most common scalar summary is the Expected Calibration Error (ECE):

$\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \bigl| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \bigr|,$

where $\mathrm{acc}(B_m)$ is the fraction of correct predictions in bin $$ B_m $$ and $\mathrm{conf}(B_m)$ is the average confidence in that bin. ECE is sensitive to the binning scheme; equal-width and equal-mass binning give different values, and ECE is biased upward at small sample sizes. Maximum Calibration Error (MCE) replaces the weighted sum with a maximum over bins for safety-critical applications. Adaptive variants such as Adaptive ECE rebalance bin populations to reduce variance.

Proper scoring rules

A proper scoring rule is a loss $$ S(f, y) $$ minimized in expectation by the true conditional distribution. The Brier score

$\mathrm{BS} = \frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} (f_k(x_i) - \mathbb{1}[y_i = k])^2$

and the negative log-likelihood $\mathrm{NLL} = -\frac{1}{n} \sum_i \log f_{y_i}(x_i)$ are both strictly proper. Proper scoring rules decompose into a calibration term and a refinement (sharpness) term, providing a principled alternative to ECE that is not subject to binning artifacts.

Sources of miscalibration

Modern neural networks are typically overconfident: training with cross-entropy until convergence drives logits to large magnitudes, pushing softmax probabilities toward the simplex corners regardless of whether the predicted class is correct. Several mechanisms contribute. Increased model capacity reduces training NLL beyond what bias-variance balance would suggest. Reducing weight decay, removing batch normalization, or training longer all worsen calibration. Distribution shift between train and test data further breaks calibration even when in-distribution calibration is good, as the model assigns high confidence to inputs unlike anything seen in training.

Label smoothing, mixup, and stochastic depth tend to improve calibration as a side effect because they prevent the network from achieving zero loss and thus discourage extreme logits. Data augmentation that injects realistic input variability has a similar effect.

Post-hoc calibration methods

Post-hoc methods recalibrate a fixed trained model using a held-out validation set, leaving the underlying classifier untouched. They are cheap, modular, and the standard first response to a miscalibrated network.

Platt scaling

Platt Scaling fits a logistic regression on the model's scores. For binary classification with score $$ z(x) $$ it learns scalars $$ a, b $$ such that $\hat{p}(x) = \sigma(a \cdot z(x) + b)$ by minimizing NLL on validation data. Platt scaling is parametric, well-suited to small validation sets, and assumes a sigmoidal distortion of the underlying scores.

Isotonic regression

Isotonic regression fits a non-decreasing step function from raw scores to calibrated probabilities by minimizing squared error subject to monotonicity. It is non-parametric and strictly more expressive than Platt scaling but requires more data and can overfit on small validation sets. The pool-adjacent-violators algorithm solves it in $$ O(n) $$ time after sorting.

Temperature scaling

For multiclass networks, temperature scaling rescales the logits $$ z $$ by a single learned scalar $$ T > 0 $$ :

$f_k(x) = \frac{\exp(z_k(x)/T)}{\sum_j \exp(z_j(x)/T)}.$

$$ T $$ is fit by minimizing NLL on a held-out set. Because temperature is a monotonic transformation, accuracy is preserved exactly. Temperature scaling is the default post-hoc method for deep classifiers; despite having a single parameter it usually matches or exceeds richer alternatives. Vector and matrix scaling extend it with per-class or full-rank linear transformations, at the cost of preserved accuracy and increased data requirements.

Histogram and Bayesian binning

Histogram binning replaces the score-to-probability mapping with bin-wise empirical accuracy. Bayesian Binning into Quantiles (BBQ) averages over multiple binning schemes weighted by their posterior plausibility, reducing the bin-choice sensitivity of histogram binning at higher computational cost.

In-training calibration

In-training methods modify the loss or training procedure to produce a calibrated model directly. Label smoothing replaces hard one-hot targets with a mixture $(1 - \alpha) e_y + \alpha / K \cdot \mathbf{1}$ , capping the maximum softmax probability and consistently reducing ECE. Focal loss down-weights confident examples and was found to give well-calibrated networks for free. Auxiliary calibration losses such as MMCE add a kernel-based estimate of calibration error to the cross-entropy objective.

Bayesian methods and approximations such as MC dropout, SWA-Gaussian, and deep ensembles induce predictive distributions whose averaged outputs are typically better calibrated than any single network, especially under distribution shift. Deep ensembles in particular combine sharpness and calibration robustly.

Comparisons and trade-offs

The choice of calibration method involves trade-offs along three axes: data requirement, expressiveness, and accuracy preservation. Temperature scaling needs only a few hundred validation examples and preserves the argmax exactly but cannot correct class-conditional biases. Vector and matrix scaling correct such biases but can degrade accuracy and require more data. Isotonic regression and BBQ are more flexible still but data-hungry. Among in-training methods, ensembles offer the best calibration under shift but multiply training and inference cost; label smoothing and focal loss are nearly free but tune away part of the model's expressiveness.

When comparing methods, evaluate ECE alongside a proper scoring rule (NLL or Brier). A method that improves ECE while raising NLL has overfit the binning scheme and is not actually better calibrated. Robustness to distribution shift should be measured separately, for example on shifted ImageNet-C or natural variations in the deployment data.

Limitations

ECE estimates are biased and noisy at small sample sizes; reported improvements within a few tenths of a percent are often not statistically meaningful. Top-label calibration ignores miscalibration in non-predicted classes, which matters for ranking and selective prediction. Most calibration methods assume the test distribution matches the validation set used for calibration, an assumption that breaks under shift; recalibration on a small target sample, distribution-shift-aware methods, and conformal prediction provide partial remedies. Finally, calibration on aggregate does not imply calibration within subgroups: a globally calibrated model can be systematically miscalibrated for minority groups, an algorithmic fairness concern that has motivated subgroup and multicalibration objectives.

References

^[1] ^[2] ^[3] ^[4] ^[5] ^[6]

↑ Template:Cite arxiv
↑ Template:Cite arxiv
↑ Template:Cite arxiv
↑ Platt, J. C. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, 1999.
↑ Naeini, M. P., Cooper, G. F., Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning, AAAI 2015.
↑ Brier, G. W. Verification of Forecasts Expressed in Terms of Probability, Monthly Weather Review, 1950.

[1] Template:Cite arxiv

[2] Template:Cite arxiv

[3] Template:Cite arxiv

[4] Platt, J. C. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, 1999.

[5] Naeini, M. P., Cooper, G. F., Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning, AAAI 2015.

[6] Brier, G. W. Verification of Forecasts Expressed in Terms of Probability, Monthly Weather Review, 1950.

[1]

[2]

[3]

[4]

[5]

[6]