Bias in Machine Learning
| Article | |
|---|---|
| Topic area | Machine Learning |
| Prerequisites | Supervised Learning, Loss Function, Generalization |
Overview
In machine learning, bias refers to systematic error: a learned model's predictions deviate from the truth in a consistent, non-random way. The word covers two related but distinct ideas. The first is statistical: the expected gap between a learning algorithm's predictions and the target function it is trying to recover, measured across hypothetical retrainings on different samples. The second is societal: a model that performs unequally across groups, encodes stereotypes, or amplifies historical inequities present in its training data. Both senses share a common structure — they describe error that does not vanish by collecting more data of the same kind — but they are studied with different tools and motivated by different concerns.
Bias is unavoidable. Every learning algorithm encodes assumptions about which functions are likely, which features matter, and how examples should be weighted; this is the inductive bias that lets generalization happen at all. The practical question is therefore not how to eliminate bias but how to choose, measure, and disclose it. This article surveys the statistical decomposition that frames the topic, the inductive biases baked into common model families, the dataset and labeling biases that distort supervised learning, the algorithmic and societal biases that arise downstream, and the techniques used to diagnose and mitigate each.
The bias-variance decomposition
For squared-error regression with target $ y = f(x) + \varepsilon $ and learned predictor $ \hat{f} $, the expected error at a point $ x $ decomposes as
$ {\displaystyle \mathbb{E}\bigl[(y - \hat{f}(x))^2\bigr] = \underbrace{\bigl(\mathbb{E}[\hat{f}(x)] - f(x)\bigr)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\bigl[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\bigr]}_{\text{Variance}} + \sigma^2,} $
where the expectation is taken over training sets drawn from the same distribution and $ \sigma^2 $ is irreducible noise. Bias measures how far the average learned model is from the truth; variance measures how much an individual model fluctuates around that average. High bias is the signature of underfitting — the hypothesis class is too restrictive to capture $ f $. High variance is the signature of overfitting — the class is flexible enough to chase noise.
The classical bias-variance tradeoff holds that reducing one tends to inflate the other, with model capacity as the knob. This picture is clean for low-capacity classical models but only partly captures modern overparameterized networks, where the double descent phenomenon shows that test error can fall again past the interpolation threshold. The bias-variance frame remains the right starting point, but it is not the whole story for deep models.
Inductive bias
Every algorithm prefers some hypotheses over others — without such a preference, no finite training set could pick out a single function. This preference is called the inductive bias and is what makes generalization possible. Examples include the smoothness assumption of $ k $-nearest neighbors, the linearity of linear regression, the locality and translation equivariance of convolutional neural networks, the permutation equivariance of graph neural networks, and the recency-decay of recurrent models. Architectural choices, regularizers, priors, optimizer geometry, and even the order of training data all contribute.
Strong inductive biases improve sample efficiency on tasks aligned with the bias and hurt it on tasks that are not. The shift in modern deep learning toward weaker structural priors and larger datasets — most visibly in transformers replacing convolutions and recurrence in many domains — is a deliberate trade: less helpful prior, more data and compute to compensate.
Dataset and labeling biases
Statistical learning theory assumes the training distribution matches the deployment distribution. In practice it rarely does, and the gap is often called dataset bias. Common forms include:
- Selection bias. The training sample is drawn non-uniformly from the population of interest. Survey nonresponse, opt-in data collection, and convenience sampling all produce it.
- Sampling bias. Some subgroups are systematically over- or under-represented relative to deployment frequencies.
- Survivorship bias. Only entities that persisted into the dataset are observed; failures are absent.
- Reporting and measurement bias. Recorded labels reflect what was measured or volunteered rather than the underlying construct — for example, recorded crime correlates with policing intensity, not crime itself.
- Label noise and annotator bias. Human labelers disagree, follow inconsistent guidelines, or import their own assumptions; aggregation can obscure systematic disagreement.
- Historical bias. Even a perfectly sampled, perfectly labeled dataset can encode patterns from a world the user does not want to perpetuate, such as historical hiring decisions or lending outcomes.
- Distribution shift. Covariate shift, label shift, and concept drift describe changes between training and deployment that violate the i.i.d. assumption.
These are properties of the data pipeline, not the optimizer, so they cannot be fixed by training longer or scaling the model. They show up as confidently wrong predictions on populations the data underrepresents.
Algorithmic and societal bias
When a model trained on biased data is deployed in a consequential setting — credit, hiring, healthcare, content moderation, search ranking — the statistical asymmetries become social ones. A widely cited example is the COMPAS recidivism risk tool, which research found assigned higher false-positive rates to Black defendants than white defendants on a benchmark dataset. Similar disparities have been documented in commercial face-recognition error rates, clinical decision support, and ad delivery.
Researchers formalize these concerns through group fairness criteria such as demographic parity (equal positive rates across groups), equalized odds (equal true- and false-positive rates), and calibration (predicted probability matches realized rate within each group). A foundational impossibility result shows that, except in degenerate cases, no single classifier can satisfy calibration and equalized odds simultaneously when base rates differ across groups.[1][2] Choosing among fairness definitions is therefore a value judgment, not a purely technical one.
Two closely related problems are shortcut learning — the model latches onto spurious features that happen to correlate with the label in training, such as image background or hospital-specific pixel artifacts — and bias amplification, where a model's predictions are more skewed than the training distribution because confident predictions on the majority class minimize loss most efficiently.
Diagnosis
Diagnosing bias requires looking beyond aggregate accuracy. Common practices:
- Slice metrics across subgroups defined by sensitive attributes, geography, time, or input characteristics.
- Compare error rates, not just accuracy, since classes with low base rates can hide failure under high overall accuracy.
- Use counterfactual perturbations — change a name, gender token, or accent and check whether predictions move.
- Probe representations for sensitive attributes; high probing accuracy on a removed attribute suggests it is encoded indirectly.
- Inspect calibration curves per group, not just globally.
- Audit the training corpus directly: token frequencies, demographic coverage, label rates per slice.
The harder diagnostic problem is unknown unknowns: subgroups or contexts the auditor did not think to slice on. Tools such as model cards, data sheets, and external red-teaming exist to surface them, but no procedure is exhaustive.
Mitigation
Mitigation strategies are usually grouped by where they intervene in the pipeline:
- Pre-processing acts on the data: reweighting, resampling, augmenting underrepresented subgroups, removing or transforming sensitive features, and synthesizing balanced examples. Cheap and modular but limited because the model may still recover the protected attribute from correlated features.
- In-processing modifies the loss function or constraints: adversarial debiasing, fairness-constrained optimization, regularization toward equality of error rates, or invariance penalties that discourage the representation from encoding the protected attribute.
- Post-processing adjusts the output: calibrated group-specific thresholds, reject-option classification, or score transformations that equalize a chosen metric.
For societal bias, technical mitigation is necessary but not sufficient. It must be paired with deployment-time monitoring, recourse mechanisms for affected users, and governance — including the option not to deploy. For statistical bias in the bias-variance sense, mitigation looks different: increase capacity, add features, or relax regularization to drive bias down at the cost of variance.
Comparisons and limitations
Statistical bias and societal bias are often conflated in informal usage but answer different questions. Statistical bias asks whether the average model converges to the true function as the sample grows; societal bias asks whether a deployed model treats people equitably. A model can be statistically unbiased and socially harmful (it faithfully reproduces an unjust status quo), or socially fair on a chosen metric while statistically inconsistent. Mitigations for one may worsen the other: enforcing equalized odds can reduce calibration; reducing variance through heavy regularization can entrench majority-group patterns.
Important limitations of the current literature: most fairness metrics presuppose discrete sensitive attributes that are observable, accurate, and stable, which is often not true. Causal frameworks promise to address some of these gaps but require strong, frequently untestable assumptions. Finally, fairness criteria are local to a single decision; system-level effects such as feedback loops between predictions and future training data are not captured by any per-prediction metric.