Overfitting and Regularization: Difference between revisions

Article
Topic area	Machine Learning
Difficulty	Intermediate
Prerequisites	Loss Functions, Neural Networks

Revision as of 06:58, 24 April 2026

Languages: English | Español | 中文

Overfitting occurs when a machine-learning model learns the training data too well — capturing noise and idiosyncrasies rather than the underlying pattern — and consequently performs poorly on unseen data. Regularization is the family of techniques used to prevent overfitting and improve a model's ability to generalise.

The bias–variance tradeoff

Prediction error on unseen data can be decomposed into three components:

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}

Bias measures how far the model's average prediction is from the true value. High bias indicates the model is too simple to capture the data's structure (underfitting).
Variance measures how much predictions fluctuate across different training sets. High variance indicates the model is too sensitive to the particular training data (overfitting).

The goal is to find the sweet spot that minimises total error. A model with too few parameters underfits (high bias); a model with too many parameters overfits (high variance). Regularization techniques tilt the balance by constraining model complexity, accepting slightly higher bias in exchange for substantially lower variance.

Detecting overfitting

The clearest diagnostic is to compare training and validation performance:

Training loss decreasing, validation loss also decreasing — the model is still learning; continue training.
Training loss decreasing, validation loss increasing — the model is overfitting; apply regularization or stop training.
Training loss high, validation loss high — the model is underfitting; increase capacity or train longer.

Plotting these learning curves over training iterations is essential practice. A large gap between training accuracy and validation accuracy is the hallmark of overfitting.

L2 regularization (weight decay)

L2 regularization adds a penalty proportional to the squared magnitude of the weights:

J(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2

The gradient of the regularization term is $\lambda \theta$ , so each weight is multiplicatively shrunk toward zero at every update — hence the name weight decay. The hyperparameter $\lambda$ controls the regularization strength.

L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.

L1 regularization

L1 regularization penalises the sum of absolute values:

J(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j|

Unlike L2, the L1 penalty drives many weights exactly to zero, producing sparse models. This makes L1 regularization useful for feature selection. The LASSO (Least Absolute Shrinkage and Selection Operator) is the classic example of L1-regularized linear regression.

Property	L1	L2
Penalty	$\lambda\sum\|\theta_j\|$	$\frac{\lambda}{2}\sum\theta_j^2$
Effect on weights	Drives many to exactly zero	Shrinks all toward zero
Sparsity	Yes	No
Bayesian interpretation	Laplace prior	Gaussian prior
Use case	Feature selection, interpretability	General regularization

Dropout

Dropout (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability $$ p $$ at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.

At test time, all neurons are active but their outputs are scaled by $$ (1 - p) $$ to compensate for the larger number of active units (or equivalently, outputs are scaled by $$ 1/(1-p) $$ during training — inverted dropout).

Dropout can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.

Early stopping

Early stopping monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.

In practice, a patience parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.

Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.

Data augmentation

Data augmentation increases the effective size and diversity of the training set by applying label-preserving transformations. For image data, common augmentations include:

Random horizontal/vertical flips
Random crops and resizing
Colour jittering (brightness, contrast, saturation)
Rotation and affine transformations
Mixup (linear interpolation of pairs of images and their labels)
Cutout (masking random patches)

For text data, augmentations include synonym replacement, back-translation, and paraphrasing. Data augmentation reduces overfitting by exposing the model to more varied inputs without collecting additional data.

Other regularization techniques

Batch normalization — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.
Label smoothing — replaces one-hot targets with a mixture, e.g. $y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C$ , preventing overconfidence.
Noise injection — adding Gaussian noise to inputs, weights, or gradients during training.

Practical guidelines

Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.
Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.
Use early stopping as a safety net.
Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.
Tune the regularization strength ( $\lambda$ , dropout rate) using a validation set, never the test set.

References

Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". JMLR, 15, 1929–1958.
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". JRSS Series B, 58(1), 267–288.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 7. MIT Press.
Zhang, C. et al. (2017). "Understanding deep learning requires rethinking generalization". ICLR.
Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". Journal of Big Data.

@@ Line 119: / Line 119: @@
 [[Category:Machine Learning]]
 [[Category:Intermediate]]
+<!--v1.2.0 cache-bust-->