Overfitting and Regularization: Difference between revisions
([deploy-bot] Deploy from CI (775ba6e)) Tag: ci-deploy |
(Force re-parse after Math source-mode rollout (v1.2.0)) Tags: ci-deploy Reverted |
||
| Line 119: | Line 119: | ||
[[Category:Machine Learning]] | [[Category:Machine Learning]] | ||
[[Category:Intermediate]] | [[Category:Intermediate]] | ||
<!--v1.2.0 cache-bust--> | |||
Revision as of 06:58, 24 April 2026
| Article | |
|---|---|
| Topic area | Machine Learning |
| Difficulty | Intermediate |
| Prerequisites | Loss Functions, Neural Networks |
Overfitting occurs when a machine-learning model learns the training data too well — capturing noise and idiosyncrasies rather than the underlying pattern — and consequently performs poorly on unseen data. Regularization is the family of techniques used to prevent overfitting and improve a model's ability to generalise.
The bias–variance tradeoff
Prediction error on unseen data can be decomposed into three components:
- $ \text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise} $
- Bias measures how far the model's average prediction is from the true value. High bias indicates the model is too simple to capture the data's structure (underfitting).
- Variance measures how much predictions fluctuate across different training sets. High variance indicates the model is too sensitive to the particular training data (overfitting).
The goal is to find the sweet spot that minimises total error. A model with too few parameters underfits (high bias); a model with too many parameters overfits (high variance). Regularization techniques tilt the balance by constraining model complexity, accepting slightly higher bias in exchange for substantially lower variance.
Detecting overfitting
The clearest diagnostic is to compare training and validation performance:
- Training loss decreasing, validation loss also decreasing — the model is still learning; continue training.
- Training loss decreasing, validation loss increasing — the model is overfitting; apply regularization or stop training.
- Training loss high, validation loss high — the model is underfitting; increase capacity or train longer.
Plotting these learning curves over training iterations is essential practice. A large gap between training accuracy and validation accuracy is the hallmark of overfitting.
L2 regularization (weight decay)
L2 regularization adds a penalty proportional to the squared magnitude of the weights:
- $ J(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2 $
The gradient of the regularization term is $ \lambda \theta $, so each weight is multiplicatively shrunk toward zero at every update — hence the name weight decay. The hyperparameter $ \lambda $ controls the regularization strength.
L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.
L1 regularization
L1 regularization penalises the sum of absolute values:
- $ J(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j| $
Unlike L2, the L1 penalty drives many weights exactly to zero, producing sparse models. This makes L1 regularization useful for feature selection. The LASSO (Least Absolute Shrinkage and Selection Operator) is the classic example of L1-regularized linear regression.
| Property | L1 | L2 |
|---|---|---|
| Penalty | $ \lambda\sum|\theta_j| $ | $ \frac{\lambda}{2}\sum\theta_j^2 $ |
| Effect on weights | Drives many to exactly zero | Shrinks all toward zero |
| Sparsity | Yes | No |
| Bayesian interpretation | Laplace prior | Gaussian prior |
| Use case | Feature selection, interpretability | General regularization |
Dropout
Dropout (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability $ p $ at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.
At test time, all neurons are active but their outputs are scaled by $ (1 - p) $ to compensate for the larger number of active units (or equivalently, outputs are scaled by $ 1/(1-p) $ during training — inverted dropout).
Dropout can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.
Early stopping
Early stopping monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.
In practice, a patience parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.
Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.
Data augmentation
Data augmentation increases the effective size and diversity of the training set by applying label-preserving transformations. For image data, common augmentations include:
- Random horizontal/vertical flips
- Random crops and resizing
- Colour jittering (brightness, contrast, saturation)
- Rotation and affine transformations
- Mixup (linear interpolation of pairs of images and their labels)
- Cutout (masking random patches)
For text data, augmentations include synonym replacement, back-translation, and paraphrasing. Data augmentation reduces overfitting by exposing the model to more varied inputs without collecting additional data.
Other regularization techniques
- Batch normalization — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.
- Label smoothing — replaces one-hot targets with a mixture, e.g. $ y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C $, preventing overconfidence.
- Noise injection — adding Gaussian noise to inputs, weights, or gradients during training.
Practical guidelines
- Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.
- Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.
- Use early stopping as a safety net.
- Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.
- Tune the regularization strength ($ \lambda $, dropout rate) using a validation set, never the test set.
See also
References
- Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". JMLR, 15, 1929–1958.
- Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". JRSS Series B, 58(1), 267–288.
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 7. MIT Press.
- Zhang, C. et al. (2017). "Understanding deep learning requires rethinking generalization". ICLR.
- Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". Journal of Big Data.