Overfitting and Regularization/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:58:30Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T22:03:45Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:40Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T02:38:32Z

Updating to match new version of source page

← Older revision		Revision as of 02:38, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Overfitting and Regularization}}~~
	{{ArticleInfobox \| topic_area = Machine Learning \| difficulty = Intermediate \| prerequisites = [[Loss Functions]], [[Neural Networks]]}}		{{ArticleInfobox \| topic_area = Machine Learning \| difficulty = Intermediate \| prerequisites = [[Loss Functions]], [[Neural Networks]]}}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:31:09Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Overfitting and Regularization}}
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Loss Functions]], [[Neural Networks]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Overfitting''' occurs when a machine-learning model learns the training data too well — capturing noise and idiosyncrasies rather than the underlying pattern — and consequently performs poorly on unseen data. '''Regularization''' is the family of techniques used to prevent overfitting and improve a model's ability to generalise.

== The bias–variance tradeoff ==

Prediction error on unseen data can be decomposed into three components:

:<math>\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}</math>

* '''Bias''' measures how far the model's average prediction is from the true value. High bias indicates the model is too simple to capture the data's structure ('''underfitting''').
* '''Variance''' measures how much predictions fluctuate across different training sets. High variance indicates the model is too sensitive to the particular training data ('''overfitting''').

The goal is to find the sweet spot that minimises total error. A model with too few parameters underfits (high bias); a model with too many parameters overfits (high variance). Regularization techniques tilt the balance by constraining model complexity, accepting slightly higher bias in exchange for substantially lower variance.

== Detecting overfitting ==

The clearest diagnostic is to compare training and validation performance:

* '''Training loss decreasing, validation loss also decreasing''' — the model is still learning; continue training.
* '''Training loss decreasing, validation loss increasing''' — the model is overfitting; apply regularization or stop training.
* '''Training loss high, validation loss high''' — the model is underfitting; increase capacity or train longer.

Plotting these '''learning curves''' over training iterations is essential practice. A large gap between training accuracy and validation accuracy is the hallmark of overfitting.

== L2 regularization (weight decay) ==

L2 regularization adds a penalty proportional to the squared magnitude of the weights:

:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>

The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''weight decay'''. The hyperparameter <math>\lambda</math> controls the regularization strength.

L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.

== L1 regularization ==

L1 regularization penalises the sum of absolute values:

:<math>J(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j|</math>

Unlike L2, the L1 penalty drives many weights exactly to zero, producing '''sparse''' models. This makes L1 regularization useful for feature selection. The LASSO (Least Absolute Shrinkage and Selection Operator) is the classic example of L1-regularized linear regression.

{| class="wikitable"
|-
! Property !! L1 !! L2
|-
| Penalty || <math>\lambda\sum|\theta_j|</math> || <math>\frac{\lambda}{2}\sum\theta_j^2</math>
|-
| Effect on weights || Drives many to exactly zero || Shrinks all toward zero
|-
| Sparsity || Yes || No
|-
| Bayesian interpretation || Laplace prior || Gaussian prior
|-
| Use case || Feature selection, interpretability || General regularization
|}

== Dropout ==

'''Dropout''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.

At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted dropout''').

Dropout can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.

== Early stopping ==

'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.

In practice, a '''patience''' parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.

Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.

== Data augmentation ==

'''Data augmentation''' increases the effective size and diversity of the training set by applying label-preserving transformations. For image data, common augmentations include:

* Random horizontal/vertical flips
* Random crops and resizing
* Colour jittering (brightness, contrast, saturation)
* Rotation and affine transformations
* Mixup (linear interpolation of pairs of images and their labels)
* Cutout (masking random patches)

For text data, augmentations include synonym replacement, back-translation, and paraphrasing. Data augmentation reduces overfitting by exposing the model to more varied inputs without collecting additional data.

== Other regularization techniques ==

* '''Batch normalization''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.
* '''Label smoothing''' — replaces one-hot targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.
* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.

== Practical guidelines ==

# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.
# Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.
# Use early stopping as a safety net.
# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.
# Tune the regularization strength (<math>\lambda</math>, dropout rate) using a validation set, never the test set.

== See also ==

* [[Loss Functions]]
* [[Neural Networks]]
* [[Gradient Descent]]
* [[Convolutional Neural Networks]]

== References ==

* Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.
* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.
* Zhang, C. et al. (2017). "Understanding deep learning requires rethinking generalization". ''ICLR''.
* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.

[[Category:Machine Learning]]
[[Category:Intermediate]]

← Older revision		Revision as of 23:58, 27 April 2026
Line 28:		Line 28:
	== L2 regularization (weight decay) ==		== L2 regularization (weight decay) ==

	L2 regularization adds a penalty proportional to the squared magnitude of the weights:		{{Term\|weight decay\|L2 regularization}} adds a penalty proportional to the squared magnitude of the weights:

	:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\\|\theta\\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>		:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\\|\theta\\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>

	The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''weight decay'''. The hyperparameter <math>\lambda</math> controls the regularization strength.		The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''{{Term\|weight decay}}'''. The {{Term\|hyperparameter}} <math>\lambda</math> controls the regularization strength.

	L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.		{{Term\|weight decay\|L2 regularization}} is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.

	== L1 regularization ==		== L1 regularization ==
Line 61:		Line 61:
	== Dropout ==		== Dropout ==

	'''~~Dropout~~''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.		'''{{Term\|dropout}}''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.

	At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted dropout''').		At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted {{Term\|dropout}}''').

	~~Dropout~~ can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.		{{Term\|dropout}} can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.

	== Early stopping ==		== Early stopping ==
Line 71:		Line 71:
	'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.		'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.

	In practice, a '''patience''' parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.		In practice, a '''patience''' parameter specifies how many {{Term\|epoch\|epochs}} to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.

	Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.		Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.
Line 90:		Line 90:
	== Other regularization techniques ==		== Other regularization techniques ==

	* '''~~Batch~~ normalization''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.		* '''{{Term\|batch normalization}}''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.
	* '''Label smoothing''' — replaces one-hot targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.		* '''Label smoothing''' — replaces {{Term\|one-hot encoding\|one-hot}} targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.
	* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.		* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.

Line 97:		Line 97:

	# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.		# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.
	# Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.		# Add regularization incrementally ({{Term\|dropout}}, {{Term\|weight decay}}, augmentation) and monitor validation performance.
	# Use early stopping as a safety net.		# Use early stopping as a safety net.
	# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.		# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.
	# Tune the regularization strength (<math>\lambda</math>, dropout rate) using a validation set, never the test set.		# Tune the regularization strength (<math>\lambda</math>, {{Term\|dropout}} rate) using a validation set, never the test set.

	== See also ==		== See also ==
Line 111:		Line 111:
	== References ==		== References ==

	* Srivastava, N. et al. (2014). "~~Dropout~~: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.		* Srivastava, N. et al. (2014). "{{Term\|dropout}}: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.
	* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.		* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.
	* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.		* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.
	* Zhang, C. et al. (2017). "Understanding deep learning requires rethinking generalization". ''ICLR''.		* Zhang, C. et al. (2017). "Understanding {{Term\|deep learning}} requires rethinking generalization". ''ICLR''.
	* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.		* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.

	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]

← Older revision		Revision as of 22:03, 27 April 2026
Line 28:		Line 28:
	== L2 regularization (weight decay) ==		== L2 regularization (weight decay) ==

	~~{{Term\|weight decay\|~~L2 regularization}} adds a penalty proportional to the squared magnitude of the weights:		L2 regularization adds a penalty proportional to the squared magnitude of the weights:

	:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\\|\theta\\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>		:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\\|\theta\\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>

	The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''~~{{Term\|~~weight decay}}'''. The ~~{{Term\|~~hyperparameter}} <math>\lambda</math> controls the regularization strength.		The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''weight decay'''. The hyperparameter <math>\lambda</math> controls the regularization strength.

	~~{{Term\|weight decay\|~~L2 regularization}} is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.		L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.

	== L1 regularization ==		== L1 regularization ==
Line 61:		Line 61:
	== Dropout ==		== Dropout ==

	'''~~{{Term\|dropout}}~~''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.		'''Dropout''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.

	At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted ~~{{Term\|~~dropout}}''').		At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted dropout''').

	~~{{Term\|dropout}}~~ can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.		Dropout can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.

	== Early stopping ==		== Early stopping ==
Line 71:		Line 71:
	'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.		'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.

	In practice, a '''patience''' parameter specifies how many ~~{{Term\|epoch\|~~epochs}} to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.		In practice, a '''patience''' parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.

	Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.		Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.
Line 90:		Line 90:
	== Other regularization techniques ==		== Other regularization techniques ==

	* '''~~{{Term\|batch~~ normalization}}''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.		* '''Batch normalization''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.
	* '''Label smoothing''' — replaces ~~{{Term\|~~one-hot ~~encoding\|one-hot}}~~ targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.		* '''Label smoothing''' — replaces one-hot targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.
	* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.		* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.

Line 97:		Line 97:

	# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.		# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.
	# Add regularization incrementally (~~{{Term\|~~dropout}}, ~~{{Term\|~~weight decay}}, augmentation) and monitor validation performance.		# Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.
	# Use early stopping as a safety net.		# Use early stopping as a safety net.
	# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.		# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.
	# Tune the regularization strength (<math>\lambda</math>, ~~{{Term\|~~dropout}} rate) using a validation set, never the test set.		# Tune the regularization strength (<math>\lambda</math>, dropout rate) using a validation set, never the test set.

	== See also ==		== See also ==
Line 111:		Line 111:
	== References ==		== References ==

	* Srivastava, N. et al. (2014). "~~{{Term\|dropout}}~~: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.		* Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.
	* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.		* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.
	* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.		* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.
	* Zhang, C. et al. (2017). "Understanding ~~{{Term\|~~deep learning}} requires rethinking generalization". ''ICLR''.		* Zhang, C. et al. (2017). "Understanding deep learning requires rethinking generalization". ''ICLR''.
	* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.		* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.

	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]

← Older revision		Revision as of 19:42, 27 April 2026
Line 28:		Line 28:
	== L2 regularization (weight decay) ==		== L2 regularization (weight decay) ==

	L2 regularization adds a penalty proportional to the squared magnitude of the weights:		{{Term\|weight decay\|L2 regularization}} adds a penalty proportional to the squared magnitude of the weights:

	:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\\|\theta\\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>		:<math>J(\theta) = L(\theta) + \frac{\lambda}{2}\\|\theta\\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2</math>

	The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''weight decay'''. The hyperparameter <math>\lambda</math> controls the regularization strength.		The gradient of the regularization term is <math>\lambda \theta</math>, so each weight is multiplicatively shrunk toward zero at every update — hence the name '''{{Term\|weight decay}}'''. The {{Term\|hyperparameter}} <math>\lambda</math> controls the regularization strength.

	L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.		{{Term\|weight decay\|L2 regularization}} is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.

	== L1 regularization ==		== L1 regularization ==
Line 61:		Line 61:
	== Dropout ==		== Dropout ==

	'''~~Dropout~~''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.		'''{{Term\|dropout}}''' (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly "dropped" (set to zero) with probability <math>p</math> at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.

	At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted dropout''').		At test time, all neurons are active but their outputs are scaled by <math>(1 - p)</math> to compensate for the larger number of active units (or equivalently, outputs are scaled by <math>1/(1-p)</math> during training — '''inverted {{Term\|dropout}}''').

	~~Dropout~~ can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.		{{Term\|dropout}} can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.

	== Early stopping ==		== Early stopping ==
Line 71:		Line 71:
	'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.		'''Early stopping''' monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.

	In practice, a '''patience''' parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.		In practice, a '''patience''' parameter specifies how many {{Term\|epoch\|epochs}} to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.

	Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.		Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.
Line 90:		Line 90:
	== Other regularization techniques ==		== Other regularization techniques ==

	* '''~~Batch~~ normalization''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.		* '''{{Term\|batch normalization}}''' — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.
	* '''Label smoothing''' — replaces one-hot targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.		* '''Label smoothing''' — replaces {{Term\|one-hot encoding\|one-hot}} targets with a mixture, e.g. <math>y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C</math>, preventing overconfidence.
	* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.		* '''Noise injection''' — adding Gaussian noise to inputs, weights, or gradients during training.

Line 97:		Line 97:

	# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.		# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.
	# Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.		# Add regularization incrementally ({{Term\|dropout}}, {{Term\|weight decay}}, augmentation) and monitor validation performance.
	# Use early stopping as a safety net.		# Use early stopping as a safety net.
	# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.		# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.
	# Tune the regularization strength (<math>\lambda</math>, dropout rate) using a validation set, never the test set.		# Tune the regularization strength (<math>\lambda</math>, {{Term\|dropout}} rate) using a validation set, never the test set.

	== See also ==		== See also ==
Line 111:		Line 111:
	== References ==		== References ==

	* Srivastava, N. et al. (2014). "~~Dropout~~: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.		* Srivastava, N. et al. (2014). "{{Term\|dropout}}: A Simple Way to Prevent Neural Networks from Overfitting". ''JMLR'', 15, 1929–1958.
	* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.		* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''JRSS Series B'', 58(1), 267–288.
	* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.		* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 7. MIT Press.
	* Zhang, C. et al. (2017). "Understanding deep learning requires rethinking generalization". ''ICLR''.		* Zhang, C. et al. (2017). "Understanding {{Term\|deep learning}} requires rethinking generalization". ''ICLR''.
	* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.		* Shorten, C. and Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning". ''Journal of Big Data''.

	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Intermediate]]		[[Category:Intermediate]]