Dropout A Simple Way to Prevent Overfitting - Revision history

DeployBot: Marked this version for translation

2026-04-27T02:53:24Z

Marked this version for translation

← Older revision		Revision as of 02:53, 27 April 2026
Line 18:		Line 18:
	'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.		'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.

	<!--T:3-->		== Overview == <!--T:3-->
	~~== Overview ==~~

	<!--T:4-->		<!--T:4-->
Line 27:		Line 26:
	Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.		Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.

	<!--T:6-->		== Key Contributions == <!--T:6-->
	~~== Key Contributions ==~~

	<!--T:7-->		<!--T:7-->
Line 36:		Line 34:
	* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.		* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.

	<!--T:8-->		== Methods == <!--T:8-->
	~~== Methods ==~~

	<!--T:9-->		<!--T:9-->
Line 66:		Line 63:
	The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.		The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.

	<!--T:18-->		== Results == <!--T:18-->
	~~== Results ==~~

	<!--T:19-->		<!--T:19-->
Line 83:		Line 79:
	The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.		The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.

	<!--T:22-->		== Impact == <!--T:22-->
	~~== Impact ==~~

	<!--T:23-->		<!--T:23-->
Line 92:		Line 87:
	While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.		While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.

	<!--T:25-->		== See also == <!--T:25-->
	~~== See also ==~~

	<!--T:26-->		<!--T:26-->
Line 100:		Line 94:
	* [[Deep Residual Learning for Image Recognition]]		* [[Deep Residual Learning for Image Recognition]]

	<!--T:27-->		== References == <!--T:27-->
	~~== References ==~~

	<!--T:28-->		<!--T:28-->

DeployBot: [deploy-bot] Drop {{LanguageBar}} (v1.4.1)

2026-04-27T02:35:51Z

[deploy-bot] Drop {{LanguageBar}} (v1.4.1)

← Older revision		Revision as of 02:35, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Dropout A Simple Way to Prevent Overfitting}}~~

	<translate>		<translate>
Line 19:		Line 18:
	'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.		'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.

	~~== Overview ==~~ <!--T:3-->		<!--T:3-->
			== Overview ==

	<!--T:4-->		<!--T:4-->
Line 27:		Line 27:
	Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.		Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.

	~~== Key Contributions ==~~ <!--T:6-->		<!--T:6-->
			== Key Contributions ==

	<!--T:7-->		<!--T:7-->
Line 35:		Line 36:
	* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.		* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.

	~~== Methods ==~~ <!--T:8-->		<!--T:8-->
			== Methods ==

	<!--T:9-->		<!--T:9-->
Line 64:		Line 66:
	The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.		The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.

	~~== Results ==~~ <!--T:18-->		<!--T:18-->
			== Results ==

	<!--T:19-->		<!--T:19-->
Line 80:		Line 83:
	The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.		The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.

	~~== Impact ==~~ <!--T:22-->		<!--T:22-->
			== Impact ==

	<!--T:23-->		<!--T:23-->
Line 88:		Line 92:
	While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.		While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.

	~~== See also ==~~ <!--T:25-->		<!--T:25-->
			== See also ==

	<!--T:26-->		<!--T:26-->
Line 95:		Line 100:
	* [[Deep Residual Learning for Image Recognition]]		* [[Deep Residual Learning for Image Recognition]]

	~~== References ==~~ <!--T:27-->		<!--T:27-->
			== References ==

	<!--T:28-->		<!--T:28-->

DeployBot: Marked this version for translation

2026-04-27T00:31:35Z

Marked this version for translation

← Older revision		Revision as of 00:31, 27 April 2026
Line 19:		Line 19:
	'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.		'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.

	<!--T:3-->		== Overview == <!--T:3-->
	~~== Overview ==~~

	<!--T:4-->		<!--T:4-->
Line 28:		Line 27:
	Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.		Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.

	<!--T:6-->		== Key Contributions == <!--T:6-->
	~~== Key Contributions ==~~

	<!--T:7-->		<!--T:7-->
Line 37:		Line 35:
	* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.		* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.

	<!--T:8-->		== Methods == <!--T:8-->
	~~== Methods ==~~

	<!--T:9-->		<!--T:9-->
Line 67:		Line 64:
	The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.		The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.

	<!--T:18-->		== Results == <!--T:18-->
	~~== Results ==~~

	<!--T:19-->		<!--T:19-->
Line 84:		Line 80:
	The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.		The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.

	<!--T:22-->		== Impact == <!--T:22-->
	~~== Impact ==~~

	<!--T:23-->		<!--T:23-->
Line 93:		Line 88:
	While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.		While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.

	<!--T:25-->		== See also == <!--T:25-->
	~~== See also ==~~

	<!--T:26-->		<!--T:26-->
Line 101:		Line 95:
	* [[Deep Residual Learning for Image Recognition]]		* [[Deep Residual Learning for Image Recognition]]

	<!--T:27-->		== References == <!--T:27-->
	~~== References ==~~

	<!--T:28-->		<!--T:28-->

DeployBot: [deploy-bot] Convert Dropout A Simple Way to Prevent Overfitting to Translate-extension page

2026-04-27T00:31:33Z

[deploy-bot] Convert Dropout A Simple Way to Prevent Overfitting to Translate-extension page

New page

<languages />
{{LanguageBar | page = Dropout A Simple Way to Prevent Overfitting}}

<translate>

{{PaperInfobox
| topic_area = Deep Learning
| difficulty = Research
| authors = Nitish Srivastava; Geoffrey Hinton; Alex Krizhevsky; Ilya Sutskever; Ruslan Salakhutdinov
| year = 2014
| venue = JMLR
| arxiv_id = 1207.0580
| source_url = https://arxiv.org/abs/1207.0580
| pdf_url = https://arxiv.org/pdf/1207.0580
}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}


'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.


== Overview ==


Deep neural networks with many parameters are powerful function approximators but are prone to overfitting, especially when training data is limited. Traditional regularization methods such as L2 weight decay and early stopping provided some relief, but were often insufficient for large networks. Model combination — training multiple models and averaging their predictions — was known to reduce overfitting but was computationally expensive.


Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble.


== Key Contributions ==


* '''Dropout regularization''': A training procedure that randomly omits neurons during each forward and backward pass, preventing neurons from developing overly specialized co-adaptations.
* '''Ensemble interpretation''': Theoretical motivation of dropout as approximate model averaging over <math>2^n</math> possible thinned networks (where <math>n</math> is the number of droppable units), with shared weights.
* '''Comprehensive empirical evaluation''': Demonstration of consistent improvements across diverse domains including vision, speech recognition, text classification, and computational biology.
* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters.


== Methods ==


During training, for each training example and each layer, each neuron's output is independently set to zero with probability <math>1 - p</math>. If <math>h_i</math> is the output of neuron <math>i</math>, the dropout operation applies:


<math>r_i \sim \text{Bernoulli}(p)</math>


<math>\tilde{h}_i = r_i \cdot h_i</math>


where <math>r_i</math> is a random mask variable. The dropped-out network is then used for the forward pass and backpropagation on that training case. Different random masks are drawn for each training example and each gradient step.


At test time, no units are dropped. Instead, the output of each neuron is multiplied by <math>p</math> to match the expected value during training:


<math>h_i^{\text{test}} = p \cdot h_i</math>


This '''weight scaling inference rule''' ensures that the expected output of each neuron at test time equals its expected output during training. An equivalent alternative, '''inverted dropout''', scales activations by <math>1/p</math> during training so that no modification is needed at test time. This approach is more common in modern implementations.


The authors showed that dropout can be interpreted as training an ensemble of <math>2^n</math> sub-networks that share weights. At test time, the scaled full network provides a geometric mean approximation to the ensemble prediction, which the authors proved is exact for a single layer with softmax output.


The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.


== Results ==


Dropout was evaluated across multiple benchmarks and consistently reduced test error:


* '''MNIST''' (handwritten digits): Error reduced from 1.60% to 1.25% with dropout on a standard feedforward network.
* '''CIFAR-10/CIFAR-100''': Significant error reductions on convolutional networks; relative improvement of approximately 15-25% on CIFAR-100.
* '''SVHN''' (Street View House Numbers): Error reduced from 2.80% to 2.68%.
* '''ImageNet''': Dropout improved the top-1 error of a large convolutional network by approximately 2 percentage points.
* '''TIMIT''' (speech recognition): Consistent improvements across various architecture sizes.
* '''Reuters''' (text classification): Improved performance on a bag-of-words text classification task.


The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.


== Impact ==


Dropout became standard practice in neural network training throughout the 2010s, included by default in most deep learning frameworks. Its conceptual simplicity and consistent effectiveness made it one of the most cited papers in machine learning. The idea of stochastic regularization through random perturbation during training influenced many subsequent techniques, including DropConnect, DropBlock, stochastic depth, and data augmentation strategies.


While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.


== See also ==


* [[ImageNet Classification with Deep CNNs]]
* [[Batch Normalization Accelerating Deep Network Training]]
* [[Deep Residual Learning for Image Recognition]]


== References ==


* Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. ''Journal of Machine Learning Research 15'', 1929-1958. [https://arxiv.org/abs/1207.0580 arXiv:1207.0580]
* Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. ''arXiv:1207.0580''.
* Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., & Fergus, R. (2013). Regularization of Neural Networks using DropConnect. ''ICML 2013''.


[[Category:Deep Learning]] [[Category:Research]] [[Category:Research Papers]]
</translate>