Dropout A Simple Way to Prevent Overfitting: Difference between revisions
([deploy-bot] Drop {{LanguageBar}} (v1.4.1)) Tag: content-generation |
(Marked this version for translation) |
||
| Line 18: | Line 18: | ||
'''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning. | '''Dropout: A Simple Way to Prevent Neural Networks from Overfitting''' is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated '''dropout''', a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning. | ||
<!--T:3--> | == Overview == <!--T:3--> | ||
<!--T:4--> | <!--T:4--> | ||
| Line 27: | Line 26: | ||
Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble. | Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability <math>p</math> and dropped (set to zero) with probability <math>1 - p</math>. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by <math>p</math> to approximate the expected output of the ensemble. | ||
<!--T:6--> | == Key Contributions == <!--T:6--> | ||
<!--T:7--> | <!--T:7--> | ||
| Line 36: | Line 34: | ||
* '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters. | * '''Practical guidelines''': Recommendations for dropout rates (<math>p = 0.5</math> for hidden units, <math>p = 0.8</math> for input units) and interactions with other hyperparameters. | ||
<!--T:8--> | == Methods == <!--T:8--> | ||
<!--T:9--> | <!--T:9--> | ||
| Line 66: | Line 63: | ||
The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results. | The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results. | ||
<!--T:18--> | == Results == <!--T:18--> | ||
<!--T:19--> | <!--T:19--> | ||
| Line 83: | Line 79: | ||
The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features. | The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features. | ||
<!--T:22--> | == Impact == <!--T:22--> | ||
<!--T:23--> | <!--T:23--> | ||
| Line 92: | Line 87: | ||
While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology. | While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology. | ||
<!--T:25--> | == See also == <!--T:25--> | ||
<!--T:26--> | <!--T:26--> | ||
| Line 100: | Line 94: | ||
* [[Deep Residual Learning for Image Recognition]] | * [[Deep Residual Learning for Image Recognition]] | ||
<!--T:27--> | == References == <!--T:27--> | ||
<!--T:28--> | <!--T:28--> | ||
Latest revision as of 02:53, 27 April 2026
| Research Paper | |
|---|---|
| Authors | Nitish Srivastava; Geoffrey Hinton; Alex Krizhevsky; Ilya Sutskever; Ruslan Salakhutdinov |
| Year | 2014 |
| Venue | JMLR |
| Topic area | Deep Learning |
| Difficulty | Research |
| arXiv | 1207.0580 |
| Download PDF | |
Dropout: A Simple Way to Prevent Neural Networks from Overfitting is a 2014 paper by Srivastava et al. published in the Journal of Machine Learning Research. The paper formalized and extensively evaluated dropout, a regularization technique in which randomly selected neurons are temporarily removed during training. Dropout prevents complex co-adaptations between neurons, effectively training an exponentially large ensemble of sub-networks within a single architecture, and became one of the most widely used regularization methods in deep learning.
Overview
Deep neural networks with many parameters are powerful function approximators but are prone to overfitting, especially when training data is limited. Traditional regularization methods such as L2 weight decay and early stopping provided some relief, but were often insufficient for large networks. Model combination — training multiple models and averaging their predictions — was known to reduce overfitting but was computationally expensive.
Dropout provides an efficient approximation to model combination. During each training step, each neuron (including input units) is retained with a probability $ p $ and dropped (set to zero) with probability $ 1 - p $. This means that on each training case, a different "thinned" sub-network is sampled. At test time, all neurons are used but their outputs are scaled by $ p $ to approximate the expected output of the ensemble.
Key Contributions
- Dropout regularization: A training procedure that randomly omits neurons during each forward and backward pass, preventing neurons from developing overly specialized co-adaptations.
- Ensemble interpretation: Theoretical motivation of dropout as approximate model averaging over $ 2^n $ possible thinned networks (where $ n $ is the number of droppable units), with shared weights.
- Comprehensive empirical evaluation: Demonstration of consistent improvements across diverse domains including vision, speech recognition, text classification, and computational biology.
- Practical guidelines: Recommendations for dropout rates ($ p = 0.5 $ for hidden units, $ p = 0.8 $ for input units) and interactions with other hyperparameters.
Methods
During training, for each training example and each layer, each neuron's output is independently set to zero with probability $ 1 - p $. If $ h_i $ is the output of neuron $ i $, the dropout operation applies:
$ r_i \sim \text{Bernoulli}(p) $
$ \tilde{h}_i = r_i \cdot h_i $
where $ r_i $ is a random mask variable. The dropped-out network is then used for the forward pass and backpropagation on that training case. Different random masks are drawn for each training example and each gradient step.
At test time, no units are dropped. Instead, the output of each neuron is multiplied by $ p $ to match the expected value during training:
$ h_i^{\text{test}} = p \cdot h_i $
This weight scaling inference rule ensures that the expected output of each neuron at test time equals its expected output during training. An equivalent alternative, inverted dropout, scales activations by $ 1/p $ during training so that no modification is needed at test time. This approach is more common in modern implementations.
The authors showed that dropout can be interpreted as training an ensemble of $ 2^n $ sub-networks that share weights. At test time, the scaled full network provides a geometric mean approximation to the ensemble prediction, which the authors proved is exact for a single layer with softmax output.
The paper also explored dropout with other regularizers, finding that combining dropout with max-norm constraints (clipping the weight vector to have a maximum L2 norm) and large decayed learning rates produced the best results.
Results
Dropout was evaluated across multiple benchmarks and consistently reduced test error:
- MNIST (handwritten digits): Error reduced from 1.60% to 1.25% with dropout on a standard feedforward network.
- CIFAR-10/CIFAR-100: Significant error reductions on convolutional networks; relative improvement of approximately 15-25% on CIFAR-100.
- SVHN (Street View House Numbers): Error reduced from 2.80% to 2.68%.
- ImageNet: Dropout improved the top-1 error of a large convolutional network by approximately 2 percentage points.
- TIMIT (speech recognition): Consistent improvements across various architecture sizes.
- Reuters (text classification): Improved performance on a bag-of-words text classification task.
The paper also analyzed the features learned by networks trained with dropout, finding that hidden units developed more distinct and individually meaningful features compared to networks without dropout, which tended to learn redundant co-adapted features.
Impact
Dropout became standard practice in neural network training throughout the 2010s, included by default in most deep learning frameworks. Its conceptual simplicity and consistent effectiveness made it one of the most cited papers in machine learning. The idea of stochastic regularization through random perturbation during training influenced many subsequent techniques, including DropConnect, DropBlock, stochastic depth, and data augmentation strategies.
While batch normalization and other techniques have reduced the necessity of dropout in some convolutional architectures, dropout remains widely used in fully connected layers, Transformer models, and whenever overfitting is a concern. The paper established randomized regularization as a core principle in deep learning methodology.
See also
- ImageNet Classification with Deep CNNs
- Batch Normalization Accelerating Deep Network Training
- Deep Residual Learning for Image Recognition
References
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1929-1958. arXiv:1207.0580
- Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv:1207.0580.
- Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., & Fergus, R. (2013). Regularization of Neural Networks using DropConnect. ICML 2013.