Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Research Paper
Authors	Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.
Year	2014
Venue	Journal of Machine Learning Research
Topic area	Machine Learning
Difficulty	Research
Source	View paper
PDF	Download PDF

Other languages:

English
Español
中文

SummarySource

Dropout: A Simple Way to Prevent Neural Networks from Overfitting is a 2014 paper by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, published in the Journal of Machine Learning Research. The paper introduces dropout, a regularization technique in which hidden and visible units are randomly removed from a neural network during each training step. By forcing the network to learn redundant, distributed representations, dropout dramatically reduces overfitting and yields state-of-the-art results across vision, speech, text, and computational biology benchmarks. The paper is one of the most cited works in deep learning and made dropout a near-universal component of modern neural network training pipelines.

Overview

Deep neural networks with many parameters are powerful function approximators but are prone to overfitting, particularly on data sets that are small relative to model capacity. The classical Bayesian remedy—averaging predictions over the posterior distribution of parameters—is infeasible for large networks. Dropout offers a tractable approximation: during training, each unit is independently retained with probability $$ p $$ and otherwise temporarily removed along with its connections. Each minibatch effectively trains a different "thinned" sub-network sampled from an exponential family of $$ 2^n $$ sub-networks that share weights. At test time, the full network is used with weights scaled by $$ p $$ , producing an efficient, deterministic approximation to the geometric mean of all sampled sub-networks.

The paper presents extensive empirical evidence that dropout improves generalization on MNIST, SVHN, CIFAR-10/100, ImageNet, TIMIT, Reuters-RCV1, and an alternative-splicing genetics task. It also extends dropout to Restricted Boltzmann Machines, analyzes its effect on learned features and activation sparsity, and explores a Gaussian-noise variant and a deterministic, marginalized counterpart for linear regression.

Key Contributions

A simple, broadly applicable regularization method—dropout—that scales to networks with tens of millions of parameters and works across architectures and modalities.
A practical weight scaling approximation: at test time, multiply each weight by the retention probability $$ p $$ . This lets a single forward pass approximate averaging over the exponential ensemble of thinned sub-networks.
State-of-the-art results at publication time on permutation-invariant MNIST (0.79% error with DBM-pretrained dropout), SVHN, CIFAR-10/100, and ImageNet ILSVRC-2012 (winning the competition).
Dropout RBMs: an extension of dropout to Restricted Boltzmann Machines that yields sparser, qualitatively different features.
Analysis showing that dropout prevents co-adaptation of hidden units, induces activation sparsity as a side effect, and behaves predictably as the retention probability $$ p $$ and data set size are varied.
A marginalized form of dropout for linear regression, equivalent to a data-dependent ridge penalty, suggesting a deterministic counterpart for richer models.
A practical hyperparameter guide (network size scaling, learning rate, momentum, max-norm constraints).

Methods

Let $y^{(l)}$ be the activation vector of layer $$ l $$ and $r^{(l)}_j \sim \mathrm{Bernoulli}(p)$ a per-unit retention mask. The dropout forward pass is:

\tilde{y}^{(l)} = r^{(l)} \ast y^{(l)},\qquad z^{(l+1)}_i = w^{(l+1)}_i\, \tilde{y}^{(l)} + b^{(l+1)}_i,\qquad y^{(l+1)}_i = f(z^{(l+1)}_i),

where $\ast$ is element-wise multiplication. A fresh mask is sampled for every training case in every minibatch. Backpropagation flows only through retained units. At test time, no units are dropped and weights are rescaled, $W^{(l)}_{\text{test}} = p\, W^{(l)}$ , so that each unit's expected output matches its training-time average.

The authors combine dropout with several techniques that proved especially synergistic:

Max-norm regularization: the incoming weight vector at each hidden unit is constrained to satisfy $\|w\|_2 \leq c$ , with typical $c \in [3, 4]$ . This permits very large learning rates without weight blow-up.
High learning rate and momentum: dropout networks tolerate (and benefit from) learning rates 10–100× larger than standard nets and momentum values around 0.95–0.99.
Network size scaling: because only $$ pn $$ units are active in expectation, the heuristic is to use roughly $$ n/p $$ units when replacing a standard layer of size $$ n $$ .
Pretraining compatibility: pretrained weights (from RBM stacks, autoencoders, or DBMs) are scaled up by $$ 1/p $$ before dropout finetuning.

The authors also derive a deterministic counterpart by marginalizing the noise. For linear regression, dropping inputs with retention probability $$ p $$ reduces in expectation to:

\underset{w}{\mathrm{minimize}}\; \|y - p X w\|^2 + p(1 - p)\, \|\Gamma w\|^2,\qquad \Gamma = (\mathrm{diag}(X^\top X))^{1/2},

a form of L2 regularization weighted by per-feature standard deviations. A Gaussian dropout variant—multiplying activations by samples from $\mathcal{N}(1, \sigma^2)$ with $\sigma^2 = (1 - p)/p$ —matches or slightly exceeds Bernoulli dropout in early experiments and removes the need for test-time weight scaling.

Results

Across a diverse benchmark suite, dropout produced consistent and often dramatic gains:

MNIST (permutation-invariant): 1.60% baseline → 1.35% with dropout → 1.06% with dropout + ReLU + max-norm → 0.95% at 2×8192 units → 0.79% with DBM-pretrained dropout (state of the art at publication).
SVHN: 3.95% baseline conv net → 3.02% with dropout in fully connected layers → 2.55% with dropout in all layers.
CIFAR-10: 14.98% baseline → 12.61% with dropout in every layer.
CIFAR-100: 43.48% → 37.20%.
ImageNet ILSVRC-2012: dropout-equipped conv nets achieved roughly 16% top-5 test error, versus ~26% for the best non-deep-learning baselines, and won the competition.
TIMIT phone recognition: 6-layer net 23.4% → 21.8%; pretrained 4-layer net 22.7% → 19.7%.
Reuters-RCV1: 31.05% → 29.62% (smaller gains on already-large training sets).
Alternative splicing (Code Quality, higher is better): standard NN 440 → dropout NN 567 → Bayesian NN 623. Dropout closes much of the gap to a Bayesian net while remaining tractable.

Analyses of learned features show that dropout breaks up the co-adaptation visible in standard autoencoders, yielding hidden units that detect localized strokes, edges, and spots. Activations also become sparser as a side effect, with mean activation falling from ~2.0 to ~0.7 on MNIST autoencoders. Sweeps over the retention probability $$ p $$ show a flat optimum in the range 0.4–0.8, with 0.5 a robust default for hidden layers and ~0.8 for input layers. Monte-Carlo averaging over $$ k $$ sampled sub-networks matches the weight-scaling approximation by around $$ k = 50 $$ , confirming that the cheap test-time procedure is faithful in practice. Dropout's gains over no regularization grow with data set size up to a sweet spot, then taper as overfitting becomes less of a concern.

Impact

Dropout became a default building block of deep learning shortly after publication. Combined with ReLU activations and max-norm or weight-decay regularization, it underpinned many of the breakthrough convolutional architectures of the mid-2010s, including the AlexNet result on ImageNet. Subsequent work generalized the idea—Gaussian dropout, DropConnect, variational dropout, Bernoulli-style stochastic regularizers in Recurrent Neural Networks and transformers, and information-theoretic interpretations such as dropout-as-Bayesian-approximation. The paper reframed neural network training as a form of implicit ensembling over an exponentially large set of weight-sharing sub-models, an idea that continues to inform research into normalization, stochastic optimization, and uncertainty estimation. The drawback of 2–3× longer training times has been partly mitigated by deterministic approximations and by the rise of architectures (such as residual networks and transformers with layer normalization) that often need less aggressive dropout.

References

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, 1106–1114.
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of the 30th International Conference on Machine Learning.
Wang, S. and Manning, C. D. (2013). Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning.
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26.