Dropout
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Neural Networks, Overfitting and Regularization |
Dropout is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning.
Motivation: Co-Adaptation
In large neural networks, neurons can develop complex co-adaptation patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns.
Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons.
The Dropout Algorithm
During Training
At each training step, every neuron in a dropout layer is independently retained with probability $ p $ (the keep probability) or set to zero with probability $ 1 - p $. Formally, for a layer with activation vector $ \mathbf{h} $:
- $ r_j \sim \mathrm{Bernoulli}(p) $
- $ \tilde{h}_j = r_j \cdot h_j $
where $ r_j $ is a binary mask drawn independently for each neuron $ j $. A typical keep probability is $ p = 0.5 $ for hidden layers and $ p = 0.8 $ or higher for the input layer.
Each training step effectively trains a different "thinned" sub-network sampled from the full architecture. With $ n $ neurons, there are $ 2^n $ possible sub-networks, creating an implicit ensemble.
During Inference: Inverted Dropout
At inference time, all neurons are active, so the expected output of each neuron is scaled by a factor of $ p $ relative to training. Two approaches address this:
- Standard dropout: Multiply all weights by $ p $ at test time.
- Inverted dropout (more common): During training, divide the retained activations by $ p $:
- $ \tilde{h}_j = \frac{r_j \cdot h_j}{p} $
Inverted dropout ensures that the expected value of $ \tilde{h}_j $ equals $ h_j $ during training, so no adjustment is needed at inference. This is the default implementation in frameworks such as PyTorch and TensorFlow.
Theoretical Interpretation
Ensemble Perspective
Dropout can be viewed as training an exponentially large ensemble of sub-networks with extensive weight sharing. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all $ 2^n $ sub-networks. This ensemble averaging reduces variance and improves generalization.
Bayesian Interpretation
Gal and Ghahramani (2016) showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. Performing dropout at test time (Monte Carlo dropout) yields a distribution over predictions, providing a practical estimate of model uncertainty.
Dropout Variants
| Variant | Description | Typical application |
|---|---|---|
| Standard dropout | Drops individual neurons | Fully connected layers |
| spatial dropout | Drops entire feature maps (channels) | Convolutional networks |
| DropConnect | Drops individual weights instead of neurons | Dense layers |
| Variational dropout | Learns the dropout rate per neuron/weight | bayesian deep learning |
| DropBlock | Drops contiguous regions of feature maps | Convolutional networks |
| Alpha dropout | Maintains self-normalizing property (for selu activations) | Self-normalizing networks |
spatial dropout (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. spatial dropout instead drops entire channels, forcing the network to use diverse feature representations.
Practical Guidelines
- Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to Lua error: Internal error: The interpreter exited with status 1. weights and after feed-forward sub-layers.
- Rate selection: Start with $ p = 0.5 $ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
- Interaction with Lua error: Internal error: The interpreter exited with status 1.: Using dropout and Batch Normalization together requires care, as dropout introduces Lua error: Internal error: The interpreter exited with status 1. that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
- Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.
Effect on Training
Dropout typically increases training loss and slows Lua error: Internal error: The interpreter exited with status 1., since the effective model Lua error: Internal error: The interpreter exited with status 1. is reduced at each Lua error: Internal error: The interpreter exited with status 1.. However, it decreases the gap between training and validation performance, leading to better Lua error: Internal error: The interpreter exited with status 1.. If training loss is already high (Lua error: Internal error: The interpreter exited with status 1.), dropout should be reduced or removed.
See also
- Overfitting and Regularization
- Batch Normalization
- Neural Networks
- Backpropagation
- Bayesian deep learning
References
- Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Lua error: Internal error: The interpreter exited with status 1.". Journal of Machine Learning Research 15(56):1929–1958.
- Gal, Y. and Ghahramani, Lua error: Internal error: The interpreter exited with status 1.. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ICML.
- Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". CVPR.
- Wan, L. et al. (2013). "Lua error: Internal error: The interpreter exited with status 1. of Neural Networks using DropConnect". ICML.
- Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A Lua error: Internal error: The interpreter exited with status 1. method for Lua error: Internal error: The interpreter exited with status 1.". NeurIPS.