Dropout: Difference between revisions

Article
Topic area	Deep Learning
Difficulty	Intermediate
Prerequisites	Neural Networks, Overfitting and Regularization

Revision as of 06:58, 24 April 2026

Languages: English | Español | 中文

Dropout is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning.

Motivation: Co-Adaptation

In large neural networks, neurons can develop complex co-adaptation patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns.

Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons.

The Dropout Algorithm

During Training

At each training step, every neuron in a dropout layer is independently retained with probability $$ p $$ (the keep probability) or set to zero with probability $$ 1 - p $$ . Formally, for a layer with activation vector $\mathbf{h}$ :

r_j \sim \mathrm{Bernoulli}(p)

\tilde{h}_j = r_j \cdot h_j

where $$ r_j $$ is a binary mask drawn independently for each neuron $$ j $$ . A typical keep probability is $$ p = 0.5 $$ for hidden layers and $$ p = 0.8 $$ or higher for the input layer.

Each training step effectively trains a different "thinned" sub-network sampled from the full architecture. With $$ n $$ neurons, there are $$ 2^n $$ possible sub-networks, creating an implicit ensemble.

During Inference: Inverted Dropout

At inference time, all neurons are active, so the expected output of each neuron is scaled by a factor of $$ p $$ relative to training. Two approaches address this:

Standard dropout: Multiply all weights by $$ p $$ at test time.
Inverted dropout (more common): During training, divide the retained activations by $$ p $$ :

\tilde{h}_j = \frac{r_j \cdot h_j}{p}

Inverted dropout ensures that the expected value of $\tilde{h}_j$ equals $$ h_j $$ during training, so no adjustment is needed at inference. This is the default implementation in frameworks such as PyTorch and TensorFlow.

Theoretical Interpretation

Ensemble Perspective

Dropout can be viewed as training an exponentially large ensemble of sub-networks with extensive weight sharing. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all $$ 2^n $$ sub-networks. This ensemble averaging reduces variance and improves generalization.

Bayesian Interpretation

Gal and Ghahramani (2016) showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. Performing dropout at test time (Monte Carlo dropout) yields a distribution over predictions, providing a practical estimate of model uncertainty.

Dropout Variants

Variant	Description	Typical application
Standard dropout	Drops individual neurons	Fully connected layers
Spatial dropout	Drops entire feature maps (channels)	Convolutional networks
DropConnect	Drops individual weights instead of neurons	Dense layers
Variational dropout	Learns the dropout rate per neuron/weight	Bayesian deep learning
DropBlock	Drops contiguous regions of feature maps	Convolutional networks
Alpha dropout	Maintains self-normalizing property (for SELU activations)	Self-normalizing networks

Spatial dropout (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations.

Practical Guidelines

Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
Rate selection: Start with $$ p = 0.5 $$ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
Interaction with BatchNorm: Using dropout and Batch Normalization together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.

Effect on Training

Dropout typically increases training loss and slows convergence, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed.

References

Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". Journal of Machine Learning Research 15(56):1929–1958.
Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ICML.
Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". CVPR.
Wan, L. et al. (2013). "Regularization of Neural Networks using DropConnect". ICML.
Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A regularization method for convolutional networks". NeurIPS.

@@ Line 97: / Line 97: @@
 [[Category:Intermediate]]
 [[Category:Neural Networks]]
+<!--v1.2.0 cache-bust-->