Dropout/en: Difference between revisions
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
| Line 3: | Line 3: | ||
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}} | {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}} | ||
'''Dropout''' is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning. | '''Dropout''' is a {{Term|regularization}} technique for neural networks that randomly sets a fraction of neuron {{Term|activation function|activations}} to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing {{Term|overfitting}} in {{Term|deep learning}}. | ||
== Motivation: Co-Adaptation == | == Motivation: Co-Adaptation == | ||
In large neural networks, neurons can develop complex '''co-adaptation''' patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns. | In large neural networks, neurons can develop complex '''co-adaptation''' patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to {{Term|overfitting}}, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns. | ||
Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons. | Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons. | ||
| Line 15: | Line 15: | ||
=== During Training === | === During Training === | ||
At each training step, every neuron in a dropout layer is independently retained with probability <math>p</math> (the '''keep probability''') or set to zero with probability <math>1 - p</math>. Formally, for a layer with activation vector <math>\mathbf{h}</math>: | At each training step, every neuron in a dropout layer is independently retained with probability <math>p</math> (the '''keep probability''') or set to zero with probability <math>1 - p</math>. Formally, for a layer with {{Term|activation function|activation}} vector <math>\mathbf{h}</math>: | ||
:<math>r_j \sim \mathrm{Bernoulli}(p)</math> | :<math>r_j \sim \mathrm{Bernoulli}(p)</math> | ||
| Line 30: | Line 30: | ||
* '''Standard dropout''': Multiply all weights by <math>p</math> at test time. | * '''Standard dropout''': Multiply all weights by <math>p</math> at test time. | ||
* '''Inverted dropout''' (more common): During training, divide the retained activations by <math>p</math>: | * '''Inverted dropout''' (more common): During training, divide the retained {{Term|activation function|activations}} by <math>p</math>: | ||
:<math>\tilde{h}_j = \frac{r_j \cdot h_j}{p}</math> | :<math>\tilde{h}_j = \frac{r_j \cdot h_j}{p}</math> | ||
| Line 58: | Line 58: | ||
| '''DropConnect''' || Drops individual weights instead of neurons || Dense layers | | '''DropConnect''' || Drops individual weights instead of neurons || Dense layers | ||
|- | |- | ||
| '''Variational dropout''' || Learns the dropout rate per neuron/weight || | | '''Variational dropout''' || Learns the dropout rate per neuron/weight || {{Term|bayesian deep learning}} | ||
|- | |- | ||
| '''DropBlock''' || Drops contiguous regions of feature maps || Convolutional networks | | '''DropBlock''' || Drops contiguous regions of feature maps || Convolutional networks | ||
|- | |- | ||
| '''Alpha dropout''' || Maintains self-normalizing property (for SELU activations) || Self-normalizing networks | | '''Alpha dropout''' || Maintains self-normalizing property (for SELU {{Term|activation function|activations}}) || Self-normalizing networks | ||
|} | |} | ||
'''Spatial dropout''' (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations. | '''Spatial dropout''' (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent {{Term|activation function|activations}} are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations. | ||
== Practical Guidelines == | == Practical Guidelines == | ||
* '''Placement''': Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers. | * '''Placement''': Apply dropout after the {{Term|activation function}} in fully connected layers. In {{Term|transformer|Transformers}}, dropout is applied to {{Term|attention}} weights and after feed-forward sub-layers. | ||
* '''Rate selection''': Start with <math>p = 0.5</math> for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets. | * '''Rate selection''': Start with <math>p = 0.5</math> for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets. | ||
* '''Interaction with BatchNorm''': Using dropout and [[Batch Normalization]] together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer. | * '''Interaction with BatchNorm''': Using dropout and [[Batch Normalization]] together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer. | ||
| Line 76: | Line 76: | ||
== Effect on Training == | == Effect on Training == | ||
Dropout typically increases training loss and slows convergence, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed. | Dropout typically increases training loss and slows {{Term|convergence}}, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed. | ||
== See also == | == See also == | ||
| Line 88: | Line 88: | ||
== References == | == References == | ||
* Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from | * Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from {{Term|overfitting}}". ''Journal of Machine Learning Research'' 15(56):1929–1958. | ||
* Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ''ICML''. | * Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ''ICML''. | ||
* Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". ''CVPR''. | * Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". ''CVPR''. | ||
* Wan, L. et al. (2013). " | * Wan, L. et al. (2013). "{{Term|regularization}} of Neural Networks using DropConnect". ''ICML''. | ||
* Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A regularization method for convolutional networks". ''NeurIPS''. | * Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A {{Term|regularization}} method for convolutional networks". ''NeurIPS''. | ||
[[Category:Deep Learning]] | [[Category:Deep Learning]] | ||
[[Category:Intermediate]] | [[Category:Intermediate]] | ||
[[Category:Neural Networks]] | [[Category:Neural Networks]] | ||
Revision as of 19:42, 27 April 2026
| Article | |
|---|---|
| Topic area | Deep Learning |
| Difficulty | Intermediate |
| Prerequisites | Neural Networks, Overfitting and Regularization |
Dropout is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning.
Motivation: Co-Adaptation
In large neural networks, neurons can develop complex co-adaptation patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns.
Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons.
The Dropout Algorithm
During Training
At each training step, every neuron in a dropout layer is independently retained with probability $ p $ (the keep probability) or set to zero with probability $ 1 - p $. Formally, for a layer with activation vector $ \mathbf{h} $:
- $ r_j \sim \mathrm{Bernoulli}(p) $
- $ \tilde{h}_j = r_j \cdot h_j $
where $ r_j $ is a binary mask drawn independently for each neuron $ j $. A typical keep probability is $ p = 0.5 $ for hidden layers and $ p = 0.8 $ or higher for the input layer.
Each training step effectively trains a different "thinned" sub-network sampled from the full architecture. With $ n $ neurons, there are $ 2^n $ possible sub-networks, creating an implicit ensemble.
During Inference: Inverted Dropout
At inference time, all neurons are active, so the expected output of each neuron is scaled by a factor of $ p $ relative to training. Two approaches address this:
- Standard dropout: Multiply all weights by $ p $ at test time.
- Inverted dropout (more common): During training, divide the retained activations by $ p $:
- $ \tilde{h}_j = \frac{r_j \cdot h_j}{p} $
Inverted dropout ensures that the expected value of $ \tilde{h}_j $ equals $ h_j $ during training, so no adjustment is needed at inference. This is the default implementation in frameworks such as PyTorch and TensorFlow.
Theoretical Interpretation
Ensemble Perspective
Dropout can be viewed as training an exponentially large ensemble of sub-networks with extensive weight sharing. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all $ 2^n $ sub-networks. This ensemble averaging reduces variance and improves generalization.
Bayesian Interpretation
Gal and Ghahramani (2016) showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. Performing dropout at test time (Monte Carlo dropout) yields a distribution over predictions, providing a practical estimate of model uncertainty.
Dropout Variants
| Variant | Description | Typical application |
|---|---|---|
| Standard dropout | Drops individual neurons | Fully connected layers |
| Spatial dropout | Drops entire feature maps (channels) | Convolutional networks |
| DropConnect | Drops individual weights instead of neurons | Dense layers |
| Variational dropout | Learns the dropout rate per neuron/weight | bayesian deep learning |
| DropBlock | Drops contiguous regions of feature maps | Convolutional networks |
| Alpha dropout | Maintains self-normalizing property (for SELU activations) | Self-normalizing networks |
Spatial dropout (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations.
Practical Guidelines
- Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
- Rate selection: Start with $ p = 0.5 $ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
- Interaction with BatchNorm: Using dropout and Batch Normalization together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
- Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.
Effect on Training
Dropout typically increases training loss and slows convergence, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed.
See also
- Overfitting and Regularization
- Batch Normalization
- Neural Networks
- Backpropagation
- Bayesian deep learning
References
- Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from overfitting". Journal of Machine Learning Research 15(56):1929–1958.
- Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ICML.
- Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". CVPR.
- Wan, L. et al. (2013). "regularization of Neural Networks using DropConnect". ICML.
- Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A regularization method for convolutional networks". NeurIPS.