Dropout: Difference between revisions

    From Marovi AI
    ([deploy-bot] Deploy from CI (775ba6e))
    Tag: ci-deploy
     
    (Force re-parse after Math source-mode rollout (v1.2.0))
    Tags: ci-deploy Reverted
    Line 97: Line 97:
    [[Category:Intermediate]]
    [[Category:Intermediate]]
    [[Category:Neural Networks]]
    [[Category:Neural Networks]]
    <!--v1.2.0 cache-bust-->

    Revision as of 06:58, 24 April 2026

    Languages: English | Español | 中文
    Article
    Topic area Deep Learning
    Difficulty Intermediate
    Prerequisites Neural Networks, Overfitting and Regularization

    Dropout is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning.

    Motivation: Co-Adaptation

    In large neural networks, neurons can develop complex co-adaptation patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns.

    Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons.

    The Dropout Algorithm

    During Training

    At each training step, every neuron in a dropout layer is independently retained with probability $ p $ (the keep probability) or set to zero with probability $ 1 - p $. Formally, for a layer with activation vector $ \mathbf{h} $:

    $ r_j \sim \mathrm{Bernoulli}(p) $
    $ \tilde{h}_j = r_j \cdot h_j $

    where $ r_j $ is a binary mask drawn independently for each neuron $ j $. A typical keep probability is $ p = 0.5 $ for hidden layers and $ p = 0.8 $ or higher for the input layer.

    Each training step effectively trains a different "thinned" sub-network sampled from the full architecture. With $ n $ neurons, there are $ 2^n $ possible sub-networks, creating an implicit ensemble.

    During Inference: Inverted Dropout

    At inference time, all neurons are active, so the expected output of each neuron is scaled by a factor of $ p $ relative to training. Two approaches address this:

    • Standard dropout: Multiply all weights by $ p $ at test time.
    • Inverted dropout (more common): During training, divide the retained activations by $ p $:
    $ \tilde{h}_j = \frac{r_j \cdot h_j}{p} $

    Inverted dropout ensures that the expected value of $ \tilde{h}_j $ equals $ h_j $ during training, so no adjustment is needed at inference. This is the default implementation in frameworks such as PyTorch and TensorFlow.

    Theoretical Interpretation

    Ensemble Perspective

    Dropout can be viewed as training an exponentially large ensemble of sub-networks with extensive weight sharing. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all $ 2^n $ sub-networks. This ensemble averaging reduces variance and improves generalization.

    Bayesian Interpretation

    Gal and Ghahramani (2016) showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. Performing dropout at test time (Monte Carlo dropout) yields a distribution over predictions, providing a practical estimate of model uncertainty.

    Dropout Variants

    Variant Description Typical application
    Standard dropout Drops individual neurons Fully connected layers
    Spatial dropout Drops entire feature maps (channels) Convolutional networks
    DropConnect Drops individual weights instead of neurons Dense layers
    Variational dropout Learns the dropout rate per neuron/weight Bayesian deep learning
    DropBlock Drops contiguous regions of feature maps Convolutional networks
    Alpha dropout Maintains self-normalizing property (for SELU activations) Self-normalizing networks

    Spatial dropout (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations.

    Practical Guidelines

    • Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
    • Rate selection: Start with $ p = 0.5 $ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
    • Interaction with BatchNorm: Using dropout and Batch Normalization together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
    • Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.

    Effect on Training

    Dropout typically increases training loss and slows convergence, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed.

    See also

    References

    • Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". Journal of Machine Learning Research 15(56):1929–1958.
    • Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ICML.
    • Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". CVPR.
    • Wan, L. et al. (2013). "Regularization of Neural Networks using DropConnect". ICML.
    • Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A regularization method for convolutional networks". NeurIPS.