DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

2026-04-24T07:08:59Z

[deploy-bot] Deploy from CI (8c92aeb)

← Older revision		Revision as of 07:08, 24 April 2026
Line 97:		Line 97:
	[[Category:Intermediate]]		[[Category:Intermediate]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]
	~~<!--v1.2.0 cache-bust-->~~
	~~<!-- pass 2 -->~~

DeployBot: Pass 2 force re-parse

2026-04-24T07:00:39Z

Pass 2 force re-parse

← Older revision		Revision as of 07:00, 24 April 2026
Line 98:		Line 98:
	[[Category:Neural Networks]]		[[Category:Neural Networks]]
	<!--v1.2.0 cache-bust-->		<!--v1.2.0 cache-bust-->
			<!-- pass 2 -->

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

2026-04-24T06:58:03Z

Force re-parse after Math source-mode rollout (v1.2.0)

← Older revision		Revision as of 06:58, 24 April 2026
Line 97:		Line 97:
	[[Category:Intermediate]]		[[Category:Intermediate]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]
			<!--v1.2.0 cache-bust-->

DeployBot: [deploy-bot] Deploy from CI (775ba6e)

2026-04-24T04:01:42Z

[deploy-bot] Deploy from CI (775ba6e)

New page

{{LanguageBar | page = Dropout}}
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Overfitting and Regularization]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Dropout''' is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning.

== Motivation: Co-Adaptation ==

In large neural networks, neurons can develop complex '''co-adaptation''' patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns.

Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons.

== The Dropout Algorithm ==

=== During Training ===

At each training step, every neuron in a dropout layer is independently retained with probability <math>p</math> (the '''keep probability''') or set to zero with probability <math>1 - p</math>. Formally, for a layer with activation vector <math>\mathbf{h}</math>:

:<math>r_j \sim \mathrm{Bernoulli}(p)</math>

:<math>\tilde{h}_j = r_j \cdot h_j</math>

where <math>r_j</math> is a binary mask drawn independently for each neuron <math>j</math>. A typical keep probability is <math>p = 0.5</math> for hidden layers and <math>p = 0.8</math> or higher for the input layer.

Each training step effectively trains a different "thinned" sub-network sampled from the full architecture. With <math>n</math> neurons, there are <math>2^n</math> possible sub-networks, creating an implicit ensemble.

=== During Inference: Inverted Dropout ===

At inference time, all neurons are active, so the expected output of each neuron is scaled by a factor of <math>p</math> relative to training. Two approaches address this:

* '''Standard dropout''': Multiply all weights by <math>p</math> at test time.
* '''Inverted dropout''' (more common): During training, divide the retained activations by <math>p</math>:

:<math>\tilde{h}_j = \frac{r_j \cdot h_j}{p}</math>

Inverted dropout ensures that the expected value of <math>\tilde{h}_j</math> equals <math>h_j</math> during training, so no adjustment is needed at inference. This is the default implementation in frameworks such as PyTorch and TensorFlow.

== Theoretical Interpretation ==

=== Ensemble Perspective ===

Dropout can be viewed as training an exponentially large ensemble of sub-networks with extensive weight sharing. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all <math>2^n</math> sub-networks. This ensemble averaging reduces variance and improves generalization.

=== Bayesian Interpretation ===

Gal and Ghahramani (2016) showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. Performing dropout at test time ('''Monte Carlo dropout''') yields a distribution over predictions, providing a practical estimate of model uncertainty.

== Dropout Variants ==

{| class="wikitable"
|-
! Variant !! Description !! Typical application
|-
| '''Standard dropout''' || Drops individual neurons || Fully connected layers
|-
| '''Spatial dropout''' || Drops entire feature maps (channels) || Convolutional networks
|-
| '''DropConnect''' || Drops individual weights instead of neurons || Dense layers
|-
| '''Variational dropout''' || Learns the dropout rate per neuron/weight || Bayesian deep learning
|-
| '''DropBlock''' || Drops contiguous regions of feature maps || Convolutional networks
|-
| '''Alpha dropout''' || Maintains self-normalizing property (for SELU activations) || Self-normalizing networks
|}

'''Spatial dropout''' (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations.

== Practical Guidelines ==

* '''Placement''': Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
* '''Rate selection''': Start with <math>p = 0.5</math> for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
* '''Interaction with BatchNorm''': Using dropout and [[Batch Normalization]] together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
* '''Scheduled dropout''': Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.

== Effect on Training ==

Dropout typically increases training loss and slows convergence, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed.

== See also ==

* [[Overfitting and Regularization]]
* [[Batch Normalization]]
* [[Neural Networks]]
* [[Backpropagation]]
* [[Bayesian deep learning]]

== References ==

* Srivastava, N. et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". ''Journal of Machine Learning Research'' 15(56):1929–1958.
* Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ''ICML''.
* Tompson, J. et al. (2015). "Efficient Object Localization Using Convolutional Networks". ''CVPR''.
* Wan, L. et al. (2013). "Regularization of Neural Networks using DropConnect". ''ICML''.
* Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). "DropBlock: A regularization method for convolutional networks". ''NeurIPS''.

[[Category:Deep Learning]]
[[Category:Intermediate]]
[[Category:Neural Networks]]

Dropout - Revision history

DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

DeployBot: Pass 2 force re-parse

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

DeployBot: [deploy-bot] Deploy from CI (775ba6e)