Translations:Dropout/26/en

Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
Rate selection: Start with $$ p = 0.5 $$ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
Interaction with BatchNorm: Using dropout and Batch Normalization together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.