Translations:Dropout/26/en
- Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
- Rate selection: Start with $ p = 0.5 $ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
- Interaction with BatchNorm: Using dropout and Batch Normalization together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
- Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.