Translations:Dropout/26/en

    From Marovi AI
    • Placement: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.
    • Rate selection: Start with $ p = 0.5 $ for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.
    • Interaction with BatchNorm: Using dropout and Batch Normalization together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.
    • Scheduled dropout: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.