Translations:Batch Normalization/22/en

Higher learning rates: By constraining activation distributions, BatchNorm allows larger step sizes without divergence.
Reduced sensitivity to initialization: Networks with BatchNorm are more forgiving of poor weight initialization.
regularization effect: The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for Dropout.
Faster convergence: Training typically requires fewer epochs to reach a given level of performance.