Batch Normalization: Difference between revisions

    From Marovi AI
    (Force re-parse after Math source-mode rollout (v1.2.0))
    Tags: ci-deploy Reverted
    (Pass 2 force re-parse)
    Tags: ci-deploy Reverted
    Line 99: Line 99:
    [[Category:Neural Networks]]
    [[Category:Neural Networks]]
    <!--v1.2.0 cache-bust-->
    <!--v1.2.0 cache-bust-->
    <!-- pass 2 -->

    Revision as of 07:00, 24 April 2026

    Languages: English | Español | 中文
    Article
    Topic area Deep Learning
    Difficulty Intermediate
    Prerequisites Neural Networks, Backpropagation

    Batch normalization (often abbreviated BatchNorm or BN) is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.

    Internal Covariate Shift

    The original motivation for batch normalization was to address internal covariate shift — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.

    While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.

    The Batch Normalization Algorithm

    During Training

    For a mini-batch $ \mathcal{B} = \{x_1, \dots, x_m\} $ of activations at a given layer, BatchNorm proceeds as follows:

    Step 1. Compute the mini-batch mean and variance:

    $ \mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2 $

    Step 2. Normalize:

    $ \hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} $

    where $ \epsilon $ is a small constant (e.g., $ 10^{-5} $) for numerical stability.

    Step 3. Scale and shift with learned parameters $ \gamma $ and $ \beta $:

    $ y_i = \gamma \hat{x}_i + \beta $

    The parameters $ \gamma $ and $ \beta $ are learned during training. They restore the network's ability to represent the identity transformation if that is optimal, ensuring that normalization does not reduce the model's expressiveness.

    During Inference

    At inference time, statistics from individual mini-batches are unreliable (the input may be a single example). Instead, BatchNorm uses running estimates of the population mean and variance accumulated during training via exponential moving averages:

    $ \mu_{\mathrm{running}} \leftarrow (1 - \alpha)\, \mu_{\mathrm{running}} + \alpha\, \mu_{\mathcal{B}} $
    $ \sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}} $

    where $ \alpha $ is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

    Benefits

    • Higher learning rates: By constraining activation distributions, BatchNorm allows larger step sizes without divergence.
    • Reduced sensitivity to initialization: Networks with BatchNorm are more forgiving of poor weight initialization.
    • Regularization effect: The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for Dropout.
    • Faster convergence: Training typically requires fewer epochs to reach a given level of performance.

    Placement

    BatchNorm is typically applied before the activation function (as in the original paper), though some practitioners place it after the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.

    Normalization Alternatives

    Method Normalizes over Use case
    Batch Norm Batch and spatial dims, per channel CNNs with large batches
    Layer Norm All channels and spatial dims, per sample Transformers, RNNs, small batches
    Instance Norm Spatial dims only, per sample per channel Style transfer, image generation
    Group Norm Groups of channels, per sample Object detection, small-batch training

    Layer normalization (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in Transformer architectures.

    Group normalization (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.

    Limitations

    • Performance degrades with very small batch sizes, as batch statistics become noisy.
    • Introduces a discrepancy between training (batch statistics) and inference (running statistics) behavior.
    • Not directly applicable to variable-length sequences without padding or masking.
    • The running statistics require careful handling when using distributed training across multiple devices.

    See also

    References

    • Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ICML.
    • Ba, J. L., Kiros, J. R. and Hinton, G. E. (2016). "Layer Normalization". arXiv:1607.06450.
    • Wu, Y. and He, K. (2018). "Group Normalization". ECCV.
    • Santurkar, S. et al. (2018). "How Does Batch Normalization Help Optimization?". NeurIPS.