Batch Normalization: Difference between revisions

Article
Topic area	Deep Learning
Difficulty	Intermediate
Prerequisites	Neural Networks, Backpropagation

Revision as of 07:00, 24 April 2026

Languages: English | Español | 中文

Batch normalization (often abbreviated BatchNorm or BN) is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.

Internal Covariate Shift

The original motivation for batch normalization was to address internal covariate shift — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.

While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.

The Batch Normalization Algorithm

During Training

For a mini-batch $\mathcal{B} = \{x_1, \dots, x_m\}$ of activations at a given layer, BatchNorm proceeds as follows:

Step 1. Compute the mini-batch mean and variance:

\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2

Step 2. Normalize:

\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}

where $\epsilon$ is a small constant (e.g., $10^{-5}$ ) for numerical stability.

Step 3. Scale and shift with learned parameters $\gamma$ and $\beta$ :

y_i = \gamma \hat{x}_i + \beta

The parameters $\gamma$ and $\beta$ are learned during training. They restore the network's ability to represent the identity transformation if that is optimal, ensuring that normalization does not reduce the model's expressiveness.

During Inference

At inference time, statistics from individual mini-batches are unreliable (the input may be a single example). Instead, BatchNorm uses running estimates of the population mean and variance accumulated during training via exponential moving averages:

\mu_{\mathrm{running}} \leftarrow (1 - \alpha)\, \mu_{\mathrm{running}} + \alpha\, \mu_{\mathcal{B}}

\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}

where $\alpha$ is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

Benefits

Higher learning rates: By constraining activation distributions, BatchNorm allows larger step sizes without divergence.
Reduced sensitivity to initialization: Networks with BatchNorm are more forgiving of poor weight initialization.
Regularization effect: The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for Dropout.
Faster convergence: Training typically requires fewer epochs to reach a given level of performance.

Placement

BatchNorm is typically applied before the activation function (as in the original paper), though some practitioners place it after the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.

Normalization Alternatives

Method	Normalizes over	Use case
Batch Norm	Batch and spatial dims, per channel	CNNs with large batches
Layer Norm	All channels and spatial dims, per sample	Transformers, RNNs, small batches
Instance Norm	Spatial dims only, per sample per channel	Style transfer, image generation
Group Norm	Groups of channels, per sample	Object detection, small-batch training

Layer normalization (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in Transformer architectures.

Group normalization (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.

Limitations

Performance degrades with very small batch sizes, as batch statistics become noisy.
Introduces a discrepancy between training (batch statistics) and inference (running statistics) behavior.
Not directly applicable to variable-length sequences without padding or masking.
The running statistics require careful handling when using distributed training across multiple devices.

References

Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ICML.
Ba, J. L., Kiros, J. R. and Hinton, G. E. (2016). "Layer Normalization". arXiv:1607.06450.
Wu, Y. and He, K. (2018). "Group Normalization". ECCV.
Santurkar, S. et al. (2018). "How Does Batch Normalization Help Optimization?". NeurIPS.

@@ Line 99: / Line 99: @@
 [[Category:Neural Networks]]
 <!--v1.2.0 cache-bust-->
+<!-- pass 2 -->