DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

2026-04-24T07:08:58Z

[deploy-bot] Deploy from CI (8c92aeb)

← Older revision		Revision as of 07:08, 24 April 2026
Line 98:		Line 98:
	[[Category:Intermediate]]		[[Category:Intermediate]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]
	~~<!--v1.2.0 cache-bust-->~~
	~~<!-- pass 2 -->~~

DeployBot: Pass 2 force re-parse

2026-04-24T07:00:24Z

Pass 2 force re-parse

← Older revision		Revision as of 07:00, 24 April 2026
Line 99:		Line 99:
	[[Category:Neural Networks]]		[[Category:Neural Networks]]
	<!--v1.2.0 cache-bust-->		<!--v1.2.0 cache-bust-->
			<!-- pass 2 -->

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

2026-04-24T06:57:48Z

Force re-parse after Math source-mode rollout (v1.2.0)

← Older revision		Revision as of 06:57, 24 April 2026
Line 98:		Line 98:
	[[Category:Intermediate]]		[[Category:Intermediate]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]
			<!--v1.2.0 cache-bust-->

DeployBot: [deploy-bot] Deploy from CI (775ba6e)

2026-04-24T04:01:40Z

[deploy-bot] Deploy from CI (775ba6e)

New page

{{LanguageBar | page = Batch Normalization}}
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Batch normalization''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.

== Internal Covariate Shift ==

The original motivation for batch normalization was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.

While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.

== The Batch Normalization Algorithm ==

=== During Training ===

For a mini-batch <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of activations at a given layer, BatchNorm proceeds as follows:

'''Step 1.''' Compute the mini-batch mean and variance:

:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>

'''Step 2.''' Normalize:

:<math>\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}</math>

where <math>\epsilon</math> is a small constant (e.g., <math>10^{-5}</math>) for numerical stability.

'''Step 3.''' Scale and shift with learned parameters <math>\gamma</math> and <math>\beta</math>:

:<math>y_i = \gamma \hat{x}_i + \beta</math>

The parameters <math>\gamma</math> and <math>\beta</math> are learned during training. They restore the network's ability to represent the identity transformation if that is optimal, ensuring that normalization does not reduce the model's expressiveness.

=== During Inference ===

At inference time, statistics from individual mini-batches are unreliable (the input may be a single example). Instead, BatchNorm uses running estimates of the population mean and variance accumulated during training via exponential moving averages:

:<math>\mu_{\mathrm{running}} \leftarrow (1 - \alpha)\, \mu_{\mathrm{running}} + \alpha\, \mu_{\mathcal{B}}</math>

:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>

where <math>\alpha</math> is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

== Benefits ==

* '''Higher learning rates''': By constraining activation distributions, BatchNorm allows larger step sizes without divergence.
* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.
* '''Regularization effect''': The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].
* '''Faster convergence''': Training typically requires fewer epochs to reach a given level of performance.

== Placement ==

BatchNorm is typically applied '''before''' the activation function (as in the original paper), though some practitioners place it '''after''' the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.

== Normalization Alternatives ==

{| class="wikitable"
|-
! Method !! Normalizes over !! Use case
|-
| '''Batch Norm''' || Batch and spatial dims, per channel || CNNs with large batches
|-
| '''Layer Norm''' || All channels and spatial dims, per sample || Transformers, RNNs, small batches
|-
| '''Instance Norm''' || Spatial dims only, per sample per channel || Style transfer, image generation
|-
| '''Group Norm''' || Groups of channels, per sample || Object detection, small-batch training
|}

'''Layer normalization''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in Transformer architectures.

'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.

== Limitations ==

* Performance degrades with very small batch sizes, as batch statistics become noisy.
* Introduces a discrepancy between training (batch statistics) and inference (running statistics) behavior.
* Not directly applicable to variable-length sequences without padding or masking.
* The running statistics require careful handling when using distributed training across multiple devices.

== See also ==

* [[Neural Networks]]
* [[Backpropagation]]
* [[Dropout]]
* [[Stochastic Gradient Descent]]
* [[Transformer]]

== References ==

* Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ''ICML''.
* Ba, J. L., Kiros, J. R. and Hinton, G. E. (2016). "Layer Normalization". ''arXiv:1607.06450''.
* Wu, Y. and He, K. (2018). "Group Normalization". ''ECCV''.
* Santurkar, S. et al. (2018). "How Does Batch Normalization Help Optimization?". ''NeurIPS''.

[[Category:Deep Learning]]
[[Category:Intermediate]]
[[Category:Neural Networks]]

Batch Normalization - Revision history

DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

DeployBot: Pass 2 force re-parse

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

DeployBot: [deploy-bot] Deploy from CI (775ba6e)