Batch Normalization/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:36:09Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T21:57:40Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:16Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T02:37:18Z

Updating to match new version of source page

← Older revision		Revision as of 02:37, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Batch Normalization}}~~
	{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]], [[Backpropagation]]}}		{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]], [[Backpropagation]]}}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:30:52Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Batch Normalization}}
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Batch normalization''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.

== Internal Covariate Shift ==

The original motivation for batch normalization was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.

While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.

== The Batch Normalization Algorithm ==

=== During Training ===

For a mini-batch <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of activations at a given layer, BatchNorm proceeds as follows:

'''Step 1.''' Compute the mini-batch mean and variance:

:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>

'''Step 2.''' Normalize:

:<math>\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}</math>

where <math>\epsilon</math> is a small constant (e.g., <math>10^{-5}</math>) for numerical stability.

'''Step 3.''' Scale and shift with learned parameters <math>\gamma</math> and <math>\beta</math>:

:<math>y_i = \gamma \hat{x}_i + \beta</math>

The parameters <math>\gamma</math> and <math>\beta</math> are learned during training. They restore the network's ability to represent the identity transformation if that is optimal, ensuring that normalization does not reduce the model's expressiveness.

=== During Inference ===

At inference time, statistics from individual mini-batches are unreliable (the input may be a single example). Instead, BatchNorm uses running estimates of the population mean and variance accumulated during training via exponential moving averages:

:<math>\mu_{\mathrm{running}} \leftarrow (1 - \alpha)\, \mu_{\mathrm{running}} + \alpha\, \mu_{\mathcal{B}}</math>

:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>

where <math>\alpha</math> is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

== Benefits ==

* '''Higher learning rates''': By constraining activation distributions, BatchNorm allows larger step sizes without divergence.
* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.
* '''Regularization effect''': The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].
* '''Faster convergence''': Training typically requires fewer epochs to reach a given level of performance.

== Placement ==

BatchNorm is typically applied '''before''' the activation function (as in the original paper), though some practitioners place it '''after''' the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.

== Normalization Alternatives ==

{| class="wikitable"
|-
! Method !! Normalizes over !! Use case
|-
| '''Batch Norm''' || Batch and spatial dims, per channel || CNNs with large batches
|-
| '''Layer Norm''' || All channels and spatial dims, per sample || Transformers, RNNs, small batches
|-
| '''Instance Norm''' || Spatial dims only, per sample per channel || Style transfer, image generation
|-
| '''Group Norm''' || Groups of channels, per sample || Object detection, small-batch training
|}

'''Layer normalization''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in Transformer architectures.

'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.

== Limitations ==

* Performance degrades with very small batch sizes, as batch statistics become noisy.
* Introduces a discrepancy between training (batch statistics) and inference (running statistics) behavior.
* Not directly applicable to variable-length sequences without padding or masking.
* The running statistics require careful handling when using distributed training across multiple devices.

== See also ==

* [[Neural Networks]]
* [[Backpropagation]]
* [[Dropout]]
* [[Stochastic Gradient Descent]]
* [[Transformer]]

== References ==

* Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ''ICML''.
* Ba, J. L., Kiros, J. R. and Hinton, G. E. (2016). "Layer Normalization". ''arXiv:1607.06450''.
* Wu, Y. and He, K. (2018). "Group Normalization". ''ECCV''.
* Santurkar, S. et al. (2018). "How Does Batch Normalization Help Optimization?". ''NeurIPS''.

[[Category:Deep Learning]]
[[Category:Intermediate]]
[[Category:Neural Networks]]

← Older revision		Revision as of 23:36, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Batch normalization''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.		'''Batch normalization''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern {{Term\|deep learning}} architectures.

	== Internal Covariate Shift ==		== Internal Covariate Shift ==

	The original motivation for batch normalization was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.		The original motivation for batch normalization was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down {{Term\|convergence}} and requiring careful initialization and small {{Term\|learning rate\|learning rates}}.

	While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.		While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.
Line 15:		Line 15:
	=== During Training ===		=== During Training ===

	For a mini-batch <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of activations at a given layer, BatchNorm proceeds as follows:		For a {{Term\|mini-batch}} <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of {{Term\|activation function\|activations}} at a given layer, BatchNorm proceeds as follows:

	'''Step 1.''' Compute the mini-batch mean and variance:		'''Step 1.''' Compute the {{Term\|mini-batch}} mean and variance:

	:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>		:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>
Line 41:		Line 41:
	:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>		:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>

	where <math>\alpha</math> is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.		where <math>\alpha</math> is the {{Term\|momentum}} parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

	== Benefits ==		== Benefits ==

	* '''Higher learning rates''': By constraining activation distributions, BatchNorm allows larger step sizes without divergence.		* '''Higher {{Term\|learning rate\|learning rates}}''': By constraining {{Term\|activation function\|activation}} distributions, BatchNorm allows larger {{Term\|learning rate\|step sizes}} without divergence.
	* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.		* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.
	* '''~~Regularization~~ effect''': The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].		* '''{{Term\|regularization}} effect''': The noise introduced by {{Term\|mini-batch}} statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].
	* '''Faster convergence''': Training typically requires fewer epochs to reach a given level of performance.		* '''Faster {{Term\|convergence}}''': Training typically requires fewer {{Term\|epoch\|epochs}} to reach a given level of performance.

	== Placement ==		== Placement ==

	BatchNorm is typically applied '''before''' the activation function (as in the original paper), though some practitioners place it '''after''' the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.		BatchNorm is typically applied '''before''' the {{Term\|activation function}} (as in the original paper), though some practitioners place it '''after''' the {{Term\|activation function\|activation}}. For {{Term\|convolution\|convolutional layers}}, normalization is performed per-channel across the spatial dimensions and the batch dimension.

	== Normalization Alternatives ==		== Normalization Alternatives ==
Line 62:		Line 62:
	\| '''Batch Norm''' \|\| Batch and spatial dims, per channel \|\| CNNs with large batches		\| '''Batch Norm''' \|\| Batch and spatial dims, per channel \|\| CNNs with large batches
	\|-		\|-
	\| '''Layer Norm''' \|\| All channels and spatial dims, per sample \|\| Transformers, RNNs, small batches		\| '''Layer Norm''' \|\| All channels and spatial dims, per sample \|\| {{Term\|transformer\|Transformers}}, RNNs, small batches
	\|-		\|-
	\| '''Instance Norm''' \|\| Spatial dims only, per sample per channel \|\| Style transfer, image generation		\| '''Instance Norm''' \|\| Spatial dims only, per sample per channel \|\| Style transfer, image generation
Line 69:		Line 69:
	\|}		\|}

	'''~~Layer~~ normalization''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in ~~Transformer~~ architectures.		'''{{Term\|layer normalization}}''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in {{Term\|transformer}} architectures.

	'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.		'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.

← Older revision		Revision as of 21:57, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''~~{{Term\|batch~~ normalization}}''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern ~~{{Term\|~~deep learning}} architectures.		'''Batch normalization''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.

	== Internal Covariate Shift ==		== Internal Covariate Shift ==

	The original motivation for ~~{{Term\|~~batch normalization}} was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down ~~{{Term\|~~convergence}} and requiring careful initialization and small ~~{{Term\|learning rate\|~~learning rates}}.		The original motivation for batch normalization was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.

	While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.		While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.
Line 15:		Line 15:
	=== During Training ===		=== During Training ===

	For a ~~{{Term\|~~mini-batch}} <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of ~~{{Term\|activation function\|~~activations}} at a given layer, BatchNorm proceeds as follows:		For a mini-batch <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of activations at a given layer, BatchNorm proceeds as follows:

	'''Step 1.''' Compute the ~~{{Term\|~~mini-batch}} mean and variance:		'''Step 1.''' Compute the mini-batch mean and variance:

	:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>		:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>
Line 41:		Line 41:
	:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>		:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>

	where <math>\alpha</math> is the ~~{{Term\|~~momentum}} parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.		where <math>\alpha</math> is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

	== Benefits ==		== Benefits ==

	* '''Higher ~~{{Term\|learning rate\|~~learning rates}}''': By constraining ~~{{Term\|~~activation ~~function\|activation}}~~ distributions, BatchNorm allows larger ~~{{Term\|learning rate\|~~step sizes}} without divergence.		* '''Higher learning rates''': By constraining activation distributions, BatchNorm allows larger step sizes without divergence.
	* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.		* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.
	* '''~~{{Term\|regularization}}~~ effect''': The noise introduced by ~~{{Term\|~~mini-batch}} statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].		* '''Regularization effect''': The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].
	* '''Faster ~~{{Term\|~~convergence}}''': Training typically requires fewer ~~{{Term\|epoch\|~~epochs}} to reach a given level of performance.		* '''Faster convergence''': Training typically requires fewer epochs to reach a given level of performance.

	== Placement ==		== Placement ==

	BatchNorm is typically applied '''before''' the ~~{{Term\|~~activation function}} (as in the original paper), though some practitioners place it '''after''' the ~~{{Term\|~~activation ~~function\|activation}}~~. For ~~{{Term\|convolution\|~~convolutional layers}}, normalization is performed per-channel across the spatial dimensions and the batch dimension.		BatchNorm is typically applied '''before''' the activation function (as in the original paper), though some practitioners place it '''after''' the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.

	== Normalization Alternatives ==		== Normalization Alternatives ==
Line 62:		Line 62:
	\| '''Batch Norm''' \|\| Batch and spatial dims, per channel \|\| CNNs with large batches		\| '''Batch Norm''' \|\| Batch and spatial dims, per channel \|\| CNNs with large batches
	\|-		\|-
	\| '''Layer Norm''' \|\| All channels and spatial dims, per sample \|~~\| {{Term\|transformer~~\|Transformers}}, RNNs, small batches		\| '''Layer Norm''' \|\| All channels and spatial dims, per sample \|\| Transformers, RNNs, small batches
	\|-		\|-
	\| '''Instance Norm''' \|\| Spatial dims only, per sample per channel \|\| Style transfer, image generation		\| '''Instance Norm''' \|\| Spatial dims only, per sample per channel \|\| Style transfer, image generation
Line 69:		Line 69:
	\|}		\|}

	'''~~{{Term\|layer~~ normalization}}''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in ~~{{Term\|transformer}}~~ architectures.		'''Layer normalization''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in Transformer architectures.

	'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.		'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.

← Older revision		Revision as of 19:42, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''~~Batch~~ normalization''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.		'''{{Term\|batch normalization}}''' (often abbreviated '''BatchNorm''' or '''BN''') is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern {{Term\|deep learning}} architectures.

	== Internal Covariate Shift ==		== Internal Covariate Shift ==

	The original motivation for batch normalization was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.		The original motivation for {{Term\|batch normalization}} was to address '''internal covariate shift''' — the phenomenon where the distribution of each layer's inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down {{Term\|convergence}} and requiring careful initialization and small {{Term\|learning rate\|learning rates}}.

	While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.		While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm's benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.
Line 15:		Line 15:
	=== During Training ===		=== During Training ===

	For a mini-batch <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of activations at a given layer, BatchNorm proceeds as follows:		For a {{Term\|mini-batch}} <math>\mathcal{B} = \{x_1, \dots, x_m\}</math> of {{Term\|activation function\|activations}} at a given layer, BatchNorm proceeds as follows:

	'''Step 1.''' Compute the mini-batch mean and variance:		'''Step 1.''' Compute the {{Term\|mini-batch}} mean and variance:

	:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>		:<math>\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2</math>
Line 41:		Line 41:
	:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>		:<math>\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}</math>

	where <math>\alpha</math> is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.		where <math>\alpha</math> is the {{Term\|momentum}} parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.

	== Benefits ==		== Benefits ==

	* '''Higher learning rates''': By constraining activation distributions, BatchNorm allows larger step sizes without divergence.		* '''Higher {{Term\|learning rate\|learning rates}}''': By constraining {{Term\|activation function\|activation}} distributions, BatchNorm allows larger {{Term\|learning rate\|step sizes}} without divergence.
	* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.		* '''Reduced sensitivity to initialization''': Networks with BatchNorm are more forgiving of poor weight initialization.
	* '''~~Regularization~~ effect''': The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].		* '''{{Term\|regularization}} effect''': The noise introduced by {{Term\|mini-batch}} statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].
	* '''Faster convergence''': Training typically requires fewer epochs to reach a given level of performance.		* '''Faster {{Term\|convergence}}''': Training typically requires fewer {{Term\|epoch\|epochs}} to reach a given level of performance.

	== Placement ==		== Placement ==

	BatchNorm is typically applied '''before''' the activation function (as in the original paper), though some practitioners place it '''after''' the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.		BatchNorm is typically applied '''before''' the {{Term\|activation function}} (as in the original paper), though some practitioners place it '''after''' the {{Term\|activation function\|activation}}. For {{Term\|convolution\|convolutional layers}}, normalization is performed per-channel across the spatial dimensions and the batch dimension.

	== Normalization Alternatives ==		== Normalization Alternatives ==
Line 62:		Line 62:
	\| '''Batch Norm''' \|\| Batch and spatial dims, per channel \|\| CNNs with large batches		\| '''Batch Norm''' \|\| Batch and spatial dims, per channel \|\| CNNs with large batches
	\|-		\|-
	\| '''Layer Norm''' \|\| All channels and spatial dims, per sample \|\| Transformers, RNNs, small batches		\| '''Layer Norm''' \|\| All channels and spatial dims, per sample \|\| {{Term\|transformer\|Transformers}}, RNNs, small batches
	\|-		\|-
	\| '''Instance Norm''' \|\| Spatial dims only, per sample per channel \|\| Style transfer, image generation		\| '''Instance Norm''' \|\| Spatial dims only, per sample per channel \|\| Style transfer, image generation
Line 69:		Line 69:
	\|}		\|}

	'''~~Layer~~ normalization''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in ~~Transformer~~ architectures.		'''{{Term\|layer normalization}}''' (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in {{Term\|transformer}} architectures.

	'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.		'''Group normalization''' (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.