Group Normalization/en

Article
Topic area	Deep Learning
Prerequisites	Batch Normalization, Convolutional Neural Network

Translate this page

Other languages:

Overview

Group Normalization (GN) is a feature normalization technique for deep neural networks introduced by Yuxin Wu and Kaiming He in 2018.^[1] It divides the channels of a layer into a fixed number of groups and computes the mean and variance for normalization within each group, independently for every example in the batch. Unlike Batch Normalization, its statistics do not depend on the batch dimension, which makes it stable when training with small batches, on memory-constrained hardware, or in tasks such as object detection and video understanding where large batch sizes are impractical.

GN sits between Layer Normalization and Instance Normalization in the family of activation-normalization methods. Layer Normalization normalizes across all channels of a single example, Instance Normalization normalizes each channel of each example separately, and Group Normalization picks a middle ground by normalizing groups of channels. Empirically, this middle ground often gives accuracy close to Batch Normalization on vision benchmarks while removing the dependence on batch statistics.

Motivation

Batch Normalization revolutionized deep learning by stabilizing training and enabling much deeper networks, but it has a structural weakness: its per-channel mean and variance are estimated from the examples in the current minibatch. When the batch is small, those estimates are noisy and the normalization itself becomes a source of error. The problem is acute in tasks such as detection, segmentation, video, and 3D learning, where a single sample can occupy a large fraction of GPU memory and per-device batches of one or two are common. Switching from a batch of 32 to a batch of 2 in BN can degrade ImageNet top-1 accuracy by more than ten percentage points.

Group Normalization was designed to keep the regularizing and conditioning benefits of activation normalization while removing the batch dependence. Because GN computes statistics from a single example only, the same code path is used at training and inference time and the normalization behavior does not change with batch size or distribution shift between training and test batches.

Formulation

Consider a 4D activation tensor from a Convolutional Neural Network with shape $$ (N, C, H, W) $$ , where $$ N $$ is the batch size, $$ C $$ is the number of channels, and $$ H, W $$ are spatial dimensions. Group Normalization partitions the $$ C $$ channels into $$ G $$ groups, each containing $$ C/G $$ channels. For a given example $$ n $$ and group $$ g $$ , let $\mathcal{S}_{n,g}$ denote the set of activations in that group, of size $(C/G) \cdot H \cdot W$ .

The group mean and variance are

$\mu_{n,g} = \frac{1}{|\mathcal{S}_{n,g}|} \sum_{x \in \mathcal{S}_{n,g}} x, \qquad \sigma_{n,g}^2 = \frac{1}{|\mathcal{S}_{n,g}|} \sum_{x \in \mathcal{S}_{n,g}} (x - \mu_{n,g})^2.$

Each activation $$ x_i $$ in group $$ g $$ of example $$ n $$ is then normalized as

$\hat{x}_i = \frac{x_i - \mu_{n,g}}{\sqrt{\sigma_{n,g}^2 + \epsilon}},$

where $\epsilon$ is a small constant (typically $10^{-5}$ ) added for numerical stability. Finally, a per-channel affine transformation with learnable parameters $\gamma_c$ and $\beta_c$ is applied:

$y_i = \gamma_c \hat{x}_i + \beta_c.$

The affine parameters preserve the representational capacity of the layer and are shared across spatial positions and across the batch dimension. Two limiting cases recover well-known methods: with $$ G = 1 $$ , GN reduces to Layer Normalization (one group spanning all channels); with $$ G = C $$ , it reduces to Instance Normalization (one group per channel). Wu and He recommend $$ G = 32 $$ as a default that works well across architectures.

Comparison with other normalizers

The activation-normalization family can be characterized by which axes are reduced over when computing statistics. Given a tensor of shape $$ (N, C, H, W) $$ :

Batch Normalization reduces over $$ (N, H, W) $$ per channel — its statistics couple examples in the batch.
Layer Normalization reduces over $$ (C, H, W) $$ per example — one mean and variance per example.
Instance Normalization reduces over $$ (H, W) $$ per (example, channel).
Group Normalization reduces over $$ (C/G, H, W) $$ per (example, group).

Only Batch Normalization mixes information across examples; the others, including GN, are sample-wise and therefore behave identically at training and inference. This independence from batch composition is what gives GN its stability under small or variable batch sizes. The trade-off is that GN does not get the implicit regularization that BN derives from batch noise, so the optimal learning rate and weight-decay schedule may differ.

Practical considerations

The number of groups $$ G $$ is a hyperparameter, but in practice it is rarely tuned per layer. The standard choice $$ G = 32 $$ works well for ResNet-like backbones with channel counts that are multiples of 32. When a layer has fewer than 32 channels, the convention is to use a fixed group size (e.g. 16 channels per group) instead of a fixed group count. Most implementations require $$ C $$ to be divisible by $$ G $$ ; pad or adjust the architecture if necessary.

GN is a drop-in replacement for BN in most architectures: replace each BN layer with a GN layer, keep the rest of the network unchanged, and retrain. No running statistics need to be tracked, which simplifies the model and removes a source of train/test mismatch. Memory and compute overhead are modest — GN typically costs slightly more than BN per forward pass in modern frameworks because the reduction shape is less hardware-friendly than per-channel reductions, but the difference is usually below ten percent.

When fine-tuning a network that was pre-trained with BN, switching to GN generally requires retraining or substantial fine-tuning rather than a hot swap, because the affine parameters were learned in a different normalization regime.

Empirical behavior

In the original paper, ResNet-50 trained on ImageNet with a batch size of 32 per GPU achieves comparable top-1 accuracy under BN and GN (within roughly half a percentage point). The decisive difference appears as the batch is shrunk: at batch size 2 per GPU, BN accuracy collapses by about ten points while GN remains essentially unchanged. The same pattern holds for object detection on COCO with Mask R-CNN and for video classification, where GN consistently outperforms BN under the small-batch regimes those tasks impose.

Group Normalization has become a standard choice in detection and segmentation frameworks, and is frequently used in diffusion models and other architectures where per-device batch sizes are constrained by activation memory. It is also a common baseline normalizer in implementations that need deterministic, distribution-free behavior, such as reinforcement learning and federated settings.

Variants and related methods

Several methods extend or relate to GN:

Switchable Normalization learns a weighted combination of BN, LN, and IN statistics per layer, effectively letting the network choose where on the BN/GN/LN/IN spectrum to operate.^[2]
Filter Response Normalization removes the mean-subtraction step entirely and combines per-channel variance normalization with a learned thresholded activation, also avoiding batch dependence.^[3]
Weight Standardization normalizes the convolutional weights instead of activations and is often combined with GN to recover BN-level accuracy at small batch sizes.^[4]
Group Convolutions share the grouping intuition along the channel dimension but operate on the convolution itself rather than on normalization statistics, and the two ideas are sometimes combined in efficient architectures.

Limitations

GN removes batch dependence at the cost of removing the implicit regularization that BN provides, and on large-batch image classification it sometimes lags BN by a small margin. The required divisibility of $$ C $$ by $$ G $$ is a minor architectural constraint. Because the reductions span variable spatial extents, GN can be slower than BN on hardware that is highly optimized for per-channel reductions. Finally, the choice of $$ G $$ is an extra hyperparameter; though defaults work well, the optimal value can depend on the network width and task.

References

[1] Template:Cite arxiv

[2] Template:Cite arxiv

[3] Template:Cite arxiv

[4] Template:Cite arxiv

[1]

[2]

[3]

[4]