RMSNorm/en
| Article | |
|---|---|
| Topic area | neural-network-components |
| Prerequisites | Layer Normalization, Transformer |
Overview
Root Mean Square Layer Normalization (RMSNorm) is a normalization layer used in deep neural networks, introduced by Zhang and Sennrich in 2019 as a simplification of Layer Normalization. RMSNorm rescales each input vector by its root mean square (RMS) and applies a learned per-feature gain, but unlike LayerNorm it omits the mean-centering (re-centering) step and the additive bias. The result is a layer with roughly half the arithmetic of LayerNorm while preserving most of its training stability benefits.
RMSNorm has become a standard component of modern Transformer architectures, particularly large Language Models. It is the normalization layer used in T5, LLaMA, LLaMA 2, LLaMA 3, Mistral, Gemma, Qwen, and many other open-weight models. Its popularity stems from a combination of simplicity, slightly faster compute, marginally better training stability with the Pre-LayerNorm placement, and the empirical observation that re-centering activations rarely changes downstream task quality.
Motivation
Layer Normalization standardizes a vector $ x \in \mathbb{R}^d $ by subtracting its mean and dividing by its standard deviation, then applying a learned scale $ \gamma $ and shift $ \beta $. The re-centering step (subtracting the mean) was originally motivated by analogy with Batch Normalization, which centers each feature across the batch. Zhang and Sennrich observed that the practical benefits of LayerNorm in deep networks come almost entirely from re-scaling rather than re-centering: the layer's role is to keep activation magnitudes bounded so that gradients neither vanish nor explode. The mean-subtraction step costs an extra pass over the vector, an additional parameter ($ \beta $), and contributes little to optimization.
RMSNorm is the answer to the question, "what happens if we keep only the re-scaling part?" It is invariant to per-vector rescaling of the input but, unlike LayerNorm, not invariant to a constant shift. Empirically this missing invariance has had no measurable cost in modern Transformer training.
Formulation
Given an input vector $ x \in \mathbb{R}^d $, RMSNorm computes
$ {\displaystyle \operatorname{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2}} $
and outputs
$ {\displaystyle \operatorname{RMSNorm}(x)_i = \frac{x_i}{\operatorname{RMS}(x) + \varepsilon}\, g_i} $
where $ g \in \mathbb{R}^d $ is a learned gain vector (initialized to ones) and $ \varepsilon $ is a small constant (typically $ 10^{-6} $ or $ 10^{-5} $) added for numerical stability. There is no learned bias and no mean subtraction.
For a sequence of token vectors, RMSNorm is applied independently per token, exactly as LayerNorm is in Transformers. In matrix form, if $ X \in \mathbb{R}^{n \times d} $ stacks $ n $ token vectors as rows, then
$ {\displaystyle \operatorname{RMSNorm}(X) = \operatorname{diag}(g)\, X \oslash \sqrt{\frac{1}{d}(X \odot X)\mathbf{1} + \varepsilon}} $
where $ \odot $ and $ \oslash $ are element-wise multiplication and division and $ \mathbf{1} $ is the all-ones vector.
Relationship to Layer Normalization
LayerNorm computes the mean $ \mu = \frac{1}{d}\sum_i x_i $ and standard deviation $ \sigma = \sqrt{\frac{1}{d}\sum_i (x_i - \mu)^2} $, then outputs $ \gamma \odot (x - \mu)/(\sigma + \varepsilon) + \beta $. RMSNorm is exactly LayerNorm with $ \mu $ forced to zero and $ \beta $ dropped. Equivalently, RMSNorm is LayerNorm restricted to the case where the input distribution is already centered, which becomes approximately true after a few layers of training in deep Transformers.
The arithmetic cost of LayerNorm per $ d $-dimensional vector is roughly two reductions (mean and variance) plus an element-wise affine; RMSNorm has one reduction (sum of squares) plus an element-wise scale. On modern GPUs the wall-clock saving is small in isolation, but compounded over hundreds of layers and trillions of tokens it is non-trivial. More importantly, the simpler kernel makes fused implementations easier to write and to keep numerically stable in mixed precision.
Placement in Transformers
RMSNorm is almost always used in the Pre-LayerNorm (pre-norm) residual configuration: the layer is applied to the input of each sub-block, and the residual connection adds the unnormalized input to the sub-block output. Schematically, for the Self-attention sub-block,
$ {\displaystyle y = x + \operatorname{Attention}(\operatorname{RMSNorm}(x))} $
and similarly for the Feedforward Network sub-block. This contrasts with the original Transformer's Post-LayerNorm (post-norm) placement, where normalization follows the residual addition. Pre-norm with RMSNorm trains stably without learning-rate warmup tricks at scales beyond which post-norm typically diverges, and is the default in essentially all modern open-weight LLMs.
A final RMSNorm is conventionally applied to the output of the last block before the unembedding projection.
Variants and Extensions
Several variants of RMSNorm appear in the literature and in production models.
Partial RMSNorm computes the RMS using only the first $ k < d $ coordinates of the input. The intuition is that the RMS of a high-dimensional vector concentrates strongly on its mean, so a partial sum is almost as accurate. Zhang and Sennrich reported negligible quality loss with $ k = d/8 $ and faster training. Partial RMSNorm has not seen wide adoption in practice because the absolute speedup is small and modern attention kernels dominate the runtime.
Gated RMSNorm multiplies the RMSNorm output by a gating function of the input or another tensor. It is used in some State Space Model architectures, notably Mamba 2, where it gates the SSM output before the projection.
Group RMSNorm splits the feature dimension into $ G $ groups and normalizes each group independently with its own gain vector, in analogy with Group Normalization for convolutional networks. Group RMSNorm is used inside grouped attention heads in some recent architectures.
QK Normalization applies RMSNorm to the query and key vectors inside Self-attention before the dot product. This stabilizes training at very large scale by preventing the pre-softmax logits from drifting to extreme magnitudes, and is used in Gemma 2, several frontier models, and the Vision Transformer ViT-22B.
Implementation Considerations
A reference PyTorch implementation is one of the simplest layers in deep learning:
<syntaxhighlight lang="python"> class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(dim))
self.eps = eps
def forward(self, x):
# x: (..., dim)
rms = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
return self.weight * (x * rms)
</syntaxhighlight>
Production implementations fuse the entire layer into a single CUDA kernel to minimize memory traffic, and compute the sum of squares in fp32 even when the inputs are bf16 or fp16 to avoid overflow when $ d $ is large (e.g., 8192 in LLaMA 70B). The gain vector is typically stored in the same precision as the activations and cast to fp32 only inside the fused kernel. The $ \varepsilon $ is added inside the square root, not after, to keep the gradient well-behaved when the input is near zero.
Apex, FlashAttention, and the major training frameworks all ship optimized RMSNorm kernels. The forward and backward passes are bandwidth-bound rather than compute-bound on accelerators.
Empirical Performance
Across machine translation, language modeling, and downstream evaluation tasks, RMSNorm matches or slightly exceeds LayerNorm in final task quality while training 5-10% faster end-to-end on common Transformer sizes. The original paper reported wins of 0.1-0.3 BLEU on WMT translation benchmarks at no quality cost, with ablations on RNN-based models showing similar trends. Subsequent large-scale studies, including the empirical work behind the T5 and LLaMA model families, found no scenario in which LayerNorm provided a clear quality advantage. Combined with its simpler kernel and good interaction with pre-norm residual connections, this evidence has made RMSNorm the default normalization choice in essentially every Transformer-based Language Model released since 2022.
Limitations
RMSNorm is not invariant to a constant additive shift of its input, so models that rely on encoding information in the mean of a representation could in principle behave differently under RMSNorm than under LayerNorm. In practice, Self-attention and Feedforward Network sub-blocks do not exploit the mean of their input, and the activations in pre-norm Transformers are approximately mean-zero by the second or third block, so the missing invariance is not measurable in real models. RMSNorm shares with LayerNorm the property that all coordinates of a vector influence the normalization of every other coordinate, which couples activations across the feature dimension and complicates certain forms of model surgery (such as pruning or per-feature quantization) more than would a per-feature normalizer like Batch Normalization.
See Also
References