Kaiming Initialization
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Neural Network, Activation Function, Backpropagation |
Overview
Kaiming initialization, also called He initialization, is a weight initialization scheme for deep neural networks designed to keep the variance of activations and gradients approximately constant across layers when the network uses rectified linear units. It was introduced by Kaiming He and colleagues in 2015 as part of work on very deep convolutional networks for image classification.[1] The method draws each weight from a zero-mean distribution whose variance is set to $ 2/n $, where $ n $ is either the number of input units (fan-in) or the number of output units (fan-out) of the layer. The factor of two compensates for the fact that a ReLU zeros out roughly half of its inputs, which would otherwise halve the signal variance at every layer and cause activations to vanish in deep networks.
Kaiming initialization is the default choice for most modern feed-forward and convolutional architectures that use ReLU or its variants. It is closely related to, and partially supersedes, Xavier Initialization (also called Glorot initialization), which assumed symmetric activations such as $ \tanh $ and used a variance of $ 1/n $. Together, these schemes form the foundation of variance-preserving initialization, a family of methods that made it possible to train networks with dozens or hundreds of layers without resorting to layer-wise pre-training.
Motivation
Before variance-preserving initialization was understood, training deep networks was notoriously unstable. If weights were drawn from a distribution with too large a variance, activations grew exponentially with depth, leading to numerical overflow or saturated nonlinearities. If the variance was too small, activations shrank toward zero and gradients vanished, leaving deep layers unable to learn. The problem is most acute during the early steps of Stochastic Gradient Descent, before any normalization layer has had the chance to adapt.
The breakthrough was to treat initialization as a statistical question: what variance should the weights have so that the variance of the pre-activation in each layer matches the variance of the input signal? Answering this question requires assumptions about the Activation Function and the distribution of inputs. Glorot and Bengio gave the first rigorous answer in 2010 for symmetric activations such as the Hyperbolic Tangent.[2] He et al. extended the analysis to ReLU, which is asymmetric and zeros out negative inputs. The asymmetry is what introduces the additional factor of two in the Kaiming formula.
Mathematical Formulation
Consider a fully connected layer with input $ x \in \mathbb{R}^{n_{\text{in}}} $, weight matrix $ W \in \mathbb{R}^{n_{\text{out}} \times n_{\text{in}}} $, and pre-activation $ y = W x $. Assume the entries of $ W $ are independent, zero-mean, with variance $ \sigma^2 $, and that the entries of $ x $ are independent, zero-mean, with variance $ \mathrm{Var}(x) $. Then for any output coordinate $ y_i $:
$ {\displaystyle \mathrm{Var}(y_i) = n_{\text{in}} \cdot \sigma^2 \cdot \mathrm{Var}(x).} $
For the output variance to match the input variance, one would set $ \sigma^2 = 1/n_{\text{in}} $, which is the Xavier rule. However, in a ReLU network the input $ x $ to layer $ l+1 $ is the ReLU output of layer $ l $. Because ReLU zeros out negative pre-activations, the variance of $ x $ is roughly half the variance of the pre-activation that produced it, assuming the pre-activation is symmetric around zero. To compensate, the weight variance must double, giving the Kaiming rule:
$ {\displaystyle \sigma^2 = \frac{2}{n_{\text{in}}}.} $
The weights can be drawn from either a normal distribution $ \mathcal{N}(0, 2/n_{\text{in}}) $ or a uniform distribution on $ [-\sqrt{6/n_{\text{in}}},\, \sqrt{6/n_{\text{in}}}] $, since both yield the same variance. The bias is typically initialized to zero. For convolutional layers, $ n_{\text{in}} $ is computed as $ k_h \cdot k_w \cdot c_{\text{in}} $, where $ k_h $ and $ k_w $ are the kernel dimensions and $ c_{\text{in}} $ is the number of input channels.
Fan-In Versus Fan-Out
Kaiming initialization admits two variants depending on which direction of signal propagation is preserved. The fan-in mode sets $ \sigma^2 = 2/n_{\text{in}} $, which keeps the forward activation variance constant across layers. The fan-out mode sets $ \sigma^2 = 2/n_{\text{out}} $, which keeps the backward gradient variance constant. He et al. showed that both choices yield trainable networks; the relevant scaling factor depends on whether the practitioner is more concerned about forward signal preservation or gradient flow during Backpropagation. In practice the difference is small for layers with comparable fan-in and fan-out, and most frameworks default to fan-in. The original paper recommended fan-out for the convolutional layers in their VGG-style architectures because the early layers had small fan-in but large fan-out.
Generalization to Other Activations
The factor of two in Kaiming initialization is specific to ReLU. For activations that pass through more or less of the input, the formula generalizes to $ \sigma^2 = g^2 / n $, where $ g $ is a gain factor that depends on the activation. The leaky ReLU with negative slope $ a $ has gain $ g = \sqrt{2/(1 + a^2)} $, which reduces to the standard Kaiming gain of $ \sqrt{2} $ when $ a = 0 $. The Sigmoid Function has gain 1, and $ \tanh $ has gain $ 5/3 $. Modern frameworks such as PyTorch expose these gains through helper functions like torch.nn.init.calculate_gain, so practitioners do not need to memorize the constants.
For more exotic activations such as SELU or smoother variants like GELU, specialized initialization schemes have been derived. SELU in particular requires a precise variance of $ 1/n_{\text{in}} $ together with carefully chosen activation parameters to drive activations toward a fixed point during training.
Comparison with Xavier Initialization
Xavier initialization predates Kaiming initialization and uses $ \sigma^2 = 1/n $, derived under the assumption that the activation function is approximately linear and symmetric around zero. This holds for $ \tanh $ near the origin but not for ReLU. As a result, networks initialized with Xavier and using ReLU activations exhibit a slow decay of activation variance with depth: each layer halves the signal, so a 30-layer ReLU network would have activations roughly $ 2^{-30} $ times their initial magnitude, far below useful precision. He et al. demonstrated empirically that this decay prevented their 30-layer convolutional network from training at all under Xavier initialization, while the same architecture trained successfully under Kaiming initialization.
The two schemes are related by a constant factor. Many practitioners use Xavier for sigmoid or $ \tanh $ networks and Kaiming for ReLU networks; this convention captures the variance-preservation argument with minimal cognitive overhead. Modern architectures that combine multiple activation types often default to Kaiming throughout, accepting a slight overshoot in variance as harmless when Batch Normalization or Layer Normalization is present.
Interaction with Normalization Layers
The advent of normalization layers has reduced, but not eliminated, the importance of initialization. Batch Normalization rescales activations to unit variance after every layer, which means that even a poorly initialized network often trains successfully. Nevertheless, the very first forward and backward passes happen before BatchNorm has accumulated reliable statistics, and a bad initialization can push pre-activations into ranges that destabilize the running estimates. Kaiming initialization is therefore still the recommended choice in normalized networks, although the practical penalty for using a different scheme is much smaller than in unnormalized ones.
In residual networks, careful initialization can also reduce the variance amplification caused by the additive skip connections. Schemes such as Fixup and ZerO-Init build on the Kaiming framework by introducing per-block scaling factors that prevent the residual sum from blowing up at initialization, allowing very deep ResNets to be trained without normalization at all.
Practical Implementation
In PyTorch, Kaiming initialization is exposed as torch.nn.init.kaiming_normal_ and torch.nn.init.kaiming_uniform_. Both accept arguments for fan mode and nonlinearity. For convolutional and linear layers, PyTorch defaults to Kaiming uniform with the gain for leaky ReLU (slope $ \sqrt{5} $), which is a historical artifact rather than a recommended setting; many practitioners override this default. TensorFlow and JAX-based frameworks provide equivalent functions, often called he_normal or he_uniform.
Beyond the choice of distribution and fan mode, three implementation details matter in practice. First, biases should be initialized to zero, not from the same distribution as weights. Second, the gain must match the activation that follows the layer, not precede it; a layer feeding into ReLU uses gain $ \sqrt{2} $, regardless of what came before. Third, when layers are tied or shared (such as in some Transformer variants), only one copy should be initialized.
Limitations
Kaiming initialization is a single-layer analysis: it assumes weights, inputs, and gradients are independent and zero-mean, and it ignores correlations introduced by the data distribution and by training dynamics. These assumptions can break in deep networks with strong feature reuse, in recurrent networks where the same weights are applied repeatedly, and in attention-based architectures where the softmax produces sharp output distributions. For RNNs and Transformers, alternative schemes such as orthogonal initialization or T-Fixup have been proposed.
A second limitation is that variance preservation is only one of several initialization desiderata. Dynamical isometry, which preserves the singular values of the input-output Jacobian, is a stronger condition that yields better-behaved gradients in extremely deep networks. Methods based on dynamical isometry, including orthogonal weights with careful nonlinearity choices, can outperform Kaiming initialization in regimes where depth is the binding constraint. For standard practice in networks of a few dozen layers with normalization, however, Kaiming initialization remains the default and rarely needs to be revisited.
References
- ↑ Template:Cite arxiv
- ↑ Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010.