Xavier Initialization

    From Marovi AI
    Other languages:
    Article
    Topic area Deep Learning
    Prerequisites Backpropagation, Activation Function


    Overview

    Xavier initialization, also called Glorot initialization, is a scheme for setting the initial weights of a deep neural network so that the variance of activations and gradients stays roughly constant across layers at the start of training. It was introduced by Xavier Glorot and Yoshua Bengio in 2010 and remains one of the most widely used default initializers, especially for networks with symmetric, zero-centered activation functions such as the hyperbolic tangent.[1]

    The core idea is a simple variance-balancing argument. If each layer's weights are drawn from a distribution whose variance is calibrated to its fan-in and fan-out, then the magnitude of forward activations and the magnitude of backward gradients neither shrink toward zero nor blow up as signals propagate through depth. This avoids the vanishing and exploding gradient pathologies that made it notoriously hard to train deep networks before careful initialization was understood.

    Motivation

    Before Glorot and Bengio's analysis, deep networks were typically initialized with weights drawn from a small Gaussian or uniform distribution chosen by hand or by trial and error. Practitioners observed that very deep networks with sigmoid or hyperbolic tangent activations were almost impossible to train: gradients computed by backpropagation would either decay to zero in early layers (the vanishing gradient problem) or grow without bound (the exploding gradient problem), depending on the scale of the weights and the choice of activation function. Symmetric saturating activations such as tanh exacerbated the issue because their derivatives shrink rapidly outside a narrow region around zero.

    Glorot and Bengio analyzed how the variance of activations evolves through the forward pass and how the variance of gradients evolves through the backward pass under the assumptions that activations are approximately linear near zero and that weights are independent and identically distributed with zero mean. They showed that to keep both forward and backward variances stable across a chain of layers, the variance of each weight should be tied to the layer's fan-in (number of input units) and fan-out (number of output units). The resulting prescription is what is now called Xavier initialization.

    Formulation

    Consider a fully connected layer that maps an input vector of dimension $ n_{\text{in}} $ to an output vector of dimension $ n_{\text{out}} $ by a weight matrix $ W $. Assume the inputs are independent with zero mean and shared variance $ \sigma_x^2 $, the weights are independent with zero mean and variance $ \sigma_w^2 $, and the activation function behaves approximately linearly near zero with unit slope. Then the variance of each output is

    $ {\displaystyle \operatorname{Var}(y) = n_{\text{in}} \, \sigma_w^2 \, \sigma_x^2.} $

    For the variance to be preserved through the forward pass, one needs $ \sigma_w^2 = 1 / n_{\text{in}} $. A symmetric argument applied to the backward pass, where the gradient is multiplied by the transpose of the weight matrix, gives $ \sigma_w^2 = 1 / n_{\text{out}} $. Since these two requirements generally conflict, Glorot and Bengio proposed the harmonic compromise

    $ {\displaystyle \sigma_w^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}.} $

    This is the defining variance of Xavier initialization. Two common parameterizations realize it:

    • Xavier normal: draw each weight independently from a Gaussian distribution $ \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) $.
    • Xavier uniform: draw each weight independently from $ \mathcal{U}(-a, a) $ with $ a = \sqrt{6 / (n_{\text{in}} + n_{\text{out}})} $, so that the uniform distribution has the prescribed variance.

    Biases are typically initialized to zero. For convolutional layers, fan-in and fan-out are computed as the number of input or output channels multiplied by the size of the receptive field (for example, kernel height times kernel width).

    Derivation sketch

    The derivation rests on three approximations: weights and inputs are independent and zero-mean, the activation function is locally linear near the origin with derivative one, and biases are zero. Under these assumptions the forward variance recursion reads

    $ {\displaystyle \operatorname{Var}(y^{(\ell)}) = n^{(\ell)}_{\text{in}} \, \sigma^{(\ell)\,2}_w \, \operatorname{Var}(y^{(\ell-1)}),} $

    so the variance is preserved across layers exactly when $ n^{(\ell)}_{\text{in}} \sigma^{(\ell)\,2}_w = 1 $. The backward recursion has the same form with $ n^{(\ell)}_{\text{in}} $ replaced by $ n^{(\ell)}_{\text{out}} $, since the gradient is propagated by the transpose. Averaging the two conditions gives the Xavier rule. Glorot and Bengio verified empirically that networks initialized this way exhibit much more stable activation and gradient histograms across depth, and train substantially faster than networks initialized with the older heuristic $ \mathcal{U}(-1/\sqrt{n_{\text{in}}}, 1/\sqrt{n_{\text{in}}}) $.

    Comparison with He initialization

    The linear-activation assumption underlying Xavier initialization is reasonable for tanh and the logistic sigmoid near the origin, but it breaks for the rectified linear unit. Because ReLU zeros out negative inputs, only roughly half of the pre-activations contribute to the output variance, and Xavier initialization tends to underestimate the appropriate weight scale for ReLU networks, leading to activations that shrink with depth. He initialization, introduced by Kaiming He and collaborators in 2015, accounts for this by using

    $ {\displaystyle \sigma_w^2 = \frac{2}{n_{\text{in}}}} $

    instead, doubling the variance to compensate for the rectifier.[2] In modern practice, Xavier initialization is the default for tanh, sigmoid, and softmax-based architectures, while He initialization is the default for ReLU and its variants. Both are special cases of a general "scaled" initialization principle and are sometimes referred to collectively as variance-preserving initializations.

    Practical considerations

    For deep networks with batch normalization or layer normalization, the choice of initializer matters less than it once did, because the normalization layer rescales activations after each linear transformation and partially absorbs miscalibrated initial scales. Even so, a sensible initializer accelerates the first few epochs of training and can be the difference between a network that diverges and one that converges, particularly for very deep networks without normalization or for the residual branches of very deep residual networks where small initial scales help.

    Most deep learning frameworks ship with both Xavier uniform and Xavier normal as built-in initializers, often under the names "glorot_uniform" and "glorot_normal" (for example in TensorFlow and Keras) or "xavier_uniform_" and "xavier_normal_" (in PyTorch). PyTorch's default linear-layer initialization uses a closely related He-uniform scheme. When mixing activations or combining with normalization, practitioners typically follow the convention of pairing tanh and sigmoid layers with Xavier and ReLU layers with He, while leaving the framework defaults for normalization and embedding layers untouched.

    For recurrent neural networks, orthogonal initialization of the recurrent weight matrix is a stronger choice than Xavier, because it preserves the norm of the hidden state across time steps exactly rather than only on average. Xavier remains a reasonable choice for the input-to-hidden weights of a recurrent layer.

    Limitations

    Xavier initialization makes assumptions that are routinely violated in practice. It assumes the activation is linear near zero with unit slope, which excludes ReLU and its variants; it assumes inputs and weights are independent with zero mean, which is approximately but not exactly true at initialization; and it ignores the effect of biases, normalization layers, and skip connections. For networks with strong nonlinearities, it underestimates or overestimates the appropriate scale, and dedicated derivations such as He initialization or LSUV initialization are preferred.

    The scheme also says nothing about the geometric structure of the weight matrix. Two networks with identical Xavier-distributed weights can still have very different conditioning of their Jacobians, and pathological eigenvalue structures can persist even when variances are well calibrated. For very deep networks, more careful initializations such as orthogonal matrices, dynamical isometry, or Fixup initialization for residual networks address the conditioning question directly.[3] Finally, the scheme is purely about the start of training; with sufficient training, modern optimizers can usually overcome moderately mis-scaled initial weights, so Xavier initialization is best understood as a robust default rather than as an optimal choice in any strict sense.

    References