Learning Rate Warmup/en

    From Marovi AI
    Other languages:
    Article
    Topic area Optimization
    Prerequisites Stochastic Gradient Descent, Learning Rate Schedule, Adam


    Overview

    Learning rate warmup is a class of learning rate schedules in which the learning rate is initially set to a small value and is gradually increased toward a target peak rate over the first few hundred to few thousand iterations of training, before any of the standard decay phases begin. The technique appeared informally in deep learning practice in the mid-2010s, was given a formal recipe by Goyal and colleagues for large-batch ImageNet training in 2017, and became a near-universal component of training pipelines for Transformer models after its use in the original transformer and BERT papers.[1][2]

    The motivation is that the very first updates of training are unusually risky: the parameters are at their random initialization, gradients are large and poorly correlated with the true descent direction, and adaptive optimizers such as Adam have not yet accumulated reliable second-moment estimates. A small learning rate during this period prevents these early updates from pushing the iterate into a bad region of the loss landscape, after which a higher rate can be safely used for the bulk of training.

    Linear warmup

    The most common form is linear warmup. Given a target peak learning rate $ \eta_{\max} $ and a warmup length $ T_w $ measured in iterations (or sometimes epochs), the rate at iteration $ t $ is

    $ {\displaystyle \eta_t = \eta_{\max} \cdot \min\!\left(1, \frac{t}{T_w}\right).} $

    After step $ T_w $ the schedule continues with whatever decay was planned: cosine annealing, inverse-square-root decay, step decay, or a constant rate. A typical warmup length is between 1% and 10% of total training, with 500 to 4000 iterations being common in language model pretraining.

    A close cousin, constant warmup, uses a small fixed rate $ \eta_0 \ll \eta_{\max} $ for the first $ T_w $ steps and then jumps to $ \eta_{\max} $. Goyal et al. observed that constant warmup is brittle when the gap between $ \eta_0 $ and $ \eta_{\max} $ is large, and recommended gradual warmup (the linear form above) as a more reliable default for large-batch training.

    Inverse-square-root warmup in transformers

    The original transformer paper paired linear warmup with a subsequent inverse-square-root decay, producing a schedule with a single, smoothly differentiable peak:

    $ {\displaystyle \eta_t = d_{\text{model}}^{-1/2} \cdot \min\!\left( t^{-1/2},\, t \cdot T_w^{-3/2} \right),} $

    where $ d_{\text{model}} $ is the transformer model dimension and $ T_w $ is the warmup length (typically 4000 steps). The two branches of the minimum cross at $ t = T_w $, where the schedule transitions from a linearly increasing warmup phase to an inverse-square-root decay. This recipe was inherited, with small modifications, by many later sequence-to-sequence systems before linear warmup followed by cosine decay overtook it for autoregressive language model pretraining.

    Why warmup helps

    Several complementary explanations have been offered, none of which is fully settled. Liu and colleagues gave the most influential analysis: in Adam and its variants the adaptive second-moment estimate $ v_t $ has very high variance during the first few updates, which in turn makes the effective per-parameter learning rate $ \eta / \sqrt{v_t + \epsilon} $ unreliable.[3] A short warmup period gives $ v_t $ time to stabilize before the global step size becomes large. The same paper proposed RAdam, a variant of Adam that incorporates a variance-rectification term and is intended to make explicit warmup unnecessary; in practice both RAdam and explicit warmup remain in use.

    A second line of reasoning concerns the loss landscape near initialization. Random weights produce activations and gradients that are far from the eventual training trajectory; large early steps can amplify imbalances between layers (for instance, between attention and feed-forward sub-layers in transformers) and trigger numerical instabilities such as exploding pre-softmax logits. Warmup limits per-step damage during this transient.

    A third explanation, specific to large-batch training, invokes the gradient noise scale. With very large batches the stochastic gradient is close to the true gradient, so the optimizer behaves almost deterministically; a high initial learning rate can then cause the iterate to overshoot persistently. Goyal et al. combined warmup with the linear scaling rule (peak rate proportional to batch size) to extend SGD with momentum to batches of 8192 images on ImageNet without loss of accuracy.

    Variants and combinations

    Beyond the linear and inverse-square-root forms, several other shapes are used in practice.

    • Cosine warmup: the rate follows a half-cosine from $ 0 $ to $ \eta_{\max} $, providing a smoother start than the linear ramp. The shape is then mirrored by cosine decay, giving a single bell-shaped schedule.
    • Exponential warmup: $ \eta_t = \eta_{\max} \cdot (1 - e^{-t / \tau}) $ for some time constant $ \tau $. Less common in deep learning, but appears in reinforcement learning and meta-learning.
    • Per-layer warmup: in LAMB and related layer-wise adaptive optimizers, the trust-ratio mechanism produces an implicit warmup whose length differs per layer. Explicit global warmup is often retained on top of this.

    Warmup combines naturally with most decay schedules. The dominant pattern in modern LLM pretraining is linear warmup followed by cosine decay to a fraction (often 10%) of the peak rate. Warmup is also commonly re-applied at the start of fine-tuning, on the same theory: a freshly initialized output head and an optimizer state reset by the framework both benefit from a brief low-rate phase.

    Practical considerations

    Warmup adds two hyperparameters in principle ($ T_w $ and the starting rate, often zero), but most recipes fix the starting rate at zero or at a small constant such as $ 10^{-7} $ and tune only $ T_w $. Empirical guidance:

    • For Transformer pretraining, set $ T_w $ between 1% and 5% of total training steps. Values smaller than 0.5% frequently produce loss spikes, especially at large model widths.
    • For supervised fine-tuning, 50 to 500 steps of warmup is typical; the optimal length scales with the size of the dataset.
    • Increase $ T_w $ when scaling batch size or model size; both interventions amplify the early-training instabilities that warmup addresses.
    • When using mixed precision or FSDP / pipeline parallelism, longer warmup helps tolerate the initial mis-scaling of the loss-scaling factor and the layer norms.

    Warmup interacts with weight decay in AdamW: because decoupled weight decay is independent of the gradient, the effective decay strength during the warmup phase is unaffected, but the relative pull toward zero is larger because the gradient updates are smaller. Some recipes therefore scale weight decay by the same factor used for the learning rate during warmup, although this is not universal.

    Limitations and alternatives

    Warmup is empirically robust but theoretically under-explained. Its effectiveness depends on details that are not fully characterized: the type of optimizer, the model architecture, the precision of arithmetic, and the initialization scheme. RAdam attempts to absorb the warmup phase into the optimizer itself; Adafactor and Lion likewise reduce the need for warmup in some regimes but do not eliminate it. Architectural fixes such as Pre-Norm transformers and careful initialization (Fixup, DeepNorm, ReZero) reduce the warmup length required for stability but rarely make warmup safe to omit at scale.

    In the absence of a single closed-form theory, the recommendation that has emerged across multiple research groups is to keep an explicit warmup phase in any training pipeline that uses an adaptive optimizer, large batches, or transformer-style architectures, and to treat its length as a hyperparameter worth a small targeted sweep when training a new model class for the first time.

    References

    1. Goyal, P. et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677, 2017.
    2. Vaswani, A. et al. Attention Is All You Need. NeurIPS, 2017. arXiv:1706.03762.
    3. Liu, L. et al. On the Variance of the Adaptive Learning Rate and Beyond. ICLR, 2020. arXiv:1908.03265.