RAdam/en

    From Marovi AI
    Other languages:
    Article
    Topic area Optimization
    Prerequisites Stochastic gradient descent, Gradient descent


    Overview

    Rectified Adam (RAdam) is an adaptive stochastic optimization algorithm that modifies the popular Adam optimizer by adding a closed-form rectification term to the adaptive learning rate. The rectification corrects for the high and undefined variance of the second-moment estimate during the first few training steps, which is the underlying cause of the instability that practitioners have traditionally addressed with manual learning-rate warmup. By deriving the variance of the adaptive learning rate analytically and introducing a multiplicative correction, RAdam aims to deliver stable updates from step one without the tuning burden of a warmup schedule. It was introduced by Liu et al. in 2019 and is widely used as a drop-in replacement for Adam in computer-vision, language-modeling, and reinforcement-learning workloads.[1]

    Motivation

    Standard Adam maintains an exponential moving average of squared gradients $ v_t $ and uses $ \sqrt{\hat{v}_t} $ in the denominator of each update. Early in training, very few gradient samples have been accumulated, so $ \hat{v}_t $ is a high-variance estimator of the true second moment. Dividing by a noisy estimate produces unreliable, often excessively large step sizes that can push parameters into bad regions of the loss surface before the moment estimates have stabilized.

    The empirical workaround that emerged in the Deep learning community is learning-rate warmup: start with a small learning rate and ramp it up over a few hundred or thousand iterations. While effective, warmup introduces additional hyperparameters (warmup length, warmup schedule shape) that interact with the base learning rate, batch size, and dataset in ways that are hard to predict. RAdam is motivated by the observation that warmup is a heuristic patch for a problem that can be characterized analytically and corrected in closed form.

    Variance of the adaptive learning rate

    The core derivation in the RAdam paper computes the variance of the inverse adaptive scaling term $ 1/\sqrt{\hat{v}_t} $ as a function of $ t $ and the second-moment decay $ \beta_2 $. Under simplifying assumptions about the gradient distribution, this variance is shown to be unbounded for small $ t $, then to decrease monotonically toward a finite asymptote as $ t \to \infty $. The authors approximate the effective sample size by the length of the approximated simple moving average (SMA):

    $ {\displaystyle \rho_t = \rho_\infty - \frac{2 t \, \beta_2^t}{1 - \beta_2^t}, \qquad \rho_\infty = \frac{2}{1 - \beta_2} - 1.} $

    For typical $ \beta_2 = 0.999 $, $ \rho_\infty \approx 1999 $; the value of $ \rho_t $ grows from zero and approaches $ \rho_\infty $ as training proceeds. The variance of the adaptive denominator can then be written in closed form in terms of $ \rho_t $, which makes it possible to rectify the update so that its variance matches the long-run regime.

    Algorithm

    Let $ \alpha $ be the base learning rate, $ (\beta_1, \beta_2) $ the moment decay rates, and $ \theta_{t-1} $ the parameters before step $ t $. Given gradient $ g_t $, RAdam updates the moments exactly as Adam does:

    $ {\displaystyle m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2,} $

    then computes the bias-corrected first moment $ \hat{m}_t = m_t / (1 - \beta_1^t) $ and the SMA length $ \rho_t $. The decision rule branches on whether the variance of the adaptive term is tractable:

    • If $ \rho_t > 4 $: compute the bias-corrected second-moment denominator $ \hat{v}_t = \sqrt{v_t / (1-\beta_2^t)} $ and the rectification factor

    $ {\displaystyle r_t = \sqrt{\frac{(\rho_t - 4)(\rho_t - 2) \rho_\infty}{(\rho_\infty - 4)(\rho_\infty - 2) \rho_t}},} $

    then take a rectified Adam step $ \theta_t = \theta_{t-1} - \alpha \, r_t \, \hat{m}_t / \hat{v}_t $.

    • Otherwise (early iterations, when the variance is intractable): take a momentum-only step $ \theta_t = \theta_{t-1} - \alpha \, \hat{m}_t $, equivalent to Stochastic gradient descent with momentum.

    The threshold $ \rho_t > 4 $ is the smallest integer for which the rectification denominator $ (\rho_\infty - 4)(\rho_\infty - 2)\rho_t $ stays well-defined; below it, the closed-form variance correction is undefined and RAdam falls back to the simpler SGD-with-momentum update.

    Behavior in practice

    The rectification factor $ r_t $ is monotonically increasing in $ \rho_t $, starts well below one when $ \rho_t $ is small, and approaches one as $ \rho_t \to \rho_\infty $. As a function of training step, RAdam therefore behaves in three regimes:

    1. Phase 1 ($ \rho_t \le 4 $): pure momentum SGD, with no adaptive scaling at all. This typically lasts only a handful of iterations for default $ \beta_2 = 0.999 $.
    2. Phase 2 ($ \rho_t > 4 $, small): rectified updates with $ r_t \ll 1 $, so the effective learning rate is well below $ \alpha $. This is the implicit warmup.
    3. Phase 3 ($ \rho_t \to \rho_\infty $): $ r_t \to 1 $ and the algorithm coincides with bias-corrected Adam.

    The transition between the regimes is smooth and depends only on $ t $ and $ \beta_2 $, not on the gradient statistics. This makes the warmup schedule data-independent and removes warmup length from the hyperparameter list.

    Comparison to related optimizers

    • Adam. RAdam reduces to Adam as $ t \to \infty $ when $ r_t \to 1 $. The two differ only in the early phase, where RAdam multiplies by $ r_t < 1 $ or skips the adaptive denominator entirely.
    • Adam with warmup. The standard linear warmup schedule is a manual heuristic that scales the base learning rate from zero to $ \alpha $ over a fixed number of steps. RAdam replaces this heuristic with an analytically derived schedule that depends only on $ \beta_2 $.
    • AdamW. AdamW corrects how Adam couples weight decay with the adaptive denominator. AdamW and RAdam are orthogonal modifications and are sometimes combined as RAdamW.
    • SGD with momentum. RAdam's Phase 1 is exactly SGD with momentum and the same $ \beta_1 $. SGD generalizes better than Adam on many vision tasks; RAdam can inherit this property only in the very early iterations, after which it becomes an adaptive method.
    • LookAhead. LookAhead is a wrapper that periodically interpolates between a fast inner optimizer and a slow set of weights. It is also orthogonal to RAdam, and the combination "Ranger" (RAdam plus LookAhead) is a popular choice in computer-vision practice.

    Hyperparameters and defaults

    RAdam preserves the Adam interface. Recommended defaults are $ \alpha = 10^{-3} $ (or task-specific), $ \beta_1 = 0.9 $, $ \beta_2 = 0.999 $, and $ \epsilon = 10^{-8} $. The key practical difference is that the warmup-length hyperparameter is removed; the warmup schedule is now implicit in the choice of $ \beta_2 $. Practitioners porting from "Adam plus warmup" recipes typically find that the manual warmup steps can be deleted with no loss in final accuracy and often a small gain in stability.

    Limitations

    The variance derivation assumes that gradient samples are stationary and approximately independent across steps, which is not exactly true in practice (mini-batches are correlated, learning-rate decay changes the gradient distribution). Empirically, RAdam still works well even when these assumptions are violated, but the theoretical guarantee is weaker than the cleanliness of the formula suggests. RAdam also inherits Adam's worse generalization on some image-classification benchmarks compared with well-tuned SGD; the rectification addresses early-stage instability but not the broader generalization gap. Finally, the rectification provides a fixed warmup-like schedule; tasks that benefit from longer or shaped warmups (very large neural networks, extreme batch sizes) may still need additional tuning on top of RAdam.

    References