Nesterov Momentum/en

    From Marovi AI
    Other languages:
    Article
    Topic area Optimization
    Prerequisites Stochastic Gradient Descent, Gradient Descent


    Overview

    Nesterov momentum, also called Nesterov accelerated gradient (NAG), is a first-order optimization method that augments Gradient Descent with a velocity term evaluated at a lookahead point. Introduced by Yurii Nesterov in 1983 for smooth convex problems, it achieves an optimal worst-case convergence rate of $ O(1/t^2) $ on convex functions with Lipschitz gradients, compared with $ O(1/t) $ for plain gradient descent. In modern machine learning it is most commonly encountered as a drop-in replacement for the momentum option of Stochastic Gradient Descent, where it often yields modestly faster convergence and improved stability than classical heavy-ball momentum.

    The defining idea is simple: instead of computing the gradient at the current parameter vector and then taking a momentum-weighted step, NAG first applies the previous velocity to obtain a lookahead point, evaluates the gradient there, and only then applies the correction. This small reordering produces a noticeable algorithmic difference and underlies the method's theoretical acceleration.

    Background: classical momentum

    Classical momentum, sometimes called Polyak's heavy-ball method, augments the gradient update with an exponentially weighted history of past gradients. Let $ \theta_t $ denote the parameters at step $ t $, $ \eta $ the learning rate, $ \mu \in [0,1) $ the momentum coefficient, and $ \nabla f(\theta_t) $ the gradient of the objective. The classical update is

    $ {\displaystyle v_{t+1} = \mu v_t + \nabla f(\theta_t),} $ $ {\displaystyle \theta_{t+1} = \theta_t - \eta\, v_{t+1}.} $

    The velocity $ v_t $ behaves like the momentum of a particle rolling in the loss landscape, damping high-frequency oscillations across narrow ravines while accumulating speed along consistent descent directions. Heavy-ball momentum can dramatically accelerate progress in poorly conditioned problems but is known to overshoot and to lack the optimal accelerated convergence rate on general smooth convex functions.

    The Nesterov update

    Nesterov's modification evaluates the gradient at the lookahead point $ \theta_t - \eta \mu v_t $ rather than at $ \theta_t $. The update becomes

    $ {\displaystyle v_{t+1} = \mu v_t + \nabla f\!\left(\theta_t - \eta \mu v_t\right),} $ $ {\displaystyle \theta_{t+1} = \theta_t - \eta\, v_{t+1}.} $

    Reading this from left to right: the optimizer commits in advance to the inertial step $ -\eta \mu v_t $, then asks "what is the gradient at the place I am about to land?" and uses that to correct the velocity. If the inertial step is heading toward an upturn, the gradient at the lookahead point will already point back, reducing overshoot before it happens.

    In its original convex form Nesterov used a time-varying coefficient $ \mu_t $ derived from a recursion on auxiliary sequences, with $ \mu_t \to 1 $ as $ t \to \infty $. Most deep-learning implementations replace this with a constant $ \mu $ (typically 0.9 or 0.99), which sacrifices the formal acceleration guarantee but retains the practical benefits.

    Intuition: lookahead correction

    A useful mental picture is to split each iteration into two phases. First the iterate slides along its current velocity, exactly as classical momentum would. Second, a gradient evaluated at this projected position acts as a corrective force. Classical momentum, by contrast, bases its correction on the gradient at the current position, which is "behind" the actual motion. The lookahead therefore gives NAG one half-step of extra information per iteration.

    This effect is most visible on quadratic objectives with elongated level sets. Heavy-ball iterates oscillate across the narrow direction with slowly decaying amplitude. NAG iterates damp the same oscillation faster because each correction is informed by where the parameter is heading, not where it currently sits.

    Convergence properties

    For an $ L $-smooth convex function $ f $ with minimum $ f^* $, Nesterov's accelerated gradient with step size $ \eta = 1/L $ satisfies

    $ {\displaystyle f(\theta_t) - f^* \le \frac{2L\,\|\theta_0 - \theta^*\|^2}{(t+1)^2},} $

    which matches the lower bound for any first-order method that only accesses gradient information.[1] On $ \mu $-strongly convex objectives the rate becomes linear with contraction factor $ 1 - \sqrt{\mu / L} $, again strictly better than gradient descent's $ 1 - \mu / L $.

    These guarantees assume exact gradients and the original time-varying schedule. With stochastic gradients, as in deep-learning practice, the formal acceleration no longer holds, but the lookahead structure typically still helps empirically.

    Practical formulation

    The form above requires evaluating the gradient at $ \theta_t - \eta \mu v_t $, which is awkward in autodiff frameworks that prefer to compute gradients at the parameters they currently store. Sutskever et al.[2] showed that a change of variables yields an equivalent update that always evaluates the gradient at the current parameters:

    $ {\displaystyle v_{t+1} = \mu v_t - \eta\, \nabla f(\theta_t),} $ $ {\displaystyle \theta_{t+1} = \theta_t + \mu v_{t+1} - \eta\, \nabla f(\theta_t).} $

    This is the form implemented by frameworks such as PyTorch (torch.optim.SGD(..., nesterov=True)) and TensorFlow Keras. It produces the same trajectory as the lookahead form up to a reparameterisation and is what most practitioners actually run.

    Comparisons with related methods

    Nesterov momentum sits between classical heavy-ball momentum and adaptive methods such as Adam and RMSprop in the design space of optimizers. Compared with heavy-ball, it usually trains faster and overshoots less; the difference is small but consistent on convolutional vision models, where SGD with Nesterov momentum remains a strong baseline. Compared with adaptive methods, NAG does not rescale gradients per parameter, which makes it less robust to heterogeneous gradient magnitudes but often gives better final generalisation on well-tuned vision benchmarks.

    Lookahead optimizers and the closely related Anderson acceleration share the spirit of NAG, in that both use auxiliary sequences to cancel oscillatory behaviour, but they apply the correction at coarser granularities or via more general extrapolation schemes.

    Limitations

    Despite its elegant theory, Nesterov momentum has several practical caveats. Its formal acceleration depends on smoothness, convexity, and exact gradients, none of which hold in deep network training; empirical gains over classical momentum are often modest and sometimes negligible once learning rate and weight decay are well tuned. With a very high momentum coefficient (close to 1) and a large learning rate it can still overshoot or diverge on noisy problems. Finally, on objectives where adaptive scaling is critical, such as transformer language models, NAG is rarely competitive with Adam or its variants and is seldom the default choice.

    References

    1. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic, 2004.
    2. Sutskever, I., Martens, J., Dahl, G., Hinton, G. "On the importance of initialization and momentum in deep learning." ICML 2013.