Incorporating Nesterov Momentum into Adam

Research Paper
Authors	Dozat, T.
Year	2016
Venue	ICLR Workshop
Topic area	Machine Learning
Difficulty	Research
Source	View paper

Other languages:

English
Español
中文

SummarySource

Incorporating Nesterov Momentum into Adam is a 2016 ICLR Workshop paper by Timothy Dozat that introduces Nadam (Nesterov-accelerated Adaptive Moment Estimation), a first-order stochastic optimization algorithm. Nadam modifies the popular Adam optimizer (Kingma & Ba, 2014) by replacing its classical-momentum component with a reformulated version of Nesterov's accelerated gradient (NAG). The substitution is conceptually small but, on the paper's MNIST autoencoder benchmark, produces measurably faster convergence and lower training and validation loss than Adam, RMSProp, NAG, classical momentum, or plain SGD.

Overview

By 2016, Adam had become a default choice for training deep neural networks because it combines two effective ideas: a momentum term that accumulates a decaying mean of past gradients, and a per-parameter adaptive learning rate derived from a decaying mean of squared gradients. The momentum component, however, is the classical Polyak (1964) form, which Sutskever et al. (2013) had already shown to be empirically inferior to Nesterov's accelerated gradient when used as a standalone momentum scheme. Dozat's contribution is to graft the NAG insight onto Adam without disturbing its adaptive learning-rate machinery, producing an algorithm that retains Adam's hyperparameter regime and implementation footprint while inheriting NAG's "look-ahead" advantage.

The paper is short — a four-page workshop submission — and presents a single empirical experiment, but its derivation is clean enough that the resulting algorithm has been adopted as the Nadam optimizer in major deep-learning frameworks including TensorFlow / Keras and PyTorch.

Conceptually, the work fits into a broader 2014–2016 line of research on combining momentum with per-parameter adaptive learning rates. Adam itself can be read as a fusion of classical (Polyak) momentum with the RMSProp adaptive denominator (Tieleman & Hinton, 2012), and Nadam takes the natural next step of swapping in Nesterov momentum, which had become the preferred form for tasks where look-ahead matters. The paper does not claim novelty for any individual ingredient — Nesterov's algorithm dates to 1983 and Adam to 2014 — but for the specific composition that lets the look-ahead survive bias correction.

Key Contributions

A reformulation of NAG. The paper rewrites Nesterov's accelerated gradient into a form that does not require evaluating the gradient at a temporarily perturbed parameter point. Instead, the next-step momentum factor is folded into the current update.
Nadam algorithm. Applying that same reformulation to Adam's momentum term yields the Nadam update rule, in which the bias-corrected first moment incorporates the upcoming momentum coefficient $\mu_{t+1}$ rather than the previous one.
A schedule for $\mu_t$ . By indexing the momentum decay coefficient by timestep, Dozat anticipates the use of momentum schedules — a refinement that several reference implementations later adopted.
Empirical evidence on MNIST. A controlled comparison on a convolutional autoencoder shows that Nadam matches or beats Adam, with both algorithms outperforming SGD, classical momentum, NAG, and RMSProp under their respective best learning rates.

Methods

The derivation proceeds in three steps.

Classical momentum (Polyak). Maintain a momentum vector that is a decaying sum of past gradient steps:

m_t \leftarrow \mu m_{t-1} + \alpha_t g_t, \qquad \theta_t \leftarrow \theta_{t-1} - m_t.

Nesterov's accelerated gradient. Sutskever et al. (2013) showed that NAG can be implemented by evaluating the gradient at the look-ahead point $\theta_{t-1} - \mu m_{t-1}$ . Dozat rewrites this so that the look-ahead is applied during the parameter update of the previous timestep instead, removing the need to evaluate the gradient at a perturbed point:

\theta_t \leftarrow \theta_{t-1} - (\mu_{t+1} m_t + \alpha_t g_t).

Adam. Adam uses a decaying mean of past gradients (rather than a sum) and divides by a decaying root-mean-square of past squared gradients, with a bias correction $1 - \mu^t$ in the denominator:

m_t \leftarrow \mu m_{t-1} + (1 - \mu) g_t, \qquad \theta_t \leftarrow \theta_{t-1} - \alpha_t \frac{m_t}{1 - \mu^t}.

Nadam. Apply the NAG reformulation to Adam by replacing the bias-corrected first moment with one that uses $\mu_{t+1}$ instead of $\mu_t$ :

\hat{m}_t = \frac{\mu_{t+1} m_t}{1 - \prod_{i=1}^{t+1} \mu_i} + \frac{(1 - \mu_t) g_t}{1 - \prod_{i=1}^{t} \mu_i},

\theta_t \leftarrow \theta_{t-1} - \frac{\alpha_t \hat{m}_t}{\sqrt{\hat{n}_t} + \epsilon},

where $\hat{n}_t = \nu n_t / (1 - \nu^t)$ is the bias-corrected second moment. The author also notes that the same NAG-style substitution is, in principle, compatible with other adaptive-learning-rate algorithms such as Adamax or Equilibrated gradient descent.

Results

The single experiment trains a convolutional autoencoder (three convolutional plus two dense layers in each of the encoder and decoder) on MNIST, compressing each $28 \times 28$ digit into a 16-dimensional latent vector and reconstructing it. Six optimizers are compared — SGD, classical momentum, NAG, RMSProp, Adam, and Nadam — each tuned only over its learning rate; other hyperparameters are fixed at $\mu = 0.975$ , $\nu = 0.999$ , $\epsilon = 10^{-8}$ . Best learning rates were $$ 0.2 $$ for SGD, $$ 0.5 $$ for momentum and NAG, $$ 0.001 $$ for RMSProp, and $$ 0.002 $$ for Adam and Nadam.

In both training and validation loss, Nadam reaches lower values faster than every other algorithm tested — including its parent Adam. The author emphasizes that this is achieved with no additional hyperparameter tuning beyond the unavoidable learning-rate sweep, supporting the claim that Nadam is a drop-in improvement on Adam rather than a more delicate algorithm.

The autoencoder benchmark is deliberately modest: it isolates the optimizer's contribution by holding architecture, dataset, regularization, and initialization fixed across all six runs. The paper does not include large-scale image-classification or language-modeling experiments, and it does not investigate the interaction between Nadam and learning-rate warm-up, weight decay, or batch-size schedules — all of which subsequent work would explore. As workshop-track research, the empirical claim is intentionally narrow: that the NAG-style first-moment substitution is at least as good as classical-momentum Adam under a controlled comparison.

Impact

Nadam has become a standard option in mainstream deep-learning libraries: TensorFlow / Keras ship it as tf.keras.optimizers.Nadam, and PyTorch added it as torch.optim.NAdam. In practice it is most frequently chosen for tasks where Adam already performs well but slightly faster convergence in early training is desirable, such as language-model fine-tuning and certain computer-vision pipelines.

The paper is also cited as an early example of cleanly transplanting an optimization-theory insight (NAG) onto an adaptive-moment optimizer, a recipe that subsequent work has replicated for variants such as AdamW (Loshchilov & Hutter, 2019) and RAdam (Liu et al., 2020). Because the modification is a single-line change to the bias-corrected first moment, Nadam's adoption did not require any new hyperparameters or implementation infrastructure, which substantially lowered the barrier to its uptake.

A pragmatic consequence of this design is that practitioners can usually replace Adam with Nadam in an existing training pipeline without revisiting the learning-rate schedule, batch size, or regularization settings. Empirically the two algorithms produce qualitatively similar loss curves, with Nadam often a small but consistent step ahead in the first few thousand iterations — a regime that matters disproportionately for fine-tuning workloads where total compute is small. For training runs that are bottlenecked by gradient noise rather than by curvature, the two algorithms are essentially interchangeable.

References

Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop. OpenReview OM0jvwB8jIp57ZJjtNEZ.
Kingma, D. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML.
Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $$ O(1/k^2) $$ . Soviet Mathematics Doklady, 27, 372–376.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
Tieleman, T. & Hinton, G. (2012). Lecture 6.5 — RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA.
Duchi, J., Hazan, E. & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12, 2121–2159.
Dauphin, Y., de Vries, H. & Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. NeurIPS, 1504–1512.
Loshchilov, I. & Hutter, F. (2019). Decoupled weight decay regularization. ICLR.
Liu, L. et al. (2020). On the variance of the adaptive learning rate and beyond. ICLR.