<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Incorporating_Nesterov_Momentum_into_Adam%2Fen</id>
	<title>Incorporating Nesterov Momentum into Adam/en - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Incorporating_Nesterov_Momentum_into_Adam%2Fen"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Incorporating_Nesterov_Momentum_into_Adam/en&amp;action=history"/>
	<updated>2026-04-27T17:01:25Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Incorporating_Nesterov_Momentum_into_Adam/en&amp;diff=12926&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Incorporating_Nesterov_Momentum_into_Adam/en&amp;diff=12926&amp;oldid=prev"/>
		<updated>2026-04-27T08:07:42Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;languages /&amp;gt;&lt;br /&gt;
{{PaperTabs}}&lt;br /&gt;
{{PaperInfobox&lt;br /&gt;
 | topic_area  = Machine Learning&lt;br /&gt;
 | difficulty  = Research&lt;br /&gt;
 | authors     = Dozat, T.&lt;br /&gt;
 | year        = 2016&lt;br /&gt;
 | venue       = ICLR Workshop&lt;br /&gt;
 | source_url  = https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ&lt;br /&gt;
}}&lt;br /&gt;
{{ContentMeta&lt;br /&gt;
 | generated_by   = claude-code-direct&lt;br /&gt;
 | model_used     = claude-opus-4-7&lt;br /&gt;
 | generated_date = 2026-04-27&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Incorporating Nesterov Momentum into Adam&amp;#039;&amp;#039;&amp;#039; is a 2016 ICLR Workshop paper by Timothy Dozat that introduces &amp;#039;&amp;#039;&amp;#039;Nadam&amp;#039;&amp;#039;&amp;#039; (Nesterov-accelerated Adaptive Moment Estimation), a first-order [[Stochastic Gradient Descent|stochastic optimization]] algorithm. Nadam modifies the popular Adam optimizer (Kingma &amp;amp; Ba, 2014) by replacing its classical-momentum component with a reformulated version of Nesterov&amp;#039;s accelerated gradient (NAG). The substitution is conceptually small but, on the paper&amp;#039;s MNIST autoencoder benchmark, produces measurably faster convergence and lower training and validation loss than Adam, RMSProp, NAG, classical momentum, or plain SGD.&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
By 2016, Adam had become a default choice for training deep neural networks because it combines two effective ideas: a momentum term that accumulates a decaying mean of past gradients, and a per-parameter adaptive learning rate derived from a decaying mean of squared gradients. The momentum component, however, is the classical Polyak (1964) form, which Sutskever et al. (2013) had already shown to be empirically inferior to Nesterov&amp;#039;s accelerated gradient when used as a standalone momentum scheme. Dozat&amp;#039;s contribution is to graft the NAG insight onto Adam without disturbing its adaptive learning-rate machinery, producing an algorithm that retains Adam&amp;#039;s hyperparameter regime and implementation footprint while inheriting NAG&amp;#039;s &amp;quot;look-ahead&amp;quot; advantage.&lt;br /&gt;
&lt;br /&gt;
The paper is short — a four-page workshop submission — and presents a single empirical experiment, but its derivation is clean enough that the resulting algorithm has been adopted as the &amp;lt;code&amp;gt;Nadam&amp;lt;/code&amp;gt; optimizer in major deep-learning frameworks including TensorFlow / Keras and PyTorch.&lt;br /&gt;
&lt;br /&gt;
Conceptually, the work fits into a broader 2014–2016 line of research on combining momentum with per-parameter adaptive learning rates. Adam itself can be read as a fusion of classical (Polyak) momentum with the RMSProp adaptive denominator (Tieleman &amp;amp; Hinton, 2012), and Nadam takes the natural next step of swapping in Nesterov momentum, which had become the preferred form for tasks where look-ahead matters. The paper does not claim novelty for any individual ingredient — Nesterov&amp;#039;s algorithm dates to 1983 and Adam to 2014 — but for the specific composition that lets the look-ahead survive bias correction.&lt;br /&gt;
&lt;br /&gt;
== Key Contributions ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;A reformulation of NAG.&amp;#039;&amp;#039;&amp;#039; The paper rewrites Nesterov&amp;#039;s accelerated gradient into a form that does not require evaluating the gradient at a temporarily perturbed parameter point. Instead, the next-step momentum factor is folded into the current update.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Nadam algorithm.&amp;#039;&amp;#039;&amp;#039; Applying that same reformulation to Adam&amp;#039;s momentum term yields the Nadam update rule, in which the bias-corrected first moment incorporates the upcoming momentum coefficient &amp;lt;math&amp;gt;\mu_{t+1}&amp;lt;/math&amp;gt; rather than the previous one.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;A schedule for &amp;lt;math&amp;gt;\mu_t&amp;lt;/math&amp;gt;.&amp;#039;&amp;#039;&amp;#039; By indexing the momentum decay coefficient by timestep, Dozat anticipates the use of momentum schedules — a refinement that several reference implementations later adopted.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Empirical evidence on MNIST.&amp;#039;&amp;#039;&amp;#039; A controlled comparison on a convolutional autoencoder shows that Nadam matches or beats Adam, with both algorithms outperforming SGD, classical momentum, NAG, and RMSProp under their respective best learning rates.&lt;br /&gt;
&lt;br /&gt;
== Methods ==&lt;br /&gt;
&lt;br /&gt;
The derivation proceeds in three steps.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Classical momentum (Polyak).&amp;#039;&amp;#039;&amp;#039; Maintain a momentum vector that is a decaying sum of past gradient steps:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;m_t \leftarrow \mu m_{t-1} + \alpha_t g_t, \qquad \theta_t \leftarrow \theta_{t-1} - m_t.&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Nesterov&amp;#039;s accelerated gradient.&amp;#039;&amp;#039;&amp;#039; Sutskever et al. (2013) showed that NAG can be implemented by evaluating the gradient at the look-ahead point &amp;lt;math&amp;gt;\theta_{t-1} - \mu m_{t-1}&amp;lt;/math&amp;gt;. Dozat rewrites this so that the look-ahead is applied during the parameter update of the previous timestep instead, removing the need to evaluate the gradient at a perturbed point:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_t \leftarrow \theta_{t-1} - (\mu_{t+1} m_t + \alpha_t g_t).&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Adam.&amp;#039;&amp;#039;&amp;#039; Adam uses a decaying mean of past gradients (rather than a sum) and divides by a decaying root-mean-square of past squared gradients, with a bias correction &amp;lt;math&amp;gt;1 - \mu^t&amp;lt;/math&amp;gt; in the denominator:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;m_t \leftarrow \mu m_{t-1} + (1 - \mu) g_t, \qquad \theta_t \leftarrow \theta_{t-1} - \alpha_t \frac{m_t}{1 - \mu^t}.&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Nadam.&amp;#039;&amp;#039;&amp;#039; Apply the NAG reformulation to Adam by replacing the bias-corrected first moment with one that uses &amp;lt;math&amp;gt;\mu_{t+1}&amp;lt;/math&amp;gt; instead of &amp;lt;math&amp;gt;\mu_t&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\hat{m}_t = \frac{\mu_{t+1} m_t}{1 - \prod_{i=1}^{t+1} \mu_i} + \frac{(1 - \mu_t) g_t}{1 - \prod_{i=1}^{t} \mu_i},&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_t \leftarrow \theta_{t-1} - \frac{\alpha_t \hat{m}_t}{\sqrt{\hat{n}_t} + \epsilon},&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\hat{n}_t = \nu n_t / (1 - \nu^t)&amp;lt;/math&amp;gt; is the bias-corrected second moment. The author also notes that the same NAG-style substitution is, in principle, compatible with other adaptive-learning-rate algorithms such as Adamax or Equilibrated gradient descent.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
The single experiment trains a [[Convolutional Neural Networks|convolutional autoencoder]] (three convolutional plus two dense layers in each of the encoder and decoder) on MNIST, compressing each &amp;lt;math&amp;gt;28 \times 28&amp;lt;/math&amp;gt; digit into a 16-dimensional latent vector and reconstructing it. Six optimizers are compared — SGD, classical momentum, NAG, RMSProp, Adam, and Nadam — each tuned only over its learning rate; other hyperparameters are fixed at &amp;lt;math&amp;gt;\mu = 0.975&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\nu = 0.999&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\epsilon = 10^{-8}&amp;lt;/math&amp;gt;. Best learning rates were &amp;lt;math&amp;gt;0.2&amp;lt;/math&amp;gt; for SGD, &amp;lt;math&amp;gt;0.5&amp;lt;/math&amp;gt; for momentum and NAG, &amp;lt;math&amp;gt;0.001&amp;lt;/math&amp;gt; for RMSProp, and &amp;lt;math&amp;gt;0.002&amp;lt;/math&amp;gt; for Adam and Nadam.&lt;br /&gt;
&lt;br /&gt;
In both training and validation loss, Nadam reaches lower values faster than every other algorithm tested — including its parent Adam. The author emphasizes that this is achieved with no additional hyperparameter tuning beyond the unavoidable learning-rate sweep, supporting the claim that Nadam is a drop-in improvement on Adam rather than a more delicate algorithm.&lt;br /&gt;
&lt;br /&gt;
The autoencoder benchmark is deliberately modest: it isolates the optimizer&amp;#039;s contribution by holding architecture, dataset, regularization, and initialization fixed across all six runs. The paper does not include large-scale image-classification or language-modeling experiments, and it does not investigate the interaction between Nadam and learning-rate warm-up, weight decay, or batch-size schedules — all of which subsequent work would explore. As workshop-track research, the empirical claim is intentionally narrow: that the NAG-style first-moment substitution is at least as good as classical-momentum Adam under a controlled comparison.&lt;br /&gt;
&lt;br /&gt;
== Impact ==&lt;br /&gt;
&lt;br /&gt;
Nadam has become a standard option in mainstream deep-learning libraries: TensorFlow / Keras ship it as &amp;lt;code&amp;gt;tf.keras.optimizers.Nadam&amp;lt;/code&amp;gt;, and PyTorch added it as &amp;lt;code&amp;gt;torch.optim.NAdam&amp;lt;/code&amp;gt;. In practice it is most frequently chosen for tasks where Adam already performs well but slightly faster convergence in early training is desirable, such as language-model fine-tuning and certain computer-vision pipelines.&lt;br /&gt;
&lt;br /&gt;
The paper is also cited as an early example of cleanly transplanting an optimization-theory insight (NAG) onto an adaptive-moment optimizer, a recipe that subsequent work has replicated for variants such as AdamW (Loshchilov &amp;amp; Hutter, 2019) and RAdam (Liu et al., 2020). Because the modification is a single-line change to the bias-corrected first moment, Nadam&amp;#039;s adoption did not require any new hyperparameters or implementation infrastructure, which substantially lowered the barrier to its uptake.&lt;br /&gt;
&lt;br /&gt;
A pragmatic consequence of this design is that practitioners can usually replace Adam with Nadam in an existing training pipeline without revisiting the learning-rate schedule, batch size, or regularization settings. Empirically the two algorithms produce qualitatively similar loss curves, with Nadam often a small but consistent step ahead in the first few thousand iterations — a regime that matters disproportionately for fine-tuning workloads where total compute is small. For training runs that are bottlenecked by gradient noise rather than by curvature, the two algorithms are essentially interchangeable.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Batch Normalization]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Attention Is All You Need]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Dozat, T. (2016). &amp;#039;&amp;#039;Incorporating Nesterov Momentum into Adam&amp;#039;&amp;#039;. ICLR Workshop. OpenReview &amp;lt;code&amp;gt;OM0jvwB8jIp57ZJjtNEZ&amp;lt;/code&amp;gt;.&lt;br /&gt;
* Kingma, D. &amp;amp; Ba, J. (2014). &amp;#039;&amp;#039;Adam: A Method for Stochastic Optimization&amp;#039;&amp;#039;. arXiv:1412.6980.&lt;br /&gt;
* Sutskever, I., Martens, J., Dahl, G. &amp;amp; Hinton, G. (2013). &amp;#039;&amp;#039;On the importance of initialization and momentum in deep learning&amp;#039;&amp;#039;. ICML.&lt;br /&gt;
* Nesterov, Y. (1983). &amp;#039;&amp;#039;A method of solving a convex programming problem with convergence rate &amp;lt;math&amp;gt;O(1/k^2)&amp;lt;/math&amp;gt;&amp;#039;&amp;#039;. Soviet Mathematics Doklady, 27, 372–376.&lt;br /&gt;
* Polyak, B. T. (1964). &amp;#039;&amp;#039;Some methods of speeding up the convergence of iteration methods&amp;#039;&amp;#039;. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.&lt;br /&gt;
* Tieleman, T. &amp;amp; Hinton, G. (2012). &amp;#039;&amp;#039;Lecture 6.5 — RMSprop: divide the gradient by a running average of its recent magnitude&amp;#039;&amp;#039;. COURSERA.&lt;br /&gt;
* Duchi, J., Hazan, E. &amp;amp; Singer, Y. (2011). &amp;#039;&amp;#039;Adaptive subgradient methods for online learning and stochastic optimization&amp;#039;&amp;#039;. JMLR, 12, 2121–2159.&lt;br /&gt;
* Dauphin, Y., de Vries, H. &amp;amp; Bengio, Y. (2015). &amp;#039;&amp;#039;Equilibrated adaptive learning rates for non-convex optimization&amp;#039;&amp;#039;. NeurIPS, 1504–1512.&lt;br /&gt;
* Loshchilov, I. &amp;amp; Hutter, F. (2019). &amp;#039;&amp;#039;Decoupled weight decay regularization&amp;#039;&amp;#039;. ICLR.&lt;br /&gt;
* Liu, L. &amp;#039;&amp;#039;et al.&amp;#039;&amp;#039; (2020). &amp;#039;&amp;#039;On the variance of the adaptive learning rate and beyond&amp;#039;&amp;#039;. ICLR.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Research]]&lt;br /&gt;
[[Category:Research Papers]]&lt;/div&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
</feed>