Adam A Method for Stochastic Optimization: Difference between revisions

Research Paper
Authors	Diederik P. Kingma; Jimmy Lei Ba
Year	2015
Venue	ICLR
Topic area	Optimization
Difficulty	Research
arXiv	1412.6980
PDF	Download PDF

Latest revision as of 02:36, 27 April 2026

Other languages:

English
Español
中文

Adam: A Method for Stochastic Optimization is a 2015 paper by Kingma and Ba that introduced the Adam optimizer, an algorithm for first-order gradient-based optimization of stochastic objective functions. Adam combines the advantages of two earlier methods — AdaGrad (which adapts learning rates per parameter) and RMSProp (which uses a running average of squared gradients) — into a single algorithm with bias-corrected moment estimates. Adam has become the default optimizer for training neural networks across most domains.

Overview

Training deep neural networks requires minimizing a high-dimensional, non-convex objective function using stochastic gradient estimates. Standard stochastic gradient descent (SGD) uses a single global learning rate for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.

Prior adaptive methods like AdaGrad accumulated squared gradients over the entire training run, causing learning rates to decay monotonically to zero — problematic for non-convex problems. RMSProp addressed this by using an exponential moving average, but lacked bias correction. Adam unified these ideas with bias-corrected estimates of both the first moment (mean) and second moment (uncentered variance) of the gradients, providing an effective and computationally efficient optimizer with well-behaved default hyperparameters.

Key Contributions

Adam optimizer: An adaptive learning rate method that maintains per-parameter learning rates based on bias-corrected estimates of the first and second moments of the gradients.
Bias correction: A mechanism to counteract the initialization bias of the moment estimates toward zero, which is especially important in the initial steps of training.
AdaMax variant: A generalization based on the infinity norm that can sometimes outperform Adam on problems with sparse gradients.
Practical defaults: Recommended hyperparameter values ( $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ ) that work well across a wide range of problems.

Methods

Adam maintains two exponential moving averages: $$ m_t $$ for the first moment (mean of gradients) and $$ v_t $$ for the second moment (mean of squared gradients):

$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$

$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$

where $g_t = \nabla_\theta f_t(\theta_{t-1})$ is the gradient at step $$ t $$ , and $\beta_1, \beta_2 \in [0, 1)$ control the exponential decay rates.

Since $$ m_t $$ and $$ v_t $$ are initialized as zero vectors, they are biased toward zero during the initial steps. Adam corrects this with bias-corrected estimates:

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$

$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

The parameter update rule is then:

$\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

where $\alpha$ is the step size (learning rate) and $\epsilon$ is a small constant for numerical stability.

The first moment estimate provides momentum-like behavior, accelerating convergence along consistent gradient directions. The second moment estimate scales the learning rate inversely with the root-mean-square of recent gradients, giving each parameter its own effective learning rate. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.

The paper also introduces AdaMax, which replaces the $$ L^2 $$ norm used in Adam's second moment with the $L^\infty$ norm, yielding a simpler update rule that avoids the bias correction for the second moment.

Results

The paper evaluated Adam on several benchmarks:

Logistic regression on MNIST: Adam converged faster than SGD with momentum, AdaGrad, and RMSProp.
Multi-layer neural networks on MNIST: Adam achieved the lowest training cost, with convergence speed comparable to or better than competing methods.
Convolutional neural networks on CIFAR-10: Adam performed comparably to SGD with carefully tuned momentum and learning rate schedules.
Variational autoencoders (VAEs): Adam was used successfully to optimize the variational lower bound, demonstrating its applicability to generative models.

The paper provided a convergence analysis showing that Adam achieves an $O(\sqrt{T})$ regret bound in the online convex optimization framework, matching the best known bounds for adaptive methods.

Impact

Adam became the most widely used optimizer in deep learning, chosen as the default in most research papers and production systems through the late 2010s and into the 2020s. Its robustness to hyperparameter choices and effectiveness across diverse architectures made it the go-to algorithm for practitioners.

Subsequent work identified limitations, including convergence issues in certain settings (addressed by AMSGrad), potential generalization gaps compared to well-tuned SGD (particularly for image classification), and sensitivity to the choice of $\epsilon$ . Variants such as AdamW (which decouples weight decay from the adaptive learning rate) became preferred for training large Transformer models. Despite these refinements, Adam and its variants remain the backbone of modern neural network optimization.

References

Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of ICLR 2015. arXiv:1412.6980
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 12.
Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101.

@@ Line 18: / Line 18: @@
 '''Adam: A Method for Stochastic Optimization''' is a 2015 paper by Kingma and Ba that introduced the '''Adam''' optimizer, an algorithm for first-order gradient-based optimization of stochastic objective functions. Adam combines the advantages of two earlier methods — '''AdaGrad''' (which adapts learning rates per parameter) and '''RMSProp''' (which uses a running average of squared gradients) — into a single algorithm with bias-corrected moment estimates. Adam has become the default optimizer for training neural networks across most domains.
-<!--T:3-->
+== Overview == <!--T:3-->
-== Overview ==
 <!--T:4-->
@@ Line 27: / Line 26: @@
 Prior adaptive methods like AdaGrad accumulated squared gradients over the entire training run, causing learning rates to decay monotonically to zero — problematic for non-convex problems. RMSProp addressed this by using an exponential moving average, but lacked bias correction. Adam unified these ideas with bias-corrected estimates of both the first moment (mean) and second moment (uncentered variance) of the gradients, providing an effective and computationally efficient optimizer with well-behaved default hyperparameters.
-<!--T:6-->
+== Key Contributions == <!--T:6-->
-== Key Contributions ==
 <!--T:7-->
@@ Line 36: / Line 34: @@
 * '''Practical defaults''': Recommended hyperparameter values (<math>\beta_1 = 0.9</math>, <math>\beta_2 = 0.999</math>, <math>\epsilon = 10^{-8}</math>) that work well across a wide range of problems.
-<!--T:8-->
+== Methods == <!--T:8-->
-== Methods ==
 <!--T:9-->
@@ Line 75: / Line 72: @@
 The paper also introduces '''AdaMax''', which replaces the <math>L^2</math> norm used in Adam's second moment with the <math>L^\infty</math> norm, yielding a simpler update rule that avoids the bias correction for the second moment.
-<!--T:21-->
+== Results == <!--T:21-->
-== Results ==
 <!--T:22-->
@@ Line 90: / Line 86: @@
 The paper provided a convergence analysis showing that Adam achieves an <math>O(\sqrt{T})</math> regret bound in the online convex optimization framework, matching the best known bounds for adaptive methods.
-<!--T:25-->
+== Impact == <!--T:25-->
-== Impact ==
 <!--T:26-->
@@ Line 99: / Line 94: @@
 Subsequent work identified limitations, including convergence issues in certain settings (addressed by AMSGrad), potential generalization gaps compared to well-tuned SGD (particularly for image classification), and sensitivity to the choice of <math>\epsilon</math>. Variants such as AdamW (which decouples weight decay from the adaptive learning rate) became preferred for training large Transformer models. Despite these refinements, Adam and its variants remain the backbone of modern neural network optimization.
-<!--T:28-->
+== See also == <!--T:28-->
-== See also ==
 <!--T:29-->
@@ Line 107: / Line 101: @@
 * [[Dropout A Simple Way to Prevent Overfitting]]
-<!--T:30-->
+== References == <!--T:30-->
-== References ==
 <!--T:31-->