Translations:Adam A Method for Stochastic Optimization/5/en
prior adaptive methods like adagrad accumulated squared gradients over the entire training run, causing learning rates to decay monotonically to zero — problematic for non-convex problems. rmsprop addressed this by using an exponential moving average, but lacked bias correction. Adam unified these ideas with bias-corrected estimates of both the first moment (mean) and second moment (uncentered variance) of the gradients, providing an effective and computationally efficient optimizer with well-behaved default hyperparameters.