Translations:Adam A Method for Stochastic Optimization/5/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) |
||
| Line 1: | Line 1: | ||
Prior adaptive methods like | Prior adaptive methods like {{Term|adagrad}} accumulated squared gradients over the entire training run, causing {{Term|learning rate|learning rates}} to decay monotonically to zero — problematic for non-convex problems. RMSProp addressed this by using an exponential moving average, but lacked bias correction. Adam unified these ideas with bias-corrected estimates of both the first moment (mean) and second moment (uncentered variance) of the gradients, providing an effective and computationally efficient optimizer with well-behaved default {{Term|hyperparameter|hyperparameters}}. | ||
Latest revision as of 21:37, 27 April 2026
Prior adaptive methods like adagrad accumulated squared gradients over the entire training run, causing learning rates to decay monotonically to zero — problematic for non-convex problems. RMSProp addressed this by using an exponential moving average, but lacked bias correction. Adam unified these ideas with bias-corrected estimates of both the first moment (mean) and second moment (uncentered variance) of the gradients, providing an effective and computationally efficient optimizer with well-behaved default hyperparameters.