Translations:Adam A Method for Stochastic Optimization/19/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) |
||
| Line 1: | Line 1: | ||
The first moment estimate provides momentum-like behavior, accelerating convergence along consistent gradient directions. The second moment estimate scales the learning rate inversely with the root-mean-square of recent gradients, giving each parameter its own effective learning rate. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates. | The first moment estimate provides {{Term|momentum}}-like behavior, accelerating {{Term|convergence}} along consistent gradient directions. The second moment estimate scales the {{Term|learning rate}} inversely with the root-mean-square of recent gradients, giving each parameter its own effective {{Term|learning rate}}. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates. | ||
Latest revision as of 21:37, 27 April 2026
The first moment estimate provides momentum-like behavior, accelerating convergence along consistent gradient directions. The second moment estimate scales the learning rate inversely with the root-mean-square of recent gradients, giving each parameter its own effective learning rate. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.