Translations:Stochastic Gradient Descent/25/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) |
||
| Line 7: | Line 7: | ||
| '''Nesterov accelerated gradient''' || Evaluates the gradient at a "look-ahead" position || Nesterov, 1983 | | '''Nesterov accelerated gradient''' || Evaluates the gradient at a "look-ahead" position || Nesterov, 1983 | ||
|- | |- | ||
| ''' | | '''{{Term|adagrad}}''' || Per-parameter rates that shrink for frequently updated features || Duchi et al., 2011 | ||
|- | |- | ||
| '''RMSProp''' || Fixes | | '''RMSProp''' || Fixes {{Term|adagrad}}'s diminishing rates using a moving average of squared gradients || Hinton (lecture notes), 2012 | ||
|- | |- | ||
| '''{{Term|Adam}}''' || Combines {{Term|momentum}} with RMSProp-style adaptive rates || Kingma & Ba, 2015 | | '''{{Term|Adam}}''' || Combines {{Term|momentum}} with RMSProp-style adaptive rates || Kingma & Ba, 2015 | ||
|- | |- | ||
| '''AdamW''' || Decouples weight decay from the adaptive gradient step || Loshchilov & Hutter, 2019 | | '''AdamW''' || Decouples {{Term|weight decay}} from the adaptive gradient step || Loshchilov & Hutter, 2019 | ||
|} | |} | ||
Latest revision as of 19:42, 27 April 2026
| Method | Key idea | Reference |
|---|---|---|
| Momentum | Accumulates an exponentially decaying moving average of past gradients | Polyak, 1964 |
| Nesterov accelerated gradient | Evaluates the gradient at a "look-ahead" position | Nesterov, 1983 |
| adagrad | Per-parameter rates that shrink for frequently updated features | Duchi et al., 2011 |
| RMSProp | Fixes adagrad's diminishing rates using a moving average of squared gradients | Hinton (lecture notes), 2012 |
| Adam | Combines momentum with RMSProp-style adaptive rates | Kingma & Ba, 2015 |
| AdamW | Decouples weight decay from the adaptive gradient step | Loshchilov & Hutter, 2019 |