| Method |
Key idea |
Reference
|
| Momentum |
Accumulates an exponentially decaying moving average of past gradients |
Polyak, 1964
|
| Nesterov accelerated gradient |
Evaluates the gradient at a "look-ahead" position |
Nesterov, 1983
|
| adagrad |
Per-parameter rates that shrink for frequently updated features |
Duchi et al., 2011
|
| RMSProp |
Fixes adagrad's diminishing rates using a moving average of squared gradients |
Hinton (lecture notes), 2012
|
| Adam |
Combines momentum with RMSProp-style adaptive rates |
Kingma & Ba, 2015
|
| AdamW |
Decouples weight decay from the adaptive gradient step |
Loshchilov & Hutter, 2019
|