Translations:Stochastic Gradient Descent/25/en

Method	Key idea	Reference
Momentum	Accumulates an exponentially decaying moving average of past gradients	Polyak, 1964
Nesterov accelerated gradient	Evaluates the gradient at a "look-ahead" position	Nesterov, 1983
Adagrad	Per-parameter rates that shrink for frequently updated features	Duchi et al., 2011
RMSProp	Fixes Adagrad's diminishing rates using a moving average of squared gradients	Hinton (lecture notes), 2012
Adam	Combines momentum with RMSProp-style adaptive rates	Kingma & Ba, 2015
AdamW	Decouples weight decay from the adaptive gradient step	Loshchilov & Hutter, 2019