Translations:Adam A Method for Stochastic Optimization/19/en

The first moment estimate provides momentum-like behavior, accelerating convergence along consistent gradient directions. The second moment estimate scales the learning rate inversely with the root-mean-square of recent gradients, giving each parameter its own effective learning rate. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.