Translations:Adam A Method for Stochastic Optimization/4/en

Training deep neural networks requires minimizing a high-dimensional, non-convex objective function using stochastic gradient estimates. Standard stochastic gradient descent (SGD) uses a single global learning rate for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.