Translations:Adam A Method for Stochastic Optimization/4/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) |
||
| Line 1: | Line 1: | ||
Training deep neural networks requires minimizing a high-dimensional, non-convex objective function using stochastic gradient estimates. Standard stochastic gradient descent (SGD) uses a single global learning rate for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature. | Training deep neural networks requires minimizing a high-dimensional, non-convex {{Term|loss function|objective function}} using stochastic gradient estimates. Standard {{Term|stochastic gradient descent}} ({{Term|stochastic gradient descent|SGD}}) uses a single global {{Term|learning rate}} for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature. | ||
Latest revision as of 21:37, 27 April 2026
Training deep neural networks requires minimizing a high-dimensional, non-convex objective function using stochastic gradient estimates. Standard stochastic gradient descent (SGD) uses a single global learning rate for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.