Translations:Adam A Method for Stochastic Optimization/4/en: Difference between revisions

    From Marovi AI
    (Importing a new version from external source)
     
    (Importing a new version from external source)
     
    Line 1: Line 1:
    Training deep neural networks requires minimizing a high-dimensional, non-convex objective function using stochastic gradient estimates. Standard stochastic gradient descent (SGD) uses a single global learning rate for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.
    Training deep neural networks requires minimizing a high-dimensional, non-convex {{Term|loss function|objective function}} using stochastic gradient estimates. Standard {{Term|stochastic gradient descent}} ({{Term|stochastic gradient descent|SGD}}) uses a single global {{Term|learning rate}} for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.

    Latest revision as of 21:37, 27 April 2026

    Information about message (contribute)
    This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.
    Message definition (Adam A Method for Stochastic Optimization)
    Training deep neural networks requires minimizing a high-dimensional, non-convex {{Term|loss function|objective function}} using stochastic gradient estimates. Standard {{Term|stochastic gradient descent}} ({{Term|stochastic gradient descent|SGD}}) uses a single global {{Term|learning rate}} for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.

    Training deep neural networks requires minimizing a high-dimensional, non-convex objective function using stochastic gradient estimates. Standard stochastic gradient descent (SGD) uses a single global learning rate for all parameters, which can be suboptimal when different parameters have gradients of very different magnitudes or when the loss surface has highly anisotropic curvature.