Translations:Adam A Method for Stochastic Optimization/19/en: Difference between revisions

Latest revision as of 21:37, 27 April 2026

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Adam A Method for Stochastic Optimization)

The first moment estimate provides {{Term|momentum}}-like behavior, accelerating {{Term|convergence}} along consistent gradient directions. The second moment estimate scales the {{Term|learning rate}} inversely with the root-mean-square of recent gradients, giving each parameter its own effective {{Term|learning rate}}. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.

The first moment estimate provides momentum-like behavior, accelerating convergence along consistent gradient directions. The second moment estimate scales the learning rate inversely with the root-mean-square of recent gradients, giving each parameter its own effective learning rate. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.

Revision as of 00:31, 27 April 2026 (view source) FuzzyBot (talk \| contribs) (Importing a new version from external source)		Latest revision as of 21:37, 27 April 2026 (view source) FuzzyBot (talk \| contribs) (Importing a new version from external source)
Line 1:		Line 1:
	The first moment estimate provides momentum-like behavior, accelerating convergence along consistent gradient directions. The second moment estimate scales the learning rate inversely with the root-mean-square of recent gradients, giving each parameter its own effective learning rate. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.		The first moment estimate provides {{Term\|momentum}}-like behavior, accelerating {{Term\|convergence}} along consistent gradient directions. The second moment estimate scales the {{Term\|learning rate}} inversely with the root-mean-square of recent gradients, giving each parameter its own effective {{Term\|learning rate}}. The combination means parameters with consistently large gradients receive smaller updates, while parameters with small or noisy gradients receive relatively larger updates.