Translations:Adam A Method for Stochastic Optimization/7/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) |
||
| Line 1: | Line 1: | ||
* '''Adam optimizer''': An adaptive learning rate method that maintains per-parameter learning rates based on bias-corrected estimates of the first and second moments of the gradients. | * '''Adam optimizer''': An adaptive {{Term|learning rate}} method that maintains per-parameter {{Term|learning rate|learning rates}} based on bias-corrected estimates of the first and second moments of the gradients. | ||
* '''Bias correction''': A mechanism to counteract the initialization bias of the moment estimates toward zero, which is especially important in the initial steps of training. | * '''Bias correction''': A mechanism to counteract the initialization bias of the moment estimates toward zero, which is especially important in the initial steps of training. | ||
* '''AdaMax variant''': A generalization based on the infinity norm that can sometimes outperform Adam on problems with sparse gradients. | * '''AdaMax variant''': A generalization based on the infinity norm that can sometimes outperform Adam on problems with sparse gradients. | ||
* '''Practical defaults''': Recommended hyperparameter values (<math>\beta_1 = 0.9</math>, <math>\beta_2 = 0.999</math>, <math>\epsilon = 10^{-8}</math>) that work well across a wide range of problems. | * '''Practical defaults''': Recommended {{Term|hyperparameter}} values (<math>\beta_1 = 0.9</math>, <math>\beta_2 = 0.999</math>, <math>\epsilon = 10^{-8}</math>) that work well across a wide range of problems. | ||
Latest revision as of 21:37, 27 April 2026
- Adam optimizer: An adaptive learning rate method that maintains per-parameter learning rates based on bias-corrected estimates of the first and second moments of the gradients.
- Bias correction: A mechanism to counteract the initialization bias of the moment estimates toward zero, which is especially important in the initial steps of training.
- AdaMax variant: A generalization based on the infinity norm that can sometimes outperform Adam on problems with sparse gradients.
- Practical defaults: Recommended hyperparameter values ($ \beta_1 = 0.9 $, $ \beta_2 = 0.999 $, $ \epsilon = 10^{-8} $) that work well across a wide range of problems.