Translations:Adam A Method for Stochastic Optimization/27/en
Subsequent work identified limitations, including convergence issues in certain settings (addressed by amsgrad), potential generalization gaps compared to well-tuned SGD (particularly for image classification), and sensitivity to the choice of $ \epsilon $. Variants such as adamw (which decouples weight decay from the adaptive learning rate) became preferred for training large transformer models. Despite these refinements, Adam and its variants remain the backbone of modern neural network optimization.