All translations
Enter a message name below to show all available translations.
Found 3 translations.
| Name | Current message text |
|---|---|
| h English (en) | Subsequent work identified limitations, including {{Term|convergence}} issues in certain settings (addressed by AMSGrad), potential generalization gaps compared to well-tuned {{Term|stochastic gradient descent|SGD}} (particularly for image classification), and sensitivity to the choice of <math>\epsilon</math>. Variants such as AdamW (which decouples {{Term|weight decay}} from the adaptive {{Term|learning rate}}) became preferred for training large {{Term|transformer}} models. Despite these refinements, Adam and its variants remain the backbone of modern neural network optimization. |
| h Spanish (es) | Trabajos posteriores identificaron limitaciones, entre ellas problemas de {{Term|convergence|convergencia}} en ciertos escenarios (abordados por AMSGrad), posibles brechas de generalización en comparación con un {{Term|stochastic gradient descent|SGD}} bien ajustado (especialmente en clasificación de imágenes) y sensibilidad a la elección de <math>\epsilon</math>. Variantes como AdamW (que desacopla el {{Term|weight decay|decaimiento de pesos}} de la {{Term|learning rate|tasa de aprendizaje}} adaptativa) pasaron a ser preferidas para entrenar grandes modelos {{Term|transformer|transformer}}. A pesar de estos refinamientos, Adam y sus variantes siguen siendo la columna vertebral de la optimización moderna de redes neuronales. |
| h Chinese (zh) | 后续工作指出了一些局限性,包括在某些设定下的 {{Term|convergence|收敛}} 问题(由 AMSGrad 解决)、与精调的 {{Term|stochastic gradient descent|SGD}} 相比可能存在的泛化差距(在图像分类中尤为明显),以及对 <math>\epsilon</math> 选择的敏感性。诸如 AdamW(将 {{Term|weight decay|权重衰减}} 与自适应 {{Term|learning rate|学习率}} 解耦)之类的变体在训练大型 {{Term|transformer|transformer}} 模型时更受青睐。尽管有这些改进,Adam 及其变体仍然是现代神经网络优化的支柱。 |