Decoupled Weight Decay Regularization - Revision history

DeployBot: Marked this version for translation

2026-04-27T07:14:25Z

Marked this version for translation

← Older revision		Revision as of 07:14, 27 April 2026
Line 20:		Line 20:
	'''Decoupled Weight Decay Regularization''' is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L<sub>2</sub> regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces '''AdamW''' (and its sibling SGDW), a variant of [[Adam]] in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.		'''Decoupled Weight Decay Regularization''' is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L<sub>2</sub> regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces '''AdamW''' (and its sibling SGDW), a variant of [[Adam]] in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.

	<!--T:2-->		== Overview == <!--T:2-->
	~~== Overview ==~~

	<!--T:3-->		<!--T:3-->
Line 29:		Line 28:
	The paper's central proposal is to '''decouple''' the decay step from the adaptive update: instead of folding <math>\lambda \theta</math> into the gradient, multiply <math>\theta</math> by <math>(1-\eta_t \lambda)</math> after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam's generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.		The paper's central proposal is to '''decouple''' the decay step from the adaptive update: instead of folding <math>\lambda \theta</math> into the gradient, multiply <math>\theta</math> by <math>(1-\eta_t \lambda)</math> after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam's generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.

	<!--T:5-->		== Key Contributions == <!--T:5-->
	~~== Key Contributions ==~~

	<!--T:6-->		<!--T:6-->
Line 40:		Line 38:
	* Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.		* Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.

	<!--T:7-->		== Methods == <!--T:7-->
	~~== Methods ==~~

	<!--T:8-->		<!--T:8-->
Line 73:		Line 70:
	To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay <math>\lambda_{\text{norm}}</math> tied to the total number of weight updates <math>BT</math> and batch size <math>B</math>, motivated by the empirical observation that the optimal raw <math>\lambda</math> falls as the budget grows.		To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay <math>\lambda_{\text{norm}}</math> tied to the total number of weight updates <math>BT</math> and batch size <math>B</math>, motivated by the empirical observation that the optimal raw <math>\lambda</math> falls as the budget grows.

	<!--T:18-->		== Results == <!--T:18-->
	~~== Results ==~~

	<!--T:19-->		<!--T:19-->
Line 91:		Line 87:
	The authors further verify that AdamW's gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L<sub>2</sub> across the entire two-dimensional <math>(\alpha, \lambda)</math> grid, not only at a single optimum.		The authors further verify that AdamW's gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L<sub>2</sub> across the entire two-dimensional <math>(\alpha, \lambda)</math> grid, not only at a single optimum.

	<!--T:24-->		== Impact == <!--T:24-->
	~~== Impact ==~~

	<!--T:25-->		<!--T:25-->
Line 106:		Line 101:
	The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors' reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.		The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors' reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.

	<!--T:29-->		== See also == <!--T:29-->
	~~== See also ==~~

	<!--T:30-->		<!--T:30-->
Line 118:		Line 112:
	* [[Neural network]]		* [[Neural network]]

	<!--T:31-->		== References == <!--T:31-->
	~~== References ==~~

	<!--T:32-->		<!--T:32-->

DeployBot: [deploy-bot] Claude-authored from arxiv:1711.05101

2026-04-27T07:14:24Z

[deploy-bot] Claude-authored from arxiv:1711.05101

New page

<languages />
{{PaperTabs}}
{{PaperInfobox
| topic_area = Machine Learning
| difficulty = Research
| authors = Ilya Loshchilov; Frank Hutter
| year = 2017
| arxiv_id = 1711.05101
| source_url = https://arxiv.org/abs/1711.05101
| pdf_url = https://arxiv.org/pdf/1711.05101.pdf
}}
{{ContentMeta
| generated_by = claude-code-direct
| model_used = claude-opus-4-7
| generated_date = 2026-04-27
}}

<translate>

'''Decoupled Weight Decay Regularization''' is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L2 regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces '''AdamW''' (and its sibling SGDW), a variant of [[Adam]] in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.


== Overview ==


In standard stochastic gradient descent, adding an L2 penalty <math>\tfrac{\lambda'}{2}\|\theta\|_2^2</math> to the loss is mathematically equivalent to multiplying the parameters by <math>(1-\lambda)</math> at every step, with <math>\lambda' = \lambda/\alpha</math> for learning rate <math>\alpha</math>. Most deep-learning libraries exploit this equivalence and implement "weight decay" by simply adding <math>\lambda \theta</math> to the gradient. The authors point out that this equivalence breaks down as soon as the optimizer rescales gradients adaptively, as in [[AdaGrad]], RMSProp, [[Adam]], or AMSGrad: the regularizer's gradient is then divided by the same per-parameter denominator as the loss gradient, so weights with historically large updates are regularized less than they would be under genuine weight decay.


The paper's central proposal is to '''decouple''' the decay step from the adaptive update: instead of folding <math>\lambda \theta</math> into the gradient, multiply <math>\theta</math> by <math>(1-\eta_t \lambda)</math> after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam's generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.


== Key Contributions ==


* A formal analysis showing that L2 regularization and weight decay are equivalent for vanilla SGD only after a learning-rate-dependent reparameterization, and are '''not''' equivalent for any optimizer whose preconditioner <math>\mathbf{M}_t</math> is not a scalar multiple of the identity.
* AdamW and SGDW algorithms that decouple weight decay from the gradient-based update, parameterized by an explicit schedule multiplier <math>\eta_t</math>.
* A "scale-adjusted L2" interpretation: for an idealized adaptive optimizer with a fixed diagonal preconditioner, decoupled weight decay is equivalent to penalizing <math>\sum_i s_i \theta_i^2</math>, regularizing parameters with large historical gradients more strongly.
* A demonstration that the optimal weight decay shrinks as the training budget grows, and a heuristic <math>\lambda_{\text{norm}} = \lambda \sqrt{B/(BT)}</math> that normalizes <math>\lambda</math> by the number of weight updates.
* AdamWR / SGDWR variants that combine decoupled weight decay with cosine-annealing warm restarts (SGDR), yielding both faster convergence and better final accuracy.
* Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.


== Methods ==


In the original formulation of weight decay due to Hanson & Pratt (1988), parameters evolve as


<math>\theta_{t+1} = (1-\lambda)\,\theta_t - \alpha \nabla f_t(\theta_t),</math>


so the decay is applied independently of the optimizer's gradient step. Most modern libraries instead absorb it into the loss as <math>f_t^{\text{reg}}(\theta) = f_t(\theta) + \tfrac{\lambda'}{2}\|\theta\|_2^2</math> and let the optimizer differentiate; for plain SGD this reproduces the original update if <math>\lambda' = \lambda/\alpha</math>.


For an optimizer with iterates <math>\theta_{t+1} = \theta_t - \alpha \mathbf{M}_t \nabla f_t(\theta_t)</math> the authors prove that whenever <math>\mathbf{M}_t \neq k\mathbf{I}</math>, no choice of <math>\lambda'</math> can make L2-regularized optimization match weight-decayed optimization, because <math>\mathbf{M}_t</math> rescales the regularizer term as well as the loss term. Adam's diagonal preconditioner <math>\hat{v}_t^{-1/2}</math> falls squarely in this regime.


'''SGDW''' replaces line 9 of the SGD-with-momentum loop with


<math>\theta_t \leftarrow \theta_{t-1} - m_t - \eta_t \lambda \theta_{t-1},</math>


so the decay term sits outside the momentum buffer. '''AdamW''' replaces Adam's parameter update with


<math>\theta_t \leftarrow \theta_{t-1} - \eta_t\!\left( \alpha\,\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon) + \lambda\,\theta_{t-1} \right),</math>


where <math>\eta_t</math> is a global schedule multiplier (constant, drop-step, or cosine annealing). When <math>\eta_t</math> follows the cosine-with-restarts schedule of SGDR, the resulting optimizer is denoted AdamWR (or SGDWR for its SGD counterpart); restarts also reset normalized state where appropriate.


To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay <math>\lambda_{\text{norm}}</math> tied to the total number of weight updates <math>BT</math> and batch size <math>B</math>, motivated by the empirical observation that the optimal raw <math>\lambda</math> falls as the budget grows.


== Results ==


On CIFAR-10 with a 26 2×96d ResNet trained for 100 epochs, AdamW reaches roughly 5.0 % test error versus about 6.0 % for vanilla Adam with L2 regularization — a relative improvement of around 15 %. SGDW gives essentially the same result as well-tuned SGD with L2, but its hyperparameter landscape is markedly simpler: heatmaps over <math>(\alpha, \lambda)</math> show diagonal "valleys" of equal performance for L2-regularized optimizers and roughly axis-aligned basins for the decoupled variants, confirming that decoupling makes the two hyperparameters approximately separable.


On ImageNet32×32, AdamW improves top-1 and top-5 accuracy over Adam-with-L2 across all budgets tested. Adding cosine annealing further improves both Adam and AdamW, and AdamWR with warm restarts matches or exceeds AdamW with a fixed schedule while reaching competitive accuracy in a fraction of the wall-clock time at intermediate snapshots. SGDWR exhibits the same pattern relative to SGDW.


The paper also reports that the optimal weight decay decreases predictably as the training budget grows: longer schedules require smaller <math>\lambda</math>, and the proposed normalized parameterization <math>\lambda_{\text{norm}}</math> transfers reasonably well across budgets, reducing the cost of grid search.


A subtler finding is that the popular practice of tying weight decay to the loss-side L2 term in Adam is harmful for parameters with sparse or low-magnitude gradients: such parameters are effectively under-regularized, while parameters with large historical gradients are over-regularized relative to the practitioner's intended <math>\lambda</math>. AdamW removes this implicit per-parameter rescaling, restoring uniform shrinkage across the network and making weight-decay sweeps far more interpretable.


The authors further verify that AdamW's gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L2 across the entire two-dimensional <math>(\alpha, \lambda)</math> grid, not only at a single optimum.


== Impact ==


AdamW has become the standard optimizer for a large fraction of contemporary deep learning, particularly for [[Transformer (machine learning model)|transformers]] in language and vision. Mainstream frameworks ship native implementations (<code>torch.optim.AdamW</code> in PyTorch since 1.2, <code>tf.keras.optimizers.AdamW</code> in TensorFlow/Keras), and the optimizer is the default in popular training stacks such as Hugging Face Transformers and timm. Practitioners typically tune AdamW with a small weight-decay coefficient (often around 0.01 to 0.1) and a cosine or linear-warmup learning-rate schedule, paralleling the AdamWR recipe.


Beyond engineering practice, the paper has shaped how regularization is discussed in deep-learning research: the distinction between "true weight decay" and "L2 as a loss penalty" is now standard terminology, and subsequent work on optimizer design — for example LAMB, AdaFactor, and Lion — explicitly considers whether and how to decouple shrinkage from adaptive scaling. The paper's hyperparameter normalization arguments also influenced later studies of how learning rate, weight decay, and batch size jointly determine the implicit-regularization of large-batch training.


A common follow-up question is whether to apply weight decay uniformly or to exclude bias terms, layer-norm scales, and embedding tables. The decoupling principle does not by itself answer this; it merely clarifies that whichever choice is made is honored exactly by AdamW, not warped by adaptive scaling. Most modern training recipes adopt a "decay everything except norm and bias" convention layered on top of AdamW.


The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors' reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.


== See also ==


* [[Adam]]
* [[Stochastic gradient descent]]
* [[Regularization (mathematics)]]
* [[Tikhonov regularization]]
* [[Hyperparameter optimization]]
* [[Deep learning]]
* [[Neural network]]


== References ==


* Loshchilov, I., & Hutter, F. (2017). ''Decoupled Weight Decay Regularization''. [https://arxiv.org/abs/1711.05101 arXiv:1711.05101]. Published at ICLR 2019.
* Hanson, S. J., & Pratt, L. Y. (1988). Comparing biases for minimal network construction with back-propagation. ''Advances in Neural Information Processing Systems 1''.
* Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. [https://arxiv.org/abs/1412.6980 arXiv:1412.6980].
* Loshchilov, I., & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. [https://arxiv.org/abs/1608.03983 arXiv:1608.03983].
* Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. [https://arxiv.org/abs/1705.08292 arXiv:1705.08292].
* Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.
* Source code: [https://github.com/loshchil/AdamW-and-SGDW github.com/loshchil/AdamW-and-SGDW].
</translate>

[[Category:Machine Learning]]
[[Category:Research]]
[[Category:Research Papers]]