Decoupled Weight Decay Regularization/en

Research Paper
Authors	Ilya Loshchilov; Frank Hutter
Year	2017
Topic area	Machine Learning
Difficulty	Research
arXiv	1711.05101
PDF	Download PDF

Other languages:

SummarySource

Decoupled Weight Decay Regularization is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L₂ regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces AdamW (and its sibling SGDW), a variant of Adam in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.

Overview

In standard stochastic gradient descent, adding an L₂ penalty $\tfrac{\lambda'}{2}\|\theta\|_2^2$ to the loss is mathematically equivalent to multiplying the parameters by $(1-\lambda)$ at every step, with $\lambda' = \lambda/\alpha$ for learning rate $\alpha$ . Most deep-learning libraries exploit this equivalence and implement "weight decay" by simply adding $\lambda \theta$ to the gradient. The authors point out that this equivalence breaks down as soon as the optimizer rescales gradients adaptively, as in AdaGrad, RMSProp, Adam, or AMSGrad: the regularizer's gradient is then divided by the same per-parameter denominator as the loss gradient, so weights with historically large updates are regularized less than they would be under genuine weight decay.

The paper's central proposal is to decouple the decay step from the adaptive update: instead of folding $\lambda \theta$ into the gradient, multiply $\theta$ by $(1-\eta_t \lambda)$ after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam's generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.

Key Contributions

A formal analysis showing that L₂ regularization and weight decay are equivalent for vanilla SGD only after a learning-rate-dependent reparameterization, and are not equivalent for any optimizer whose preconditioner $\mathbf{M}_t$ is not a scalar multiple of the identity.
AdamW and SGDW algorithms that decouple weight decay from the gradient-based update, parameterized by an explicit schedule multiplier $\eta_t$ .
A "scale-adjusted L₂" interpretation: for an idealized adaptive optimizer with a fixed diagonal preconditioner, decoupled weight decay is equivalent to penalizing $\sum_i s_i \theta_i^2$ , regularizing parameters with large historical gradients more strongly.
A demonstration that the optimal weight decay shrinks as the training budget grows, and a heuristic $\lambda_{\text{norm}} = \lambda \sqrt{B/(BT)}$ that normalizes $\lambda$ by the number of weight updates.
AdamWR / SGDWR variants that combine decoupled weight decay with cosine-annealing warm restarts (SGDR), yielding both faster convergence and better final accuracy.
Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.

Methods

In the original formulation of weight decay due to Hanson & Pratt (1988), parameters evolve as

$\theta_{t+1} = (1-\lambda)\,\theta_t - \alpha \nabla f_t(\theta_t),$

so the decay is applied independently of the optimizer's gradient step. Most modern libraries instead absorb it into the loss as $f_t^{\text{reg}}(\theta) = f_t(\theta) + \tfrac{\lambda'}{2}\|\theta\|_2^2$ and let the optimizer differentiate; for plain SGD this reproduces the original update if $\lambda' = \lambda/\alpha$ .

For an optimizer with iterates $\theta_{t+1} = \theta_t - \alpha \mathbf{M}_t \nabla f_t(\theta_t)$ the authors prove that whenever $\mathbf{M}_t \neq k\mathbf{I}$ , no choice of $\lambda'$ can make L₂-regularized optimization match weight-decayed optimization, because $\mathbf{M}_t$ rescales the regularizer term as well as the loss term. Adam's diagonal preconditioner $\hat{v}_t^{-1/2}$ falls squarely in this regime.

SGDW replaces line 9 of the SGD-with-momentum loop with

$\theta_t \leftarrow \theta_{t-1} - m_t - \eta_t \lambda \theta_{t-1},$

so the decay term sits outside the momentum buffer. AdamW replaces Adam's parameter update with

$\theta_t \leftarrow \theta_{t-1} - \eta_t\!\left( \alpha\,\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon) + \lambda\,\theta_{t-1} \right),$

where $\eta_t$ is a global schedule multiplier (constant, drop-step, or cosine annealing). When $\eta_t$ follows the cosine-with-restarts schedule of SGDR, the resulting optimizer is denoted AdamWR (or SGDWR for its SGD counterpart); restarts also reset normalized state where appropriate.

To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay $\lambda_{\text{norm}}$ tied to the total number of weight updates $$ BT $$ and batch size $$ B $$ , motivated by the empirical observation that the optimal raw $\lambda$ falls as the budget grows.

Results

On CIFAR-10 with a 26 2×96d ResNet trained for 100 epochs, AdamW reaches roughly 5.0 % test error versus about 6.0 % for vanilla Adam with L₂ regularization — a relative improvement of around 15 %. SGDW gives essentially the same result as well-tuned SGD with L₂, but its hyperparameter landscape is markedly simpler: heatmaps over $(\alpha, \lambda)$ show diagonal "valleys" of equal performance for L₂-regularized optimizers and roughly axis-aligned basins for the decoupled variants, confirming that decoupling makes the two hyperparameters approximately separable.

On ImageNet32×32, AdamW improves top-1 and top-5 accuracy over Adam-with-L₂ across all budgets tested. Adding cosine annealing further improves both Adam and AdamW, and AdamWR with warm restarts matches or exceeds AdamW with a fixed schedule while reaching competitive accuracy in a fraction of the wall-clock time at intermediate snapshots. SGDWR exhibits the same pattern relative to SGDW.

The paper also reports that the optimal weight decay decreases predictably as the training budget grows: longer schedules require smaller $\lambda$ , and the proposed normalized parameterization $\lambda_{\text{norm}}$ transfers reasonably well across budgets, reducing the cost of grid search.

A subtler finding is that the popular practice of tying weight decay to the loss-side L₂ term in Adam is harmful for parameters with sparse or low-magnitude gradients: such parameters are effectively under-regularized, while parameters with large historical gradients are over-regularized relative to the practitioner's intended $\lambda$ . AdamW removes this implicit per-parameter rescaling, restoring uniform shrinkage across the network and making weight-decay sweeps far more interpretable.

The authors further verify that AdamW's gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L₂ across the entire two-dimensional $(\alpha, \lambda)$ grid, not only at a single optimum.

Impact

AdamW has become the standard optimizer for a large fraction of contemporary deep learning, particularly for transformers in language and vision. Mainstream frameworks ship native implementations (torch.optim.AdamW in PyTorch since 1.2, tf.keras.optimizers.AdamW in TensorFlow/Keras), and the optimizer is the default in popular training stacks such as Hugging Face Transformers and timm. Practitioners typically tune AdamW with a small weight-decay coefficient (often around 0.01 to 0.1) and a cosine or linear-warmup learning-rate schedule, paralleling the AdamWR recipe.

Beyond engineering practice, the paper has shaped how regularization is discussed in deep-learning research: the distinction between "true weight decay" and "L₂ as a loss penalty" is now standard terminology, and subsequent work on optimizer design — for example LAMB, AdaFactor, and Lion — explicitly considers whether and how to decouple shrinkage from adaptive scaling. The paper's hyperparameter normalization arguments also influenced later studies of how learning rate, weight decay, and batch size jointly determine the implicit-regularization of large-batch training.

A common follow-up question is whether to apply weight decay uniformly or to exclude bias terms, layer-norm scales, and embedding tables. The decoupling principle does not by itself answer this; it merely clarifies that whichever choice is made is honored exactly by AdamW, not warped by adaptive scaling. Most modern training recipes adopt a "decay everything except norm and bias" convention layered on top of AdamW.

The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors' reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.

References

Loshchilov, I., & Hutter, F. (2017). Decoupled Weight Decay Regularization. arXiv:1711.05101. Published at ICLR 2019.
Hanson, S. J., & Pratt, L. Y. (1988). Comparing biases for minimal network construction with back-propagation. Advances in Neural Information Processing Systems 1.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
Loshchilov, I., & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv:1608.03983.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv:1705.08292.
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.
Source code: github.com/loshchil/AdamW-and-SGDW.