<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Decoupled_Weight_Decay_Regularization</id>
	<title>Decoupled Weight Decay Regularization - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Decoupled_Weight_Decay_Regularization"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Decoupled_Weight_Decay_Regularization&amp;action=history"/>
	<updated>2026-04-27T14:38:39Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Decoupled_Weight_Decay_Regularization&amp;diff=11456&amp;oldid=prev</id>
		<title>DeployBot: Marked this version for translation</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Decoupled_Weight_Decay_Regularization&amp;diff=11456&amp;oldid=prev"/>
		<updated>2026-04-27T07:14:25Z</updated>

		<summary type="html">&lt;p&gt;Marked this version for translation&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:14, 27 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l20&quot;&gt;Line 20:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 20:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Decoupled Weight Decay Regularization&amp;#039;&amp;#039;&amp;#039; is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces &amp;#039;&amp;#039;&amp;#039;AdamW&amp;#039;&amp;#039;&amp;#039; (and its sibling SGDW), a variant of [[Adam]] in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Decoupled Weight Decay Regularization&amp;#039;&amp;#039;&amp;#039; is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces &amp;#039;&amp;#039;&amp;#039;AdamW&amp;#039;&amp;#039;&amp;#039; (and its sibling SGDW), a variant of [[Adam]] in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:2--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Overview == &lt;/ins&gt;&amp;lt;!--T:2--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Overview ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:3--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:3--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l29&quot;&gt;Line 29:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 28:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The paper&amp;#039;s central proposal is to &amp;#039;&amp;#039;&amp;#039;decouple&amp;#039;&amp;#039;&amp;#039; the decay step from the adaptive update: instead of folding &amp;lt;math&amp;gt;\lambda \theta&amp;lt;/math&amp;gt; into the gradient, multiply &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; by &amp;lt;math&amp;gt;(1-\eta_t \lambda)&amp;lt;/math&amp;gt; after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam&amp;#039;s generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The paper&amp;#039;s central proposal is to &amp;#039;&amp;#039;&amp;#039;decouple&amp;#039;&amp;#039;&amp;#039; the decay step from the adaptive update: instead of folding &amp;lt;math&amp;gt;\lambda \theta&amp;lt;/math&amp;gt; into the gradient, multiply &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; by &amp;lt;math&amp;gt;(1-\eta_t \lambda)&amp;lt;/math&amp;gt; after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam&amp;#039;s generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:5--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Key Contributions == &lt;/ins&gt;&amp;lt;!--T:5--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Key Contributions ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:6--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:6--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l40&quot;&gt;Line 40:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 38:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:7--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Methods == &lt;/ins&gt;&amp;lt;!--T:7--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Methods ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:8--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:8--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l73&quot;&gt;Line 73:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 70:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay &amp;lt;math&amp;gt;\lambda_{\text{norm}}&amp;lt;/math&amp;gt; tied to the total number of weight updates &amp;lt;math&amp;gt;BT&amp;lt;/math&amp;gt; and batch size &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt;, motivated by the empirical observation that the optimal raw &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; falls as the budget grows.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay &amp;lt;math&amp;gt;\lambda_{\text{norm}}&amp;lt;/math&amp;gt; tied to the total number of weight updates &amp;lt;math&amp;gt;BT&amp;lt;/math&amp;gt; and batch size &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt;, motivated by the empirical observation that the optimal raw &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; falls as the budget grows.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:18--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Results == &lt;/ins&gt;&amp;lt;!--T:18--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Results ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:19--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:19--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l91&quot;&gt;Line 91:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 87:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The authors further verify that AdamW&amp;#039;s gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; across the entire two-dimensional &amp;lt;math&amp;gt;(\alpha, \lambda)&amp;lt;/math&amp;gt; grid, not only at a single optimum.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The authors further verify that AdamW&amp;#039;s gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; across the entire two-dimensional &amp;lt;math&amp;gt;(\alpha, \lambda)&amp;lt;/math&amp;gt; grid, not only at a single optimum.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:24--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Impact == &lt;/ins&gt;&amp;lt;!--T:24--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Impact ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:25--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:25--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l106&quot;&gt;Line 106:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 101:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors&amp;#039; reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors&amp;#039; reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:29--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== See also == &lt;/ins&gt;&amp;lt;!--T:29--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== See also ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:30--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:30--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l118&quot;&gt;Line 118:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 112:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Neural network]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* [[Neural network]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:31--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== References == &lt;/ins&gt;&amp;lt;!--T:31--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== References ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:32--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--T:32--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-11453:rev-11456 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Decoupled_Weight_Decay_Regularization&amp;diff=11453&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Claude-authored from arxiv:1711.05101</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Decoupled_Weight_Decay_Regularization&amp;diff=11453&amp;oldid=prev"/>
		<updated>2026-04-27T07:14:24Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Claude-authored from arxiv:1711.05101&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;languages /&amp;gt;&lt;br /&gt;
{{PaperTabs}}&lt;br /&gt;
{{PaperInfobox&lt;br /&gt;
 | topic_area  = Machine Learning&lt;br /&gt;
 | difficulty  = Research&lt;br /&gt;
 | authors     = Ilya Loshchilov; Frank Hutter&lt;br /&gt;
 | year        = 2017&lt;br /&gt;
 | arxiv_id    = 1711.05101&lt;br /&gt;
 | source_url  = https://arxiv.org/abs/1711.05101&lt;br /&gt;
 | pdf_url     = https://arxiv.org/pdf/1711.05101.pdf&lt;br /&gt;
}}&lt;br /&gt;
{{ContentMeta&lt;br /&gt;
 | generated_by   = claude-code-direct&lt;br /&gt;
 | model_used     = claude-opus-4-7&lt;br /&gt;
 | generated_date = 2026-04-27&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;translate&amp;gt;&lt;br /&gt;
&amp;lt;!--T:1--&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Decoupled Weight Decay Regularization&amp;#039;&amp;#039;&amp;#039; is a 2017 paper by Ilya Loshchilov and Frank Hutter that exposes a long-standing inequivalence between L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; regularization and true weight decay in adaptive gradient optimizers, and proposes a simple fix. The paper introduces &amp;#039;&amp;#039;&amp;#039;AdamW&amp;#039;&amp;#039;&amp;#039; (and its sibling SGDW), a variant of [[Adam]] in which the weight-decay term is applied directly to the parameters rather than added to the gradient before the adaptive scaling. AdamW closes much of the long-observed generalization gap between Adam and SGD with momentum on image classification, and it has since become the de-facto optimizer for training large-scale transformers and other modern neural networks.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:2--&amp;gt;&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:3--&amp;gt;&lt;br /&gt;
In standard stochastic gradient descent, adding an L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; penalty &amp;lt;math&amp;gt;\tfrac{\lambda&amp;#039;}{2}\|\theta\|_2^2&amp;lt;/math&amp;gt; to the loss is mathematically equivalent to multiplying the parameters by &amp;lt;math&amp;gt;(1-\lambda)&amp;lt;/math&amp;gt; at every step, with &amp;lt;math&amp;gt;\lambda&amp;#039; = \lambda/\alpha&amp;lt;/math&amp;gt; for learning rate &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt;. Most deep-learning libraries exploit this equivalence and implement &amp;quot;weight decay&amp;quot; by simply adding &amp;lt;math&amp;gt;\lambda \theta&amp;lt;/math&amp;gt; to the gradient. The authors point out that this equivalence breaks down as soon as the optimizer rescales gradients adaptively, as in [[AdaGrad]], RMSProp, [[Adam]], or AMSGrad: the regularizer&amp;#039;s gradient is then divided by the same per-parameter denominator as the loss gradient, so weights with historically large updates are regularized less than they would be under genuine weight decay.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:4--&amp;gt;&lt;br /&gt;
The paper&amp;#039;s central proposal is to &amp;#039;&amp;#039;&amp;#039;decouple&amp;#039;&amp;#039;&amp;#039; the decay step from the adaptive update: instead of folding &amp;lt;math&amp;gt;\lambda \theta&amp;lt;/math&amp;gt; into the gradient, multiply &amp;lt;math&amp;gt;\theta&amp;lt;/math&amp;gt; by &amp;lt;math&amp;gt;(1-\eta_t \lambda)&amp;lt;/math&amp;gt; after the Adam update. The result is AdamW. Empirically, AdamW (i) makes the optimal weight-decay factor and the optimal learning rate roughly orthogonal, and (ii) lifts Adam&amp;#039;s generalization on CIFAR-10, CIFAR-100, and ImageNet32×32 to be competitive with SGD with momentum, an outcome that previously required problem-specific switching between optimizers.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:5--&amp;gt;&lt;br /&gt;
== Key Contributions ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:6--&amp;gt;&lt;br /&gt;
* A formal analysis showing that L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; regularization and weight decay are equivalent for vanilla SGD only after a learning-rate-dependent reparameterization, and are &amp;#039;&amp;#039;&amp;#039;not&amp;#039;&amp;#039;&amp;#039; equivalent for any optimizer whose preconditioner &amp;lt;math&amp;gt;\mathbf{M}_t&amp;lt;/math&amp;gt; is not a scalar multiple of the identity.&lt;br /&gt;
* AdamW and SGDW algorithms that decouple weight decay from the gradient-based update, parameterized by an explicit schedule multiplier &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt;.&lt;br /&gt;
* A &amp;quot;scale-adjusted L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;&amp;quot; interpretation: for an idealized adaptive optimizer with a fixed diagonal preconditioner, decoupled weight decay is equivalent to penalizing &amp;lt;math&amp;gt;\sum_i s_i \theta_i^2&amp;lt;/math&amp;gt;, regularizing parameters with large historical gradients more strongly.&lt;br /&gt;
* A demonstration that the optimal weight decay shrinks as the training budget grows, and a heuristic &amp;lt;math&amp;gt;\lambda_{\text{norm}} = \lambda \sqrt{B/(BT)}&amp;lt;/math&amp;gt; that normalizes &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; by the number of weight updates.&lt;br /&gt;
* AdamWR / SGDWR variants that combine decoupled weight decay with cosine-annealing warm restarts (SGDR), yielding both faster convergence and better final accuracy.&lt;br /&gt;
* Extensive ablations on CIFAR-10 with a 26 2×96d ResNet and on ImageNet32×32, covering training budgets of 100 to 1800 epochs and three learning-rate schedules.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:7--&amp;gt;&lt;br /&gt;
== Methods ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:8--&amp;gt;&lt;br /&gt;
In the original formulation of weight decay due to Hanson &amp;amp; Pratt (1988), parameters evolve as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:9--&amp;gt;&lt;br /&gt;
&amp;lt;math&amp;gt;\theta_{t+1} = (1-\lambda)\,\theta_t - \alpha \nabla f_t(\theta_t),&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:10--&amp;gt;&lt;br /&gt;
so the decay is applied independently of the optimizer&amp;#039;s gradient step. Most modern libraries instead absorb it into the loss as &amp;lt;math&amp;gt;f_t^{\text{reg}}(\theta) = f_t(\theta) + \tfrac{\lambda&amp;#039;}{2}\|\theta\|_2^2&amp;lt;/math&amp;gt; and let the optimizer differentiate; for plain SGD this reproduces the original update if &amp;lt;math&amp;gt;\lambda&amp;#039; = \lambda/\alpha&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:11--&amp;gt;&lt;br /&gt;
For an optimizer with iterates &amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \alpha \mathbf{M}_t \nabla f_t(\theta_t)&amp;lt;/math&amp;gt; the authors prove that whenever &amp;lt;math&amp;gt;\mathbf{M}_t \neq k\mathbf{I}&amp;lt;/math&amp;gt;, no choice of &amp;lt;math&amp;gt;\lambda&amp;#039;&amp;lt;/math&amp;gt; can make L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;-regularized optimization match weight-decayed optimization, because &amp;lt;math&amp;gt;\mathbf{M}_t&amp;lt;/math&amp;gt; rescales the regularizer term as well as the loss term. Adam&amp;#039;s diagonal preconditioner &amp;lt;math&amp;gt;\hat{v}_t^{-1/2}&amp;lt;/math&amp;gt; falls squarely in this regime.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:12--&amp;gt;&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;SGDW&amp;#039;&amp;#039;&amp;#039; replaces line 9 of the SGD-with-momentum loop with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:13--&amp;gt;&lt;br /&gt;
&amp;lt;math&amp;gt;\theta_t \leftarrow \theta_{t-1} - m_t - \eta_t \lambda \theta_{t-1},&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:14--&amp;gt;&lt;br /&gt;
so the decay term sits outside the momentum buffer. &amp;#039;&amp;#039;&amp;#039;AdamW&amp;#039;&amp;#039;&amp;#039; replaces Adam&amp;#039;s parameter update with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:15--&amp;gt;&lt;br /&gt;
&amp;lt;math&amp;gt;\theta_t \leftarrow \theta_{t-1} - \eta_t\!\left( \alpha\,\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon) + \lambda\,\theta_{t-1} \right),&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:16--&amp;gt;&lt;br /&gt;
where &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; is a global schedule multiplier (constant, drop-step, or cosine annealing). When &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; follows the cosine-with-restarts schedule of SGDR, the resulting optimizer is denoted AdamWR (or SGDWR for its SGD counterpart); restarts also reset normalized state where appropriate.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:17--&amp;gt;&lt;br /&gt;
To make hyperparameters comparable across training budgets, the paper introduces a normalized weight decay &amp;lt;math&amp;gt;\lambda_{\text{norm}}&amp;lt;/math&amp;gt; tied to the total number of weight updates &amp;lt;math&amp;gt;BT&amp;lt;/math&amp;gt; and batch size &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt;, motivated by the empirical observation that the optimal raw &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; falls as the budget grows.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:18--&amp;gt;&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:19--&amp;gt;&lt;br /&gt;
On CIFAR-10 with a 26 2×96d ResNet trained for 100 epochs, AdamW reaches roughly 5.0 % test error versus about 6.0 % for vanilla Adam with L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; regularization — a relative improvement of around 15 %. SGDW gives essentially the same result as well-tuned SGD with L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;, but its hyperparameter landscape is markedly simpler: heatmaps over &amp;lt;math&amp;gt;(\alpha, \lambda)&amp;lt;/math&amp;gt; show diagonal &amp;quot;valleys&amp;quot; of equal performance for L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;-regularized optimizers and roughly axis-aligned basins for the decoupled variants, confirming that decoupling makes the two hyperparameters approximately separable.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:20--&amp;gt;&lt;br /&gt;
On ImageNet32×32, AdamW improves top-1 and top-5 accuracy over Adam-with-L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; across all budgets tested. Adding cosine annealing further improves both Adam and AdamW, and AdamWR with warm restarts matches or exceeds AdamW with a fixed schedule while reaching competitive accuracy in a fraction of the wall-clock time at intermediate snapshots. SGDWR exhibits the same pattern relative to SGDW.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:21--&amp;gt;&lt;br /&gt;
The paper also reports that the optimal weight decay decreases predictably as the training budget grows: longer schedules require smaller &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt;, and the proposed normalized parameterization &amp;lt;math&amp;gt;\lambda_{\text{norm}}&amp;lt;/math&amp;gt; transfers reasonably well across budgets, reducing the cost of grid search.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:22--&amp;gt;&lt;br /&gt;
A subtler finding is that the popular practice of tying weight decay to the loss-side L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; term in Adam is harmful for parameters with sparse or low-magnitude gradients: such parameters are effectively under-regularized, while parameters with large historical gradients are over-regularized relative to the practitioner&amp;#039;s intended &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt;. AdamW removes this implicit per-parameter rescaling, restoring uniform shrinkage across the network and making weight-decay sweeps far more interpretable.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:23--&amp;gt;&lt;br /&gt;
The authors further verify that AdamW&amp;#039;s gains are not an artifact of changing the implicit learning rate: the comparison is run with separately tuned step sizes for both variants, and AdamW dominates Adam-with-L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; across the entire two-dimensional &amp;lt;math&amp;gt;(\alpha, \lambda)&amp;lt;/math&amp;gt; grid, not only at a single optimum.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:24--&amp;gt;&lt;br /&gt;
== Impact ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:25--&amp;gt;&lt;br /&gt;
AdamW has become the standard optimizer for a large fraction of contemporary deep learning, particularly for [[Transformer (machine learning model)|transformers]] in language and vision. Mainstream frameworks ship native implementations (&amp;lt;code&amp;gt;torch.optim.AdamW&amp;lt;/code&amp;gt; in PyTorch since 1.2, &amp;lt;code&amp;gt;tf.keras.optimizers.AdamW&amp;lt;/code&amp;gt; in TensorFlow/Keras), and the optimizer is the default in popular training stacks such as Hugging Face Transformers and timm. Practitioners typically tune AdamW with a small weight-decay coefficient (often around 0.01 to 0.1) and a cosine or linear-warmup learning-rate schedule, paralleling the AdamWR recipe.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:26--&amp;gt;&lt;br /&gt;
Beyond engineering practice, the paper has shaped how regularization is discussed in deep-learning research: the distinction between &amp;quot;true weight decay&amp;quot; and &amp;quot;L&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt; as a loss penalty&amp;quot; is now standard terminology, and subsequent work on optimizer design — for example LAMB, AdaFactor, and Lion — explicitly considers whether and how to decouple shrinkage from adaptive scaling. The paper&amp;#039;s hyperparameter normalization arguments also influenced later studies of how learning rate, weight decay, and batch size jointly determine the implicit-regularization of large-batch training.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:27--&amp;gt;&lt;br /&gt;
A common follow-up question is whether to apply weight decay uniformly or to exclude bias terms, layer-norm scales, and embedding tables. The decoupling principle does not by itself answer this; it merely clarifies that whichever choice is made is honored exactly by AdamW, not warped by adaptive scaling. Most modern training recipes adopt a &amp;quot;decay everything except norm and bias&amp;quot; convention layered on top of AdamW.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:28--&amp;gt;&lt;br /&gt;
The 2017 paper was eventually published as a conference paper at ICLR 2019, and the authors&amp;#039; reference implementations of AdamW, SGDW, AdamWR, and SGDWR remain a standard benchmark for new adaptive optimizers and regularization schemes.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:29--&amp;gt;&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:30--&amp;gt;&lt;br /&gt;
* [[Adam]]&lt;br /&gt;
* [[Stochastic gradient descent]]&lt;br /&gt;
* [[Regularization (mathematics)]]&lt;br /&gt;
* [[Tikhonov regularization]]&lt;br /&gt;
* [[Hyperparameter optimization]]&lt;br /&gt;
* [[Deep learning]]&lt;br /&gt;
* [[Neural network]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:31--&amp;gt;&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--T:32--&amp;gt;&lt;br /&gt;
* Loshchilov, I., &amp;amp; Hutter, F. (2017). &amp;#039;&amp;#039;Decoupled Weight Decay Regularization&amp;#039;&amp;#039;. [https://arxiv.org/abs/1711.05101 arXiv:1711.05101]. Published at ICLR 2019.&lt;br /&gt;
* Hanson, S. J., &amp;amp; Pratt, L. Y. (1988). Comparing biases for minimal network construction with back-propagation. &amp;#039;&amp;#039;Advances in Neural Information Processing Systems 1&amp;#039;&amp;#039;.&lt;br /&gt;
* Kingma, D. P., &amp;amp; Ba, J. (2014). Adam: A Method for Stochastic Optimization. [https://arxiv.org/abs/1412.6980 arXiv:1412.6980].&lt;br /&gt;
* Loshchilov, I., &amp;amp; Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. [https://arxiv.org/abs/1608.03983 arXiv:1608.03983].&lt;br /&gt;
* Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., &amp;amp; Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. [https://arxiv.org/abs/1705.08292 arXiv:1705.08292].&lt;br /&gt;
* Reddi, S. J., Kale, S., &amp;amp; Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.&lt;br /&gt;
* Source code: [https://github.com/loshchil/AdamW-and-SGDW github.com/loshchil/AdamW-and-SGDW].&lt;br /&gt;
&amp;lt;/translate&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Research]]&lt;br /&gt;
[[Category:Research Papers]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
</feed>