Other languages:

Decoupled Weight Decay Regularization

Research Paper
Authors	Ilya Loshchilov; Frank Hutter
Year	2017
Topic area	Machine Learning
Difficulty	Research
arXiv	1711.05101
PDF	Download PDF

Ilya Loshchilov & Frank Hutter
/ 弗賴堡大學
/ 德國弗賴堡,
/ {ilya,fh}@cs.uni-freiburg.de

摘要

L₂ 正則化與權重衰減（weight decay）正則化對標準隨機梯度下降是等價的（在按學習率重新縮放後），但正如我們所證明的，對於像 Adam 這樣的自適應梯度算法，情況並非如此。儘管這些算法的常見實現使用 L₂ 正則化（且常將其稱為「權重衰減」，鑑於我們所揭示的不等價性，這種命名可能具有誤導性），我們提出一種簡單的修改，通過將權重衰減從針對損失函數的優化步驟中解耦出來，恢復權重衰減正則化的原始形式。我們提供實證證據表明，所提出的修改 (i) 對標準 SGD 和 Adam 而言都將最優權重衰減係數的選擇與學習率的設置解耦，並 (ii) 大幅提升 Adam 的泛化性能，使其能夠在圖像分類數據集上與帶 momentum 的 SGD 相競爭（而在此之前 Adam 在這些任務上通常處於下風）。我們提出的解耦權重衰減已被許多研究者採用，社區也已在 TensorFlow 和 PyTorch 中實現；我們實驗的完整源代碼可見於 https://github.com/loshchil/AdamW-and-SGDW

1 引言

AdaGrad（Duchi 等，2011）、RMSProp（Tieleman 與 Hinton，2012）、Adam（Kingma 與 Ba，2014）以及最近的 AMSGrad（Reddi 等，2018）等自適應梯度方法，已成為訓練前饋和循環神經網絡時默認的選擇（Xu 等，2015；Radford 等，2015）。儘管如此，在 CIFAR-10、CIFAR-100（Krizhevsky 2009）等流行圖像分類數據集上的最新成果仍然由帶 momentum 的 SGD 取得（Gastaldi，2017；Cubuk 等，2018）。此外，Wilson 等（2017）指出：在一系列多樣化的深度學習任務（圖像分類、字符級語言建模和成分句法分析等）上進行測試時，自適應梯度方法的泛化性能不如帶 momentum 的 SGD。研究者已就這種較差泛化性能的成因提出了不同假設，例如尖銳局部極小點的存在（Keskar 等，2016；Dinh 等，2017）以及自適應梯度方法本身固有的問題（Wilson 等，2017）。本文研究：用 SGD 與 Adam 訓練深度神經網絡時，使用 L₂ 正則化還是權重衰減正則化更好。我們表明：最流行的自適應梯度方法 Adam 泛化較差的一個主要因素，是 L₂ 正則化對它遠不如對 SGD 那樣有效。具體而言，我們對 Adam 的分析得出以下觀察：

/: L₂ 正則化與權重衰減並不相同。對於 SGD，可以通過基於學習率對權重衰減係數進行重參數化使兩者等價；然而，常被忽視的是：對於 Adam 並非如此。特別地，當 L₂ 正則化與自適應梯度結合時，那些歷史參數和/或梯度幅值較大的權重所受到的正則化會比使用權重衰減時弱。 / ; / : L₂ 正則化在 Adam 中並不有效。Adam 等自適應梯度方法之所以可能不及帶 momentum 的 SGD，一個可能的原因是常見深度學習庫只實現了 L₂ 正則化，而非原始的權重衰減。因此在那些 L₂ 正則化對 SGD 有益的任務/數據集（例如許多流行的圖像分類數據集）上，Adam 的結果會比帶 momentum 的 SGD 更差（對後者而言 L₂ 正則化的行為符合預期）。 / ; / : 權重衰減在 SGD 和 Adam 中同樣有效。對 SGD，它等價於 L₂ 正則化；而對 Adam 則不然。 / ; / : 最優權重衰減取決於 batch 總通過次數 / 權重更新次數。我們對 SGD 與 Adam 的實證分析表明：要執行的運行時間/ batch 通過次數越多，最優權重衰減就越小。 / ; / : Adam 可以從計劃好的學習率乘子中獲得顯著收益。Adam 是一種自適應梯度算法，因此會為每個參數自適應地調整學習率，但這並不排除使用一個按調度（例如餘弦退火）變化的全局學習率乘子來大幅提升其性能的可能性。

本文的主要貢獻是通過將權重衰減從基於梯度的更新中解耦來改進 Adam 的正則化。在一項全面的分析中，我們表明：與 L₂ 正則化相比，Adam 在使用解耦權重衰減時的泛化能力大幅提升，測試誤差獲得 15% 的相對改進（參見圖 2 與 3）；這一結論在多個圖像識別數據集（CIFAR-10 與 ImageNet32x32）、不同訓練預算（100 至 1800 epoch）和不同學習率調度（固定、階梯衰減以及餘弦退火；參見圖 1）下都成立。我們還證明：解耦權重衰減使得最優學習率與最優權重衰減係數的取值大體相互獨立，從而簡化超參數優化（參見圖 2）。

本文的主要動機是改進 Adam，使其在那些過去不具競爭力的問題上也能與帶 momentum 的 SGD 相抗衡。我們希望以此使從業者不再需要在 Adam 和 SGD 之間切換，從而減少為每個數據集/任務挑選特定訓練算法及超參數這一常見難題。

2 將權重衰減從基於梯度的更新中解耦

在 Hanson 與 Pratt（1988）所描述的權重衰減中，權重 ${\textstyle \mathbf{θ}}$ 按下式呈指數衰減：

	${\textstyle {{\mathbf{θ}}_{t + 1} = {{{({1 - \lambda})}\hspace{0pt}{\mathbf{θ}}_{t}} - {\alpha\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}}},}$		(1)

其中 ${\textstyle \lambda}$ 定義每一步權重衰減的速率， ${\textstyle {\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}$ 是第 ${\textstyle t}$ 個 batch 的梯度，將與學習率 ${\textstyle \alpha}$ 相乘。對標準 SGD 而言，這等價於標準的 L₂ 正則化：

命題 1（對標準 SGD 而言：weight decay = L2 正則化）。

以基礎學習率 ${\textstyle \alpha}$ 運行的標準 SGD，在 batch 損失函數 ${\textstyle f_{t}\hspace{0pt}{({\mathbf{θ}})}}$ 上使用權重衰減 ${\textstyle \lambda}$ （由公式 1 定義）所執行的步驟，與其在 ${\textstyle {f_{t}^{\text{reg}}\hspace{0pt}{({\mathbf{θ}})}} = {{f_{t}\hspace{0pt}{({\mathbf{θ}})}} + {\frac{\lambda^{\prime}}{2}\hspace{0pt}\left. \parallel{\mathbf{θ}}\parallel \right._{2}^{2}}}}$ 上不使用權重衰減所執行的步驟相同，其中 ${\textstyle \lambda^{\prime} = \frac{\lambda}{\alpha}}$ 。

這一已知事實以及我們其它命題的證明見附錄 A。

由於這一等價關係，L₂ 正則化在包括流行深度學習庫在內的諸多場合都被稱作權重衰減。然而，正如我們將在本節後面證明的那樣，這一等價關係對自適應梯度方法並不成立。即便對 SGD 這一簡單情形，一個常被忽視的事實是：為了使該等價關係成立，L₂ 正則化係數 ${\textstyle \lambda^{\prime}}$ 必須設為 ${\textstyle \frac{\lambda}{\alpha}}$ ；也就是說，若存在一個總體最佳的權重衰減取值 ${\textstyle \lambda}$ ，那麼 ${\textstyle \lambda^{\prime}}$ 的最佳取值就與學習率 ${\textstyle \alpha}$ 緊密耦合。為了解耦這兩個超參數的影響，我們主張按 Hanson 與 Pratt（1988）所提出的方式（公式 1）解耦權重衰減步驟。

1: given initial learning rate ${\textstyle \alpha \in {IR}}$ , momentum factor ${\textstyle \beta_{1} \in {IR}}$ , weight decay/L₂ regularization factor ${\textstyle \lambda \in {IR}}$ 2: initialize time step ${\textstyle t\leftarrow 0}$ , parameter vector ${\textstyle {\mathbf{θ}}_{t = 0} \in {IR}^{n}}$ , first moment vector ${\textstyle \text{m}_{t = 0}\leftarrow\text{0}}$ , schedule multiplier ${\textstyle \eta_{t = 0} \in {IR}}$ 3: repeat 4: ${\textstyle t\leftarrow{t + 1}}$ 5: ${\textstyle {{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t - 1})}}\leftarrow{\text{SelectBatch}\hspace{0pt}{({\mathbf{θ}}_{t - 1})}}}$ ${\textstyle \rhd}$ select batch and return the corresponding gradient 6: ${\textstyle \text{g}_{t}\leftarrow{{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t - 1})}}}$ ${\textstyle + {\lambda\hspace{0pt}{\mathbf{θ}}_{t - 1}}}$ 7: ${\textstyle \eta_{t}\leftarrow{\text{SetScheduleMultiplier}\hspace{0pt}{(t)}}}$ ${\textstyle \rhd}$ can be fixed, decay, be used for warm restarts 8: ${\textstyle \text{m}_{t}\leftarrow{{\beta_{1}\hspace{0pt}\text{m}_{t - 1}} + {\eta_{t}\hspace{0pt}\alpha\hspace{0pt}\text{g}_{t}}}}$ 9: ${\textstyle {\mathbf{θ}}_{t}\leftarrow{{\mathbf{θ}}_{t - 1} - \text{m}_{t}}}$ ${\textstyle - {\eta_{t}\hspace{0pt}\lambda\hspace{0pt}{\mathbf{θ}}_{t - 1}}}$ 10: until stopping criterion is met 11: return optimized parameters ${\textstyle {\mathbf{θ}}_{t}}$

1: given ${\textstyle {\alpha = 0.001},{{\beta_{1} = 0.9},{{\beta_{2} = 0.999},{{\epsilon = 10^{- 8}},{\lambda \in {IR}}}}}}$ 2: initialize time step ${\textstyle t\leftarrow 0}$ , parameter vector ${\textstyle {\mathbf{θ}}_{t = 0} \in {IR}^{n}}$ , first moment vector ${\textstyle \text{m}_{t = 0}\leftarrow\text{0}}$ , second moment vector ${\textstyle \text{v}_{t = 0}\leftarrow\text{0}}$ , schedule multiplier ${\textstyle \eta_{t = 0} \in {IR}}$ 3: repeat 4: ${\textstyle t\leftarrow{t + 1}}$ 5: ${\textstyle {{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t - 1})}}\leftarrow{\text{SelectBatch}\hspace{0pt}{({\mathbf{θ}}_{t - 1})}}}$ ${\textstyle \rhd}$ select batch and return the corresponding gradient 6: ${\textstyle \text{g}_{t}\leftarrow{{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t - 1})}}}$ ${\textstyle + {\lambda\hspace{0pt}{\mathbf{θ}}_{t - 1}}}$ 7: ${\textstyle \text{m}_{t}\leftarrow{{\beta_{1}\hspace{0pt}\text{m}_{t - 1}} + {{({1 - \beta_{1}})}\hspace{0pt}\text{g}_{t}}}}$ ${\textstyle \rhd}$ here and below all operations are element-wise 8: ${\textstyle \text{v}_{t}\leftarrow{{\beta_{2}\hspace{0pt}\text{v}_{t - 1}} + {{({1 - \beta_{2}})}\hspace{0pt}\text{g}_{t}^{2}}}}$ 9: ${\textstyle {\hat{\text{m}}}_{t}\leftarrow{\text{m}_{t}/{({1 - \beta_{1}^{t}})}}}$ ${\textstyle \rhd}$ ${\textstyle \beta_{1}}$ is taken to the power of ${\textstyle t}$ 10: ${\textstyle {\hat{\text{v}}}_{t}\leftarrow{\text{v}_{t}/{({1 - \beta_{2}^{t}})}}}$ ${\textstyle \rhd}$ ${\textstyle \beta_{2}}$ is taken to the power of ${\textstyle t}$ 11: ${\textstyle \eta_{t}\leftarrow{\text{SetScheduleMultiplier}\hspace{0pt}{(t)}}}$ ${\textstyle \rhd}$ can be fixed, decay, or also be used for warm restarts 12: ${\textstyle {\mathbf{θ}}_{t}\leftarrow{{\mathbf{θ}}_{t - 1} - {\eta_{t}\hspace{0pt}\left( {{{\alpha\hspace{0pt}{\hat{\text{m}}}_{t}}/{({\sqrt{{\hat{\text{v}}}_{t}} + \epsilon})}}\hspace{0pt}{+ {\lambda\hspace{0pt}{\mathbf{θ}}_{t - 1}}}} \right)}}}$ 13: until stopping criterion is met 14: return optimized parameters ${\textstyle {\mathbf{θ}}_{t}}$

首先看 SGD 的情形：我們提議在 Algorithm 1 第 9 行，按梯度信息對 ${\textstyle {\mathbf{θ}}_{t}}$ 做更新的同時，對權重進行衰減。這便給出我們所提出的、使用解耦權重衰減的帶 momentum SGD 變體（SGDW）。這一簡單修改將 ${\textstyle \lambda}$ 與 ${\textstyle \alpha}$ 顯式地解耦（當然，與任意兩個超參數一樣，也可能仍存在某種依賴於問題的隱式耦合）。為了支持對 ${\textstyle \alpha}$ 與 ${\textstyle \lambda}$ 的可能調度，我們引入一個由用戶自定義過程 ${\textstyle S\hspace{0pt}e\hspace{0pt}t\hspace{0pt}S\hspace{0pt}c\hspace{0pt}h\hspace{0pt}e\hspace{0pt}d\hspace{0pt}u\hspace{0pt}l\hspace{0pt}e\hspace{0pt}M\hspace{0pt}u\hspace{0pt}l\hspace{0pt}t\hspace{0pt}i\hspace{0pt}p\hspace{0pt}l\hspace{0pt}i\hspace{0pt}e\hspace{0pt}r\hspace{0pt}{(t)}}$ 提供的縮放因子 ${\textstyle \eta_{t}}$ 。

現在轉向像流行的優化器 Adam（Kingma 與 Ba，2014）這樣的自適應梯度算法，它們會按梯度的歷史幅值對其進行縮放。直觀地講，當 Adam 在損失函數 ${\textstyle f}$ 加上 L₂ 正則化的目標上運行時，那些在 ${\textstyle f}$ 中梯度通常較大的權重所受到的正則化會弱於使用解耦權重衰減時的情形，因為正則項的梯度會與 ${\textstyle f}$ 的梯度一起被縮放。這就導致：對自適應梯度算法而言，L₂ 正則化與解耦權重衰減並不等價：

命題 2（對自適應梯度而言：weight decay \neq L2 正則化）。

設 ${\textstyle O}$ 表示一個優化器，在 batch 損失函數 ${\textstyle f_{t}\hspace{0pt}{({\mathbf{θ}})}}$ 上不使用權重衰減時其迭代為 ${\textstyle {\mathbf{θ}}_{t + 1}\leftarrow{{\mathbf{θ}}_{t} - {\alpha\hspace{0pt}\mathbf{M}_{t}\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}}}$ ，使用權重衰減時其迭代為 ${\textstyle {\mathbf{θ}}_{t + 1}\leftarrow{{{({1 - \lambda})}\hspace{0pt}{\mathbf{θ}}_{t}} - {\alpha\hspace{0pt}\mathbf{M}_{t}\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}}}$ ，其中 ${\textstyle \mathbf{M}_{t} \neq {k\hspace{0pt}\mathbf{I}}}$ （ ${\textstyle k \in {\mathbb{R}}}$ ）。那麼，對 ${\textstyle O}$ 而言，不存在任何 L₂ 係數 ${\textstyle \lambda^{\prime}}$ ，使得在 batch 損失 ${\textstyle {f_{t}^{\text{reg}}\hspace{0pt}{({\mathbf{θ}})}} = {{f_{t}\hspace{0pt}{({\mathbf{θ}})}} + {\frac{\lambda^{\prime}}{2}\hspace{0pt}\left. \parallel{\mathbf{θ}}\parallel \right._{2}^{2}}}}$ 上不使用權重衰減地運行 ${\textstyle O}$ 等價於在 ${\textstyle f_{t}\hspace{0pt}{({\mathbf{θ}})}}$ 上以衰減 ${\textstyle \lambda \in {\mathbb{R}}^{+}}$ 運行 ${\textstyle O}$ 。

我們如 Algorithm 2 第 12 行所示，將 Adam 中的權重衰減與基於損失的梯度更新解耦；這便得到了我們提出的、使用解耦權重衰減的 Adam 變體（AdamW）。

既然已證明對自適應梯度算法而言 L₂ 正則化與權重衰減正則化不同，那就引出了它們如何不同、又應如何理解其效果的問題。它們對標準 SGD 的等價性仍非常有助於直觀理解：兩種機制都以相同的速率把權重推向零。然而對自適應梯度算法二者則有所不同：在 L₂ 正則化下，損失函數的梯度與正則項（即權重的 L₂ 範數）的梯度之和會一起被自適應處理；而在解耦權重衰減下，僅對損失函數的梯度進行自適應處理（權重衰減步驟與自適應梯度機制分離）。在 L₂ 正則化下，兩種梯度都按其典型（累計）幅值歸一化，因此典型梯度幅值 ${\textstyle s}$ 較大的權重 ${\textstyle x}$ 所受到的相對正則化量比其它權重更小。相對地，解耦權重衰減以同一速率 ${\textstyle \lambda}$ 正則化所有權重，這相當於對 ${\textstyle s}$ 較大的權重 ${\textstyle x}$ 比標準 L₂ 正則化更強地施加正則化。我們將對一類帶固定預條件子的自適應梯度算法的簡單特例形式化地證明這一點：

命題 3（對帶固定預條件子的自適應梯度算法而言：weight decay = 按尺度調整的 L_{2} 正則化）。

設 ${\textstyle O}$ 表示具有命題 2 中相同特性的算法，並使用固定的預條件子矩陣 ${\textstyle \text{M}_{t} = {\text{diag}\hspace{0pt}{(\text{s})}^{- 1}}}$ （對所有 ${\textstyle i}$ 都有 ${\textstyle s_{i} > 0}$ ）。那麼，以基礎學習率 ${\textstyle \alpha}$ 運行的 ${\textstyle O}$ 在 batch 損失函數 ${\textstyle f_{t}\hspace{0pt}{({\mathbf{θ}})}}$ 上使用權重衰減 ${\textstyle \lambda}$ 所執行的步驟，與其在按尺度調整的正則化 batch 損失

	${{f_{t}^{\text{sreg}}\hspace{0pt}{({\mathbf{θ}})}} = {{f_{t}\hspace{0pt}{({\mathbf{θ}})}} + {\frac{\lambda^{\prime}}{2\hspace{0pt}\alpha}\hspace{0pt}\left. \parallel{{\mathbf{θ}} \odot \sqrt{\text{s}}}\parallel \right._{2}^{2}}}},$		(2)

其中 ${\textstyle \odot}$ 與 ${\textstyle \sqrt{\cdot}}$ 分別表示逐元素乘法與逐元素平方根，且 ${\textstyle \lambda^{\prime} = \frac{\lambda}{\alpha}}$ 。

需要注意的是，這一命題並不直接適用於實際的自適應梯度算法，因為它們會在每一步都更改預條件子矩陣。儘管如此，它仍可幫助我們直觀理解每一步實際上正在優化的等效損失函數：那些反向預條件子 ${\textstyle s_{i}}$ 較大的參數 ${\textstyle \theta_{i}}$ （在實踐中通常源於第 ${\textstyle i}$ 維度上歷史梯度較大）所受到的相對正則化強度大於 L₂ 正則化下的情形；具體而言，正則化強度與 ${\textstyle \sqrt{s_{i}}}$ 成正比。

3 通過將自適應梯度方法視為貝葉斯濾波來論證解耦權重衰減

我們現在討論 Aitchison（2018）關於自適應梯度算法的貝葉斯濾波統一理論中對解耦權重衰減的一種論證。在我們將本文的初步版本上傳至 arXiv 後，Aitchison 指出，他的理論「為我們提供了一個理論框架，可以在其中理解這種權重衰減相對於 ${\textstyle L_{2}}$ 正則化的優越性，因為通過直接應用貝葉斯濾波得到的是權重衰減，而非 ${\textstyle L_{2}}$ 正則化」（Aitchison，2018）。雖然這一理論的全部功勞歸屬於 Aitchison，但我們在此進行總結，以闡明為何權重衰減可能優於 ${\textstyle L_{2}}$ 正則化。

Aitchison（2018）將 ${\textstyle n}$ 個參數 ${\textstyle \theta_{1},\ldots,\theta_{n}}$ 的隨機優化視為一個貝葉斯濾波問題，其目標是：在給定其它參數當前值 ${\textstyle {\mathbf{θ}}_{- i}\hspace{0pt}{(t)}}$ 的條件下，在時間步 ${\textstyle t}$ 推斷每個參數 ${\textstyle \theta_{i}}$ 最優值的分布。當其它參數不變時這是一個優化問題，而當它們變化時則變成藉助貝葉斯濾波「追蹤」優化器的問題，具體如下：給定時間步 ${\textstyle t}$ 時優化器的一個概率分布 ${\textstyle P\hspace{0pt}{({{\mathbf{θ}}_{t} \mid {\mathbf{y}}_{\mathbf{1}:{\mathbf{t}}}})}}$ ，它考慮了前 ${\textstyle t}$ 個 mini batch 中的數據 ${\textstyle {\mathbf{y}}_{\mathbf{1}:{\mathbf{t}}}}$ ；一個狀態轉移先驗 ${\textstyle P\hspace{0pt}{({{\mathbf{θ}}_{t + 1} \mid {\mathbf{θ}}_{t}})}}$ ，它反映該分布在相鄰兩步間發生的（很小的）、與數據無關的變化；以及由第 ${\textstyle t + 1}$ 步 mini batch 導出的似然 ${\textstyle P\hspace{0pt}{({{\mathbf{y}}_{t + 1} \mid {\mathbf{θ}}_{t + 1}})}}$ 。然後即可（按貝葉斯濾波的常規做法）通過對 ${\textstyle {\mathbf{θ}}_{t}}$ 邊緣化得到一步預測 ${\textstyle P\hspace{0pt}{({{\mathbf{θ}}_{t + 1} \mid {\mathbf{y}}_{\mathbf{1}:{\mathbf{t}}}})}}$ ，再應用貝葉斯規則納入似然 ${\textstyle P\hspace{0pt}{({{\mathbf{y}}_{t + 1} \mid {\mathbf{θ}}_{t + 1}})}}$ ，得到時間步 ${\textstyle t + 1}$ 時優化器的後驗分布 ${\textstyle P\hspace{0pt}{({{\mathbf{θ}}_{t + 1} \mid {\mathbf{y}}_{\mathbf{1}:{{\mathbf{t}} + \mathbf{1}}}})}}$ 。Aitchison（2018）假設狀態轉移分布 ${\textstyle P\hspace{0pt}{({{\mathbf{θ}}_{t + 1} \mid {\mathbf{θ}}_{t}})}}$ 為高斯分布、似然 ${\textstyle P\hspace{0pt}{({{\mathbf{y}}_{t + 1} \mid {\mathbf{θ}}_{t + 1}})}}$ 為近似共軛，從而得到濾波分布均值的下述閉式更新：

	${{\mathbf{μ}}_{p\hspace{0pt}o\hspace{0pt}s\hspace{0pt}t} = {{\mathbf{μ}}_{p\hspace{0pt}r\hspace{0pt}i\hspace{0pt}o\hspace{0pt}r} + {\mathbf{\Sigma}_{p\hspace{0pt}o\hspace{0pt}s\hspace{0pt}t} \times {\mathbf{g}}}}},$		(3)

其中 ${\textstyle \mathbf{g}}$ 是時間 ${\textstyle t}$ 處 mini batch 的對數似然梯度。這一結果暗示了一個由濾波分布的後驗不確定性 ${\textstyle \mathbf{\Sigma}_{p\hspace{0pt}o\hspace{0pt}s\hspace{0pt}t}}$ 給出的梯度預條件子：對我們越不確定的參數，更新越大；對我們越確定的參數，更新越小。Aitchison（2018）隨後表明：Adam、RMSprop 等流行的自適應梯度方法以及 Kronecker 分解類方法都是該框架的特例。

解耦權重衰減作為狀態轉移分布的一部分，可以非常自然地融入這一統一框架：Aitchison（2018）假設優化器按下述高斯分布緩慢變化：

	${{P\hspace{0pt}{({{\mathbf{θ}}_{t + 1} \mid {\mathbf{θ}}_{t}})}} = {\mathcal{N}\hspace{0pt}{({{({{\mathbf{I}} - {\mathbf{A}}})}\hspace{0pt}{\mathbf{θ}}_{t}},{\mathbf{Q}})}}},$		(4)

其中 ${\textstyle \mathbf{Q}}$ 是權重所受高斯擾動的協方差， ${\textstyle \mathbf{A}}$ 是用於防止數值隨時間無限增長的正則化項。當令 ${\textstyle {\mathbf{A}} = {\lambda \times {\mathbf{I}}}}$ 時，該正則化項 ${\textstyle \mathbf{A}}$ 所起的作用恰好就是公式 1 中描述的解耦權重衰減，因為它會在每一步將當前均值估計 ${\textstyle {\mathbf{θ}}_{t}}$ 乘以 ${\textstyle ({1 - \lambda})}$ 。值得注意的是：這種正則化是直接作用於先驗上的，並不依賴於各參數的不確定性（而這恰是 ${\textstyle L_{2}}$ 正則化所需要的）。

4 實驗驗證

現在我們在不同訓練預算與學習率調度下評估解耦權重衰減的性能。我們的實驗設置遵循 Gastaldi（2017）的工作，他在 L₂ 正則化之外，還提出對一個 3 分支殘差 DNN 應用新的 Shake-Shake 正則化，從而在 CIFAR-10 數據集（Krizhevsky，2009）上取得了 2.86 % 的新最優結果。我們使用了基於 fb.resnet.torch 的同一模型/源碼 ¹¹1https://github.com/xgastaldi/shake-shake。我們始終使用 128 的 batch size，並對 CIFAR 數據集應用常規的數據增強流程。基礎網絡分別是 26 2x64d 的 ResNet（即網絡深度為 26、有 2 條殘差分支、第一個殘差塊的寬度為 64）和 26 2x96d 的 ResNet，參數量分別為 11.6M 和 25.6M。對網絡以及 Shake-Shake 方法的詳細介紹，感興趣的讀者請參閱 Gastaldi（2017）。我們還在 ImageNet32x32 數據集（Chrabaszcz 等，2017）上進行了實驗，該數據集是原始 ImageNet 的下採樣版本，包含 120 萬張 32 ${\textstyle \times}$ 32 像素的圖像。

4.1 在不同學習率調度下評估解耦權重衰減

在第一個實驗中，我們使用三種不同的學習率調度——固定學習率、階梯衰減調度，以及餘弦退火調度（Loshchilov 與 Hutter，2016）——比較帶 ${\textstyle L_{2}}$ 正則化的 Adam 與帶解耦權重衰減的 Adam（AdamW）。由於 Adam 已經為每個參數自適應地調整學習率，所以在 Adam 上使用學習率乘子調度不像在 SGD 上那麼常見；但我們的結果顯示，這類調度能大幅提升 Adam 的性能，因此我們主張不應忽視它們在自適應梯度算法中的使用。

對於每種學習率調度和每種權重衰減變體，我們分別使用不同的初始學習率 ${\textstyle \alpha}$ 和權重衰減係數 ${\textstyle \lambda}$ ，訓練一個 2x64d 的 ResNet 共 100 epoch。圖 1 表明，對所有學習率調度而言，解耦權重衰減都優於 ${\textstyle L_{2}}$ 正則化，且學習率調度越好二者差距越大。我們還觀察到，解耦權重衰減帶來一個更可分離的超參數搜索空間，尤其是在使用 step-drop 或餘弦退火等學習率調度時。該圖還表明，餘弦退火明顯優於其它學習率調度；因此後續實驗我們均採用餘弦退火。

4.2 解耦權重衰減與初始學習率參數

為了驗證關於 ${\textstyle \alpha}$ 與 ${\textstyle \lambda}$ 耦合的假設，我們在圖 2 中比較了 SGD（SGD vs. SGDW，上排）和 Adam（Adam vs. AdamW，下排）下 L₂ 正則化與解耦權重衰減的性能表現。在 SGD（圖 2 左上）中，L₂ 正則化未與學習率解耦（如 Algorithm 1 所述的常規做法），圖清楚地顯示：最佳超參數所在的盆地（以顏色表示，前 10 優超參數以黑圈標記）既不與 x 軸對齊，也不與 y 軸對齊，而是落在對角線上。這表明兩個超參數相互依賴、必須同時調整；只改其中之一可能會顯著惡化結果。例如，考慮左上黑圈處的設置（ ${\textstyle \alpha = {1/2}}$ 、 ${\textstyle \lambda = {{1/8} \ast 0.001}}$ ）：單獨改變 ${\textstyle \alpha}$ 或 ${\textstyle \lambda}$ 都會讓結果變差，而同時調整兩者仍可獲得明顯改進。我們注意到，初始學習率與 L₂ 正則化係數之間的這種耦合，可能正是 SGD 被認為對超參數高度敏感這一名聲的來源之一。

相反，圖 2 右上展示的帶解耦權重衰減的 SGD（SGDW）結果表明，權重衰減與初始學習率已經解耦。所提出的方法使兩個超參數更可分離：即便學習率尚未調好（例如考慮圖 2 右上中 1/1024 這一取值），將其固定不動而只優化權重衰減係數，仍能得到一個不錯的值（1/4*0.001）。帶 L₂ 正則化的 SGD 則不然（參見圖 2 左上）。

Adam 配 L₂ 正則化的結果見圖 2 左下。Adam 的最佳超參數設置明顯劣於 SGD 的最佳設置（對比圖 2 左上）。儘管兩種方法都使用了 L₂ 正則化，Adam 完全沒有從中受益：在非零 L₂ 正則化係數下取得的最佳結果，與不使用 L₂ 正則化（即 ${\textstyle \lambda = 0}$ ）時的最佳結果相當。與原始 SGD 相似，超參數地形的形狀表明兩個超參數相互耦合。

相對地，圖 2 右下展示的帶解耦權重衰減的 Adam 新變體（AdamW）結果表明：AdamW 在很大程度上將權重衰減與學習率解耦了。其最佳超參數配置下的結果明顯優於帶 L₂ 正則化的 Adam 的最佳結果，並可與 SGD 和 SGDW 相媲美。

綜上，圖 2 的結果支持我們的假設：權重衰減與學習率這兩個超參數可以解耦，進而簡化 SGD 的超參數調優問題，並使 Adam 的性能提升至與帶 momentum 的 SGD 相競爭的水平。

4.3 AdamW 更好的泛化能力

雖然前一實驗已經表明 AdamW 的最優超參數盆地比 Adam 更寬更深，但接下來我們考察更長的訓練（1800 epoch）的結果，以比較 AdamW 與 Adam 的泛化能力。

我們將初始學習率固定為 0.001，這既是 Adam 的默認學習率，也是在我們的實驗中表現相當不錯的取值。圖 3 給出了 Adam 在 12 個 L₂ 正則化設置下與 AdamW 在 7 個歸一化權重衰減設置下的結果（歸一化權重衰減表示在附錄 B.1 中正式定義的一種重新縮放；它相當於一個依賴於 batch 通過次數的乘性因子）。有趣的是，在訓練前半段，Adam 與 AdamW 的學習曲線動態往往是一致的，但 AdamW 通常能得到更低的訓練損失與測試誤差（分別參見圖 3 左上和右上）。重要的是，在 Adam 中使用 L₂ 形式的權重衰減並未取得 AdamW 中解耦權重衰減那樣好的結果（同見圖 3 左下）。接着，我們研究 AdamW 更好的結果是僅由更好的收斂帶來的，還是也來自更好的泛化。對 Adam 和 AdamW 的最佳設置而言，圖 3 右下的結果表明：AdamW 不僅在訓練損失上更優，而且在相近訓練損失下也展現出更好的泛化性能。ImageNet32x32 上的結果（參見附錄中的 SuppFigure 4）得出相同的結論：泛化性能獲得了顯著改進。

4.4 帶 warm restarts 的 AdamWR：更好的任意時刻性能

為了改進 SGDW 與 AdamW 的任意時刻性能，我們以 Loshchilov 與 Hutter（2016）中提出的 warm restarts 對它們進行擴展，分別得到 SGDWR 與 AdamWR（見附錄 B.2 節）。如圖 4 所示，AdamWR 在 CIFAR-10 與 ImageNet32x32 上將 AdamW 大幅加速，最高可達 10 倍（參見第一次 restart 處的結果）。在默認學習率 0.001 下，AdamW 在 CIFAR-10（也參見 SuppFigure 5）與 ImageNet32x32（也參見 SuppFigure 6）上的測試誤差相對 Adam 都獲得了 15 % 的相對改進。

AdamWR 取得了同樣的改進結果，但任意時刻性能要好得多。這些改進在 CIFAR-10 上大幅縮小了 Adam 與 SGDWR 之間的差距，在 ImageNet32x32 上則取得了可比的性能。

4.5 在其他數據集與架構上使用 AdamW

其他多個研究組已在可被引用的工作中成功採用了 AdamW。例如，Wang 等（2018）使用 AdamW 在標準 WIDER FACE 數據集（Yang 等，2016）上訓練了一種新的人臉檢測架構，在與之前的最優算法性能相當的同時，預測速度提升了近 10 倍。Völker 等（2018）使用帶餘弦退火的 AdamW 訓練卷積神經網絡，對從顱內腦電（EEG）記錄中測得的與錯誤相關的腦電信號進行分類與特徵刻畫。雖然他們的論文未給出與 Adam 的對比，但他們友善地提供了在其表現最佳的針對該問題專門設計的網絡架構 Deep4Net 以及一種 ResNet 變體上對二者的直接對比。在使用與 Adam 相同的超參數設置下，AdamW 在 Deep4Net 上取得了更高的測試集精度（73.68% 對 71.37%），在 ResNet 上則取得了具有統計顯著性的更高測試集精度（72.04% 對 61.34%）。Radford 等（2018）使用 AdamW 訓練 Transformer（Vaswani 等，2017）架構，在大量自然語言理解 benchmark 上取得新的最優結果。Zhang 等（2018）在 CIFAR 數據集上以 ResNet 與 VGG 架構，對 SGD、Adam 以及 Kronecker-Factored Approximate Curvature（K-FAC）優化器（Martens 與 Grosse，2015）的 L₂ 正則化與權重衰減進行了比較，報告稱：在二者存在差異的情形下，解耦權重衰減始終優於 L₂ 正則化。

5 結論與未來工作

鑑於已有研究指出 Adam 等自適應梯度方法的泛化可能弱於帶 momentum 的 SGD（Wilson 等，2017），我們識別並揭示了對 Adam 而言 L₂ 正則化與權重衰減的不等價性。我們通過實證表明：我們提出的帶解耦權重衰減的 Adam 版本，其泛化性能大幅優於常見的帶 L₂ 正則化的 Adam 實現。我們還提議在 Adam 中使用 warm restarts，以改善其任意時刻的性能。

我們在圖像分類數據集上獲得的結果還需在更廣泛的任務上加以驗證，尤其是那些預計正則化起重要作用的任務。一個值得探索的方向是將我們關於權重衰減的發現整合進其它試圖改進 Adam 的方法，例如歸一化方向保持的 Adam（Zhang 等，2017）。雖然我們的實驗分析聚焦於 Adam，但我們相信類似的結論也適用於其它自適應梯度方法，如 AdaGrad（Duchi 等，2011）和 AMSGrad（Reddi 等，2018）。

6 致謝

我們感謝 Patryk Chrabaszcz 在 ImageNet32x32 實驗上的協助；感謝 Matthias Feurer 與 Robin Schirrmeister 在多輪迭代中對本文提出的寶貴反饋；感謝 Martin Völker、Robin Schirrmeister 與 Tonio Ball 為我們提供了在他們的 EEG 數據上對 AdamW 與 Adam 的對比。我們也感謝深度學習社區的下列成員在不同深度學習庫中實現解耦權重衰減：

•
/
Jingwei Zhang、Lei Tai、Robin Schirrmeister 以及 Kashif Rasul 在 PyTorch 中的實現（見 https://github.com/pytorch/pytorch/pull/4429）
•
/
Phil Jund 在 TensorFlow 中的實現，詳見
/ https://www.tensorflow.org/api_docs/python/tf/contrib/opt/DecoupledWeightDecayExtension
•
/
Sylvain Gugger、Anand Saha、Jeremy Howard 以及 fast.ai 其他成員的實現，可在 https://github.com/sgugger/Adam-experiments 獲取
•
/
Guillaume Lambard 在 Keras 中的實現，可在 https://github.com/GLambard/AdamW_Keras 獲取
•
/
Yagami Lin 在 Caffe 中的實現，可在 https://github.com/Yagami123/Caffe-AdamW-AdamWR 獲取

本工作由歐洲研究理事會（ERC）在歐盟 Horizon 2020 研究與創新項目下編號 716721 的資助、由德國研究基金會（DFG）通過 BrainLinksBrainTools 卓越集群（資助編號 EXC 1086）以及編號為 INST 37/935-1 FUGG 的資助、並由德國巴登-符騰堡州通過 bwHPC 提供支持。

參考文獻

Aitchison (2018) Laurence Aitchison. A unified theory of adaptive stochastic gradient descent as Bayesian filtering. arXiv:1507.02030, 2018. / * Chrabaszcz et al. (2017) Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of ImageNet as an alternative to the CIFAR datasets. arXiv:1707.08819, 2017. / * Cubuk et al. (2018) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018. / * Dinh et al. (2017) Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv:1703.04933, 2017. / * Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011. / * Gastaldi (2017) Xavier Gastaldi. Shake-Shake regularization. arXiv preprint arXiv:1705.07485, 2017. / * Hanson & Pratt (1988) Stephen José Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation. 見 Proceedings of the 1st International Conference on Neural Information Processing Systems, pp. 177–185, 1988. / * Huang et al. (2017) Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv:1704.00109, 2017. / * Keskar et al. (2016) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836, 2016. / * Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. / * Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. / * Li et al. (2017) Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017. / * Loshchilov & Hutter (2016) Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. arXiv:1608.03983, 2016. / * Martens & Grosse (2015) James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. 見 International conference on machine learning, pp. 2408–2417, 2015. / * Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015. / * Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper. pdf, 2018. / * Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. International Conference on Learning Representations, 2018. / * Smith (2016) Leslie N Smith. Cyclical learning rates for training neural networks. arXiv:1506.01186v3, 2016. / * Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. / * Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 見 Advances in Neural Information Processing Systems, pp. 5998–6008, 2017. / * Völker et al. (2018) Martin Völker, Jiří Hammer, Robin T Schirrmeister, Joos Behncke, Lukas DJ Fiederer, Andreas Schulze-Bonhage, Petr Marusič, Wolfram Burgard, and Tonio Ball. Intracranial error detection via deep learning. arXiv preprint arXiv:1805.01667, 2018. / * Wang et al. (2018) Jianfeng Wang, Ye Yuan, Gang Yu, and Sun Jian. Sface: An efficient network for face detection in large scale variations. arXiv preprint arXiv:1804.06559, 2018. / * Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv:1705.08292, 2017. / * Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. 見 International Conference on Machine Learning, pp. 2048–2057, 2015. / * Yang et al. (2016) Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. 見 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5533, 2016. / * Zhang et al. (2018) Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018. / * Zhang et al. (2017) Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized direction-preserving adam. arXiv:1709.04546, 2017. / * Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. 見 arXiv:1707.07012 [cs.CV], 2017.

Appendix

附錄 A 權重衰減與 L2 正則化的形式化分析

命題 1 的證明
/ 這一已知事實的證明非常直接。SGD 不使用權重衰減時，在 ${\textstyle {f_{t}^{\text{reg}}\hspace{0pt}{({\mathbf{θ}})}} = {{f_{t}\hspace{0pt}{({\mathbf{θ}})}} + {\frac{\lambda^{\prime}}{2}\hspace{0pt}\left. \parallel{\mathbf{θ}}\parallel \right._{2}^{2}}}}$ 上的迭代為：

	${{\mathbf{θ}}_{t + 1}\leftarrow{{\mathbf{θ}}_{t} - {\alpha\hspace{0pt}{\nabla f_{t}^{\text{reg}}}\hspace{0pt}{({\mathbf{θ}}_{t})}}} = {{\mathbf{θ}}_{t} - {\alpha\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}} - {\alpha\hspace{0pt}\lambda^{\prime}\hspace{0pt}{\mathbf{θ}}_{t}}}}.$		(5)

SGD 使用權重衰減時，在 ${\textstyle f_{t}\hspace{0pt}{({\mathbf{θ}})}}$ 上的迭代為：

	${{\mathbf{θ}}_{t + 1}\leftarrow{{{({1 - \lambda})}\hspace{0pt}{\mathbf{θ}}_{t}} - {\alpha\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}}}.$		(6)

由於 ${\textstyle \lambda^{\prime} = \frac{\lambda}{\alpha}}$ ，這兩組迭代完全相同。∎

命題 2 的證明
/ 與命題 1 的證明類似， ${\textstyle O}$ 在 ${\textstyle {f_{t}^{\text{reg}}\hspace{0pt}{({\mathbf{θ}})}} = {{f_{t}\hspace{0pt}{({\mathbf{θ}})}} + {\frac{1}{2}\hspace{0pt}\lambda^{\prime}\hspace{0pt}\left. \parallel{\mathbf{θ}}\parallel \right._{2}^{2}}}}$ 上不使用權重衰減、以及 ${\textstyle O}$ 在 ${\textstyle f_{t}}$ 上使用權重衰減 ${\textstyle \lambda}$ 時的迭代分別為：

	${\textstyle {\mathbf{θ}}_{t + 1}}$	${\textstyle \leftarrow}$	${\textstyle {{\mathbf{θ}}_{t} - {\alpha\hspace{0pt}\lambda^{\prime}\hspace{0pt}\mathbf{M}_{t}\hspace{0pt}{\mathbf{θ}}_{t}} - {\alpha\hspace{0pt}\mathbf{M}_{t}\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}}.}$		(7)
	${\textstyle {\mathbf{θ}}_{t + 1}}$	${\textstyle \leftarrow}$	${\textstyle {{{({1 - \lambda})}\hspace{0pt}{\mathbf{θ}}_{t}} - {\alpha\hspace{0pt}\mathbf{M}_{t}\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}}.}$		(8)

若兩組迭代對所有 ${\textstyle {\mathbf{θ}}_{t}}$ 都相等，則有 ${\textstyle {\lambda\hspace{0pt}{\mathbf{θ}}_{t}} = {\alpha\hspace{0pt}\lambda^{\prime}\hspace{0pt}\mathbf{M}_{t}\hspace{0pt}{\mathbf{θ}}_{t}}}$ 。這隻有在 ${\textstyle \mathbf{M}_{t} = {k\hspace{0pt}\mathbf{I}}}$ （ ${\textstyle k \in {\mathbb{R}}}$ ）時才能對所有 ${\textstyle {\mathbf{θ}}_{t}}$ 都成立，但 ${\textstyle O}$ 並非如此。因此不存在任何 L₂ 正則化項 ${\textstyle \lambda^{\prime}\hspace{0pt}\left. \parallel{\mathbf{θ}}\parallel \right._{2}^{2}}$ 能使兩組迭代等價。∎

命題 3 的證明
/ ${\textstyle O}$ 不使用權重衰減時，在 ${\textstyle {f_{t}^{\text{sreg}}\hspace{0pt}{({\mathbf{θ}})}} = {{f_{t}\hspace{0pt}{({\mathbf{θ}})}} + {\frac{\lambda^{\prime}}{2}\hspace{0pt}\left. \parallel{{\mathbf{θ}} \odot \sqrt{\text{s}}}\parallel \right._{2}^{2}}}}$ 上的迭代為：

${\textstyle {\mathbf{θ}}_{t + 1}}$	${\textstyle \leftarrow}$	${\textstyle {\mathbf{θ}}_{t} - {{\alpha\hspace{0pt}{\nabla f_{t}^{\text{sreg}}}\hspace{0pt}{({\mathbf{θ}}_{t})}}/\text{s}}}$	(9)
	${\textstyle =}$	${\textstyle {\mathbf{θ}}_{t} - {{\alpha\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}/\text{s}} - {{{\alpha\hspace{0pt}\lambda^{\prime}\hspace{0pt}{\mathbf{θ}}_{t}} \odot \text{s}}/\text{s}}}$	(10)
	${\textstyle =}$	${\textstyle {{\mathbf{θ}}_{t} - {{\alpha\hspace{0pt}{\nabla f_{t}}\hspace{0pt}{({\mathbf{θ}}_{t})}}/\text{s}} - {\alpha\hspace{0pt}\lambda^{\prime}\hspace{0pt}{\mathbf{θ}}_{t}}},}$	(11)

其中除以 s 是逐元素的。 ${\textstyle O}$ 使用權重衰減時，在 ${\textstyle f_{t}\hspace{0pt}{({\mathbf{θ}})}}$ 上的迭代為：

	${\textstyle {\mathbf{θ}}_{t + 1}}$	${\textstyle \leftarrow}$	${\textstyle {{({1 - \lambda})}\hspace{0pt}{\mathbf{θ}}_{t}} - {{\alpha\hspace{0pt}{\nabla f}\hspace{0pt}{({\mathbf{θ}}_{t})}}/\text{s}}}$		(12)
		${\textstyle =}$	${\textstyle {{\mathbf{θ}}_{t} - {{\alpha\hspace{0pt}{\nabla f}\hspace{0pt}{({\mathbf{θ}}_{t})}}/\text{s}} - {\lambda\hspace{0pt}{\mathbf{θ}}_{t}}},}$		(13)

由於 ${\textstyle \lambda^{\prime} = \frac{\lambda}{\alpha}}$ ，這兩組迭代完全相同。∎

附錄 B Adam 的其他實踐改進

在討論了用於改善 Adam 泛化能力的解耦權重衰減之後，本節再介紹兩個用於改善 Adam 實際表現的組件。

B.1 歸一化權重衰減

我們的初步實驗表明：不同的計算預算（以 batch 通過次數定義）下，最優的權重衰減係數也不同。與此相關，Li 等（2017）證明：在總 epoch 數相同的情況下，較小的 batch size 會使權重衰減的收縮效應更為顯著。這裡我們提議通過對權重衰減取值進行歸一化來減弱這種依賴。具體而言，我們用一個新的（更穩健的）歸一化權重衰減超參數 ${\textstyle \lambda_{n\hspace{0pt}o\hspace{0pt}r\hspace{0pt}m}}$ 取代原來的超參數 ${\textstyle \lambda}$ ，並按 ${\textstyle \lambda = {\lambda_{n\hspace{0pt}o\hspace{0pt}r\hspace{0pt}m}\hspace{0pt}\sqrt{\frac{b}{B\hspace{0pt}T}}}}$ 來設置 ${\textstyle \lambda}$ ，其中 ${\textstyle b}$ 是 batch size， ${\textstyle B}$ 是訓練樣本總數， ${\textstyle T}$ 是總 epoch 數。²²2在我們 B.2 節討論的 AdamWR 變體語境中， ${\textstyle T}$ 表示當前 restart 內的總 epoch 數。因此 ${\textstyle \lambda_{n\hspace{0pt}o\hspace{0pt}r\hspace{0pt}m}}$ 可解釋為只允許一次 batch 通過時所使用的權重衰減。我們強調，所選用的這種歸一化方式只是基於少量實驗得出的一種可能；我們得出的更具持久性的結論是：使用某種形式的歸一化能夠大幅改善結果。

B.2 帶餘弦退火與 warm restarts 的 Adam

我們現在按照近期工作（Loshchilov 與 Hutter，2016）將餘弦退火和 warm restarts 應用到 Adam 上。在該工作中，我們提出了帶 warm restarts 的隨機梯度下降（SGDR），通過按餘弦調度快速冷卻學習率並周期性地將其升高，來改善 SGD 的任意時刻性能。SGDR 已被成功採用，以在多個流行的圖像分類 benchmark 上取得新的最優結果（Huang 等，2017；Gastaldi，2017；Zoph 等，2017），因此我們在提出後不久便嘗試將其推廣到 Adam。然而，儘管我們最初版本的帶 warm restarts 的 Adam 任意時刻性能優於普通 Adam，但與帶 warm restarts 的 SGD 相比並不具競爭力——原因正是 L₂ 正則化在它身上不像在 SGD 中那樣有效。如今，通過原始權重衰減正則化（第 2 節）解決了這一問題，並引入了歸一化權重衰減（B.1 節），我們關於餘弦退火與 warm restarts 的原始工作便可以直接遷移到 Adam 上。

為保持論述自洽，我們簡要說明 SGDR 如何調度有效學習率的變化，以加速 DNN 的訓練。這裡，我們將初始學習率 ${\textstyle \alpha}$ 與用於在迭代 ${\textstyle t}$ 時得到實際學習率的乘子 ${\textstyle \eta_{t}}$ 解耦（例如參見 Algorithm 1 第 8 行）。在 SGDR 中，每完成 ${\textstyle T_{i}}$ 個 epoch 我們就模擬一次新的 warm-start SGD 運行/restart，其中 ${\textstyle i}$ 是該次運行的索引。需要強調的是，這種 restart 並非從零開始，而是通過提高 ${\textstyle \eta_{t}}$ 同時使用舊的 ${\textstyle {\mathbf{θ}}_{t}}$ 作為初始解來模擬。 ${\textstyle \eta_{t}}$ 的提升幅度控制了之前獲取的信息（例如 momentum）被使用的程度。在第 ${\textstyle i}$ 次運行內部， ${\textstyle \eta_{t}}$ 按照餘弦退火的學習率（Loshchilov 與 Hutter，2016）逐 batch 衰減，如下式所示：

	${\textstyle {\eta_{t} = {\eta_{m\hspace{0pt}i\hspace{0pt}n}^{(i)} + {0.5\hspace{0pt}{({\eta_{m\hspace{0pt}a\hspace{0pt}x}^{(i)} - \eta_{m\hspace{0pt}i\hspace{0pt}n}^{(i)}})}\hspace{0pt}{({1 + {\cos{({{\pi\hspace{0pt}T_{c\hspace{0pt}u\hspace{0pt}r}}/T_{i}})}}})}}}},}$		(14)

其中 ${\textstyle \eta_{m\hspace{0pt}i\hspace{0pt}n}^{(i)}}$ 與 ${\textstyle \eta_{m\hspace{0pt}a\hspace{0pt}x}^{(i)}}$ 是乘子的取值範圍， ${\textstyle T_{c\hspace{0pt}u\hspace{0pt}r}}$ 表示自上次 restart 以來已完成的 epoch 數。 ${\textstyle T_{c\hspace{0pt}u\hspace{0pt}r}}$ 在每個 batch 迭代 ${\textstyle t}$ 處更新，因此不限於整數值。在每次第 ${\textstyle i}$ 次 restart 時調整（例如減小） ${\textstyle \eta_{m\hspace{0pt}i\hspace{0pt}n}^{(i)}}$ 與 ${\textstyle \eta_{m\hspace{0pt}a\hspace{0pt}x}^{(i)}}$ （亦可參見 Smith（2016））有可能進一步改善性能，但此處我們不考慮這一選項，以避免引入額外的超參數。當 ${\textstyle \eta_{m\hspace{0pt}a\hspace{0pt}x}^{(i)} = 1}$ 且 ${\textstyle \eta_{m\hspace{0pt}i\hspace{0pt}n}^{(i)} = 0}$ 時，可將公式（14）簡化為

	${\textstyle {\eta_{t} = {0.5 + {0.5\hspace{0pt}{\cos{({{\pi\hspace{0pt}T_{c\hspace{0pt}u\hspace{0pt}r}}/T_{i}})}}}}}.}$		(15)

為獲得良好的任意時刻性能，可以從一個初始較小的 ${\textstyle T_{i}}$ （例如總預算的 1 % 到 10 %）開始，並在每次 restart 時將其乘以 ${\textstyle T_{m\hspace{0pt}u\hspace{0pt}l\hspace{0pt}t}}$ （例如 ${\textstyle T_{m\hspace{0pt}u\hspace{0pt}l\hspace{0pt}t} = 2}$ ）。當 ${\textstyle T_{c\hspace{0pt}u\hspace{0pt}r} = T_{i}}$ 時通過將 ${\textstyle T_{c\hspace{0pt}u\hspace{0pt}r}}$ 置 0 觸發第 ${\textstyle ({i + 1})}$ 次 restart。調度乘子的一個示例設置見附錄 C。

我們提出的 AdamWR 算法即為 AdamW（參見 Algorithm 2）配合按公式（15）取值的 ${\textstyle \eta_{t}}$ ，並在每次迭代時按 B.1 節描述的歸一化權重衰減計算 ${\textstyle \lambda}$ 。我們注意到，歸一化權重衰減使我們能夠在 AdamWR 與 SGDWR（帶 warm restarts 的 SGDW）的短運行與長運行中使用恆定的參數設置。

附錄 C 調度乘子設置示例

在 SuppFigure 1 中給出了 ${\textstyle T_{i = 0} = 100}$ 、 ${\textstyle T_{m\hspace{0pt}u\hspace{0pt}l\hspace{0pt}t} = 2}$ 時乘子 ${\textstyle \eta_{t}}$ 的一個調度示例。在最初的 100 個 epoch 之後，由於 ${\textstyle \eta_{t = 100} = 0}$ ，學習率會降為 0。之後由於 ${\textstyle T_{c\hspace{0pt}u\hspace{0pt}r} = T_{i = 0}}$ ，我們通過將 ${\textstyle T_{c\hspace{0pt}u\hspace{0pt}r}}$ 重置為 0 來 restart，按公式（15）將乘子 ${\textstyle \eta_{t}}$ 重置為 1。該乘子隨後會再次從 1 減小到 0，但這次跨越 200 個 epoch，因為 ${\textstyle T_{i = 1} = {T_{i = 0}\hspace{0pt}T_{m\hspace{0pt}u\hspace{0pt}l\hspace{0pt}t}} = 200}$ 。在每次 restart 之前 ${\textstyle \eta_{t} = 0}$ 時所得的解（例如 SuppFigure 1 中所示的第 100、300、700 與 1500 epoch 處）被優化器推薦為可選解，並以最新的解優先。

附錄 D 其他結果

我們考察了：讓「標準 Adam」（帶 L₂ 正則化和固定學習率的 Adam）執行更長的訓練（1800 epoch）是否會讓餘弦退火變得不必要。SuppFigure 2 在 4×4 的對數超參數網格上給出了標準 Adam 的結果（網格之所以稀疏，是因為 1800 epoch 的運行計算開銷很大）。即使考慮到網格分辨率較低，其結果至多也只能與 AdamW 在 epoch 數減少為 1/18、網絡更小的情形下取得的結果相當（參見 SuppFigure 3 上排中間）。考慮到正文圖 1（同時展示了使用餘弦退火等學習率調度可能帶來的改進以及解耦權重衰減的有效性）的內容，這一結論並不令人意外。

我們對 Adam 與 SGD 的實驗結果表明：以 epoch 數衡量的總運行時長會影響最優超參數所在的盆地（參見 SuppFigure 3）。更具體地說，總 epoch 數越多，權重衰減的取值就應越小。SuppFigure 4 表明：我們針對該問題的解決方案——公式（15）所定義的歸一化權重衰減——簡化了超參數選擇，因為在短訓練運行下觀察到的最優值與遠更長運行下的最優值相近。我們用 CIFAR-10 上的初步實驗來提出公式（15）中給出的平方根歸一化方法，並在 ImageNet32x32 數據集（Chrabaszcz 等，2017）上再次驗證它並非偶然——該數據集是原始 ImageNet 的下採樣版本，包含 120 萬張 32 ${\textstyle \times}$ 32 像素的圖像，其中一個 epoch 比 CIFAR-10 上的長 24 倍。這一實驗同樣支持平方根縮放：在 CIFAR-10 上觀察到的最優歸一化權重衰減取值，對 ImageNet32x32 而言也幾乎是最優的（參見 SuppFigure 3）。相反，如果我們對 ImageNet32x32 和 CIFAR-10 在相同 epoch 數下使用相同的原始權重衰減取值 ${\textstyle \lambda}$ ，那麼在沒有所提歸一化的情形下， ${\textstyle \lambda}$ 對 ImageNet32x32 而言會大約偏大 5 倍，從而導致性能顯著變差。SGDW 與 AdamW 之間的最優歸一化權重衰減取值也十分接近（例如 ${\textstyle \lambda_{n\hspace{0pt}o\hspace{0pt}r\hspace{0pt}m} = 0.025}$ 與 ${\textstyle \lambda_{n\hspace{0pt}o\hspace{0pt}r\hspace{0pt}m} = 0.05}$ ）。這些結果清楚地表明：歸一化權重衰減可大幅改善性能；儘管平方根縮放在我們的實驗中表現非常好，但我們也強調：這些實驗並非非常全面，很可能還存在更好的縮放規則。

SuppFigure 4 是正文圖 3 在 ImageNet32x32 上的對應版本（而非 CIFAR-10）。定性結論一致：權重衰減能取得比 L₂ 正則化更優的訓練損失（cross-entropy），並在測試誤差上帶來更大的改善。

SuppFigure 5 與 SuppFigure 6 是正文圖 4 的對應版本，並在底排補充了訓練損失曲線。結果表明：在 CIFAR-10 上，Adam 及其帶解耦權重衰減的變體在訓練損失上的收斂比對應的 SGD 變體更快（在 ImageNet32x32 上差異較小）。如正文所述，當訓練損失值相同時，AdamW 的測試誤差比 Adam 更低。有趣的是，SuppFigure 5 與 SuppFigure 6 表明：帶 restart 的變體 AdamWR 與 SGDWR 在泛化方面也分別優於 AdamW 與 SGDW。