This page is a translated version of the page Generative Adversarial Networks/paper and the translation is 100% complete.

Other languages:

SummarySource

生成對抗網絡

Research Paper
Authors	Ian J. Goodfellow; Jean Pouget-Abadie; Mehdi Mirza; Bing Xu; David Warde-Farley; Sherjil Ozair; Aaron Courville; Yoshua Bengio
Year	2014
Topic area	Machine Learning
Difficulty	Research
arXiv	1406.2661
PDF	Download PDF

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, Yoshua Bengio
Département d』informatique et de recherche opérationnelle
Université de Montréal
Montréal, QC H3C 3J7
Jean Pouget-Abadie 從 Ecole Polytechnique 訪問 Université de Montréal。Sherjil Ozair 從 Indian Institute of Technology Delhi 訪問 Université de Montréal。Yoshua Bengio 是 CIFAR 高級研究員。

摘要

我們提出了一個通過對抗過程估計生成模型的新框架，其中同時訓練兩個模型：一個捕捉數據分佈的生成模型 ${\textstyle G}$ ，以及一個估計樣本來自訓練數據而非來自 ${\textstyle G}$ 的概率的判別模型 ${\textstyle D}$ 。 ${\textstyle G}$ 的訓練過程是最大化 ${\textstyle D}$ 出錯的概率。該框架對應於一個二人極小極大博弈。在任意函數 ${\textstyle G}$ 和 ${\textstyle D}$ 的空間中，存在唯一解，其中 ${\textstyle G}$ 恢復訓練數據分佈， ${\textstyle D}$ 處處等於 ${\textstyle \frac{1}{2}}$ 。在 ${\textstyle G}$ 和 ${\textstyle D}$ 由多層感知機定義的情形下，整個系統可以通過反向傳播進行訓練。在訓練或樣本生成過程中，不需要任何馬爾可夫鏈或展開的近似推理網絡。實驗通過對生成樣本的定性與定量評估展示了該框架的潛力。

1 引言

深度學習的前景在於發現豐富的、層次化的模型 [2]，這些模型能夠表示在人工智能應用中遇到的各類數據（如自然圖像、包含語音的音頻波形以及自然語言語料中的符號）上的概率分佈。迄今為止，深度學習最顯著的成功大多涉及判別模型，通常是將高維、豐富的感官輸入映射到類別標籤的模型 [14, 22]。這些顯著的成功主要基於反向傳播和dropout算法，使用具有特別良好梯度行為的分段線性單元 [19, 9, 10]。深度生成模型的影響相對較小，原因是難以逼近最大似然估計及相關策略中出現的許多難處理的概率計算，以及在生成場景中難以利用分段線性單元的優勢。我們提出了一種新的生成模型估計過程，迴避了這些困難。¹¹1所有代碼和超參數見 http://www.github.com/goodfeli/adversarial

在所提出的對抗網絡框架中，生成模型與一個對手對抗：判別模型學習判斷樣本來自模型分佈還是數據分佈。生成模型可以類比為一個偽造團伙，試圖製造假幣並在不被察覺的情況下使用；而判別模型則類似於警察，試圖識別這些偽造品。這場博弈中的競爭促使雙方不斷改進各自的方法，直到偽造品與真品難以區分。

該框架可以為許多類型的模型和優化算法產出具體的訓練算法。在本文中，我們探討一個特殊情形：生成模型通過將隨機噪聲傳入一個多層感知機來生成樣本，判別模型也是一個多層感知機。我們將這一特殊情形稱為對抗網絡。在此情形下，我們可以僅使用非常成功的反向傳播和dropout算法 [17] 來訓練兩個模型，並僅使用前向傳播從生成模型中採樣。不需要近似推理或馬爾可夫鏈。

2 相關工作

與含潛變量的有向圖模型相對的一個替代方案是含潛變量的無向圖模型，例如受限玻爾茲曼機（RBM）[27, 16]、深度玻爾茲曼機（DBM）[26] 及其眾多變體。這類模型中的相互作用表示為未歸一化勢函數的乘積，並通過對所有隨機變量狀態的全局求和/積分進行歸一化。該量（配分函數）及其梯度除最平凡的情形外都是難以處理的，但可以通過馬爾可夫鏈蒙特卡洛（MCMC）方法估計。混合性給依賴於 MCMC 的學習算法帶來了顯著的問題 [3, 5]。

深度信念網絡（DBN）[16] 是包含一個無向層和若干有向層的混合模型。儘管存在快速的近似逐層訓練準則，DBN 仍然承擔了無向和有向模型雙方的計算困難。

也有一些不近似或不限定對數似然的替代準則，例如分數匹配（score matching）[18] 和噪聲對比估計（NCE）[13]。兩者都要求學習到的概率密度可以解析地指定到一個歸一化常數為止。需要注意的是，在許多帶有若干層潛變量的有趣生成模型（如 DBN 和 DBM）中，甚至無法導出一個易處理的未歸一化概率密度。諸如去噪自編碼器 [30] 與收縮自編碼器之類的模型，其學習規則與應用於 RBM 的分數匹配非常相似。在 NCE 中，正如本工作一樣，使用判別式訓練準則來擬合一個生成模型。然而，並不是另外擬合一個獨立的判別模型，而是用生成模型本身來區分生成的數據和來自固定噪聲分佈的樣本。由於 NCE 使用固定的噪聲分佈，一旦模型已經在觀察變量的一個小子集上學到了大致正確的分佈之後，學習速度便急劇下降。

最後，一些技術並不顯式地定義概率分佈，而是訓練一個生成機器以從目標分佈中抽取樣本。這種方法的優勢是，這類機器可以設計為通過反向傳播進行訓練。該方向上近期的代表性工作包括生成隨機網絡（GSN）框架 [5]，它擴展了廣義去噪自編碼器 [4]：二者均可視為定義了一個參數化的馬爾可夫鏈，即學習一個執行生成式馬爾可夫鏈一步的機器的參數。與 GSN 相比，對抗網絡框架在採樣時不需要馬爾可夫鏈。由於對抗網絡在生成過程中不需要反饋迴路，因此能夠更好地利用分段線性單元 [19, 9, 10]，這類單元提高了反向傳播的性能，但在反饋迴路中使用時由於無界激活而存在問題。通過反向傳播訓練生成機器的更近期的例子包括變分自編碼（auto-encoding variational Bayes）[20] 和隨機反向傳播 [24]。

3 對抗網絡

當兩個模型都是多層感知機時，對抗建模框架最容易應用。為了學習生成器在數據 ${\textstyle \mathbf{x}}$ 上的分佈 ${\textstyle p_{g}}$ ，我們在輸入噪聲變量上定義先驗 ${\textstyle p_{\mathbf{z}}\hspace{0pt}{({\mathbf{z}})}}$ ，然後將到數據空間的映射表示為 ${\textstyle G\hspace{0pt}{({\mathbf{z}};\theta_{g})}}$ ，其中 ${\textstyle G}$ 是由參數為 ${\textstyle \theta_{g}}$ 的多層感知機表示的可微函數。我們還定義第二個多層感知機 ${\textstyle D\hspace{0pt}{({\mathbf{x}};\theta_{d})}}$ ，輸出單個標量。 ${\textstyle D\hspace{0pt}{({\mathbf{x}})}}$ 表示 ${\textstyle \mathbf{x}}$ 來自數據而非 ${\textstyle p_{g}}$ 的概率。我們訓練 ${\textstyle D}$ 以最大化為訓練樣本和來自 ${\textstyle G}$ 的樣本分配正確標籤的概率。同時，我們訓練 ${\textstyle G}$ 以最小化 ${\textstyle \log{({1 - {D\hspace{0pt}{({G\hspace{0pt}{({\mathbf{z}})}})}}})}}$ ：

換言之， ${\textstyle D}$ 與 ${\textstyle G}$ 進行如下二人極小極大博弈，價值函數為 ${\textstyle V\hspace{0pt}{(G,D)}}$ ：

	${{{\min\limits_{G}{\max\limits_{D}V}}\hspace{0pt}{(D,G)}} = {{{\mathbb{E}}_{{\mathbf{x}} \sim {p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}}}\hspace{0pt}{\lbrack{{\log D}\hspace{0pt}{({\mathbf{x}})}}\rbrack}} + {{\mathbb{E}}_{{\mathbf{z}} \sim {p_{\mathbf{z}}\hspace{0pt}{({\mathbf{z}})}}}\hspace{0pt}{\lbrack{\log{({1 - {D\hspace{0pt}{({G\hspace{0pt}{({\mathbf{z}})}})}}})}}\rbrack}}}}.$		(1)

在下一節中，我們對對抗網絡進行理論分析，本質上證明只要 ${\textstyle G}$ 和 ${\textstyle D}$ 具有足夠的容量（即在非參數極限下），訓練準則就能讓我們恢復數據生成分佈。關於該方法更非形式化、更具教學意義的解釋，參見圖 1。在實踐中，我們必須用迭代的數值方法來實現該博弈。在訓練的內層循環中將 ${\textstyle D}$ 優化到完成在計算上是不可行的，而且在有限數據集上會導致過擬合。因此，我們交替進行 ${\textstyle k}$ 步對 ${\textstyle D}$ 的優化和一步對 ${\textstyle G}$ 的優化。只要 ${\textstyle G}$ 變化得足夠慢， ${\textstyle D}$ 就能保持在其最優解附近。該策略類似於 SML/PCD 訓練 [31, 29] 在學習步驟之間保留來自馬爾可夫鏈的樣本，以避免將馬爾可夫鏈的預熱作為學習內層循環的一部分。該過程在算法 1 中正式給出。

在實踐中，方程 1 可能無法為 ${\textstyle G}$ 提供足夠的梯度以良好學習。在學習初期，當 ${\textstyle G}$ 表現較差時， ${\textstyle D}$ 可以以高置信度拒絕這些樣本，因為它們與訓練數據明顯不同。此時， ${\textstyle \log{({1 - {D\hspace{0pt}{({G\hspace{0pt}{({\mathbf{z}})}})}}})}}$ 會飽和。與其訓練 ${\textstyle G}$ 最小化 ${\textstyle \log{({1 - {D\hspace{0pt}{({G\hspace{0pt}{({\mathbf{z}})}})}}})}}$ ，我們可以訓練 ${\textstyle G}$ 最大化 ${\textstyle {\log D}\hspace{0pt}{({G\hspace{0pt}{({\mathbf{z}})}})}}$ 。這一目標函數得到 ${\textstyle G}$ 與 ${\textstyle D}$ 動力學的同一個不動點，但在學習初期提供了強得多的梯度。

			…
(a)	(b)	(c)		(d)

 对于训练迭代次数，执行      对  ${\textstyle k}$  步，执行          ${\textstyle \bullet}$  从噪声先验  ${\textstyle p_{g}\hspace{0pt}{({\mathbf{z}})}}$  中采样  ${\textstyle m}$  个噪声样本的小批量  ${\textstyle \{{\mathbf{z}}^{(1)},\ldots,{\mathbf{z}}^{(m)}\}}$ 。          ${\textstyle \bullet}$  从数据生成分布  ${\textstyle p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}}$  中采样  ${\textstyle m}$  个样本的小批量  ${\textstyle \{{\mathbf{x}}^{(1)},\ldots,{\mathbf{x}}^{(m)}\}}$ 。          ${\textstyle \bullet}$  通过沿其随机梯度上升来更新判别器：

	${{\nabla_{\theta_{d}}\frac{1}{m}}\hspace{0pt}{\sum\limits_{i = 1}^{m}\left\lbrack {{{\log D}\hspace{0pt}\left( {\mathbf{x}}^{(i)} \right)} + {\log\left( {1 - {D\hspace{0pt}\left( {G\hspace{0pt}\left( {\mathbf{z}}^{(i)} \right)} \right)}} \right)}} \right\rbrack}}.$

    结束 for       ${\textstyle \bullet}$  从噪声先验  ${\textstyle p_{g}\hspace{0pt}{({\mathbf{z}})}}$  中采样  ${\textstyle m}$  个噪声样本的小批量  ${\textstyle \{{\mathbf{z}}^{(1)},\ldots,{\mathbf{z}}^{(m)}\}}$ 。       ${\textstyle \bullet}$  通过沿其随机梯度下降来更新生成器：

	${{\nabla_{\theta_{g}}\frac{1}{m}}\hspace{0pt}{\sum\limits_{i = 1}^{m}{\log\left( {1 - {D\hspace{0pt}\left( {G\hspace{0pt}\left( {\mathbf{z}}^{(i)} \right)} \right)}} \right)}}}.$

 结束 for基于梯度的更新可使用任何标准的基于梯度的学习规则。我们在实验中使用了动量。

4 理論結果

生成器 ${\textstyle G}$ 隱式定義了一個概率分佈 ${\textstyle p_{g}}$ ，即當 ${\textstyle {\mathbf{z}} \sim p_{\mathbf{z}}}$ 時所得樣本 ${\textstyle G\hspace{0pt}{({\mathbf{z}})}}$ 的分佈。因此，我們希望算法 1 在容量與訓練時間足夠時收斂到 ${\textstyle p_{\text{data}}}$ 的一個良好估計。本節的結果是在非參數設定下完成的，即我們通過研究概率密度函數空間中的收斂性來表示一個具有無限容量的模型。

我們將在第 4.1 節證明這一極小極大博弈在 ${\textstyle p_{g} = p_{\text{data}}}$ 處具有全局最優。隨後在第 4.2 節證明算法 1 優化方程 1，從而獲得所期望的結果。

4.1 p_{g} = p_{\text{data}} 的全局最優性

我們首先考慮對任意給定生成器 ${\textstyle G}$ 的最優判別器 ${\textstyle D}$ 。

命題 1.

對於固定的 ${\textstyle G}$ ，最優判別器 ${\textstyle D}$ 為

	${D_{G}^{\ast}\hspace{0pt}{({\mathbf{x}})}} = \frac{p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}}{{p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}} + {p_{g}\hspace{0pt}{({\mathbf{x}})}}}$		(2)

證明。

對於給定的任意生成器 ${\textstyle G}$ ，判別器 D 的訓練準則是最大化數量 ${\textstyle V\hspace{0pt}{(G,D)}}$

	${\textstyle {V\hspace{0pt}{(G,D)}} =}$	${\textstyle {\int_{\mathbf{x}}{p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}\hspace{0pt}{\log{({D\hspace{0pt}{({\mathbf{x}})}})}}\hspace{0pt}{dx}}} + {\int_{z}{p_{\mathbf{z}}\hspace{0pt}{({\mathbf{z}})}\hspace{0pt}{\log{({1 - {D\hspace{0pt}{({g\hspace{0pt}{({\mathbf{z}})}})}}})}}\hspace{0pt}{dz}}}}$
	${\textstyle =}$	${\textstyle {\int_{\mathbf{x}}{p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}\hspace{0pt}{\log{({D\hspace{0pt}{({\mathbf{x}})}})}}}} + {p_{g}\hspace{0pt}{({\mathbf{x}})}\hspace{0pt}{\log{({1 - {D\hspace{0pt}{({\mathbf{x}})}}})}}\hspace{0pt}d\hspace{0pt}x}}$		(3)

對於任意 ${\textstyle {(a,b)} \in {{\mathbb{R}}^{2} \smallsetminus {\{ 0,0\}}}}$ ，函數 ${\textstyle y\rightarrow{{a\hspace{0pt}{\log{(y)}}} + {b\hspace{0pt}{\log{({1 - y})}}}}}$ 在 ${\textstyle \lbrack 0,1\rbrack}$ 上的最大值在 ${\textstyle \frac{a}{a + b}}$ 處取得。判別器無需在 ${\textstyle {S\hspace{0pt}u\hspace{0pt}p\hspace{0pt}p\hspace{0pt}{(p_{\text{data}})}} \cup {S\hspace{0pt}u\hspace{0pt}p\hspace{0pt}p\hspace{0pt}{(p_{g})}}}$ 之外定義，證明結束。∎

注意， ${\textstyle D}$ 的訓練目標可以解釋為最大化估計條件概率 ${\textstyle P\hspace{0pt}{({Y = \left. y \middle| {\mathbf{x}} \right.})}}$ 的對數似然，其中 ${\textstyle Y}$ 表示 ${\textstyle \mathbf{x}}$ 來自 ${\textstyle p_{\text{data}}}$ （ ${\textstyle y = 1}$ ）還是來自 ${\textstyle p_{g}}$ （ ${\textstyle y = 0}$ ）。方程 1 中的極小極大博弈現在可以重新表述為：

${\textstyle {C\hspace{0pt}{(G)}} =}$	${\textstyle {\max\limits_{D}V}\hspace{0pt}{(G,D)}}$
${\textstyle =}$	${\textstyle {{\mathbb{E}}_{{\mathbf{x}} \sim p_{\text{data}}}\hspace{0pt}{\lbrack{{\log D_{G}^{\ast}}\hspace{0pt}{({\mathbf{x}})}}\rbrack}} + {{\mathbb{E}}_{{\mathbf{z}} \sim p_{\mathbf{z}}}\hspace{0pt}{\lbrack{\log{({1 - {D_{G}^{\ast}\hspace{0pt}{({G\hspace{0pt}{({\mathbf{z}})}})}}})}}\rbrack}}}$	(4)
${\textstyle =}$	${\textstyle {{\mathbb{E}}_{{\mathbf{x}} \sim p_{\text{data}}}\hspace{0pt}{\lbrack{{\log D_{G}^{\ast}}\hspace{0pt}{({\mathbf{x}})}}\rbrack}} + {{\mathbb{E}}_{{\mathbf{x}} \sim p_{g}}\hspace{0pt}{\lbrack{\log{({1 - {D_{G}^{\ast}\hspace{0pt}{({\mathbf{x}})}}})}}\rbrack}}}$
${\textstyle =}$	${\textstyle {{\mathbb{E}}_{{\mathbf{x}} \sim p_{\text{data}}}\hspace{0pt}\left\lbrack {\log\frac{p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}}{{P_{\text{data}}\hspace{0pt}{({\mathbf{x}})}} + {p_{g}\hspace{0pt}{({\mathbf{x}})}}}} \right\rbrack} + {{\mathbb{E}}_{{\mathbf{x}} \sim p_{g}}\hspace{0pt}\left\lbrack {\log\frac{p_{g}\hspace{0pt}{({\mathbf{x}})}}{{p_{\text{data}}\hspace{0pt}{({\mathbf{x}})}} + {p_{g}\hspace{0pt}{({\mathbf{x}})}}}} \right\rbrack}}$

定理 1.

虛擬訓練準則 ${\textstyle C\hspace{0pt}{(G)}}$ 的全局極小當且僅當 ${\textstyle p_{g} = p_{\text{data}}}$ 時達到。在該點， ${\textstyle C\hspace{0pt}{(G)}}$ 取值 ${\textstyle - {\log 4}}$ 。

證明。

對於 ${\textstyle p_{g} = p_{\text{data}}}$ ， ${\textstyle {D_{G}^{\ast}\hspace{0pt}{({\mathbf{x}})}} = \frac{1}{2}}$ （參考方程 2）。因此，在 ${\textstyle {D_{G}^{\ast}\hspace{0pt}{({\mathbf{x}})}} = \frac{1}{2}}$ 處考察方程 4.1，我們得到 ${\textstyle {C\hspace{0pt}{(G)}} = {{\log\frac{1}{2}} + {\log\frac{1}{2}}} = {- {\log 4}}}$ 。為說明這是 ${\textstyle C\hspace{0pt}{(G)}}$ 可能的最優值，且僅在 ${\textstyle p_{g} = p_{\text{data}}}$ 時達到，注意到

	${{{\mathbb{E}}_{{\mathbf{x}} \sim p_{\text{data}}}\hspace{0pt}\left\lbrack {- {\log 2}} \right\rbrack} + {{\mathbb{E}}_{{\mathbf{x}} \sim p_{g}}\hspace{0pt}\left\lbrack {- {\log 2}} \right\rbrack}} = {- {\log 4}}$

並且，將此表達式從 ${\textstyle {C\hspace{0pt}{(G)}} = {V\hspace{0pt}{(D_{G}^{\ast},G)}}}$ 中減去，我們得到：

	${C\hspace{0pt}{(G)}} = {{- {\log{(4)}}} + {K\hspace{0pt}L\hspace{0pt}\left( p_{\text{data}}\parallel\frac{p_{\text{data}} + p_{g}}{2} \right)} + {K\hspace{0pt}L\hspace{0pt}\left( p_{g}\parallel\frac{p_{\text{data}} + p_{g}}{2} \right)}}$		(5)

其中 KL 為 Kullback–Leibler 散度。我們在上述表達式中識別出模型分佈與數據生成過程之間的 Jensen–Shannon 散度：

	${C\hspace{0pt}{(G)}} = {{- {\log{(4)}}} + {{2 \cdot J}\hspace{0pt}S\hspace{0pt}D\hspace{0pt}\left( p_{\text{data}}\parallel p_{g} \right)}}$		(6)

由於兩個分佈之間的 Jensen–Shannon 散度始終非負，且僅在相等時為零，我們已證明 ${\textstyle C^{\ast} = {- {\log{(4)}}}}$ 是 ${\textstyle C\hspace{0pt}{(G)}}$ 的全局極小，且唯一解為 ${\textstyle p_{g} = p_{\text{data}}}$ ，即生成模型完美復現數據生成過程。∎

4.2 算法 1 的收斂性

命題 2.

如果 ${\textstyle G}$ 與 ${\textstyle D}$ 具有足夠的容量，並且在算法 1 的每一步中都允許判別器在給定 ${\textstyle G}$ 的情況下達到其最優，且 ${\textstyle p_{g}}$ 被更新以改進準則

	${{\mathbb{E}}_{{\mathbf{x}} \sim p_{\text{data}}}\hspace{0pt}{\lbrack{{\log D_{G}^{\ast}}\hspace{0pt}{({\mathbf{x}})}}\rbrack}} + {{\mathbb{E}}_{{\mathbf{x}} \sim p_{g}}\hspace{0pt}{\lbrack{\log{({1 - {D_{G}^{\ast}\hspace{0pt}{({\mathbf{x}})}}})}}\rbrack}}$

那麼 ${\textstyle p_{g}}$ 收斂到 ${\textstyle p_{\text{data}}}$

證明。

考慮 ${\textstyle {V\hspace{0pt}{(G,D)}} = {U\hspace{0pt}{(p_{g},D)}}}$ 作為 ${\textstyle p_{g}}$ 的函數，如上述準則所做。注意 ${\textstyle U\hspace{0pt}{(p_{g},D)}}$ 在 ${\textstyle p_{g}}$ 中是凸的。凸函數上確界的次導數包含在最大值取得點處該函數的導數。換言之，若 ${\textstyle {f\hspace{0pt}{(x)}} = {\sup_{\alpha \in \mathcal{A}}{f_{\alpha}\hspace{0pt}{(x)}}}}$ 且對每個 ${\textstyle \alpha}$ ， ${\textstyle f_{\alpha}\hspace{0pt}{(x)}}$ 在 ${\textstyle x}$ 中是凸的，則當 ${\textstyle \beta = {\arg\hspace{0pt}{\sup_{\alpha \in \mathcal{A}}{f_{\alpha}\hspace{0pt}{(x)}}}}}$ 時 ${\textstyle {\partial{f_{\beta}\hspace{0pt}{(x)}}} \in {\partial f}}$ 。這相當於在給定 ${\textstyle G}$ 所對應的最優 ${\textstyle D}$ 處對 ${\textstyle p_{g}}$ 進行一次梯度下降更新。 ${\textstyle \sup_{D}{U\hspace{0pt}{(p_{g},D)}}}$ 在 ${\textstyle p_{g}}$ 中是凸的，並具有唯一全局最優（如定理 1 所證），因此對 ${\textstyle p_{g}}$ 進行足夠小的更新時， ${\textstyle p_{g}}$ 收斂到 ${\textstyle p_{x}}$ ，證明結束。∎

在實踐中，對抗網絡通過函數 ${\textstyle G\hspace{0pt}{({\mathbf{z}};\theta_{g})}}$ 表示有限的 ${\textstyle p_{g}}$ 分佈族，我們優化的是 ${\textstyle \theta_{g}}$ 而非 ${\textstyle p_{g}}$ 本身。用多層感知機定義 ${\textstyle G}$ 會在參數空間中引入多個臨界點。然而，多層感知機在實踐中的優秀表現表明，儘管缺乏理論保證，它仍然是一個合理可用的模型。

5 實驗

我們在多個數據集上訓練對抗網絡，包括 MNIST [23]、Toronto Face Database（TFD）[28] 與 CIFAR-10 [21]。生成器網絡混合使用整流線性激活 [19, 9] 與 sigmoid 激活，而判別器網絡使用 maxout [10] 激活。在訓練判別器網絡時使用了 dropout [17]。儘管我們的理論框架允許在生成器的中間層使用 dropout 和其他噪聲，但我們僅將噪聲用作生成器網絡最底層的輸入。

我們通過對 ${\textstyle G}$ 生成的樣本擬合高斯 Parzen 窗，並報告該分佈下的對數似然，來估計測試集數據在 ${\textstyle p_{g}}$ 下的概率。高斯的 ${\textstyle \sigma}$ 參數通過在驗證集上交叉驗證獲得。該流程由 Breuleux 等人 [8] 提出，並被應用於多種無法直接計算精確似然的生成模型 [25, 3, 5]。結果見表 1。這種似然估計方法方差較大，在高維空間中表現不佳，但據我們所知是目前可用的最佳方法。能夠採樣但無法直接估計似然的生成模型的進展，激勵了關於如何評估此類模型的進一步研究。

模型	MNIST	TFD
DBN [3]	${\textstyle 138 \pm 2}$	${\textstyle 1909 \pm 66}$
Stacked CAE [3]	${\textstyle 121 \pm 1.6}$	${\textstyle \mathbf{2}\mathbf{1}\mathbf{1}\mathbf{0} \pm \mathbf{5}\mathbf{0}}$
Deep GSN [6]	${\textstyle 214 \pm 1.1}$	${\textstyle 1890 \pm 29}$
對抗網絡	${\textstyle \mathbf{2}\mathbf{2}\mathbf{5} \pm \mathbf{2}}$	${\textstyle \mathbf{2}\mathbf{0}\mathbf{5}\mathbf{7} \pm \mathbf{2}\mathbf{6}}$

在圖 2 和圖 3 中，我們展示了訓練後從生成器網絡中抽取的樣本。雖然我們不聲稱這些樣本優於已有方法生成的樣本，但我們相信它們至少與文獻中較好的生成模型具有競爭力，並凸顯了對抗框架的潛力。


a)	b)

c)	d)

	深度有向圖模型	深度無向圖模型	生成式自編碼器	對抗模型
訓練	訓練時需要推理。	訓練時需要推理。需要 MCMC 以近似配分函數梯度。	在混合性與重構/生成能力之間存在強制取捨	需保持判別器與生成器同步。Helvetica。
推理	學習到的近似推理	變分推理	基於 MCMC 的推理	學習到的近似推理
採樣	無困難	需要馬爾可夫鏈	需要馬爾可夫鏈	無困難
評估 ${\textstyle p\hspace{0pt}{(x)}}$	難以處理，可用 AIS 近似	難以處理，可用 AIS 近似	不顯式表示，可用 Parzen 密度估計近似	不顯式表示，可用 Parzen 密度估計近似
模型設計	幾乎所有模型都面臨極大困難	需要精心設計以確保多種性質	任何可微函數在理論上都是允許的	任何可微函數在理論上都是允許的

6 優勢與劣勢

這個新框架相對於以往的建模框架既有優勢也有劣勢。劣勢主要是 ${\textstyle p_{g}\hspace{0pt}{({\mathbf{x}})}}$ 沒有顯式表示，以及 ${\textstyle D}$ 在訓練時必須與 ${\textstyle G}$ 良好同步（具體而言， ${\textstyle G}$ 在不更新 ${\textstyle D}$ 的情況下不能訓練過多，以避免「Helvetica 情景」——即 ${\textstyle G}$ 將過多的 ${\textstyle \mathbf{z}}$ 值塌縮到相同的 ${\textstyle \mathbf{x}}$ 值，導致沒有足夠的多樣性來建模 ${\textstyle p_{\text{data}}}$ ），就像玻爾茲曼機的負鏈必須在學習步驟之間保持同步那樣。優勢在於從不需要馬爾可夫鏈，僅使用反向傳播（backprop）來獲得梯度，學習期間不需要推理，並且模型中可以納入種類繁多的函數。表 2 總結了生成對抗網絡與其他生成建模方法的比較。

上述優勢主要是計算上的。對抗模型還可能獲得某些統計上的優勢，因為生成器網絡並不直接用數據樣本更新，而是僅用流經判別器的梯度更新。這意味着輸入的成分不會直接被複製到生成器的參數中。對抗網絡的另一個優勢是可以表示非常尖銳甚至退化的分佈，而基於馬爾可夫鏈的方法則要求分佈略顯模糊，以便鏈能夠在不同模式之間混合。

7 結論與未來工作

該框架允許許多直接的擴展：

1.

通過將 ${\textstyle \mathbf{c}}$ 作為輸入同時加入 ${\textstyle G}$ 與 ${\textstyle D}$ ，可以得到條件生成模型 ${\textstyle p\hspace{0pt}{({{\mathbf{x}} \mid {\mathbf{c}}})}}$ 。
2.

可以通過訓練一個輔助網絡在給定 ${\textstyle \mathbf{x}}$ 的情況下預測 ${\textstyle \mathbf{z}}$ 來進行學習到的近似推理。這與 wake-sleep 算法 [15] 訓練的推理網絡類似，但其優勢在於：可以在生成器網絡訓練完成之後，針對一個固定的生成器網絡訓練推理網絡。
3.

可以通過訓練一組共享參數的條件模型來近似建模所有條件 ${\textstyle p\hspace{0pt}{({{\mathbf{x}}_{S} \mid {\mathbf{x}}_{\mathit{S\not{}}}})}}$ ，其中 ${\textstyle S}$ 是 ${\textstyle \mathbf{x}}$ 索引的子集。本質上，可以使用對抗網絡實現確定性 MP-DBM [11] 的隨機擴展。
4.

半監督學習：當可用的標註數據有限時，來自判別器或推理網絡的特徵可以提升分類器的性能。
5.

效率提升：通過設計更好的方法來協調 ${\textstyle G}$ 與 ${\textstyle D}$ ，或在訓練期間確定更好的 ${\textstyle \mathbf{z}}$ 採樣分佈，可以大幅加快訓練速度。

本文論證了對抗建模框架的可行性，表明這些研究方向可能富有成效。

致謝

我們要感謝 Patrice Marcotte、Olivier Delalleau、Kyunghyun Cho、Guillaume Alain 和 Jason Yosinski 進行的有益討論。Yann Dauphin 與我們分享了他的 Parzen 窗評估代碼。我們要感謝 Pylearn2 [12] 與 Theano [7, 1] 的開發者，尤其是 Frédéric Bastien，他專門為本項目加急加入了一個 Theano 特性。Arnaud Bergeron 在 LaTeX 排版方面提供了非常必要的支持。我們還要感謝 CIFAR 和 Canada Research Chairs 的資助，以及 Compute Canada 和 Calcul Québec 提供的計算資源。Ian Goodfellow 獲得 2013 年 Google 深度學習獎學金的支持。最後，我們要感謝 Les Trois Brasseurs 激發了我們的創造力。

參考文獻

Bastien et al. [2012] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., 與 Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
Bengio [2009] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
Bengio et al. [2013a] Bengio, Y., Mesnil, G., Dauphin, Y., 與 Rifai, S. (2013a). Better mixing via deep representations. 見 ICML』13。
Bengio et al. [2013b] Bengio, Y., Yao, L., Alain, G., 與 Vincent, P. (2013b). Generalized denoising auto-encoders as generative models. 見 NIPS26。Nips Foundation。
Bengio et al. [2014a] Bengio, Y., Thibodeau-Laufer, E., 與 Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. 見 ICML』14。
Bengio et al. [2014b] Bengio, Y., Thibodeau-Laufer, E., Alain, G., 與 Yosinski, J. (2014b). Deep generative stochastic networks trainable by backprop. 見 Proceedings of the 30th International Conference on Machine Learning (ICML』14)。
Bergstra et al. [2010] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., 與 Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. 見 Proceedings of the Python for Scientific Computing Conference (SciPy)。口頭報告。
Breuleux et al. [2011] Breuleux, O., Bengio, Y., 與 Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.
Glorot et al. [2011] Glorot, X., Bordes, A., 與 Bengio, Y. (2011). Deep sparse rectifier neural networks. 見 AISTATS』2011。
Goodfellow et al. [2013a] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., 與 Bengio, Y. (2013a). Maxout networks. 見 ICML』2013。
Goodfellow et al. [2013b] Goodfellow, I. J., Mirza, M., Courville, A., 與 Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. 見 NIPS』2013。
Goodfellow et al. [2013c] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., 與 Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
Gutmann 與 Hyvarinen [2010] Gutmann, M. 與 Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. 見 AISTATS』2010。
Hinton et al. [2012a] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., 與 Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton et al. [1995] Hinton, G. E., Dayan, P., Frey, B. J., 與 Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161.
Hinton et al. [2006] Hinton, G. E., Osindero, S., 與 Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Hinton et al. [2012b] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., 與 Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. 技術報告，arXiv:1207.0580。
Hyvärinen [2005] Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res., 6.
Jarrett et al. [2009] Jarrett, K., Kavukcuoglu, K., Ranzato, M., 與 LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? 見 Proc. International Conference on Computer Vision (ICCV』09)，第 2146–2153 頁。IEEE。
Kingma 與 Welling [2014] Kingma, D. P. 與 Welling, M. (2014). Auto-encoding variational bayes. 見 Proceedings of the International Conference on Learning Representations (ICLR)。
Krizhevsky 與 Hinton [2009] Krizhevsky, A. 與 Hinton, G. (2009). Learning multiple layers of features from tiny images. 技術報告，University of Toronto。
Krizhevsky et al. [2012] Krizhevsky, A., Sutskever, I., 與 Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. 見 NIPS』2012。
LeCun et al. [1998] LeCun, Y., Bottou, L., Bengio, Y., 與 Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Rezende et al. [2014] Rezende, D. J., Mohamed, S., 與 Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. 技術報告，arXiv:1401.4082。
Rifai et al. [2012] Rifai, S., Bengio, Y., Dauphin, Y., 與 Vincent, P. (2012). A generative process for sampling contractive auto-encoders. 見 ICML』12。
Salakhutdinov 與 Hinton [2009] Salakhutdinov, R. 與 Hinton, G. E. (2009). Deep Boltzmann machines. 見 AISTATS』2009，第 448–455 頁。
Smolensky [1986] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. 見 D. E. Rumelhart 與 J. L. McClelland 編，Parallel Distributed Processing，卷 1，第 6 章，第 194–281 頁。MIT Press，Cambridge。
Susskind et al. [2010] Susskind, J., Anderson, A., 與 Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto。
Tieleman [2008] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. 見 W. W. Cohen, A. McCallum, 與 S. T. Roweis 編，ICML 2008，第 1064–1071 頁。ACM。
Vincent et al. [2008] Vincent, P., Larochelle, H., Bengio, Y., 與 Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. 見 ICML 2008。
Younes [1999] Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65(3), 177–228.