A Theoretically Grounded Application of Dropout in Recurrent Neural Networks/zh

Research Paper
Authors	Yarin Gal; Zoubin Ghahramani
Year	2015
Topic area	Machine Learning
Difficulty	Research
arXiv	1512.05287
PDF	Download PDF

This page is a translated version of the page A Theoretically Grounded Application of Dropout in Recurrent Neural Networks and the translation is 100% complete.

Other languages:

English
Español
中文

SummarySource

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks 是 Yarin Gal 与 Zoubin Ghahramani(剑桥大学)发表于 2016 年 NeurIPS 的论文,推导出在循环神经网络中应用 dropout 的有原则方法。作者将 dropout 解释为贝叶斯神经网络中的变分推断,由此证明在每个时间步上,对输入、输出和循环连接都应重复使用同一个二值掩码——这一方案不同于以往的启发式做法,可对 LSTM 或 GRU 中的每一个权重矩阵都进行正则化。将该方法应用于 Zaremba 等人在 Penn Treebank 上的 LSTM 语言模型,可将单模型最优性能提升到测试困惑度 73.4。

概述

循环网络一向以难以正则化著称。朴素 dropout(即在每个时间步重新采样 Bernoulli 掩码)曾被普遍认为会破坏循环动力学,因此先前的做法(Pham 等人、Zaremba 等人)将 dropout 仅限于前馈连接,使循环权重矩阵得不到保护,模型在小型语料上仍极易过拟合。

Gal 与 Ghahramani 从贝叶斯神经网络中近似变分推断的角度重新推导 dropout。把 RNN 的权重矩阵视为随机变量,并采用高斯混合作为近似后验,当变分分布近似 Bernoulli 时,恰好可恢复普通的 dropout。关键之处在于:在序列模型中,从后验抽取的 Monte Carlo 样本是每个序列采样一次,因此得到的掩码在所有时间步之间共享。仅此一项改动就使得在循环连接上施加 dropout 成为可能,而不会破坏时间动力学。

主要贡献

针对 RNN 的 dropout 的贝叶斯推导(即变分 RNN),为掩码选择提供了理论依据。
在 LSTM 和 GRU 的输入、输出与循环层上采用每序列同一掩码规则,并与朴素 dropout 并排对照展示。
Embedding dropout:以概率方式处理 one-hot 输入,该方法随机丢弃句子中整个类型的词(而非 token)——这是语言模型中以往被忽视的正则化来源。
在 Penn Treebank 语言建模基准上取得新的单模型最优结果(测试困惑度由 78.4 降至 73.4)。
在测试阶段使用 MC dropout 作为后验预测估计器,并辅以更廉价的均值场近似。

方法

RNN 的变分视角

对于输入序列 $\mathbf{x} = [\mathbf{x}_1, \dots, \mathbf{x}_T]$ ,简单 RNN 反复应用 $\mathbf{h}_t = \sigma(\mathbf{x}_t \mathbf{W}_h + \mathbf{h}_{t-1} \mathbf{U}_h + \mathbf{b}_h)$ 。作者将 $\boldsymbol{\omega} = \{\mathbf{W}_h, \mathbf{U}_h, \mathbf{b}_h, \mathbf{W}_y, \mathbf{b}_y\}$ 视为服从正态先验的随机变量,并通过最小化以下目标来近似难以处理的后验 $p(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y})$ :

\mathrm{KL}(q(\boldsymbol{\omega}) \parallel p(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y})) \propto -\sum_{i=1}^N \int q(\boldsymbol{\omega}) \log p(\mathbf{y}_i \mid \mathbf{f}^{\boldsymbol{\omega}}(\mathbf{x}_i))\, \mathrm{d}\boldsymbol{\omega} + \mathrm{KL}(q(\boldsymbol{\omega}) \parallel p(\boldsymbol{\omega})).

对每个序列抽取单一的 Monte Carlo 样本 $\widehat{\boldsymbol{\omega}}_i \sim q(\boldsymbol{\omega})$ ,并在每个时间步 $t \le T$ 上复用。针对权重矩阵每一行 $\mathbf{w}_k$ 的因式化近似分布是如下两分量混合:

q(\mathbf{w}_k) = p\, \mathcal{N}(\mathbf{w}_k; \mathbf{0}, \sigma^2 I) + (1-p)\, \mathcal{N}(\mathbf{w}_k; \mathbf{m}_k, \sigma^2 I),

其中 $\sigma^2$ 取较小值。KL 项化简为对变分均值 $\mathbf{m}_k$ 的 $$ L_2 $$ 正则化。

在 LSTM/GRU 中的实现

对于绑定权重(tied-weights)参数化的 LSTM,每一步计算如下:

\begin{pmatrix}\mathbf{i}\\ \mathbf{f}\\ \mathbf{o}\\ \mathbf{g}\end{pmatrix} = \begin{pmatrix}\mathrm{sigm}\\ \mathrm{sigm}\\ \mathrm{sigm}\\ \tanh\end{pmatrix}\!\left(\begin{pmatrix}\mathbf{x}_t \circ \mathbf{z}_x\\ \mathbf{h}_{t-1} \circ \mathbf{z}_h\end{pmatrix} \cdot \mathbf{W}\right),

其中 $\mathbf{z}_x$ 和 $\mathbf{z}_h$ 是每序列采样一次并在所有 $$ t $$ 上复用的 Bernoulli 掩码。非绑定权重的 LSTM 为每个门使用独立掩码,以每步四次矩阵乘法为代价换取方差更小的梯度。

词嵌入 dropout

对于离散输入,dropout 通过在整个序列上使用相同的掩码,作用于嵌入矩阵 $\mathbf{W}_E \in \mathbb{R}^{V \times D}$ 的各行。被丢弃的词类型会在它出现的每个位置上消失(例如,"the dog and the cat" 会变成 "— dog and — cat",而绝不会出现 "— dog and the cat")。当序列长度 $T \ll V$ 时,实现上只需对实际用到的 $$ T $$ 个 embedding 加掩码即可。

结果

Penn Treebank 语言建模。 将变分 LSTM 接入 Zaremba 等人的 Torch 参考实现,并调优权重衰减后,作者报告:

中等模型(每层 650 个单元):测试困惑度 78.6(变分、非绑定、MC),Zaremba 等人为 82.7。
大型模型(每层 1500 个单元):测试困惑度 73.4(变分、非绑定、MC),Zaremba 等人为 78.4;验证困惑度 77.3(绑定)。
Moon 等人的变体——仅在 LSTM 单元上使用同一掩码——除非与新提出的 embedding dropout 结合使用,否则其表现不如 Zaremba 等人;即便结合后,也仍落后于变分变体。
由 10 个变分 LSTM 组成的集成获得 68.7 的测试困惑度,与 Zaremba 等人由 38 个模型组成的集成持平。

情感分析(Cornell 影评语料)。 在 5000 条被截断为 200-token 片段的影评上,变分 LSTM 与变分 GRU 是仅有的不会过拟合的模型,且在 LSTM/GRU 各基线中取得最低的测试误差。

消融实验。 必须将循环层 dropout( $$ p_U $$ )与 embedding dropout( $$ p_E $$ )同时使用:在 $$ p_E = 0 $$ 时,提高 $$ p_U $$ 反而加剧过拟合,因为未正则化的 embedding 层占主导。在 $$ p_E = 0.5 $$ 时,提高 $$ p_U $$ 才如预期般提升鲁棒性。在变分 dropout 下,权重衰减仍扮演重要角色(它对应于先验),与朴素 dropout 通常省略权重衰减的做法不同。廉价的 dropout 近似(在测试时把 $\mathbf{W}$ 替换为 $p\mathbf{W}$ )在该场景下能很好地代替 MC dropout。

影响

RNN 的变分 dropout 成为循环语言模型的默认正则方法,并迅速被主流深度学习框架(Keras、PyTorch、TensorFlow)吸收,常以 recurrent dropout 或 variational dropout 之名出现。Penn Treebank 73.4 的困惑度此后多年是单模型标杆,也是后续若干 SOTA 结果(例如 AWD-LSTM、mixture-of-softmaxes)中的关键要素。更宏观地看,该论文巩固了 Gal 与 Ghahramani 早期工作所提出的 dropout 贝叶斯视角,使 MC dropout 成为深度学习中估计不确定性的实用工具。

参见

参考文献

Gal, Y. 与 Ghahramani, Z. (2016). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. Advances in Neural Information Processing Systems 29.
Gal, Y. 与 Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML.
Srivastava, N. 等. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR.
Zaremba, W.、Sutskever, I. 与 Vinyals, O. (2014). Recurrent Neural Network Regularization. arXiv:1409.2329.
Hochreiter, S. 与 Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.
Cho, K. 等. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP.
Marcus, M. P.、Marcinkiewicz, M. A. 与 Santorini, B. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics.
Moon, T.、Choi, H.、Lee, H. 与 Song, I. (2015). RNNDROP: A Novel Dropout for RNNs in ASR. ASRU.
Pang, B. 与 Lee, L. (2005). Seeing Stars: Exploiting Class Relationships for Sentiment Categorization. ACL.