This page is a translated version of the page Attention Is All You Need/paper and the translation is 58% complete.

Outdated translations are marked like this.

Other languages:

English
Español
中文

在提供适当署名的前提下，Google 特此授予复制本文中表格与图表的许可，仅限用于新闻或学术作品。

Lua error: Internal error: The interpreter exited with status 1. 即一切所需

Research Paper
Authors	Ashish Vaswani; Noam Shazeer; Niki Parmar; Jakob Uszkoreit; Llion Jones; Aidan N. Gomez; Lukasz Kaiser; Illia Polosukhin
Year	2017
Topic area	NLP
Difficulty	Research
arXiv	1706.03762
PDF	Download PDF

\ANDAshish Vaswani
Google Brain
avaswani@google.com &Noam Shazeer¹¹footnotemark: 1
Google Brain
noam@google.com &Niki Parmar¹¹footnotemark: 1
Google Research
nikip@google.com &Jakob Uszkoreit¹¹footnotemark: 1
Google Research
usz@google.com &Llion Jones¹¹footnotemark: 1
Google Research
llion@google.com &Aidan N. Gomez¹¹footnotemark: 1
University of Toronto
aidan@cs.toronto.edu &Łukasz Kaiser¹¹footnotemark: 1
Google Brain
lukaszkaiser@google.com &Illia Polosukhin¹¹footnotemark: 1
illia.polosukhin@gmail.com 同等贡献。作者顺序随机。Jakob 提出用自Lua error: Internal error: The interpreter exited with status 1.替换 RNN，并发起评估这一想法的工作。Ashish 与 Illia 一起设计并实现了第一批 Lua error: Internal error: The interpreter exited with status 1. 模型，并在本工作的各个方面起到了关键作用。Noam 提出了缩放点积 Lua error: Internal error: The interpreter exited with status 1.、多头 Lua error: Internal error: The interpreter exited with status 1. 以及无参数位置表示，成为另一位几乎参与每个细节的人员。Niki 在我们的原始代码库及 tensor2tensor 中设计、实现、调优并评估了大量模型变体。Llion 也尝试了新的模型变体，负责我们最初的代码库以及高效推理与可视化工作。Lukasz 和 Aidan 投入了无数漫长的日子设计并实现 tensor2tensor 的各个部分，替换了我们更早的代码库，极大改进了结果并大幅加速了我们的研究。在 Google Brain 工作期间完成的工作。在 Google Research 工作期间完成的工作。

摘要

主流的序列转导模型基于复杂的循环或卷积神经网络，包含一个编码器和一个解码器。表现最好的模型还通过 Lua error: Internal error: The interpreter exited with status 1. 机制连接编码器与解码器。我们提出了一种新的简单网络架构——Lua error: Internal error: The interpreter exited with status 1.，它完全基于 Lua error: Internal error: The interpreter exited with status 1. 机制，彻底摒弃了循环和Lua error: Internal error: The interpreter exited with status 1.。在两个机器翻译任务上的实验表明，这些模型在质量上更胜一筹，同时具有更好的并行性，所需的训练时间也显著减少。我们的模型在 WMT 2014 英德翻译任务上取得了 28.4 BLEU，比包括 ensemble 在内的现有最佳结果提高了 2 BLEU 以上。在 WMT 2014 英法翻译任务上，我们的模型在八块 GPU 上训练 3.5 天后，建立了 41.8 的新单模型最先进 BLEU 分数，仅为文献中最佳模型训练成本的一小部分。我们将 Lua error: Internal error: The interpreter exited with status 1. 成功应用于英文成分句法分析，无论是大规模还是有限训练数据，均表明 Lua error: Internal error: The interpreter exited with status 1. 能很好地泛化到其他任务。

1 引言

循环神经网络，尤其是 Lua error: Internal error: The interpreter exited with status 1. [13] 和门控循环神经网络 [7]，已牢固确立为序列建模与转导问题（如语言建模和机器翻译）中的最先进方法 [35, 2, 5]。此后，众多研究持续推进循环语言模型和编码器-解码器架构的边界 [38, 24, 15]。

循环模型通常沿输入和输出序列的符号位置进行计算分解。将位置与计算时间步对齐后，它们生成一个隐藏状态序列 ${\textstyle h_{t}}$ ，作为前一隐藏状态 ${\textstyle h_{t - 1}}$ 和位置 ${\textstyle t}$ 输入的函数。这种内在的序列性使得训练样本内部无法并行化，在序列较长时尤为关键，因为内存约束限制了样本之间的批处理。最近的工作通过因式分解技巧 [21] 和条件计算 [32] 在计算效率方面取得了显著改进，后者还提升了模型性能。然而，序列计算的根本约束依然存在。

Lua error: Internal error: The interpreter exited with status 1. 机制已成为各类任务中最具竞争力的序列建模与转导模型不可或缺的部分，使得在输入或输出序列中建模任意距离的依赖关系成为可能 [2, 19]。然而，除少数情况外 [27]，此类 Lua error: Internal error: The interpreter exited with status 1. 机制通常与循环网络结合使用。

在本工作中，我们提出了 Lua error: Internal error: The interpreter exited with status 1.，这是一种摒弃循环、完全依赖 Lua error: Internal error: The interpreter exited with status 1. 机制来刻画输入与输出之间全局依赖的模型架构。Lua error: Internal error: The interpreter exited with status 1. 实现了显著更高的并行化，仅在八块 P100 GPU 上训练十二小时即可达到新的翻译质量最先进水平。

2 背景

降低序列计算量的目标也是 Extended Neural GPU [16]、ByteNet [18] 和 ConvS2S [9] 的设计基础，它们都将卷积神经网络作为基本构件，并行计算所有输入和输出位置的隐藏表示。在这些模型中，将任意两个输入或输出位置的信号联系起来所需的操作数随位置间距离增长，ConvS2S 中为线性，ByteNet 中为对数。这使得学习远距离位置之间的依赖更加困难 [12]。在 Lua error: Internal error: The interpreter exited with status 1. 中，该操作数降至常数，但代价是由于对 Lua error: Internal error: The interpreter exited with status 1. 加权位置进行平均而导致的有效分辨率下降；我们通过第 3.2 节所述的多头 Lua error: Internal error: The interpreter exited with status 1. 来抵消这一影响。

自Lua error: Internal error: The interpreter exited with status 1.（有时称为内部Lua error: Internal error: The interpreter exited with status 1.）是一种将单个序列中不同位置关联起来以计算该序列表示的 Lua error: Internal error: The interpreter exited with status 1. 机制。自Lua error: Internal error: The interpreter exited with status 1.已成功用于多种任务，包括阅读理解、抽象摘要、文本蕴含以及与任务无关的句子表示学习 [4, 27, 28, 22]。

端到端记忆网络基于循环 Lua error: Internal error: The interpreter exited with status 1. 机制而非序列对齐的循环，已被证明在简单语言问答和语言建模任务上表现良好 [34]。

然而据我们所知，Lua error: Internal error: The interpreter exited with status 1. 是第一个完全依赖自Lua error: Internal error: The interpreter exited with status 1.来计算其输入和输出表示，而不使用序列对齐的 RNN 或 Lua error: Internal error: The interpreter exited with status 1. 的转导模型。在接下来的章节中，我们将描述 Lua error: Internal error: The interpreter exited with status 1.，阐述使用自Lua error: Internal error: The interpreter exited with status 1.的动机，并讨论其相对于 [17, 18] 和 [9] 等模型的优势。

3 模型架构

大多数具有竞争力的神经序列转导模型采用编码器-解码器结构 [5, 2, 35]。其中，编码器将符号表示的输入序列 ${\textstyle (x_{1},\ldots,x_{n})}$ 映射为连续表示序列 ${\textstyle \mathbf{z} = {(z_{1},\ldots,z_{n})}}$ 。给定 ${\textstyle \mathbf{z}}$ ，解码器随后逐元素生成输出序列 ${\textstyle (y_{1},\ldots,y_{m})}$ 。在每一步，模型都是自回归的 [10]，在生成下一个符号时将之前生成的符号作为额外输入。

Lua error: Internal error: The interpreter exited with status 1. 遵循这一总体架构，编码器和解码器均使用堆叠的自Lua error: Internal error: The interpreter exited with status 1.及逐位置全连接层，分别如图 1 的左右两半所示。

3.1 编码器与解码器堆栈

编码器：

编码器由 ${\textstyle N = 6}$ 层相同结构堆叠而成。每一层包含两个子层。第一个子层是多头自Lua error: Internal error: The interpreter exited with status 1.机制，第二个子层是简单的逐位置全连接前馈网络。我们在每个子层周围采用残差连接 [11]，然后进行 Lua error: Internal error: The interpreter exited with status 1. [1]。也就是说，每个子层的输出为 ${\textstyle {LayerNorm}\hspace{0pt}{({x + {{Sublayer}\hspace{0pt}{(x)}}})}}$ ，其中 ${\textstyle {Sublayer}\hspace{0pt}{(x)}}$ 是该子层自身实现的函数。为便于这些残差连接，模型中的所有子层以及 Lua error: Internal error: The interpreter exited with status 1. 层均产生维度为 ${\textstyle d_{\text{model}} = 512}$ 的输出。

解码器：

解码器同样由 ${\textstyle N = 6}$ 层相同结构堆叠而成。除了每个编码器层中的两个子层外，解码器插入了第三个子层，对编码器堆栈的输出执行多头 Lua error: Internal error: The interpreter exited with status 1.。与编码器类似，我们在每个子层周围采用残差连接，随后进行 Lua error: Internal error: The interpreter exited with status 1.。我们还修改了解码器堆栈中的自Lua error: Internal error: The interpreter exited with status 1.子层，以防止某个位置关注后续位置。这种掩码，加上输出 Lua error: Internal error: The interpreter exited with status 1. 偏移一个位置的事实，确保位置 ${\textstyle i}$ 的预测只能依赖小于 ${\textstyle i}$ 位置上的已知输出。

3.2 Lua error: Internal error: The interpreter exited with status 1.

Lua error: Internal error: The interpreter exited with status 1. 函数可以描述为将一个查询及一组键-值对映射到一个输出，其中查询、键、值和输出均为向量。输出以值的加权和计算，每个值所分配的权重由查询与对应键的相容性函数计算得到。

3.2.1 缩放点积 Lua error: Internal error: The interpreter exited with status 1.

我们将这种特定的 Lua error: Internal error: The interpreter exited with status 1. 称为"缩放点积 Lua error: Internal error: The interpreter exited with status 1."（图 2）。输入由维度为 ${\textstyle d_{k}}$ 的查询和键以及维度为 ${\textstyle d_{v}}$ 的值组成。我们计算查询与所有键的点积，并将每个点积除以 ${\textstyle \sqrt{d_{k}}}$ ，然后应用 Lua error: Internal error: The interpreter exited with status 1. 函数得到值的权重。

在实践中，我们对一组查询同时计算 Lua error: Internal error: The interpreter exited with status 1. 函数，将它们打包成矩阵 ${\textstyle Q}$ 。键和值同样被打包成矩阵 ${\textstyle K}$ 和 ${\textstyle V}$ 。我们按以下方式计算输出矩阵：

	${{Attention}\hspace{0pt}{(Q,K,V)}} = {{softmax}\hspace{0pt}{(\frac{Q\hspace{0pt}K^{T}}{\sqrt{d_{k}}})}\hspace{0pt}V}$		(1)

最常用的两种 Lua error: Internal error: The interpreter exited with status 1. 函数是加性 Lua error: Internal error: The interpreter exited with status 1. [2] 和点积（乘性）Lua error: Internal error: The interpreter exited with status 1.。点积 Lua error: Internal error: The interpreter exited with status 1. 与我们的算法相同，只是多了一个 ${\textstyle \frac{1}{\sqrt{d_{k}}}}$ 的缩放因子。加性 Lua error: Internal error: The interpreter exited with status 1. 使用具有单个隐藏层的前馈网络计算相容性函数。虽然两者在理论复杂度上相似，但在实践中点积 Lua error: Internal error: The interpreter exited with status 1. 更快且更节省空间，因为它可以通过高度优化的矩阵乘法代码实现。

虽然对于 ${\textstyle d_{k}}$ 较小的情况，这两种机制表现相近，但当 ${\textstyle d_{k}}$ 较大时，加性 Lua error: Internal error: The interpreter exited with status 1. 优于不进行缩放的点积 Lua error: Internal error: The interpreter exited with status 1. [3]。我们怀疑当 ${\textstyle d_{k}}$ 很大时，点积的量级会变大，将 Lua error: Internal error: The interpreter exited with status 1. 函数推入梯度极小的区域 ¹¹1为说明点积为何变大，假设 ${\textstyle q}$ 和 ${\textstyle k}$ 的分量是均值为 ${\textstyle 0}$ 、方差为 ${\textstyle 1}$ 的独立随机变量。那么它们的点积 ${\textstyle {q \cdot k} = {\sum_{i = 1}^{d_{k}}{q_{i}\hspace{0pt}k_{i}}}}$ 均值为 ${\textstyle 0}$ ，方差为 ${\textstyle d_{k}}$ 。。为抵消这一效应，我们将点积按 ${\textstyle \frac{1}{\sqrt{d_{k}}}}$ 进行缩放。

3.2.2 多头 Lua error: Internal error: The interpreter exited with status 1.

缩放点积 Lua error: Internal error: The interpreter exited with status 1.

多头 Lua error: Internal error: The interpreter exited with status 1.

我们发现，与其使用维度为 ${\textstyle d_{\text{model}}}$ 的键、值和查询执行单一的 Lua error: Internal error: The interpreter exited with status 1. 函数，不如将查询、键和值分别用不同的、学习到的线性投影线性地投影 ${\textstyle h}$ 次到 ${\textstyle d_{k}}$ 、 ${\textstyle d_{k}}$ 和 ${\textstyle d_{v}}$ 维度，效果更好。然后在每个投影后的查询、键和值上并行执行 Lua error: Internal error: The interpreter exited with status 1. 函数，产生 ${\textstyle d_{v}}$ 维的输出值。这些输出被拼接并再次投影，得到最终值，如图 2 所示。

多头 Lua error: Internal error: The interpreter exited with status 1. 允许模型在不同位置共同关注来自不同表示子空间的信息。若只有单个 Lua error: Internal error: The interpreter exited with status 1. 头，则平均化会抑制这一能力。

	${\textstyle {MultiHead}\hspace{0pt}{(Q,K,V)}}$	${\textstyle = {{Concat}\hspace{0pt}{({head}_{1},\ldots,{head}_{h})}\hspace{0pt}W^{O}}}$
	${\textstyle \text{where}\hspace{0pt}{head}_{i}}$	${\textstyle = {{Attention}\hspace{0pt}{({Q\hspace{0pt}W_{i}^{Q}},{K\hspace{0pt}W_{i}^{K}},{V\hspace{0pt}W_{i}^{V}})}}}$

其中投影为参数矩阵 ${\textstyle W_{i}^{Q} \in {\mathbb{R}}^{d_{\text{model}} \times d_{k}}}$ 、 ${\textstyle W_{i}^{K} \in {\mathbb{R}}^{d_{\text{model}} \times d_{k}}}$ 、 ${\textstyle W_{i}^{V} \in {\mathbb{R}}^{d_{\text{model}} \times d_{v}}}$ 和 ${\textstyle W^{O} \in {\mathbb{R}}^{{h\hspace{0pt}d_{v}} \times d_{\text{model}}}}$ 。

在本工作中，我们采用 ${\textstyle h = 8}$ 个并行的 Lua error: Internal error: The interpreter exited with status 1. 层（即头）。对每个头使用 ${\textstyle d_{k} = d_{v} = {d_{\text{model}}/h} = 64}$ 。由于每个头的维度减小，总计算开销与具有完整维度的单头 Lua error: Internal error: The interpreter exited with status 1. 相近。

3.2.3 Lua error: Internal error: The interpreter exited with status 1. 在我们模型中的应用

Lua error: Internal error: The interpreter exited with status 1. 以三种不同的方式使用多头 Lua error: Internal error: The interpreter exited with status 1.：

•

在"编码器-解码器 Lua error: Internal error: The interpreter exited with status 1."层中，查询来自前一解码器层，而记忆键和值来自编码器的输出。这使得解码器中的每个位置都能关注输入序列的所有位置。这模仿了 Lua error: Internal error: The interpreter exited with status 1. 模型中典型的编码器-解码器 Lua error: Internal error: The interpreter exited with status 1. 机制，如 [38, 2, 9]。
•

编码器包含自Lua error: Internal error: The interpreter exited with status 1.层。在自Lua error: Internal error: The interpreter exited with status 1.层中，所有的键、值和查询都来自同一位置，本例中即编码器中前一层的输出。编码器中的每个位置都可以关注编码器前一层的所有位置。
•

类似地，解码器中的自Lua error: Internal error: The interpreter exited with status 1.层允许解码器中的每个位置关注解码器中直到该位置（含）的所有位置。我们需要在解码器中阻止向左的信息流以保持自回归性质。我们在缩放点积 Lua error: Internal error: The interpreter exited with status 1. 内部实现该屏蔽，将 Lua error: Internal error: The interpreter exited with status 1. 输入中所有对应非法连接的值设为 ${\textstyle - \infty}$ 。参见图 2。

3.3 按位置的前馈网络

除了 Lua error: Internal error: The interpreter exited with status 1. 子层外，编码器和解码器中的每一层都包含一个全连接前馈网络，独立且相同地作用于每个位置。它由两个线性变换组成，中间是 ReLU Lua error: Internal error: The interpreter exited with status 1.。

	${{FFN}\hspace{0pt}{(x)}} = {{{\max{(0,{{x\hspace{0pt}W_{1}} + b_{1}})}}\hspace{0pt}W_{2}} + b_{2}}$		(2)

虽然线性变换在不同位置上是相同的，但它们在不同层之间使用不同的参数。另一种描述方式是两个核大小为 1 的Lua error: Internal error: The interpreter exited with status 1.。输入和输出的维度为 ${\textstyle d_{\text{model}} = 512}$ ，内层维度为 ${\textstyle d_{f\hspace{0pt}f} = 2048}$ 。

3.4 Lua error: Internal error: The interpreter exited with status 1. 与 Lua error: Internal error: The interpreter exited with status 1.

与其他序列转导模型类似，我们使用学习到的 Lua error: Internal error: The interpreter exited with status 1. 将输入和输出 token 转换为维度为 ${\textstyle d_{\text{model}}}$ 的向量。我们还使用常规的学习线性变换和 Lua error: Internal error: The interpreter exited with status 1. 函数将解码器输出转换为预测的下一 token 概率。在我们的模型中，我们在两个 Lua error: Internal error: The interpreter exited with status 1. 层和 Lua error: Internal error: The interpreter exited with status 1. 前的线性变换之间共享同一权重矩阵，类似于 [30]。在 Lua error: Internal error: The interpreter exited with status 1. 层中，我们将这些权重乘以 ${\textstyle \sqrt{d_{\text{model}}}}$ 。

3.5 位置编码

由于我们的模型不包含循环也不包含 Lua error: Internal error: The interpreter exited with status 1.，为了让模型利用序列的顺序，我们必须注入一些关于序列中 token 相对或绝对位置的信息。为此，我们在编码器和解码器堆栈底部的输入 Lua error: Internal error: The interpreter exited with status 1. 上加入"位置编码"。位置编码与 Lua error: Internal error: The interpreter exited with status 1. 具有相同的维度 ${\textstyle d_{\text{model}}}$ ，因而两者可以相加。位置编码有许多选择，包括可学习的和固定的 [9]。

在本工作中，我们使用不同频率的正弦和余弦函数：

	${\textstyle {P\hspace{0pt}E_{({p\hspace{0pt}o\hspace{0pt}s},{2\hspace{0pt}i})}} = {s\hspace{0pt}i\hspace{0pt}n\hspace{0pt}{({{p\hspace{0pt}o\hspace{0pt}s}/10000^{{2\hspace{0pt}i}/d_{\text{model}}}})}}}$
	${\textstyle {P\hspace{0pt}E_{({p\hspace{0pt}o\hspace{0pt}s},{{2\hspace{0pt}i} + 1})}} = {c\hspace{0pt}o\hspace{0pt}s\hspace{0pt}{({{p\hspace{0pt}o\hspace{0pt}s}/10000^{{2\hspace{0pt}i}/d_{\text{model}}}})}}}$

其中 ${\textstyle p\hspace{0pt}o\hspace{0pt}s}$ 是位置， ${\textstyle i}$ 是维度。也就是说，位置编码的每个维度对应一个正弦曲线。波长构成从 ${\textstyle 2\hspace{0pt}\pi}$ 到 ${\textstyle {10000 \cdot 2}\hspace{0pt}\pi}$ 的几何级数。我们选择此函数是因为我们假设它将让模型容易学到通过相对位置进行关注，因为对于任何固定偏移量 ${\textstyle k}$ ， ${\textstyle P\hspace{0pt}E_{{p\hspace{0pt}o\hspace{0pt}s} + k}}$ 都可以表示为 ${\textstyle P\hspace{0pt}E_{p\hspace{0pt}o\hspace{0pt}s}}$ 的线性函数。

我们还尝试改用学习的位置 Lua error: Internal error: The interpreter exited with status 1. [9]，发现两种方案产生几乎相同的结果（见表 3 第 (E) 行）。我们选择正弦版本是因为它可能允许模型外推到比训练中遇到的更长的序列长度。

4 为什么使用自Lua error: Internal error: The interpreter exited with status 1.

在本节中，我们将自Lua error: Internal error: The interpreter exited with status 1.层的若干方面与常用于将一个变长符号表示序列 ${\textstyle (x_{1},\ldots,x_{n})}$ 映射到等长序列 ${\textstyle (z_{1},\ldots,z_{n})}$ （其中 ${\textstyle {x_{i},z_{i}} \in {\mathbb{R}}^{d}}$ ，如典型的序列转导编码器或解码器中的隐藏层）的循环和Lua error: Internal error: The interpreter exited with status 1.进行比较。为说明使用自Lua error: Internal error: The interpreter exited with status 1.的动机，我们考虑三个期望条件。

一是每层的总计算复杂度；二是可以并行化的计算量，以所需的最少顺序操作数衡量。

三是网络中长程依赖之间的路径长度。学习长程依赖是许多序列转导任务中的关键挑战。影响学习此类依赖能力的一个关键因素是前向与后向信号在网络中必须穿越的路径长度。输入和输出序列中任意位置组合之间的路径越短，就越容易学习长程依赖 [12]。因此，我们还比较了由不同类型层组成的网络中任意两个输入和输出位置之间的最大路径长度。

层类型	每层复杂度	顺序操作	最大路径长度

自Lua error: Internal error: The interpreter exited with status 1.	${\textstyle O\hspace{0pt}{({n^{2} \cdot d})}}$	${\textstyle O\hspace{0pt}{(1)}}$	${\textstyle O\hspace{0pt}{(1)}}$
循环	${\textstyle O\hspace{0pt}{({n \cdot d^{2}})}}$	${\textstyle O\hspace{0pt}{(n)}}$	${\textstyle O\hspace{0pt}{(n)}}$
卷积	${\textstyle O\hspace{0pt}{({k \cdot n \cdot d^{2}})}}$	${\textstyle O\hspace{0pt}{(1)}}$	${\textstyle O\hspace{0pt}{({l\hspace{0pt}o\hspace{0pt}g_{k}\hspace{0pt}{(n)}})}}$
自Lua error: Internal error: The interpreter exited with status 1.（受限）	${\textstyle O\hspace{0pt}{({r \cdot n \cdot d})}}$	${\textstyle O\hspace{0pt}{(1)}}$	${\textstyle O\hspace{0pt}{({n/r})}}$

如表 1 所示，自Lua error: Internal error: The interpreter exited with status 1.层用恒定数量的顺序操作连接所有位置，而循环层需要 ${\textstyle O\hspace{0pt}{(n)}}$ 次顺序操作。就计算复杂度而言，当序列长度 ${\textstyle n}$ 小于表示维度 ${\textstyle d}$ 时，自Lua error: Internal error: The interpreter exited with status 1.层比循环层更快——这在最先进的机器翻译模型所使用的句子表示（如 word-piece [38] 和 byte-pair [31]）中最为常见。为提升涉及非常长序列任务的计算性能，可将自Lua error: Internal error: The interpreter exited with status 1.限制在以对应输出位置为中心、大小为 ${\textstyle r}$ 的输入序列邻域内。这将使最大路径长度增加到 ${\textstyle O\hspace{0pt}{({n/r})}}$ 。我们计划在未来工作中进一步研究这种方法。

核宽度为 ${\textstyle k < n}$ 的单个Lua error: Internal error: The interpreter exited with status 1.并不连接所有输入和输出位置对。要做到这一点，连续核情况下需要 ${\textstyle O\hspace{0pt}{({n/k})}}$ 个Lua error: Internal error: The interpreter exited with status 1.的堆叠，膨胀Lua error: Internal error: The interpreter exited with status 1.情况下则需要 ${\textstyle O\hspace{0pt}{({l\hspace{0pt}o\hspace{0pt}g_{k}\hspace{0pt}{(n)}})}}$ [18]，从而增加网络中任意两个位置之间最长路径的长度。Lua error: Internal error: The interpreter exited with status 1.通常比循环层昂贵 ${\textstyle k}$ 倍。然而，可分离Lua error: Internal error: The interpreter exited with status 1. [6] 将复杂度大幅降低到 ${\textstyle O\hspace{0pt}{({{k \cdot n \cdot d} + {n \cdot d^{2}}})}}$ 。即便 ${\textstyle k = n}$ 时，可分离 Lua error: Internal error: The interpreter exited with status 1. 的复杂度也等于自Lua error: Internal error: The interpreter exited with status 1.层与逐位置前馈层的组合，这正是我们在模型中采用的方法。

作为附带好处，自Lua error: Internal error: The interpreter exited with status 1.可能产生更具可解释性的模型。我们在附录中考察了模型的 Lua error: Internal error: The interpreter exited with status 1. 分布，并提供与讨论了示例。不仅各个 Lua error: Internal error: The interpreter exited with status 1. 头明显地学到了执行不同的任务，许多头似乎还表现出与句子句法和语义结构相关的行为。

5 训练

本节描述我们模型的训练规程。

5.1 训练数据与批处理

我们在标准的 WMT 2014 英德数据集上训练，该数据集包含约 450 万对句子。句子使用 byte-pair 编码 [3] 进行编码，源-目标共享词表约为 37000 个 token。对于英法翻译，我们使用更大的 WMT 2014 英法数据集，包含 3600 万句子，并将 token 划分为 32000 word-piece 词表 [38]。句子对按近似序列长度成批处理。每个训练批次包含约 25000 个源 token 和 25000 个目标 token 的句子对。

5.2 硬件与计划

我们在一台配备 8 块 NVIDIA P100 GPU 的机器上训练模型。对于使用本文所述 Lua error: Internal error: The interpreter exited with status 1. 的基础模型，每个训练步骤约耗时 0.4 秒。基础模型总共训练了 100,000 步或 12 小时。对于大型模型（描述见表 3 的最后一行），每步耗时 1.0 秒。大型模型训练了 300,000 步（3.5 天）。

5.3 优化器

我们使用 Lua error: Internal error: The interpreter exited with status 1. 优化器 [20]，参数为 ${\textstyle \beta_{1} = 0.9}$ 、 ${\textstyle \beta_{2} = 0.98}$ 和 ${\textstyle \epsilon = 10^{- 9}}$ 。我们在整个训练过程中按以下公式调整 Lua error: Internal error: The interpreter exited with status 1.：

	${l\hspace{0pt}r\hspace{0pt}a\hspace{0pt}t\hspace{0pt}e} = {d_{\text{model}}^{- 0.5} \cdot {\min{({s\hspace{0pt}t\hspace{0pt}e\hspace{0pt}p\hspace{0pt}\_\hspace{0pt}n\hspace{0pt}u\hspace{0pt}m^{- 0.5}},{{{s\hspace{0pt}t\hspace{0pt}e\hspace{0pt}p\hspace{0pt}\_\hspace{0pt}n\hspace{0pt}u\hspace{0pt}m} \cdot w}\hspace{0pt}a\hspace{0pt}r\hspace{0pt}m\hspace{0pt}u\hspace{0pt}p\hspace{0pt}\_\hspace{0pt}s\hspace{0pt}t\hspace{0pt}e\hspace{0pt}p\hspace{0pt}s^{- 1.5}})}}}$		(3)

这对应于在前 ${\textstyle w\hspace{0pt}a\hspace{0pt}r\hspace{0pt}m\hspace{0pt}u\hspace{0pt}p\hspace{0pt}\_\hspace{0pt}s\hspace{0pt}t\hspace{0pt}e\hspace{0pt}p\hspace{0pt}s}$ 个训练步内将 Lua error: Internal error: The interpreter exited with status 1. 线性增加，之后按步数的平方根倒数成比例减小。我们使用 ${\textstyle {w\hspace{0pt}a\hspace{0pt}r\hspace{0pt}m\hspace{0pt}u\hspace{0pt}p\hspace{0pt}\_\hspace{0pt}s\hspace{0pt}t\hspace{0pt}e\hspace{0pt}p\hspace{0pt}s} = 4000}$ 。

5.4 Lua error: Internal error: The interpreter exited with status 1.

我们在训练期间采用三种类型的 Lua error: Internal error: The interpreter exited with status 1.：

残差 Lua error: Internal error: The interpreter exited with status 1.

我们对每个子层的输出应用 Lua error: Internal error: The interpreter exited with status 1. [33]，然后再将其加到子层输入并进行归一化。此外，我们还在编码器和解码器堆栈中对 Lua error: Internal error: The interpreter exited with status 1. 与位置编码之和应用 Lua error: Internal error: The interpreter exited with status 1.。对于基础模型，我们使用 ${\textstyle P_{d\hspace{0pt}r\hspace{0pt}o\hspace{0pt}p} = 0.1}$ 的丢弃率。

标签平滑

在训练期间，我们使用值为 ${\textstyle \epsilon_{l\hspace{0pt}s} = 0.1}$ 的标签平滑 [36]。这会损害困惑度，因为模型学到了更加不确定，但提升了准确率和 BLEU 分数。

6 结果

6.1 机器翻译

模型	BLEU		训练成本（FLOPs）
模型	EN-DE	EN-FR	EN-DE	EN-FR
ByteNet [18]	23.75
Deep-Att + PosUnk [39]		39.2		${\textstyle 1.0 \cdot 10^{20}}$
GNMT + RL [38]	24.6	39.92	${\textstyle 2.3 \cdot 10^{19}}$	${\textstyle 1.4 \cdot 10^{20}}$
ConvS2S [9]	25.16	40.46	${\textstyle 9.6 \cdot 10^{18}}$	${\textstyle 1.5 \cdot 10^{20}}$
Lua error: Internal error: The interpreter exited with status 1. [32]	26.03	40.56	${\textstyle 2.0 \cdot 10^{19}}$	${\textstyle 1.2 \cdot 10^{20}}$
Deep-Att + PosUnk Ensemble [39]		40.4		${\textstyle 8.0 \cdot 10^{20}}$
GNMT + RL Ensemble [38]	26.30	41.16	${\textstyle 1.8 \cdot 10^{20}}$	${\textstyle 1.1 \cdot 10^{21}}$
ConvS2S Ensemble [9]	26.36	41.29	${\textstyle 7.7 \cdot 10^{19}}$	${\textstyle 1.2 \cdot 10^{21}}$
Lua error: Internal error: The interpreter exited with status 1.（基础模型）	27.3	38.1	${\textstyle 3.3 \cdot \mathbf{1}\mathbf{0}^{\mathbf{1}\mathbf{8}}}$
Lua error: Internal error: The interpreter exited with status 1.（大型）	28.4	41.8	${\textstyle 2.3 \cdot 10^{19}}$

在 WMT 2014 英德翻译任务上，大型 Lua error: Internal error: The interpreter exited with status 1. 模型（表 2 中的 Lua error: Internal error: The interpreter exited with status 1.（大型））比此前报告的最佳模型（含 ensemble）高出 ${\textstyle 2.0}$ BLEU 以上，建立了 ${\textstyle 28.4}$ 的新最先进 BLEU 分数。该模型的配置见表 3 的最后一行。训练在 ${\textstyle 8}$ 块 P100 GPU 上耗时 ${\textstyle 3.5}$ 天。即便我们的基础模型也超过了所有此前发表的模型与 ensemble，且训练成本仅为任一竞争模型的一小部分。

在 WMT 2014 英法翻译任务上，我们的大型模型获得了 ${\textstyle 41.0}$ 的 BLEU 分数，超过此前发表的所有单模型，训练成本不到先前最先进模型的 ${\textstyle 1/4}$ 。为英法翻译训练的 Lua error: Internal error: The interpreter exited with status 1.（大型）模型使用 ${\textstyle P_{d\hspace{0pt}r\hspace{0pt}o\hspace{0pt}p} = 0.1}$ 的 Lua error: Internal error: The interpreter exited with status 1. 率，而非 ${\textstyle 0.3}$ 。

对于基础模型，我们使用通过平均最后 5 个 checkpoint（每 10 分钟写出一次）得到的单一模型。对于大型模型，我们平均了最后 20 个 checkpoint。我们使用 beam size 为 ${\textstyle 4}$ 、长度惩罚 ${\textstyle \alpha = 0.6}$ [38] 的 beam search。这些 Lua error: Internal error: The interpreter exited with status 1. 是在开发集上经过实验后选定的。我们将推理时最大输出长度设置为输入长度 + ${\textstyle 50}$ ，但在可行时提前终止 [38]。

表 2 总结了我们的结果，并将我们的翻译质量和训练成本与文献中其他模型架构进行了比较。我们通过将训练时间、所用 GPU 数量以及每块 GPU 的持续单精度浮点运算能力估计值相乘，估算训练模型所用的浮点运算次数 ²²2我们对 K80、K40、M40 和 P100 分别使用 2.8、3.7、6.0 和 9.5 TFLOPS。。

6.2 模型变体

	${\textstyle N}$	${\textstyle d_{\text{model}}}$	${\textstyle d_{\text{ff}}}$	${\textstyle h}$	${\textstyle d_{k}}$	${\textstyle d_{v}}$	${\textstyle P_{d\hspace{0pt}r\hspace{0pt}o\hspace{0pt}p}}$	${\textstyle \epsilon_{l\hspace{0pt}s}}$	训练	PPL	BLEU	参数量
	${\textstyle N}$	${\textstyle d_{\text{model}}}$	${\textstyle d_{\text{ff}}}$	${\textstyle h}$	${\textstyle d_{k}}$	${\textstyle d_{v}}$	${\textstyle P_{d\hspace{0pt}r\hspace{0pt}o\hspace{0pt}p}}$	${\textstyle \epsilon_{l\hspace{0pt}s}}$	步数	(dev)	(dev)	${\textstyle \times 10^{6}}$
基础	6	512	2048	8	64	64	0.1	0.1	100K	4.92	25.8	65
(A)				1	512	512				5.29	24.9
				4	128	128				5.00	25.5
				16	32	32				4.91	25.8
				32	16	16				5.01	25.4
(B)					16					5.16	25.1	58
(B)					32					5.01	25.4	60
(C)	2									6.11	23.7	36
	4									5.19	25.3	50
	8									4.88	25.5	80
		256			32	32				5.75	24.5	28
		1024			128	128				4.66	26.0	168
			1024							5.12	25.4	53
			4096							4.75	26.2	90
(D)							0.0			5.77	24.6
							0.2			4.95	25.5
								0.0		4.67	25.3
								0.2		5.47	25.7
(E)		用位置 Lua error: Internal error: The interpreter exited with status 1. 替代正弦								4.92	25.7
大型	6	1024	4096	16			0.3		300K	4.33	26.4	213

为评估 Lua error: Internal error: The interpreter exited with status 1. 不同组件的重要性，我们以不同方式改变基础模型，并在开发集 newstest2013 上测量英德翻译性能的变化。我们使用上一节所述的 beam search，但不进行 checkpoint 平均。我们将这些结果列于表 3。

在表 3 的第 (A) 行中，我们如第 3.2.2 节所述，在保持计算量不变的情况下改变 Lua error: Internal error: The interpreter exited with status 1. 头数及 Lua error: Internal error: The interpreter exited with status 1. 键值维度。单头 Lua error: Internal error: The interpreter exited with status 1. 比最佳设置低 0.9 BLEU，而头数过多时质量也会下降。

在表 3 的第 (B) 行中，我们观察到减少 Lua error: Internal error: The interpreter exited with status 1. 键大小 ${\textstyle d_{k}}$ 会损害模型质量。这表明判定相容性并非易事，比点积更复杂的相容性函数可能有益。我们还在第 (C) 和 (D) 行中观察到，正如预期，更大的模型表现更好，而 Lua error: Internal error: The interpreter exited with status 1. 在避免过拟合方面非常有帮助。在第 (E) 行中，我们将正弦位置编码替换为学习的位置 Lua error: Internal error: The interpreter exited with status 1. [9]，并观察到与基础模型几乎相同的结果。

6.3 英文成分句法分析

解析器	训练	WSJ 23 F1
Vinyals & Kaiser el al. (2014) [37]	仅 WSJ，判别式	88.3
Petrov et al. (2006) [29]	仅 WSJ，判别式	90.4
Zhu et al. (2013) [40]	仅 WSJ，判别式	90.4
Dyer et al. (2016) [8]	仅 WSJ，判别式	91.7
Lua error: Internal error: The interpreter exited with status 1.（4 层）	仅 WSJ，判别式	91.3
Zhu et al. (2013) [40]	半监督	91.3
Huang & Harper (2009) [14]	半监督	91.3
McClosky et al. (2006) [26]	半监督	92.1
Vinyals & Kaiser el al. (2014) [37]	半监督	92.1
Lua error: Internal error: The interpreter exited with status 1.（4 层）	半监督	92.7
Luong et al. (2015) [23]	多任务	93.0
Dyer et al. (2016) [8]	生成式	93.3

为评估 Lua error: Internal error: The interpreter exited with status 1. 能否泛化到其他任务，我们在英文成分句法分析上进行了实验。这一任务带来特定挑战：输出受到强烈的结构约束，且明显长于输入。此外，基于 RNN 的 Lua error: Internal error: The interpreter exited with status 1. 模型在小数据规模下未能达到最先进结果 [37]。

我们在 Penn Treebank [25] 的 Wall Street Journal (WSJ) 部分（约 40K 训练句子）上训练了一个 ${\textstyle d_{m\hspace{0pt}o\hspace{0pt}d\hspace{0pt}e\hspace{0pt}l} = 1024}$ 的 4 层 Lua error: Internal error: The interpreter exited with status 1.。我们还在半监督设定下训练了它，使用了来自高置信度语料和 BerkeleyParser 的更大语料，共约 17M 句 [37]。在仅 WSJ 设定下使用 16K token 词表，在半监督设定下使用 32K token 词表。

我们只进行了少量实验，在 Section 22 开发集上选择 Lua error: Internal error: The interpreter exited with status 1.（包括 Lua error: Internal error: The interpreter exited with status 1. 和残差，见第 5.4 节）、Lua error: Internal error: The interpreter exited with status 1.和 beam 大小，其他所有参数与英德基础翻译模型保持不变。推理时，我们将最大输出长度增加到输入长度 + ${\textstyle 300}$ 。仅 WSJ 设定与半监督设定均使用 ${\textstyle 21}$ 的 beam 大小和 ${\textstyle \alpha = 0.3}$ 。

我们在表 4 中的结果表明，尽管未做任务特定的调优，我们的模型表现出乎意料地好，结果优于此前报告的所有模型，只有 Recurrent Neural Network Grammar [8] 例外。

与基于 RNN 的 Lua error: Internal error: The interpreter exited with status 1. 模型 [37] 相比，Lua error: Internal error: The interpreter exited with status 1. 即便只在 40K 句的 WSJ 训练集上训练，也优于 BerkeleyParser [29]。

7 结论

在本工作中，我们提出了 Lua error: Internal error: The interpreter exited with status 1.，这是第一个完全基于 Lua error: Internal error: The interpreter exited with status 1. 的序列转导模型，用多头自Lua error: Internal error: The interpreter exited with status 1.替换了编码器-解码器架构中最常用的循环层。

对于翻译任务，Lua error: Internal error: The interpreter exited with status 1. 的训练速度可以显著快于基于循环或Lua error: Internal error: The interpreter exited with status 1.的架构。在 WMT 2014 英德及 WMT 2014 英法翻译任务上，我们均取得了新的最先进水平。在前者上，我们最好的模型甚至超过了此前报告的所有 ensemble。

我们对基于 Lua error: Internal error: The interpreter exited with status 1. 的模型的未来充满期待，并计划将其应用于其他任务。我们计划将 Lua error: Internal error: The interpreter exited with status 1. 扩展到涉及文本以外输入输出模态的问题，并研究局部、受限的 Lua error: Internal error: The interpreter exited with status 1. 机制，以高效处理图像、音频和视频等大型输入与输出。使生成过程不再严格顺序化也是我们的另一个研究目标。

我们用于训练和评估模型的代码可在 https://github.com/tensorflow/tensor2tensor 获取。

致谢

我们感谢 Nal Kalchbrenner 和 Stephan Gouws 的宝贵评论、更正与启发。

参考文献

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Lua error: Internal error: The interpreter exited with status 1.-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[6] Francois Chollet. Xception: Lua error: Internal error: The interpreter exited with status 1. with depthwise separable Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:1610.02357, 2016.
[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016.
[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional Lua error: Internal error: The interpreter exited with status 1. learning. arXiv preprint arXiv:1705.03122v2, 2017.
[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Lua error: Internal error: The interpreter exited with status 1.. Neural computation, 9(8):1735–1780, 1997.
[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
[16] Łukasz Kaiser and Samy Bengio. Can active memory replace Lua error: Internal error: The interpreter exited with status 1.? In Advances in Neural Information Processing Systems, (NIPS), 2016.
[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured Lua error: Internal error: The interpreter exited with status 1. networks. In International Conference on Learning Representations, 2017.
[20] Diederik Kingma and Jimmy Ba. Lua error: Internal error: The interpreter exited with status 1.: A method for stochastic optimization. In ICLR, 2015.
[21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for Lua error: Internal error: The interpreter exited with status 1. networks. arXiv preprint arXiv:1703.10722, 2017.
[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence Lua error: Internal error: The interpreter exited with status 1.. arXiv preprint arXiv:1703.03130, 2017.
[23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task Lua error: Internal error: The interpreter exited with status 1. learning. arXiv preprint arXiv:1511.06114, 2015.
[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to Lua error: Internal error: The interpreter exited with status 1.-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable Lua error: Internal error: The interpreter exited with status 1. model. In Empirical Methods in Natural Language Processing, 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
[30] Ofir Press and Lior Wolf. Using the output Lua error: Internal error: The interpreter exited with status 1. to improve language models. arXiv preprint arXiv:1608.05859, 2016.
[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated Lua error: Internal error: The interpreter exited with status 1. layer. arXiv preprint arXiv:1701.06538, 2017.
[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Lua error: Internal error: The interpreter exited with status 1.: a simple way to prevent neural networks from Lua error: Internal error: The interpreter exited with status 1.. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Lua error: Internal error: The interpreter exited with status 1. learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.

Lua error: Internal error: The interpreter exited with status 1. 可视化