Latest revision as of 08:05, 27 April 2026

Other languages:

English
Español
中文

寻找激活函数

Research Paper
Authors	Prajit Ramachandran; Barret Zoph; Quoc V. Le
Year	2017
Topic area	Machine Learning
Difficulty	Research
arXiv	1710.05941
PDF	Download PDF

Prajit Ramachandran, Barret Zoph, Quoc V. Le
Google Brain
{prajit,barretzoph,qvl}@google.com
本工作完成于 Google Brain Residency 项目期间（g.co/brainresidency）。

摘要

深度网络中激活函数的选择对训练动态和任务性能有显著影响。目前最成功且使用最广泛的激活函数是 Rectified Linear Unit（ReLU）。尽管已经提出了多种手工设计的 ReLU 替代方案，但由于增益不稳定，无一能够取代它。本工作中，我们提出利用自动化搜索技术来发现新的激活函数。通过结合穷举搜索与基于强化学习的搜索，我们发现了多个新颖的激活函数。我们通过对最佳被发现的激活函数进行实证评估来验证搜索的有效性。我们的实验表明，最佳被发现的激活函数 ${\textstyle {f\hspace{0pt}{(x)}} = {{x \cdot \text{sigmoid}}\hspace{0pt}{({\beta\hspace{0pt}x})}}}$ （我们将其命名为 Swish），在多个具有挑战性的数据集上，于更深层模型上往往优于 ReLU。例如，仅将 ReLU 替换为 Swish 单元便可在 ImageNet 上将 Mobile NASNet-A 的 top-1 分类准确率提高 0.9%，将 Inception-ResNet-v2 提高 0.6%。Swish 的简洁性以及与 ReLU 的相似性使从业者可以轻松地在任何神经网络中将 ReLU 替换为 Swish 单元。

1 引言

每个深度网络的核心都是一个线性变换后跟一个激活函数 ${\textstyle f\hspace{0pt}{( \cdot )}}$ 。激活函数在深度神经网络的训练成功中起着主要作用。目前最成功且使用最广泛的激活函数是 Rectified Linear Unit（ReLU）（Hahnloser et al., 2000；Jarrett et al., 2009；Nair & Hinton, 2010），其定义为 ${\textstyle {f\hspace{0pt}{(x)}} = {\max{(x,0)}}}$ 。ReLU 的使用是一项突破性进展，使得最先进的深度网络能够进行完全监督训练（Krizhevsky et al., 2012）。带有 ReLU 的深度网络比带有 sigmoid 或 tanh 单元的网络更易于优化，因为当输入为正时，ReLU 函数允许梯度流动。由于其简洁性和有效性，ReLU 已成为整个深度学习社区中默认使用的激活函数。

尽管已经提出了大量替代 ReLU 的激活函数（Maas et al., 2013；He et al., 2015；Clevert et al., 2015；Klambauer et al., 2017），但没有一个能够获得 ReLU 所享有的广泛采用。许多从业者更青睐 ReLU 的简洁性和可靠性，因为其他激活函数在不同模型和数据集上的性能改进往往不稳定。

提出用于替代 ReLU 的激活函数都是为了符合被认为重要的属性而手工设计的。然而，最近研究表明，使用搜索技术来自动发现传统上由人手工设计的组件极为有效（Zoph & Le, 2016；Bello et al., 2017；Zoph et al., 2017）。例如，Zoph et al.（2017）使用基于强化学习的搜索找到了一个可复用的卷积单元，在 ImageNet 上超过了人类设计的架构。

在本工作中，我们使用自动化搜索技术来发现新颖的激活函数。我们专注于寻找新的标量激活函数 —— 即输入一个标量、输出一个标量的函数 —— 因为标量激活函数可以在不改变网络架构的情况下替换 ReLU。通过结合穷举搜索与基于强化学习的搜索，我们发现了若干表现出色的新颖激活函数。为了进一步验证使用搜索发现标量激活函数的有效性，我们对最佳被发现的激活函数进行了实证评估。我们将最佳被发现的激活函数命名为 Swish，其形式为 ${\textstyle {f\hspace{0pt}{(x)}} = {{x \cdot \text{sigmoid}}\hspace{0pt}{({\beta\hspace{0pt}x})}}}$ ，其中 ${\textstyle \beta}$ 是一个常数或可训练参数。我们大量的实验表明，在图像分类和机器翻译等多种具有挑战性的领域中，Swish 在深度网络上始终匹配或优于 ReLU。在 ImageNet 上，将 ReLU 替换为 Swish 单元可使 Mobile NASNet-A 的 top-1 分类准确率提高 0.9%（Zoph et al., 2017），Inception-ResNet-v2 提高 0.6%（Szegedy et al., 2017）。考虑到从 Inception V3（Szegedy et al., 2016）到 Inception-ResNet-v2（Szegedy et al., 2017）一整年的架构调整和扩大才带来 1.3% 的准确率提升，这些准确率收益是可观的。

2 方法

为了利用搜索技术，必须设计一个包含有前途的候选激活函数的搜索空间。设计搜索空间时一个重要的挑战是平衡搜索空间的大小与表达能力。过于受限的搜索空间不会包含新颖的激活函数，而过大的搜索空间则难以有效搜索。为了平衡两者，我们设计了一个简单的搜索空间，受 Bello et al.（2017）的优化器搜索空间启发，通过组合一元和二元函数来构造激活函数。

如图 1 所示，激活函数通过反复组合"核心单元"构造而成，其定义为 ${\textstyle b\hspace{0pt}{({u_{1}\hspace{0pt}{(x_{1})}},{u_{2}\hspace{0pt}{(x_{2})}})}}$ 。核心单元接受两个标量输入，分别将每个输入通过一个一元函数，然后用一个二元函数将两个一元输出组合为一个标量输出。由于我们的目标是寻找将单个标量输入转换为单个标量输出的标量激活函数，因此一元函数的输入被限制为该层的预激活 ${\textstyle x}$ 和二元函数的输出。

给定搜索空间，搜索算法的目标是为一元和二元函数找到有效的选择。搜索算法的选择取决于搜索空间的大小。如果搜索空间很小，例如只用一个核心单元，便可以穷举枚举整个搜索空间。如果核心单元重复多次，则搜索空间会变得极其庞大（即 ${\textstyle 10^{12}}$ 量级的可能性），使穷举搜索不可行。

对于大型搜索空间，我们使用 RNN 控制器（Zoph & Le, 2016），如图 2 所示。在每个时间步，控制器预测激活函数的一个组件。预测结果会反馈到下一个时间步的控制器，重复此过程直到预测出激活函数的所有组件。然后用预测得到的字符串来构造激活函数。

一旦搜索算法生成了一个候选激活函数，就会用该候选激活函数训练一个"子网络"在某个任务上（例如 CIFAR-10 上的图像分类）。训练完成后，记录子网络的验证准确率并用于更新搜索算法。在穷举搜索中，会维护一个按验证准确率排序的最佳激活函数列表。在 RNN 控制器的情形下，使用强化学习训练控制器以最大化验证准确率，其中验证准确率作为奖励。这种训练促使控制器生成具有较高验证准确率的激活函数。

由于评估单个激活函数需要训练一个子网络，搜索的计算成本很高。为减少进行搜索所需的实际时间，我们采用分布式训练方案来并行化每个子网络的训练。在该方案中，搜索算法提出一批候选激活函数加入队列。worker 机从队列中取出激活函数，训练一个子网络，并将对应激活函数的最终验证准确率反馈回来。验证准确率会被汇总，用于更新搜索算法。

3 搜索发现

我们所有的搜索都使用 ResNet-20（He et al., 2016a）作为子网络架构，在 CIFAR-10（Krizhevsky & Hinton, 2009）上训练 10K 步。这种受限环境可能会使结果产生偏差，因为表现最好的激活函数可能只对小型网络有效。然而，我们在实验部分中表明，许多被发现的函数能够泛化到更大的模型。对于小型搜索空间使用穷举搜索，对于较大的搜索空间则使用 RNN 控制器。RNN 控制器使用 Policy Proximal Optimization（Schulman et al., 2017）训练，使用奖励的指数滑动平均作为基线以降低方差。所考虑的一元和二元函数的完整列表如下：

•

一元函数： ${\textstyle x}$ , ${\textstyle - x}$ , ${\textstyle |x|}$ , ${\textstyle x^{2}}$ , ${\textstyle x^{3}}$ , ${\textstyle \sqrt{x}}$ , ${\textstyle \beta\hspace{0pt}x}$ , ${\textstyle x + \beta}$ , ${\textstyle \log{({{|x|} + \epsilon})}}$ , ${\textstyle \exp{(x)}}$ ${\textstyle \sin{(x)}}$ , ${\textstyle \cos{(x)}}$ , ${\textstyle \sinh{(x)}}$ , ${\textstyle \cosh{(x)}}$ , ${\textstyle \tanh{(x)}}$ , ${\textstyle \sinh^{- 1}{(x)}}$ , ${\textstyle \tan^{- 1}{(x)}}$ , ${\textstyle \text{sinc}\hspace{0pt}{(x)}}$ , ${\textstyle \max{(x,0)}}$ , ${\textstyle \min{(x,0)}}$ , ${\textstyle \sigma\hspace{0pt}{(x)}}$ , ${\textstyle \log{({1 + {\exp{(x)}}})}}$ , ${\textstyle \exp{({- x^{2}})}}$ , ${\textstyle \text{erf}\hspace{0pt}{(x)}}$ , ${\textstyle \beta}$
•

二元函数： ${\textstyle x_{1} + x_{2}}$ , ${\textstyle x_{1} \cdot x_{2}}$ , ${\textstyle x_{1} - x_{2}}$ , ${\textstyle \frac{x_{1}}{x_{2} + \epsilon}}$ , ${\textstyle \max{(x_{1},x_{2})}}$ , ${\textstyle \min{(x_{1},x_{2})}}$ , ${\textstyle {\sigma\hspace{0pt}{(x_{1})}} \cdot x_{2}}$ , ${\textstyle \exp{({- {\beta\hspace{0pt}{({x_{1} - x_{2}})}^{2}}})}}$ , ${\textstyle \exp{({- {\beta\hspace{0pt}{|{x_{1} - x_{2}}|}}})}}$ , ${\textstyle {\beta\hspace{0pt}x_{1}} + {{({1 - \beta})}\hspace{0pt}x_{2}}}$

其中 ${\textstyle \beta}$ 表示按通道可训练的参数， ${\textstyle {\sigma\hspace{0pt}{(x)}} = {({1 + {\exp{({- x})}}})}^{- 1}}$ 为 sigmoid 函数。通过改变用于构造激活函数的核心单元数量，以及搜索算法可用的一元和二元函数，可以创建不同的搜索空间。

图 3 绘制了搜索发现的表现最佳的新颖激活函数。我们强调搜索揭示的几个值得注意的趋势：

•

复杂的激活函数始终不如更简单的激活函数，这可能是由于优化难度增加所致。表现最好的激活函数可以用 ${\textstyle 1}$ 个或 ${\textstyle 2}$ 个核心单元来表示。
•

表现最好的激活函数共有的一个结构是将原始预激活 ${\textstyle x}$ 作为最终二元函数的输入： ${\textstyle b\hspace{0pt}{(x,{g\hspace{0pt}{(x)}})}}$ 。ReLU 函数也遵循这一结构，其中 ${\textstyle {b\hspace{0pt}{(x_{1},x_{2})}} = {\max{(x_{1},x_{2})}}}$ 且 ${\textstyle {g\hspace{0pt}{(x)}} = 0}$ 。
•

搜索发现了使用周期函数（如 ${\textstyle \sin}$ 和 ${\textstyle \cos}$ ）的激活函数。最常见的使用方式是与原始预激活 ${\textstyle x}$ （或线性缩放后的 ${\textstyle x}$ ）进行加法或减法。先前工作仅简要探索了在激活函数中使用周期函数（Parascandolo et al., 2016），因此这些被发现的函数为进一步研究提示了一条富有成果的途径。
•

使用除法的函数往往表现不佳，因为当分母接近 ${\textstyle 0}$ 时输出会爆炸。只有当分母中的函数远离 ${\textstyle 0}$ （如 ${\textstyle \cosh{(x)}}$ ），或仅当分子也接近 ${\textstyle 0}$ 时分母才接近 ${\textstyle 0}$ （产生 ${\textstyle 1}$ 的输出），除法才会成功。

由于这些激活函数是使用一个相对较小的子网络发现的，它们的表现在更大模型上可能无法泛化。为测试表现最好的新颖激活函数对不同架构的鲁棒性，我们使用 preactivation ResNet-164（RN）（He et al., 2016b）、Wide ResNet 28-10（WRN）（Zagoruyko & Komodakis, 2016）和 DenseNet 100-12（DN）（Huang et al., 2017）模型进行了额外实验。我们在 TensorFlow 中实现这 3 个模型，并将 ReLU 函数替换为搜索发现的每个表现最好的新颖激活函数。我们使用每篇工作中描述的相同超参数（例如使用带动量的 SGD 进行优化），并按照先前工作报告 5 次运行的中位数。

函数	RN	WRN	DN
ReLU [ ${\textstyle \max{(x,0)}}$ ]	93.8	95.3	94.8
${\textstyle {x \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}$	94.5	95.5	94.9
${\textstyle \max{(x,{\sigma\hspace{0pt}{(x)}})}}$	94.3	95.3	94.8
${\textstyle {\cos{(x)}} - x}$	94.1	94.8	94.6
${\textstyle \min{(x,{\sin{(x)}})}}$	94.0	95.1	94.4
${\textstyle {({\tan^{- 1}{(x)}})}^{2} - x}$	93.9	94.7	94.9
${\textstyle \max{(x,{\tanh{(x)}})}}$	93.9	94.2	94.5
${\textstyle {\text{sinc}\hspace{0pt}{(x)}} + x}$	91.5	92.1	92.0
${\textstyle x \cdot {({\sinh^{- 1}{(x)}})}^{2}}$	85.1	92.1	91.1

函数	RN	WRN	DN
ReLU [ ${\textstyle \max{(x,0)}}$ ]	74.2	77.8	83.7
${\textstyle {x \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}$	75.1	78.0	83.9
${\textstyle \max{(x,{\sigma\hspace{0pt}{(x)}})}}$	74.8	78.6	84.2
${\textstyle {\cos{(x)}} - x}$	75.2	76.6	81.8
${\textstyle \min{(x,{\sin{(x)}})}}$	73.4	77.1	74.3
${\textstyle {({\tan^{- 1}{(x)}})}^{2} - x}$	75.2	76.7	83.1
${\textstyle \max{(x,{\tanh{(x)}})}}$	74.8	76.0	78.6
${\textstyle {\text{sinc}\hspace{0pt}{(x)}} + x}$	66.1	68.3	67.9
${\textstyle x \cdot {({\sinh^{- 1}{(x)}})}^{2}}$	52.8	70.6	68.1

结果如表 2 与 2 所示。尽管模型架构有所变化，八个激活函数中有六个仍能成功泛化。在这六个激活函数中，所有函数在 ResNet-164 上都达到或超过 ReLU。此外，被发现的激活函数中有两个 —— ${\textstyle {x \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}$ 与 ${\textstyle \max{(x,{\sigma\hspace{0pt}{(x)}})}}$ —— 在所有三个模型上都始终匹配或超过 ReLU。

尽管这些结果令人鼓舞，但被发现的激活函数能否在具有挑战性的真实数据集上成功替代 ReLU 仍不清楚。为验证搜索的有效性，本工作其余部分聚焦于实证评估激活函数 ${\textstyle {f\hspace{0pt}{(x)}} = {{x \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}}$ ，我们将其命名为 Swish。我们选择对 Swish 进行广泛评估而不是 ${\textstyle \max{(x,{\sigma\hspace{0pt}{(x)}})}}$ ，因为早期实验表明 Swish 的泛化性更好。在接下来的章节中，我们分析 Swish 的性质，然后对若干大型模型在多种任务上对 Swish、ReLU 与其他候选基线激活函数进行充分的实证比较。

4 Swish

回顾一下，Swish 定义为 ${\textstyle {x \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}$ ，其中 ${\textstyle {\sigma\hspace{0pt}{(z)}} = {({1 + {\exp{({- z})}}})}^{- 1}}$ 为 sigmoid 函数， ${\textstyle \beta}$ 是常数或可训练参数。图 5 绘制了不同 ${\textstyle \beta}$ 值下 Swish 的曲线。当 ${\textstyle \beta = 1}$ 时，Swish 等价于 Elfwing et al.（2017）为强化学习提出的 Sigmoid-weighted Linear Unit（SiL）。当 ${\textstyle \beta = 0}$ 时，Swish 退化为按比例缩放的线性函数 ${\textstyle {f\hspace{0pt}{(x)}} = \frac{x}{2}}$ 。当 ${\textstyle \beta\rightarrow\infty}$ 时，sigmoid 部分逼近 ${\textstyle 0}$ - ${\textstyle 1}$ 阶跃函数，因此 Swish 趋近于 ReLU。这表明 Swish 可以粗略地视为在线性函数与 ReLU 之间进行非线性插值的光滑函数；如果将 ${\textstyle \beta}$ 设为可训练参数，模型还可控制插值的程度。

与 ReLU 类似，Swish 在上方无界、在下方有界。与 ReLU 不同的是，Swish 是光滑且非单调的。事实上，Swish 的非单调性使其区别于大多数常见的激活函数。Swish 的导数为

	${\textstyle f^{\prime}\hspace{0pt}{(x)}}$	${\textstyle = {{\sigma\hspace{0pt}{({\beta\hspace{0pt}x})}} + {{{\beta\hspace{0pt}x} \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}\hspace{0pt}{({1 - {\sigma\hspace{0pt}{({\beta\hspace{0pt}x})}}})}}}}$
		${\textstyle = {{{\sigma\hspace{0pt}{({\beta\hspace{0pt}x})}} + {{{\beta\hspace{0pt}x} \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}} - {{{\beta\hspace{0pt}x} \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}^{2}}}}$
		${\textstyle = {{{{\beta\hspace{0pt}x} \cdot \sigma}\hspace{0pt}{(x)}} + {\sigma\hspace{0pt}{({\beta\hspace{0pt}x})}\hspace{0pt}{({1 - {{{\beta\hspace{0pt}x} \cdot \sigma}\hspace{0pt}{({\beta\hspace{0pt}x})}}})}}}}$
		${\textstyle = {{\beta\hspace{0pt}f\hspace{0pt}{(x)}} + {\sigma\hspace{0pt}{({\beta\hspace{0pt}x})}\hspace{0pt}{({1 - {\beta\hspace{0pt}f\hspace{0pt}{(x)}}})}}}}$

图 5 展示了不同 ${\textstyle \beta}$ 值下 Swish 的一阶导数。 ${\textstyle \beta}$ 的尺度决定了一阶导数趋近于 ${\textstyle 0}$ 与 ${\textstyle 1}$ 的速度。当 ${\textstyle \beta = 1}$ 时，对于约小于 ${\textstyle 1.25}$ 的输入，导数的幅值小于 ${\textstyle 1}$ 。因此，Swish 在 ${\textstyle \beta = 1}$ 时的成功表明，ReLU 的梯度保持性质（即当 ${\textstyle x > 0}$ 时导数为 1）在现代架构中可能不再是显著优势。

Swish 和 ReLU 之间最显著的差异是 Swish 在 ${\textstyle x < 0}$ 时的非单调"凸起"。如图 7 所示，很大比例的预激活值落在该凸起的范围内 ( ${\textstyle - 5 \leq x \leq 0)}$ ，这表明非单调凸起是 Swish 的一个重要方面。可以通过改变 ${\textstyle \beta}$ 参数来控制凸起的形状。虽然固定 ${\textstyle \beta = 1}$ 在实践中是有效的，但实验部分表明训练 ${\textstyle \beta}$ 可以在某些模型上进一步提高性能。图 7 绘制了来自 Mobile NASNet-A 模型 (Zoph 等, 2017) 的训练后 ${\textstyle \beta}$ 值的分布。训练后的 ${\textstyle \beta}$ 值分布在 ${\textstyle 0}$ 和 ${\textstyle 1.5}$ 之间，并在 ${\textstyle \beta \approx 1}$ 处有一个峰值，这表明该模型利用了可训练 ${\textstyle \beta}$ 参数所提供的额外灵活性。

在实践中，在大多数深度学习库中只需修改一行代码即可实现 Swish，例如在 TensorFlow（Abadi et al., 2016）中使用 x * tf.sigmoid(beta * x)，或在本工作提交之后发布的 TensorFlow 版本中使用 tf.nn.swish(x)。需要注意的是，如果使用 BatchNorm（Ioffe & Szegedy, 2015），应当设置其 scale 参数。一些高层库由于 ReLU 是分段线性函数而默认关闭 scale 参数，但该设置对 Swish 并不正确。对于训练 Swish 网络，我们发现略微降低用于训练 ReLU 网络的学习率效果很好。

5 Swish 的实验

我们在具有挑战性的数据集上将 Swish 与 ReLU 以及多个近期提出的激活函数进行基准比较，发现 Swish 在几乎所有任务上都达到或超过基线。以下章节将更详细地描述我们的实验设置和结果。作为总结，表 3 展示了 Swish 与我们考虑的每个基线激活函数的对比（基线定义见下一节）。表 3 中的结果通过比较 Swish 与不同激活函数在多种模型（如 Inception ResNet-v2（Szegedy et al., 2017）和 Transformer（Vaswani et al., 2017））、多个数据集（如 CIFAR、ImageNet 与英→德翻译）上的表现进行汇总。¹¹1为了避免使比较失真，每种模型仅比较一次。具有多个结果的模型用其结果的中位数来表示。具体而言，进行汇总的模型包括：(a) 在 CIFAR-10 与 CIFAR-100 上的 ResNet-164、Wide ResNet 28-10 与 DenseNet 100-12；(b) 在 3 次运行上的 Mobile NASNet-A 与 Inception-ResNet-v2；(c) 在 4 个 newstest 结果上的 WMT Transformer 模型。在单边配对符号检验下，Swish 相对其他激活函数的改进具有统计显著性。

基线	ReLU	LReLU	PReLU	Softplus	ELU	SELU	GELU
Swish ${\textstyle >}$ Baseline	9	7	6	6	8	8	8
Swish ${\textstyle =}$ Baseline	0	1	3	2	0	1	1
Swish ${\textstyle <}$ Baseline	0	1	0	1	1	0	0

5.1 实验设置

我们在多种模型和数据集上将 Swish 与若干其他基线激活函数进行比较。由于已提出的激活函数数量众多，我们选择最常见的激活函数作为对比，并遵循每篇原始工作中给出的指南：

•

Leaky ReLU（LReLU） (Maas et al., 2013):

${f\hspace{0pt}{(x)}} = \begin{cases} x & {{\text{if~}\hspace{0pt}x} \geq 0} \\ {\alpha\hspace{0pt}x} & {{\text{if~}\hspace{0pt}x} < 0} \end{cases}$

其中 ${\textstyle \alpha = 0.01}$ 。LReLU 允许在 ${\textstyle x < 0}$ 时有少量信息流过。
•

Parametric ReLU（PReLU）（He et al., 2015）：与 LReLU 形式相同，但 ${\textstyle \alpha}$ 为可学习参数。每个通道共享一个 ${\textstyle \alpha}$ ，其初始化为 ${\textstyle 0.25}$ 。
•

Softplus（Nair & Hinton, 2010）： ${\textstyle {f\hspace{0pt}{(x)}} = {\log{({1 + {\exp{(x)}}})}}}$ 。Softplus 是一个具有与 Swish 相似性质的光滑函数，但严格为正且单调。可以视为 ReLU 的光滑版本。
•

Exponential Linear Unit（ELU）（Clevert et al., 2015）：

${f\hspace{0pt}{(x)}} = \begin{cases} x & {{\text{if~}\hspace{0pt}x} \geq 0} \\ {\alpha\hspace{0pt}{({{\exp{(x)}} - 1})}} & {{\text{if~}\hspace{0pt}x} < 0} \end{cases}$

其中 ${\textstyle \alpha = 1.0}$

•

Scaled Exponential Linear Unit（SELU）（Klambauer et al., 2017）：

	${f\hspace{0pt}{(x)}} = {\lambda\hspace{0pt}\begin{cases} x & {{\text{if~}\hspace{0pt}x} \geq 0} \\ {\alpha\hspace{0pt}{({{\exp{(x)}} - 1})}} & {{\text{if~}\hspace{0pt}x} < 0} \end{cases}}$

其中 ${\textstyle \alpha \approx 1.6733}$ 与 ${\textstyle \lambda \approx 1.0507}$ 。

•

Gaussian Error Linear Unit（GELU）（Hendrycks & Gimpel, 2016）： ${\textstyle {f\hspace{0pt}{(x)}} = {{x \cdot \Phi}\hspace{0pt}{(x)}}}$ ，其中 ${\textstyle \Phi\hspace{0pt}{(x)}}$ 是标准正态分布的累积分布函数。GELU 是一种非单调函数，其形状与 ${\textstyle \beta = 1.4}$ 时的 Swish 相似。

我们同时评估了带有可训练 ${\textstyle \beta}$ 的 Swish 和固定 ${\textstyle \beta = 1}$ 的 Swish（为简便起见我们将其称为 Swish-1，但它等价于 Elfwing et al.（2017）的 Sigmoid-weighted Linear Unit）。请注意，由于我们的训练设置存在差异，我们的结果可能无法与对应工作的结果直接比较。

5.2 CIFAR

我们首先在 CIFAR-10 和 CIFAR-100 数据集（Krizhevsky & Hinton, 2009）上将 Swish 与所有基线激活函数进行比较。我们沿用比较搜索技术发现的激活函数时所用的相同设置，使用 preactivation ResNet-164（He et al., 2016b）、Wide ResNet 28-10（WRN）（Zagoruyko & Komodakis, 2016）和 DenseNet 100-12（Huang et al., 2017）模型，比较 5 次运行的中位数。

模型	ResNet	WRN	DenseNet
LReLU	94.2	95.6	94.7
PReLU	94.1	95.1	94.5
Softplus	94.6	94.9	94.7
ELU	94.1	94.1	94.4
SELU	93.0	93.2	93.9
GELU	94.3	95.5	94.8
ReLU	93.8	95.3	94.8
Swish-1	94.7	95.5	94.8
Swish	94.5	95.5	94.8

模型	ResNet	WRN	DenseNet
LReLU	74.2	78.0	83.3
PReLU	74.5	77.3	81.5
Softplus	76.0	78.4	83.7
ELU	75.0	76.0	80.6
SELU	73.2	74.3	80.8
GELU	74.7	78.0	83.8
ReLU	74.2	77.8	83.7
Swish-1	75.1	78.5	83.8
Swish	75.1	78.0	83.9

表 5 与 5 中的结果表明，在 CIFAR-10 与 CIFAR-100 的每个模型上，Swish 与 Swish-1 始终匹配或超过 ReLU。在几乎所有模型上，Swish 也匹配或超越最佳基线的表现。值得注意的是，"最佳基线"在不同模型间会变化，这显示出 Swish 在面对这些变化的基线时的稳定性。Softplus 在一侧也趋近零且光滑，与 Swish 相似，同样表现出色。

5.3 ImageNet

接下来，我们在 ImageNet 2012 分类数据集（Russakovsky et al., 2015）上将 Swish 与基线激活函数进行基准比较。ImageNet 被广泛视为最重要的图像分类数据集之一，包含 1,000 个类别和 128 万张训练图像。我们在拥有 50,000 张图像的验证集上进行评估。

我们在为 ImageNet 设计的多种架构上比较所有激活函数：Inception-ResNet-v2、Inception-v4、Inception-v3（Szegedy et al., 2017）、MobileNet（Howard et al., 2017）以及 Mobile NASNet-A（Zoph et al., 2017）。所有这些架构都是为 ReLU 设计的。我们再次将 ReLU 激活函数替换为不同的激活函数，并训练固定步数，由 ReLU 基线的收敛情况决定。对于每个激活函数，我们用 RMSProp（Tieleman & Hinton, 2012）尝试 3 个不同的学习率，并选取其中最好的。²²2对于使用 ELU、SELU 和 PReLU 的某些模型，由于最初的 3 个学习率未能收敛，我们额外尝试了 3 个学习率（共 6 个学习率）。所有网络使用 He initialization 初始化（He et al., 2015）。³³3对于 SELU，我们同时尝试了 He initialization 与 Klambauer et al.（2017）建议的初始化，并为每个模型分别选择最佳结果。为验证性能差异可复现，我们以第一组实验的最佳学习率对 Inception-ResNet-v2 与 Mobile NASNet-A 实验各重复运行 3 次。我们在图 8 中绘制 Mobile NASNet-A 的学习曲线。

模型	Top-1 Acc. (%)			Top-5 Acc. (%)
LReLU	73.8	73.9	74.2	91.6	91.9	91.9
PReLU	74.6	74.7	74.7	92.4	92.3	92.3
Softplus	74.0	74.2	74.2	91.6	91.8	91.9
ELU	74.1	74.2	74.2	91.8	91.8	91.8
SELU	73.6	73.7	73.7	91.6	91.7	91.7
GELU	74.6	-	-	92.0	-	-
ReLU	73.5	73.6	73.8	91.4	91.5	91.6
Swish-1	74.6	74.7	74.7	92.1	92.0	92.0
Swish	74.9	74.9	75.2	92.3	92.4	92.4

模型	Top-1 Acc. (%)			Top-5 Acc. (%)
LReLU	79.5	79.5	79.6	94.7	94.7	94.7
PReLU	79.7	79.8	80.1	94.8	94.9	94.9
Softplus	80.1	80.2	80.4	95.2	95.2	95.3
ELU	75.8	79.9	80.0	92.6	95.0	95.1
SELU	79.0	79.2	79.2	94.5	94.4	94.5
GELU	79.6	79.6	79.9	94.8	94.8	94.9
ReLU	79.5	79.6	79.8	94.8	94.8	94.8
Swish-1	80.2	80.3	80.4	95.1	95.2	95.2
Swish	80.2	80.2	80.3	95.0	95.2	95.0

模型	Top-1 Acc. (%)	Top-5 Acc. (%)
LReLU	72.5	91.0
PReLU	74.2	91.9
Softplus	73.6	91.6
ELU	73.9	91.3
SELU	73.2	91.0
GELU	73.5	91.4
ReLU	72.0	90.8
Swish-1	74.2	91.6
Swish	74.2	91.7

模型	Top-1 Acc. (%)	Top-5 Acc. (%)
LReLU	78.4	94.1
PReLU	77.7	93.5
Softplus	78.7	94.4
ELU	77.9	93.7
SELU	76.7	92.8
GELU	77.7	93.9
ReLU	78.4	94.2
Swish-1	78.7	94.2
Swish	78.7	94.0

模型	Top-1 Acc. (%)	Top-5 Acc. (%)
LReLU	79.3	94.7
PReLU	79.3	94.4
Softplus	79.6	94.8
ELU	79.5	94.5
SELU	78.3	94.5
GELU	79.0	94.6
ReLU	79.2	94.6
Swish-1	79.3	94.7
Swish	79.3	94.6

表 6-10 中的结果显示 Swish 表现强劲。在 Inception-ResNet-v2 上，Swish 比 ReLU 高出可观的 ${\textstyle 0.5\%}$ 。Swish 在移动尺寸模型上表现尤为突出：相对 ReLU，在 Mobile NASNet-A 上提升 ${\textstyle 1.4\%}$ ，在 MobileNet 上提升 ${\textstyle 2.2\%}$ 。在大多数模型上，Swish 也匹配或超过表现最佳的基线，并且最佳基线再次因模型而异。Softplus 在较大模型上达到了与 Swish 相当的准确率，但在两种移动尺寸模型上表现更差。对 Inception-v4 而言，切换激活函数带来的收益更为有限，Swish 略逊于 Softplus 与 ELU。总体而言，结果表明切换到 Swish 可在几乎不需要额外调参的情况下提升性能。

5.4 机器翻译

我们在机器翻译领域中额外对 Swish 进行了基准比较。我们在标准的 WMT 2014 英→德数据集上训练机器翻译模型，该数据集包含 450 万训练句子，并在 4 个不同的 newstest 集上使用标准 BLEU 度量进行评估。我们使用基于注意力的 Transformer（Vaswani et al., 2017）模型，该模型在每个注意力层之间使用一个 2 层的前馈网络，其中使用 ReLU。我们用 2 个不同的学习率⁴⁴4我们额外为 Softplus 尝试了一个学习率，但发现它在所有学习率下都表现不佳。训练一个 12 层的"Base Transformer"模型 300K 步，其余超参数与原始工作中相同，例如使用 Adam（Kingma & Ba, 2015）进行优化。

模型	newstest2013	newstest2014	newstest2015	newstest2016
LReLU	26.2	27.9	29.8	33.4
PReLU	26.3	27.7	29.7	33.1
Softplus	23.4	23.6	25.8	29.2
ELU	24.6	25.1	27.7	32.5
SELU	23.7	23.5	25.9	30.5
GELU	25.9	27.3	29.5	33.1
ReLU	26.1	27.8	29.8	33.3
Swish-1	26.2	28.0	30.1	34.0
Swish	26.5	27.6	30.0	33.1

表 11 表明 Swish 在机器翻译上优于或匹配其他基线。Swish-1 在 newstest2016 上表现尤为突出，比下一个表现最好的基线高出 ${\textstyle 0.6}$ BLEU 分。表现最差的基线函数是 Softplus，这显示出其在不同领域间性能的不一致性。相比之下，Swish 在多个领域中始终表现良好。

6 相关工作

Swish 是使用多种自动化搜索技术发现的。其他工作中也使用搜索技术来发现卷积与循环架构（Zoph & Le, 2016；Zoph et al., 2017；Real et al., 2017；Cai et al., 2017；Zhong et al., 2017）以及优化器（Bello et al., 2017）。使用搜索技术来发现传统上由手工设计的组件，是近期复兴的元学习（meta-learning）子领域的一个实例（Schmidhuber, 1987；Naik & Mammone, 1992；Thrun & Pratt, 2012）。元学习已被用于为 one-shot 学习寻找初始化（Finn et al., 2017；Ravi & Larochelle, 2016）、可适应的强化学习（Wang et al., 2016；Duan et al., 2016）以及生成模型参数（Ha et al., 2016）。元学习之所以强大，是因为其编码的极少假设带来的灵活性可以导出实证有效的解决方案。我们利用这一性质来寻找像 Swish 这样具有强实证表现的标量激活函数。

虽然本工作关注的是将一个标量映射到另一个标量的标量激活函数，但深度网络中使用的激活函数有许多种类型。many-to-one 函数，如 max pooling、maxout（Goodfellow et al., 2013）以及 gating（Hochreiter & Schmidhuber, 1997；Srivastava et al., 2015；van den Oord et al., 2016；Dauphin et al., 2016；Wu et al., 2016；Miech et al., 2017），其能力来自于以非线性方式组合多个来源。one-to-many 函数，如 Concatenated ReLU（Shang et al., 2016），通过对单个输入应用多个非线性函数来提升性能。最后，many-to-many 函数，如 BatchNorm（Ioffe & Szegedy, 2015）与 LayerNorm（Ba et al., 2016），在其输入之间引入强非线性关系。

先前的大部分工作集中在提出新的激活函数（Maas et al., 2013；Agostinelli et al., 2014；He et al., 2015；Clevert et al., 2015；Hendrycks & Gimpel, 2016；Klambauer et al., 2017；Qiu & Cai, 2017；Zhou et al., 2017；Elfwing et al., 2017），但很少有研究（如 Xu et al.（2015））系统地比较不同的激活函数。据我们所知，这是首个在多个具有挑战性的数据集上比较标量激活函数的研究。

我们的研究表明，Swish 在深度模型上始终优于 ReLU。Swish 的强劲表现挑战了关于 ReLU 的传统观念。当残差连接（He et al., 2016a）已经能够支持非常深的网络的优化时，关于 ReLU 梯度保持性质重要性的假设似乎已无必要。在完全基于注意力的 Transformer（Vaswani et al., 2017）中也能找到类似的洞见：当使用长度固定的注意力连接时，构造精巧的 LSTM 单元（Hochreiter & Schmidhuber, 1997）不再必要。架构改进降低了单个组件保持梯度的必要性。

7 结论

在本工作中，我们利用自动化搜索技术发现了具有强实证表现的新颖激活函数。随后我们对最佳被发现的激活函数进行了实证验证，我们将其命名为 Swish，其定义为 ${\textstyle {f\hspace{0pt}{(x)}} = {{x \cdot \text{sigmoid}}\hspace{0pt}{({\beta\hspace{0pt}x})}}}$ 。我们的实验使用了为 ReLU 设计的模型和超参数，只是将 ReLU 激活函数替换为 Swish；即便是这样简单且次优的流程，也使 Swish 始终优于 ReLU 与其他激活函数。我们预计当这些模型和超参数针对 Swish 进行专门设计时，会有额外的收益。Swish 的简洁性以及与 ReLU 的相似性意味着在任何网络中替换 ReLU 只需简单地修改一行代码。

致谢

我们感谢 Esteban Real、Geoffrey Hinton、Irwan Bello、Jascha Sohl-Dickstein、Jon Shlens、Kathryn Rough、Mohammad Norouzi、Navdeep Jaitly、Niki Parmar、Sam Smith、Simon Kornblith、Vijay Vasudevan 以及 Google Brain 团队对本项目的帮助。

参考文献

Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation, volume 16, pp. 265–283, 2016.
Agostinelli et al. (2014) Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830, 2014.
Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In Advances in Neural Information Processing Systems, 2016.
Bello et al. (2017) Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pp. 459–468, 2017.
Cai et al. (2017) Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873, 2017.
Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
Dauphin et al. (2016) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.
Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
Elfwing et al. (2017) Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118, 2017.
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In International Conference on Machine Learning, 2013.
Ha et al. (2016) David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
Hahnloser et al. (2000) Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947, 2000.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a.
He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Springer, 2016b.
Hendrycks & Gimpel (2016) Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415, 2016.
Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition, 2017.
Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015.
Jarrett et al. (2009) Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, 2009.
Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
Klambauer et al. (2017) Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. arXiv preprint arXiv:1706.02515, 2017.
Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Technical report, University of Toronto, 2009.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, volume 30, 2013.
Miech et al. (2017) Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905, 2017.
Naik & Mammone (1992) Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, volume 1, pp. 437–442. IEEE, 1992.
Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
Parascandolo et al. (2016) Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. Taming the waves: sine as activation function in deep neural networks. 2016.
Qiu & Cai (2017) Suo Qiu and Bolun Cai. Flexible rectified linear units for improving convolutional neural networks. arXiv preprint arXiv:1706.08098, 2017.
Ravi & Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
Real et al. (2017) Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
Schmidhuber (1987) Jurgen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Shang et al. (2016) Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In International Conference on Machine Learning, pp. 2217–2225, 2016.
Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Szegedy et al. (2017) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017.
Thrun & Pratt (2012) Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
Wu et al. (2016) Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 2856–2864, 2016.
Xu et al. (2015) Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
Zhong et al. (2017) Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552, 2017.
Zhou et al. (2017) Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Xingya Dai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. arXiv preprint arXiv:1706.06978, 2017.
Zoph & Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2016.
Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.

@@ Line 276: / Line 276: @@
 [[File:Arxiv_1710_05941_x7.png|423x276px|见图说明]]
-<div class="mw-translate-fuzzy">
+Swish 和 ReLU 之间最显著的差异是 Swish 在 <math display="inline">x < 0</math> 时的非单调"凸起"。如图 [[#S4.F7|7]] 所示，很大比例的预激活值落在该凸起的范围内 (<math display="inline">- 5 \leq x \leq 0)</math>，这表明非单调凸起是 Swish 的一个重要方面。可以通过改变 <math display="inline">\beta</math> 参数来控制凸起的形状。虽然固定 <math display="inline">\beta = 1</math> 在实践中是有效的，但实验部分表明训练 <math display="inline">\beta</math> 可以在某些模型上进一步提高性能。图 [[#S4.F7|7]] 绘制了来自 Mobile NASNet-A 模型 (Zoph 等, [[#bib.bib53|2017]]) 的训练后 <math display="inline">\beta</math> 值的分布。训练后的 <math display="inline">\beta</math> 值分布在 <math display="inline">0</math> 和 <math display="inline">1.5</math> 之间，并在 <math display="inline">\beta \approx 1</math> 处有一个峰值，这表明该模型利用了可训练 <math display="inline">\beta</math> 参数所提供的额外灵活性。
-Swish 与 ReLU 最显著的差别是 Swish 在 <math display="inline">x < 0</math> 时的非单调"凸包"。如图 [[#S4.F7|7]] 所示，大量预激活落入凸包的范围（<math display="inline">- 5 \leq x \leq 0)</math>，表明非单调凸包是 Swish 的一个重要特征。凸包的形状可以通过改变 <math display="inline">\beta</math> 参数来控制。虽然在实践中固定 <math display="inline">\beta = 1</math> 已经有效，但实验部分表明对某些模型而言训练 <math display="inline">\beta</math> 可进一步提高性能。图 [[#S4.F7|7]] 绘制了 Mobile NASNet-A 模型 （Zoph et al., [[#bib.bib53|2017]]）训练得到的 <math display="inline">\beta</math> 值分布。训练得到的 <math display="inline">\beta</math> 值分布在 <math display="inline">0</math> 与 <math display="inline">1.5</math> 之间，并在 <math display="inline">\beta \approx 1</math> 处出现峰值，表明模型利用了可训练 <math display="inline">\beta</math> 参数的额外灵活性。
-</div>
 在实践中，在大多数深度学习库中只需修改一行代码即可实现 Swish，例如在 TensorFlow（Abadi et al., [[#bib.bib1|2016]]）中使用 <code>x * tf.sigmoid(beta * x)</code>，或在本工作提交之后发布的 TensorFlow 版本中使用 <code>tf.nn.swish(x)</code>。需要注意的是，如果使用 BatchNorm（Ioffe &amp; Szegedy, [[#bib.bib21|2015]]），应当设置其 scale 参数。一些高层库由于 ReLU 是分段线性函数而默认关闭 scale 参数，但该设置对 Swish 并不正确。对于训练 Swish 网络，我们发现略微降低用于训练 ReLU 网络的学习率效果很好。