Searching for Activation Functions: Difference between revisions

Research Paper
Authors	Prajit Ramachandran; Barret Zoph; Quoc V. Le
Year	2017
Topic area	Machine Learning
Difficulty	Research
arXiv	1710.05941
PDF	Download PDF

Latest revision as of 06:50, 27 April 2026

Other languages:

English
Español
中文

SummarySource

Searching for Activation Functions is a 2017 paper by Prajit Ramachandran, Barret Zoph, and Quoc V. Le of Google Brain that uses automated search to discover scalar activation functions for deep neural networks. The search yields a family of simple, non-monotonic functions, of which the authors highlight one — Swish, defined as $f(x) = x \cdot \sigma(\beta x)$ — and show that it consistently matches or outperforms ReLU on deep models across image classification and machine translation benchmarks. The paper was presented at the ICLR 2018 workshop track.

Overview

Activation functions sit at the heart of every deep network and have a large effect on optimization and generalization. Despite a long line of hand-designed alternatives — Leaky ReLU, PReLU, ELU, SELU, GELU, Softplus — the ReLU $f(x) = \max(x, 0)$ remained the de facto default because competing functions tended to give inconsistent gains across models and datasets.

Rather than designing yet another activation function by hand, the authors apply automated search over a compositional space of unary and binary primitives. The best discovered function, which they name Swish, is structurally close to ReLU but smooth and non-monotonic. The authors show that simply replacing ReLUs with Swish improves accuracy on a wide variety of pretrained-style architectures with minimal hyperparameter tuning, and that the result is robust enough to generalize from the small CIFAR-10 child networks used during search to large ImageNet- and translation-scale models.

Key Contributions

A compositional search space for scalar activation functions, built from a small library of unary functions (e.g. $$ x $$ , $$ x^2 $$ , $\sigma(x)$ , $\tanh(x)$ , $\sin(x)$ ) and binary functions (e.g. $$ x_1 + x_2 $$ , $x_1 \cdot x_2$ , $\max$ , $\sigma(x_1)\cdot x_2$ ).
A search procedure combining exhaustive enumeration for small spaces with an RL-trained RNN controller (using PPO) for spaces too large to enumerate.
The discovery and detailed analysis of Swish, $f(x) = x \cdot \sigma(\beta x)$ , where $\beta$ is a constant or a per-channel trainable parameter.
Extensive empirical comparison against seven baseline activation functions (ReLU, LReLU, PReLU, Softplus, ELU, SELU, GELU) on CIFAR-10/100, ImageNet, and WMT 2014 English→German translation.

Methods

The search space treats activation functions as repeated applications of a "core unit" of the form $$ b(u_1(x_1), u_2(x_2)) $$ , where $$ u_1, u_2 $$ are unary functions, $$ b $$ is a binary function, and the inputs $$ x_1, x_2 $$ are either the layer preactivation $$ x $$ or the output of an earlier core unit. Different search spaces are constructed by varying the number of core units and the available primitives.

Candidate activation functions are evaluated by training a small "child network" — a ResNet-20 on CIFAR-10 for 10K steps — and reporting validation accuracy. For small spaces the authors enumerate exhaustively; for spaces of order $10^{12}$ they train an RNN controller with reinforcement learning to maximize validation accuracy, using an exponential moving average of rewards as a baseline. Search is parallelized across worker machines that pull candidate activation functions from a queue, train a child network, and report the final validation accuracy back to the search algorithm.

After the search, the top candidates are stress-tested on three larger CIFAR architectures — preactivation ResNet-164, Wide ResNet 28-10, and DenseNet 100-12 — to filter out functions that overfit to the small child-network setting. Six of the eight top novel functions transfer; two of them, $x \cdot \sigma(\beta x)$ and $\max(x, \sigma(x))$ , match or beat ReLU on every model. The authors then commit to evaluating $x \cdot \sigma(\beta x)$ at scale, partly because early experiments suggested better generalization.

The search uncovers several recurring patterns: simple functions outperform complex ones (1–2 core units suffice), the top functions tend to use the raw preactivation as one input to the final binary operation (mirroring ReLU's structure), and division-based functions rarely work because their outputs explode near zero. Among the top candidates, the authors single out

f(x) = x \cdot \sigma(\beta x), \qquad \sigma(z) = (1 + \exp(-z))^{-1}

which they call Swish. Setting $\beta = 1$ recovers the Sigmoid-weighted Linear Unit (SiL) of Elfwing et al.; setting $\beta \to \infty$ recovers ReLU; setting $\beta = 0$ gives the linear function $$ x/2 $$ . Swish can therefore be viewed as a smooth interpolant between linear and ReLU behaviour, with $\beta$ controlling the degree of nonlinearity.

The first derivative of Swish is

f'(x) = \sigma(\beta x) + \beta x \cdot \sigma(\beta x)\bigl(1 - \sigma(\beta x)\bigr) = \beta f(x) + \sigma(\beta x)\bigl(1 - \beta f(x)\bigr)

so Swish is smooth everywhere, unbounded above, bounded below, and non-monotonic — it dips below zero in a small "bump" for $$ x $$ roughly between $$ -5 $$ and $$ 0 $$ , before approaching zero from below as $x \to -\infty$ . The authors show empirically that a large fraction of preactivations land inside this bump and argue it is an essential part of the function's behaviour. The shape of the bump is controlled by $\beta$ : when $\beta$ is treated as a per-channel trainable parameter, fitted values on Mobile NASNet-A spread between 0 and 1.5 with a peak near 1, suggesting models do exploit the extra flexibility.

Implementation is a single-line change in modern frameworks (e.g. x * tf.sigmoid(beta * x)). The authors note that BatchNorm's scale parameter must remain enabled (some libraries disable it by default for ReLU) and that learning rates often need to be slightly lower than those tuned for ReLU.

Results

On CIFAR-10 and CIFAR-100, Swish and Swish-1 match or outperform ReLU on every model considered (preactivation ResNet-164, Wide ResNet 28-10, DenseNet 100-12). The "best baseline" varies by model — Softplus, GELU, and PReLU each lead on different rows — but Swish is the only function consistently at or near the top.

On ImageNet classification, replacing ReLU with Swish gives:

Mobile NASNet-A: +1.4% top-1 accuracy on average over three runs ( $73.5 \to 74.9$ %).
Inception-ResNet-v2: +0.5–0.6% top-1 ( $79.6 \to 80.2$ %).
MobileNet: +2.2% top-1 ( $72.0 \to 74.2$ %).
Inception-v3 and Inception-v4: roughly +0.1% top-1, within noise.

For context, the authors note that an entire year of architectural tuning between Inception-v3 and Inception-ResNet-v2 yielded a 1.3% improvement, so the gains from a one-line activation swap are economically meaningful. On a 12-layer "Base Transformer" trained on WMT 2014 English→German, Swish-1 also matches or exceeds every baseline across four newstest sets, with the largest gain on newstest2016 (+0.6 BLEU over the next-best).

A summary sign test against each baseline (counting wins, ties, and losses across nine models) shows Swish strictly winning more often than losing against all seven of ReLU, LReLU, PReLU, Softplus, ELU, SELU, and GELU.

Swish's gains are largest on mobile-sized convolutional architectures (Mobile NASNet-A, MobileNet) and on the Transformer, while on Inception-v4 the gap narrows to within noise. Softplus, the next most consistent baseline, is competitive on large image classifiers but collapses on machine translation (3+ BLEU below ReLU on the WMT newstest sets), illustrating the cross-domain inconsistency the paper sets out to overcome.

Impact

The Swish paper had outsized practical influence relative to its theoretical novelty. The function had in fact been independently proposed under the name Sigmoid-weighted Linear Unit (SiL) by Elfwing, Uchibe, and Doya in a reinforcement learning context, and the closely related GELU (Hendrycks and Gimpel, 2016) shares the same smooth, non-monotonic shape. The contribution here is the first systematic empirical demonstration that such functions improve accuracy on large-scale image and language models, together with the recipe — searching over a compositional space using a child network as a fast proxy — that produced it.

After release, Swish was added to mainstream frameworks (e.g. tf.nn.swish) and adopted in production architectures such as EfficientNet. The variant Hard Swish — a piecewise-linear approximation defined as $x \cdot \mathrm{ReLU6}(x + 3)/6$ — was introduced in MobileNetV3 to recover Swish's accuracy gains while being cheap on mobile hardware. GELU itself was later popularized by BERT and the GPT family, where it became the default activation in Transformer feed-forward blocks, vindicating the broader category that Swish helped make mainstream.

The paper also helped legitimize the use of automated search for low-level neural-network components, complementing the Google Brain team's parallel work on architecture search (NASNet) and optimizer search. The authors explicitly tie their findings to the "architectural improvements lessen the need for individual components to preserve gradients" argument that emerged after residual connections and Transformer-style attention removed many of the obstacles ReLU was originally designed to mitigate.

A common misreading of the paper is that "Swish beats ReLU everywhere"; the actual experimental record is more nuanced. On large image classifiers the gap is small and architecture-dependent — Inception-v4 is essentially a tie — and any retraining of these networks ought to retune learning rates from scratch rather than reusing ReLU-tuned schedules. The robust takeaway is the comparative one: across nine architectures and three domains, Swish is the least-bad default, and the search procedure can plausibly be re-run to find activations specialized to a new architecture.

References

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941.
Elfwing, S., Uchibe, E., & Doya, K. (2017). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv:1702.03118.
Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
Zoph, B., & Le, Q. V. (2017). Neural Architecture Search with Reinforcement Learning. ICLR.
Bello, I., Zoph, B., Vasudevan, V., & Le, Q. V. (2017). Neural Optimizer Search with Reinforcement Learning. ICML.
Howard, A., et al. (2019). Searching for MobileNetV3. ICCV — introduces Hard Swish.

@@ Line 20: / Line 20: @@
 '''Searching for Activation Functions''' is a 2017 paper by Prajit Ramachandran, Barret Zoph, and Quoc V. Le of Google Brain that uses automated search to discover scalar activation functions for deep neural networks. The search yields a family of simple, non-monotonic functions, of which the authors highlight one — '''Swish''', defined as <math>f(x) = x \cdot \sigma(\beta x)</math> — and show that it consistently matches or outperforms ReLU on deep models across image classification and machine translation benchmarks. The paper was presented at the [[International Conference on Learning Representations|ICLR]] 2018 workshop track.
-<!--T:2-->
+== Overview == <!--T:2-->
-== Overview ==
 <!--T:3-->
@@ Line 29: / Line 28: @@
 Rather than designing yet another activation function by hand, the authors apply [[Neural architecture search|automated search]] over a compositional space of unary and binary primitives. The best discovered function, which they name Swish, is structurally close to ReLU but smooth and non-monotonic. The authors show that simply replacing ReLUs with Swish improves accuracy on a wide variety of pretrained-style architectures with minimal hyperparameter tuning, and that the result is robust enough to generalize from the small CIFAR-10 child networks used during search to large ImageNet- and translation-scale models.
-<!--T:5-->
+== Key Contributions == <!--T:5-->
-== Key Contributions ==
 <!--T:6-->
@@ Line 38: / Line 36: @@
 * Extensive empirical comparison against seven baseline activation functions (ReLU, LReLU, PReLU, Softplus, ELU, SELU, GELU) on CIFAR-10/100, [[ImageNet]], and WMT 2014 English→German translation.
-<!--T:7-->
+== Methods == <!--T:7-->
-== Methods ==
 <!--T:8-->
@@ Line 71: / Line 68: @@
 Implementation is a single-line change in modern frameworks (e.g. <code>x * tf.sigmoid(beta * x)</code>). The authors note that BatchNorm's scale parameter must remain enabled (some libraries disable it by default for ReLU) and that learning rates often need to be slightly lower than those tuned for ReLU.
-<!--T:18-->
+== Results == <!--T:18-->
-== Results ==
 <!--T:19-->
@@ Line 95: / Line 91: @@
 Swish's gains are largest on mobile-sized convolutional architectures (Mobile NASNet-A, MobileNet) and on the Transformer, while on Inception-v4 the gap narrows to within noise. Softplus, the next most consistent baseline, is competitive on large image classifiers but collapses on machine translation (3+ BLEU below ReLU on the WMT newstest sets), illustrating the cross-domain inconsistency the paper sets out to overcome.
-<!--T:25-->
+== Impact == <!--T:25-->
-== Impact ==
 <!--T:26-->
@@ Line 110: / Line 105: @@
 A common misreading of the paper is that "Swish beats ReLU everywhere"; the actual experimental record is more nuanced. On large image classifiers the gap is small and architecture-dependent — Inception-v4 is essentially a tie — and any retraining of these networks ought to retune learning rates from scratch rather than reusing ReLU-tuned schedules. The robust takeaway is the comparative one: across nine architectures and three domains, Swish is the least-bad default, and the search procedure can plausibly be re-run to find activations specialized to a new architecture.
-<!--T:30-->
+== See also == <!--T:30-->
-== See also ==
 <!--T:31-->
@@ Line 120: / Line 114: @@
 * [[ImageNet]]
-<!--T:32-->
+== References == <!--T:32-->
-== References ==
 <!--T:33-->