Translations:Searching for Activation Functions/22/en
For context, the authors note that an entire year of architectural tuning between Inception-v3 and Inception-ResNet-v2 yielded a 1.3% improvement, so the gains from a one-line activation swap are economically meaningful. On a 12-layer "Base Transformer" trained on WMT 2014 English→German, Swish-1 also matches or exceeds every baseline across four newstest sets, with the largest gain on newstest2016 (+0.6 BLEU over the next-best).