Translations:Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer/33/en

Multi-gate Mixture-of-Experts — task-conditioned MoE for multi-task learning.
Attention Is All You Need — the Transformer architecture into which MoE layers were later inserted by GShard and Switch Transformer.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting — a related form of stochastic conditional computation.
Language Models are Few-Shot Learners — large dense language model whose scaling MoE work aimed to surpass at lower cost.