Translations:Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer/33/en
- Multi-gate Mixture-of-Experts — task-conditioned MoE for multi-task learning.
- Attention Is All You Need — the Transformer architecture into which MoE layers were later inserted by GShard and Switch Transformer.
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting — a related form of stochastic conditional computation.
- Language Models are Few-Shot Learners — large dense language model whose scaling MoE work aimed to surpass at lower cost.