Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer/en

    From Marovi AI
    Other languages:
    SummarySource
    Research Paper
    Authors Noam Shazeer; Azalia Mirhoseini; Krzysztof Maziarz; Andy Davis; Quoc Le; Geoffrey Hinton; Jeff Dean
    Year 2017
    Topic area Machine Learning
    Difficulty Research
    arXiv 1701.06538
    PDF Download PDF

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer is a 2017 paper by Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean that introduced the Sparsely-Gated Mixture-of-Experts (MoE) layer for deep neural networks. The paper showed that a single layer composed of thousands of feed-forward sub-networks, with a trainable gating network selecting only a handful per example, can grow model capacity by more than 1000× while keeping per-example computation roughly constant. Applied between stacked LSTM layers for language modeling and machine translation, MoE models with up to 137 billion parameters set new state-of-the-art results at lower computational cost than dense baselines.

    Overview

    The capacity of a neural network — its ability to absorb information — is bounded by its parameter count, but parameter count is normally tied to per-example computation. Conditional computation had been proposed as a way to break this coupling: only activate a sparse subset of the network for each input, so capacity grows without a proportional increase in FLOPs. Earlier work had identified the idea but had not realized it at scale, due to algorithmic and systems challenges (uneven batch sizes across active sub-networks, network bandwidth pressure, expert collapse).

    This paper presents the first practical realization of large-scale conditional computation. The Sparsely-Gated Mixture-of-Experts layer consists of up to thousands of expert feed-forward networks and a gating network that, given an input, produces a sparse vector of expert weights. Only the experts with non-zero gates are evaluated, so the per-example cost stays bounded even as the number of experts (and thus parameters) scales into the billions. The layer is inserted convolutionally — applied identically at every time step — between two stacked LSTM layers in the authors' models.

    Key Contributions

    • Sparsely-Gated MoE layer. A single neural-network layer with up to 131,072 experts and a top-k gating network that activates only a small constant number of experts per example, decoupling capacity from compute.
    • Noisy Top-K Gating. A gating mechanism that adds tunable Gaussian noise before the top-k selection, providing both sparsity and stochastic load balancing.
    • Solutions to the shrinking-batch problem. A mix of data and model parallelism, plus convolutional application across time, that keeps each expert's effective batch large enough for efficient GPU execution.
    • Soft load-balancing losses. Auxiliary "importance" and "load" losses that prevent a few experts from dominating and ensure balanced utilization across the population.
    • Empirical wins at unprecedented scale. MoE models with up to 137 billion parameters, achieving 24% lower perplexity on the 1 Billion Word Benchmark and surpassing prior state-of-the-art on WMT'14 En→Fr and En→De and on a multilingual En→{Fr, De, Es, It, Pt, ...} translation task — all at lower compute cost than the dense baselines.

    Methods

    The MoE layer

    A MoE layer contains $ n $ expert networks $ E_1, \dots, E_n $ (typically two-layer feed-forward ReLU networks) and a gating network $ G $ producing a sparse $ n $-dimensional vector. The output of the layer for input $ x $ is

    $ y = \sum_{i=1}^{n} G(x)_i \, E_i(x). $

    Wherever $ G(x)_i = 0 $, the corresponding expert is skipped. With a top-k gate, only k experts are evaluated per example, giving constant per-example compute regardless of $ n $.

    Noisy Top-K Gating

    The gating network combines a softmax over a learned linear projection with additive trainable Gaussian noise, then keeps only the largest k logits before normalizing:

    $ H(x)_i = (x \cdot W_g)_i + \epsilon \cdot \mathrm{softplus}((x \cdot W_{noise})_i),\quad \epsilon \sim \mathcal{N}(0,1) $
    $ G(x) = \mathrm{Softmax}(\mathrm{KeepTopK}(H(x), k)). $

    The noise both sparsifies activations and helps balance load across experts during training; setting all but the top k entries to $ -\infty $ guarantees exact sparsity at inference.

    Performance engineering

    When $ k $ experts out of $ n $ are chosen for each of $ b $ examples, each expert sees on average $ kb/n $ examples, which is far too small for efficient GPU matrix multiplications. The authors address this by:

    • Mixing data and model parallelism. The MoE layer is sharded across $ d $ devices so every expert lives on exactly one device, and all $ d $ data-parallel replicas synchronously route their selected examples to the experts they need. Each expert's effective batch becomes $ k \cdot b \cdot d / n $, which scales with cluster size.
    • Convolutional application. The MoE is applied independently at every time step of the LSTM, multiplying the per-expert batch by the unrolled sequence length.
    • Hierarchical MoE. For very large $ n $, a primary gate selects a group of experts and a secondary gate selects experts within that group, keeping routing tractable.
    • Bandwidth optimization. Experts are made compute-heavy (large hidden layers) so their compute-to-input-bytes ratio absorbs the cost of shuffling examples between devices.

    Load-balancing losses

    Two auxiliary losses are added to the training objective to keep the gating network from collapsing to a few favorite experts:

    $ L_{importance}(X) = w_{importance} \cdot \mathrm{CV}\!\left(\sum_{x \in X} G(x)\right)^2, $

    where CV is the coefficient of variation. A second loss $ L_{load} $ equalizes the number of examples each expert receives — not just the gate-weighted importance — using a smooth differentiable estimator built from the noisy gating distribution. Together they enforce both equal weight and equal example count per expert.

    Results

    Language modeling

    On the 1 Billion Word Language Modeling Benchmark, the authors trained MoE models with 4, 32, 256, 1024, 4096, 32768, and 65536 experts, all matched at roughly 8M ops/timestep. The 4096-expert model reduced test perplexity by 24% relative to the previous best published model, while higher-compute variants with up to 4 billion parameters set new state-of-the-art perplexities at significantly lower compute than prior dense models. Computational efficiency on 16–32 Tesla K40 GPUs reached 0.74–1.56 TFLOPS/GPU.

    On the 100 Billion Word Google News Corpus — roughly 100× larger than the 1 Billion Word set — they trained an MoE with 131,072 experts, totaling 137 billion parameters, and continued to see perplexity gains as capacity grew, indicating that MoE benefits keep compounding when training data is plentiful.

    Machine translation

    On WMT'14 En→Fr and En→De, MoE-augmented stacked LSTM encoders/decoders surpassed the previous Google Neural Machine Translation (GNMT) state-of-the-art BLEU scores while using less training compute. On a multilingual production translation system covering twelve language pairs simultaneously, a single MoE-augmented model improved BLEU on every pair compared to a dedicated GNMT model trained on each pair individually.

    Impact

    This paper is a foundational entry in the modern MoE literature. It established the recipe — sparse top-k gating, noise-based load balancing, expert sharding, soft balancing losses — that subsequent work built upon directly. It directly inspired GShard (2020), which scaled the same idea to Transformer encoders for multilingual translation, and Switch Transformer (2021), which simplified the gate to top-1 routing and scaled to over a trillion parameters. It also underpins GLaM, ST-MoE, Mixtral 8×7B, and the MoE backbones used in many frontier large language models. The "outrageously large" phrasing of the title turned out to be modest: by the early 2020s, MoE layers with hundreds of billions of parameters were routinely deployed in production language models, and the conditional-computation principle introduced here became a central tool for trading parameter count against per-token cost.

    See also

    References

    • Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. International Conference on Learning Representations (ICLR).
    • Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.
    • Eigen, D., Ranzato, M. A., & Sutskever, I. (2013). Learning Factored Representations in a Deep Mixture of Experts.
    • Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts. Neural Computation, 3(1), 79–87.
    • Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, 6(2), 181–214.
    • Lepikhin, D. et al. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.
    • Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.