Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Research Paper
Authors	Jiaqi Ma; Zhe Zhao; Xinyang Yi; Jilin Chen; Lichan Hong; Ed H. Chi
Year	2018
Venue	Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18)
Topic area	Machine Learning
Difficulty	Research
Source	View paper
PDF	Download PDF

Other languages:

English
Español
中文

SummarySource

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts is a 2018 paper by Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi, published in the Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). The work proposes the Multi-gate Mixture-of-Experts (MMoE) architecture, in which a shared bank of expert subnetworks is combined per task by a separate softmax gating network, allowing a single multi-task neural network to flexibly model task relationships without substantially increasing parameter count. MMoE became a foundational technique in industrial multi-task learning and is widely deployed in large-scale recommendation systems, including those at YouTube, where the design is credited with improving offline AUC and live engagement and satisfaction metrics.

Overview

Neural multi-task learning seeks to train a single model that jointly predicts several related targets — for example, both whether a user will watch a recommended item and whether they will like it afterwards. Sharing representations across tasks promises improved sample efficiency and regularization, but in practice the prediction quality of the dominant Shared-Bottom architecture is highly sensitive to task relatedness: when tasks compete for capacity in the shared trunk, gradients conflict and joint training can underperform separate single-task models. Earlier remedies — such as L2-Constrained networks, Cross-Stitch networks, and tensor-factorized multi-task networks — replace hard sharing with soft constraints, but they typically add many task-specific parameters and lose the inference-time efficiency that motivates multi-task models in production.

MMoE replaces the single shared trunk with a bank of expert feed-forward networks and gives each task its own gating network. The gates are linear-softmax functions of the input that produce per-example mixture weights over the experts; each task therefore consumes its own input-conditioned blend of the same expert pool. When tasks are similar, gates converge on overlapping experts and benefit from shared representations; when tasks conflict, gates learn to route to disjoint experts and the model recovers the behaviour of separate models — all while leaving the expert pool itself untouched.

The authors validate MMoE in three settings of increasing realism: a synthetic regression benchmark with controllable task correlation, the UCI Census-income binary-classification benchmark, and a Google content recommendation system trained on tens of billions of user feedback events. Across all three, MMoE matches or exceeds prior soft-sharing baselines while preserving the lightweight computational profile of a Shared-Bottom model.

Key Contributions

The Multi-gate Mixture-of-Experts (MMoE) architecture for multi-task neural networks: a shared pool of expert networks combined per task by a task-specific softmax gate over the input.
A controlled synthetic study of task relatedness based on sinusoidal regression with weight-vector cosine similarity as a tunable proxy for label Pearson correlation, isolating how multi-task models behave as task correlation degrades.
A trainability analysis showing that MMoE not only attains better mean loss than Shared-Bottom and One-gate MoE (OMoE) baselines, but also exhibits markedly lower variance over random initializations — i.e., it is harder to land in poor local minima.
Benchmark results on the UCI Census-income dataset that match or beat L2-Constrained, Cross-Stitch, and Tensor-Factorization multi-task baselines under matched parameter budgets.
Production-scale evidence from a Google recommendation system: MMoE improves engagement AUC and offline R² over a Shared-Bottom production model, and yields statistically significant gains in both engagement and satisfaction live metrics without inflating serving cost.

Methods

Let $$ K $$ denote the number of tasks. The standard Shared-Bottom multi-task model consists of one shared encoder $$ f $$ and a tower $$ h_k $$ per task:

$ y_k = h_k(f(x)). $

MMoE replaces the single encoder $$ f $$ with a bank of $$ n $$ expert networks $f_1, \ldots, f_n$ and introduces a softmax gating network $$ g^k $$ for each task:

y_k = h_k\!\left(\sum_{i=1}^{n} g^k(x)_i\, f_i(x)\right),\qquad g^k(x) = \mathrm{softmax}(W_{g_k}\, x),

where $W_{g_k} \in \mathbb{R}^{n \times d}$ is a per-task trainable matrix. Each expert is a feed-forward MLP with ReLU activations; gates are deliberately kept lightweight so that the parameter overhead over a Shared-Bottom model of comparable expert width is negligible. A One-gate MoE (OMoE) baseline — in which all tasks share a single gate — is included to isolate the contribution of per-task gating from that of the MoE structure itself.

For the synthetic study, two regression labels are produced from weight vectors $$ w_1, w_2 $$ with controlled cosine similarity $$ p $$ :

w_1 = c\, u_1,\qquad w_2 = c\!\left(p\, u_1 + \sqrt{1 - p^2}\, u_2\right),

with $u_1 \perp u_2$ , and labels generated through a non-linear mixture of sinusoidal functions of $$ w_k^T x $$ plus Gaussian noise. The cosine similarity $$ p $$ serves as a controllable proxy for the empirical label Pearson correlation, providing a clean axis along which to vary task relatedness.

On the Census-income benchmark, two task pairs are constructed from demographic features (income vs. marital status; education vs. marital status). On the production recommendation system, two binary classification tasks — an engagement-related signal and a satisfaction-related signal — are jointly trained on tens of billions of user feedback events, with all baselines tuned by a Gaussian-Process hyperparameter search at a matched maximum hidden-unit budget of 2048 per layer.

Results

On the synthetic benchmark, MMoE reduces the gap between high-correlation and low-correlation regimes far more than OMoE or Shared-Bottom, and dominates them in average loss across 200 independent runs at each correlation level. The OMoE baseline, lacking per-task gates, degrades sharply when task correlation falls — confirming that per-task gating is the load-bearing piece of the design. A trainability histogram further shows that Shared-Bottom suffers a long tail of poor local minima, whereas MMoE concentrates its outcomes near the best-attainable loss.

On UCI Census-income, MMoE attains the highest mean AUC on the main task in both groups (income/marital status and education/marital status), narrowly exceeding L2-Constrained and Cross-Stitch and substantially outperforming Tensor-Factorization, which collapses under low task relatedness. The single-task model retains a small edge on the auxiliary marital-status task because it is hyper-tuned for it, while the multi-task models are tuned only for the main task.

In Google's content recommendation system, MMoE produces the strongest engagement AUC and R² at every checkpoint over training (2M, 4M, and 6M steps). L2-Constrained and Cross-Stitch fall behind even Shared-Bottom because their parameter counts roughly double, leaving them under-constrained. Live A/B tests show MMoE improves engagement by +0.25% and satisfaction by +2.65% over the Shared-Bottom production model, both significant at the 95% level — and crucially without measurable serving-time overhead, because expert sharing preserves the Shared-Bottom efficiency advantage.

Impact

MMoE became one of the most widely adopted multi-task architectures in industrial machine learning, particularly in large-scale recommendation, ranking, and ads. The design influenced subsequent work on gated multi-task learning, including Customized Gate Control (CGC) and Progressive Layered Extraction (PLE), and it informs the broader family of Sparsely-gated mixture-of-experts and conditional-computation architectures that scale parameter count without scaling per-example FLOPs. Its core insight — that per-task input-conditioned gates over a shared expert pool can decouple task conflict from shared-representation benefits — has carried over into transformer-era MoE designs used in modern large language models. The paper is also frequently cited as evidence that gating mechanisms improve trainability in non-convex deep networks, complementing analogous findings for gated recurrent units.

References

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. Chi. 2018. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 1930–1939. https://doi.org/10.1145/3219819.3220007
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive mixtures of local experts. Neural Computation 3, 1, 79–87.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv:1701.06538.
Rich Caruana. 1998. Multitask learning. In Learning to learn. Springer, 95–133.
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In CVPR. 3994–4003.
Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. In ACL (2). 845–850.
Yongxin Yang and Timothy Hospedales. 2016. Deep multi-task representation learning: A tensor factorisation approach. arXiv:1605.06391.
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. In RecSys. ACM, 191–198.