FuzzyBot: Updating to match new version of source page

2026-04-27T08:00:07Z

Updating to match new version of source page

New page

<languages />
{{PaperTabs}}
{{PaperInfobox
| topic_area = NLP
| difficulty = Research
| authors = Ruidong Han; Bin Yin; Shangyu Chen; He Jiang; Fei Jiang; Xiang Li; Chi Ma; Mincong Huang; Xiaoguang Li; Chunzhen Jing; Yueming Han; Menglei Zhou; Lei Yu; Chuan Liu; Wei Lin
| year = 2025
| arxiv_id = 2505.18654
| source_url = https://arxiv.org/abs/2505.18654
| pdf_url = https://arxiv.org/pdf/2505.18654.pdf
}}
{{ContentMeta
| generated_by = claude-code-direct
| model_used = claude-opus-4-7
| generated_date = 2026-04-27
}}

'''MTGR: Industrial-Scale Generative Recommendation Framework in Meituan''' is a 2025 paper by Ruidong Han, Bin Yin and colleagues at Meituan that introduces a ranking model unifying the strengths of Deep Learning Recommendation Models (DLRMs) and Generative Recommendation Models (GRMs). MTGR builds on the HSTU transformer architecture but, unlike prior generative recommenders, retains the hand-crafted cross features that DLRMs rely on, and reorganizes user-candidate data into a single shared sequence so that scaling computation does not scale linearly with the number of candidates. The authors report a 65× increase in forward FLOPs per sample over a mature DLRM baseline together with a +1.22% gain in conversion volume and +1.31% in click-through rate, and have deployed the model on the main traffic of the Meituan take-away platform.

== Overview ==

Industrial recommenders typically face a tension between expressiveness and cost. DLRMs ingest curated features — user profile, behavior sequences, candidate features, and especially cross features that encode user-item interactions — but their inference cost grows roughly linearly with the number of candidates per request, capping how far the model can be scaled.

GRMs replace this with a transformer over tokenized user behavior trained with next-token prediction, achieving favorable scaling but forcing the removal of cross features, which the authors found severely degrades ranking quality even at large parameter counts.

MTGR resolves the tension by treating the user as a context shared across all candidates in a request, packing user features, two kinds of behavior sequences, and per-candidate features (with cross features attached to candidate tokens) into a single token sequence consumed by a stack of HSTU blocks under a custom attention mask. Because the user is encoded once per request rather than once per candidate, the cost of growing the transformer is amortized over many candidates and inference scales sub-linearly in the candidate count. Discriminative scoring on the candidate tokens then preserves the ranking objective familiar to DLRM practitioners.

== Key Contributions ==

* A hybrid scaling architecture that preserves all DLRM input features — including cross features — while inheriting the GRM-style scalability of HSTU, by reformulating recommendation as discriminative scoring over a tokenized user-aggregated sequence rather than next-token prediction.
* Group-Layer Normalization (GLN), a layer normalization variant that normalizes tokens within each semantic group (user, historical behavior, real-time behavior, candidate) separately so that heterogeneous feature domains can share a single attention stack.
* A dynamic three-rule attention mask that distinguishes static context, causal real-time interactions, and self-only candidate visibility, preventing temporal information leakage between candidates and recent user actions inside the same training sample.
* User-level sample aggregation that compresses all candidates of a request (or training window) into one forward pass, reducing training samples from O(candidates) to O(users) and giving sub-linear inference cost in the candidate count.
* A TorchRec-based training stack with dynamic hash-table embeddings, two-stage ID deduplication, automatic table merging, dynamic per-GPU batch size for load balance, bf16 mixed precision, and a CUTLASS attention kernel; together these yield 1.6×–2.4× higher throughput than vanilla TorchRec and good scaling beyond 100 GPUs.

== Methods ==

For a request with <math>K</math> candidates, the traditional DLRM expands the data into <math>K</math> independent samples <math>\mathbb{D}_i = [\mathbf{U}, \vec{\mathbf{S}}, \vec{\mathbf{R}}, \mathbf{C}_i, \mathbf{I}_i]</math>, where <math>\mathbf{U}</math> is the user profile, <math>\vec{\mathbf{S}}</math> the long-term behavior sequence, <math>\vec{\mathbf{R}}</math> recent real-time interactions, <math>\mathbf{C}_i</math> the cross features between the user and candidate <math>i</math>, and <math>\mathbf{I}_i</math> the candidate's own features. After per-feature embedding, target attention summarizes <math>\vec{\mathbf{S}}</math> against <math>\mathbf{I}_i</math> and an MLP produces a logit per candidate.

MTGR rearranges the data so that the user appears once per request and the cross features are attached to candidate tokens:

:<math>\mathbb{D} = [\mathbf{U}, \vec{\mathbf{S}}, \vec{\mathbf{R}}, [\mathbf{C}, \mathbf{I}]_1, \ldots, [\mathbf{C}, \mathbf{I}]_K]</math>

Each scalar feature in <math>\mathbf{U}</math> becomes a token of dimension <math>d_{\text{model}}</math>; each item in <math>\vec{\mathbf{S}}</math> and <math>\vec{\mathbf{R}}</math> becomes a token via embedding-and-MLP; each candidate is a token whose embedding fuses its identity features with the user-specific cross features. The full token stream is fed through <math>L</math> stacked HSTU blocks. Inside a block, the input <math>\mathbf{X}</math> is normalized by Group-Layer Normalization, projected to four heads <math>\mathbf{Q}, \mathbf{K}, \mathbf{V}, \mathbf{U}</math>, and updated via

:<math>\tilde{\mathbf{V}} = \frac{\text{silu}(\mathbf{K}^T \mathbf{Q})}{N_{\mathbf{U}} + N_{\vec{\mathbf{S}}} + N_{\vec{\mathbf{R}}} + N_{\mathbf{I}}} \mathbf{M} \mathbf{V}</math>

:<math>\mathbf{X} \leftarrow \text{MLP}(\text{GroupLN}(\tilde{\mathbf{V}} \odot \mathbf{U})) + \mathbf{X}</math>

The custom mask <math>\mathbf{M}</math> enforces three rules: the static sequence (<math>\mathbf{U}</math> and <math>\vec{\mathbf{S}}</math>) is visible to all tokens; tokens in the dynamic sequence <math>\vec{\mathbf{R}}</math> are causal — only visible to tokens chronologically later than themselves, including candidates that occurred after a given real-time event; candidate tokens are visible only to themselves, so candidates within the same request cannot leak signal to one another. This eliminates the temporal leakage that a naive causal mask would introduce when real-time interactions and candidate exposures are aggregated in the same window.

Group-Layer Normalization, in contrast to a single shared layer norm, computes statistics within each domain (user, long-term behavior, real-time behavior, candidate) separately. Because these domains live in different semantic spaces and have different feature counts, sharing a normalization across them collapses their distributions and weakens the attention signal; the ablation study shows GLN to be roughly as impactful as adding several HSTU blocks.

The training framework departs from TensorFlow in favor of PyTorch with TorchRec. Embedding tables use a decoupled hash-based design — a compact key-to-pointer index plus a separate value structure — that allows real-time insertion and eviction of sparse IDs without pre-allocating capacity. Cross-device embedding lookup is accelerated by two-stage ID deduplication and automatic table merging. Long-tail user sequence lengths are handled by dynamic per-GPU batch size with gradient reweighting. Three pipelined streams (copy, dispatch, compute) overlap I/O, embedding lookup, and forward/backward computation, and a CUTLASS-based attention kernel similar to FlashAttention is used together with bf16 mixed precision.

== Results ==

On a 10-day Meituan ranking dataset with 0.21 billion users, 4.3 million items, 23.7 billion exposures and rich cross features, MTGR is compared against several DLRM scaling families — DNN, MoE, Wukong, MultiEmbed, and UserTower — each combined with either SIM-style sequence retrieval or full end-to-end sequence modeling. The strongest DLRM baseline is UserTower-SIM. Even MTGR-small (3 HSTU blocks, <math>d_{\text{model}} = 512</math>, 5.47 GFLOPs/example) exceeds it on AUC and GAUC for both CTR and CTCVR, and MTGR-medium (5 blocks, <math>d_{\text{model}} = 768</math>) and MTGR-large (15 blocks, <math>d_{\text{model}} = 768</math>, 55.76 GFLOPs/example) extend the gain monotonically, with the improvement on CTCVR GAUC following an approximately power-law relationship in computational complexity.

Ablations show that removing Group-Layer Normalization or dynamic masking each costs roughly as much performance as the gap between MTGR-small and MTGR-medium, and removing cross features wipes out the entire MTGR-large advantage over DLRM, confirming that the design's main payoff is feeding cross features into a scalable transformer rather than the transformer alone. Scalability sweeps along three independent axes — the number of HSTU blocks, the model dimension <math>d_{\text{model}}</math>, and the input sequence length — all yield smooth power-law improvements in CTCVR GAUC.

In a six-month online A/B test against a UserTower-SIM model that had been trained continuously for two years, MTGR-large delivered +0.0153 CTR GAUC, +0.0288 CTCVR GAUC, +1.90% PV_CTR and +1.02% UV_CTCVR — by Meituan's reckoning the largest single ranking gain in nearly two years — at flat training cost and 12% lower inference cost, the latter coming from sub-linear scaling of inference in the candidate count. Notably, the MTGR model used only six months of training data while the DLRM baseline had been continuously updated over two years, suggesting that more training data should widen the margin further.

== Impact ==

MTGR is one of the first reported industrial deployments to make the HSTU-style generative architecture compatible with the cross-feature-rich pipelines that drive most production rankers, addressing a key obstacle that had limited generative recommenders to feature-light public benchmarks. By showing that scaling laws hold for ranking models when input richness is preserved, the work strengthens the case that recommendation can follow the same compute-driven trajectory as language and vision.

Practically, the system runs on the main traffic of Meituan, the world's largest food-delivery platform, serving hundreds of millions of users; the released framework choices — TorchRec with dynamic hash tables, group-wise normalization, and dynamic masking — provide a concrete reference for other industrial teams migrating from DLRM-style stacks to transformer-based ranking.

The authors close by sketching a path toward multi-scenario foundation models for recommendation, in which a single MTGR-style backbone could be shared across multiple business surfaces, mirroring the way large language models are reused across downstream tasks.

== See also ==

* [[Recommender system]]
* [[Deep learning]]
* [[Transformer (deep learning architecture)]]
* [[Attention (machine learning)]]
* [[Neural scaling law]]
* [[Click-through rate]]
* [[Layer normalization]]
* [[Embedding (machine learning)]]
* [[FlashAttention]]
* [[Mixture of experts]]
* [[Meituan]]

== References ==

* Han, R., Yin, B., Chen, S., Jiang, H., Jiang, F., Li, X., Ma, C., Huang, M., Li, X., Jing, C., Han, Y., Zhou, M., Yu, L., Liu, C., and Lin, W. (2025). ''MTGR: Industrial-Scale Generative Recommendation Framework in Meituan''. arXiv:2505.18654.
* Zhai, J. et al. (2024). Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (HSTU).
* Deng, J. et al. (2025). OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment. arXiv:2502.18965.
* Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models.
* Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
* Ivchenko, D. et al. (2022). TorchRec: a PyTorch Domain Library for Recommendation Systems.
* Pi, Q. et al. (2020). Search-Based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction (SIM).
* Zhang, B. et al. (2024). Wukong: Towards a Scaling Law for Large-Scale Recommendation.
* Guo, X. et al. (2023). On the Embedding Collapse when Scaling up Recommendation Models.

[[Category:NLP]]
[[Category:Research]]
[[Category:Research Papers]]

MTGR: Industrial-Scale Generative Recommendation Framework in Meituan/en - Revision history

FuzzyBot: Updating to match new version of source page