Wide & Deep Learning for Recommender Systems/en

Research Paper
Authors	Heng-Tze Cheng; Levent Koc; Jeremiah Harmsen; Tal Shaked; Tushar Chandra; Hrishi Aradhye; Glen Anderson; Greg Corrado; Wei Chai; Mustafa Ispir; Rohan Anil; Zakaria Haque; Lichan Hong; Vihan Jain; Xiaobing Liu; Hemal Shah
Year	2016
Topic area	Machine Learning
Difficulty	Research
arXiv	1606.07792
PDF	Download PDF

Other languages:

SummarySource

Wide & Deep Learning for Recommender Systems is a 2016 paper by Heng-Tze Cheng and colleagues at Google that introduces a hybrid architecture for large-scale recommender systems. The model jointly trains a wide linear component, which memorizes specific feature interactions through cross-product transformations, and a deep feed-forward neural network, which generalizes to unseen feature combinations through low-dimensional embeddings. The framework was productionized in the Google Play app store and improved app acquisitions by 3.9% in live A/B tests, while a reference implementation was released in TensorFlow.

Overview

A recommender system can be viewed as a search ranking pipeline: a query consisting of user and contextual features is mapped to a ranked list of candidate items. Two competing capabilities are needed. Memorization learns the frequent co-occurrence of items or features from historical data and exploits direct correlations; it is well served by generalized linear models on cross-product features but does not extrapolate to unseen pairs. Generalization explores new feature combinations and is well served by embedding-based models such as factorization machines and deep neural networks, but dense embeddings can over-generalize on sparse, high-rank query-item matrices and surface less relevant items.

The Wide & Deep framework combines both signals in a single jointly trained model. The wide branch handles exception-style rules with few parameters, while the deep branch covers the long tail of unseen interactions. The two branches share a single logistic loss, so each component can specialize without redundantly modeling what the other already captures.

Key Contributions

The Wide & Deep architecture, which jointly trains a feed-forward neural network with sparse-feature embeddings and a linear model with cross-product transformations for generic recommender systems.
A productionized implementation evaluated on Google Play, a commercial mobile app store with over one billion active users and over one million apps, demonstrating a +3.9% online gain in app acquisitions over a strong wide-only baseline.
An open-source TensorFlow implementation with a high-level API, lowering the barrier to applying the architecture to other ranking problems.
Engineering details — warm-starting, multithreaded scoring, and feature-vocabulary generation — that bring per-request latency down to roughly 14 ms while scoring tens of millions of candidates per second.

Methods

Wide component

The wide component is a generalized linear model

y = \mathbf{w}^{T}\mathbf{x} + b,

where $\mathbf{x} = [x_1, x_2, \ldots, x_d]$ is the feature vector, $\mathbf{w}$ the weights, and $$ b $$ the bias. Beyond the raw inputs, the feature set includes cross-product transformations

\phi_k(\mathbf{x}) = \prod_{i=1}^{d} x_i^{c_{ki}}, \quad c_{ki} \in \{0, 1\},

where $c_{ki}$ is 1 if feature $$ i $$ participates in the $$ k $$ -th transformation and 0 otherwise. For binary features, $\phi_k$ evaluates to 1 only when all participating features are 1, which captures conjunctions such as AND(user_installed_app=netflix, impression_app=pandora) and injects nonlinearity into the linear model.

Deep component

The deep component is a feed-forward neural network. Each categorical feature is mapped to a dense embedding vector with dimensionality on the order of $$ O(10) $$ to $$ O(100) $$ , learned end-to-end. Embeddings are concatenated with normalized continuous features and propagated through hidden layers

a^{(l+1)} = f\!\left(W^{(l)} a^{(l)} + b^{(l)}\right),

where $$ f $$ is the activation function (rectified linear units in the experiments) and $a^{(l)}$ , $W^{(l)}$ , $b^{(l)}$ are the activations, weights, and biases of layer $$ l $$ .

Joint training

The two branches are combined as a weighted sum of their output log-odds and fed to a shared logistic loss:

P(Y = 1 \mid \mathbf{x}) = \sigma\!\left(\mathbf{w}_{wide}^{T}[\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{deep}^{T} a^{(l_f)} + b\right),

where $\sigma$ is the sigmoid function and $a^{(l_f)}$ is the final hidden activation of the deep network. Crucially this is joint training rather than ensembling: gradients flow back into both branches simultaneously, so the wide part only needs to complement the gaps left by the deep part. The wide branch is optimized with Follow-the-Regularized-Leader (FTRL) and $$ L_1 $$ regularization, while the deep branch uses AdaGrad. In the production model, a 32-dimensional embedding is learned per categorical feature, concatenated into a roughly 1200-dimensional input, and passed through three ReLU layers before a logistic output unit.

System

The recommendation pipeline retrieves a candidate set, then ranks it with the Wide & Deep model. Training data is built from impression logs with a binary label for app acquisition. To absorb the more than 500 billion training examples and the cost of frequent retraining, the team introduced warm-starting, which initializes a new model with the embeddings and linear weights of its predecessor. A dry-run sanity check guards against regressions before promotion to live serving.

Results

In a 3-week live A/B test on Google Play, the Wide & Deep model improved the app acquisition rate on the main landing page by +3.9% relative to a heavily tuned wide-only logistic regression baseline (statistically significant) and by +1% over a deep-only model. Offline AUC differences were narrower (0.728 vs. 0.726 vs. 0.722 for Wide & Deep, Wide, and Deep), suggesting that the online gains come partly from the model's ability to learn from new exploratory recommendations rather than only from offline ranking quality.

On the serving side, single-threaded batch scoring took 31 ms; splitting each batch across multiple threads cut client-side latency to 14 ms while sustaining over 10 million app scores per second at peak.

Impact

Wide & Deep became one of the canonical reference architectures for industrial click-through-rate (CTR) prediction and ranking, alongside Factorization Machines and the deep CTR models that followed. The pattern of pairing a memorization-friendly linear branch with a generalization-friendly deep branch motivated successors such as DeepFM, Deep & Cross Network, and xDeepFM, which automate the cross-feature engineering that the wide branch still relies on.

The TensorFlow DNNLinearCombinedClassifier / DNNLinearCombinedRegressor estimators productized the architecture as a drop-in API, and the paper is widely cited in textbook treatments of recommender systems and deep learning applied to ranking. Beyond its direct influence on CTR models, the broader principle — that complementary inductive biases can be combined under a shared loss instead of via post-hoc ensembling — informed later hybrid designs that mix retrieval and ranking signals, structured priors with neural networks, or rule-based features with learned representations.

The work is also frequently cited as an early successful case study of deploying deep learning in a high-traffic production ranking system. The discussion of warm-starting, vocabulary management, and quantile-based normalization of continuous features became a template for engineering teams building their first online deep ranking pipelines.

References

Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., & Shah, H. (2016). Wide & Deep Learning for Recommender Systems. arXiv:1606.07792.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
McMahan, H. B. (2011). Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In Proc. AISTATS.
Rendle, S. (2012). Factorization machines with libFM. ACM Transactions on Intelligent Systems and Technology, 3(3), 57:1–57:22.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition.
Wang, H., Wang, N., & Yeung, D.-Y. (2015). Collaborative deep learning for recommender systems. In Proc. KDD, 1235–1244.