<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Wide_%26_Deep_Learning_for_Recommender_Systems%2Fen</id>
	<title>Wide &amp; Deep Learning for Recommender Systems/en - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Wide_%26_Deep_Learning_for_Recommender_Systems%2Fen"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Wide_%26_Deep_Learning_for_Recommender_Systems/en&amp;action=history"/>
	<updated>2026-04-27T17:00:04Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Wide_%26_Deep_Learning_for_Recommender_Systems/en&amp;diff=12914&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Wide_%26_Deep_Learning_for_Recommender_Systems/en&amp;diff=12914&amp;oldid=prev"/>
		<updated>2026-04-27T08:02:20Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;languages /&amp;gt;&lt;br /&gt;
{{PaperTabs}}&lt;br /&gt;
{{PaperInfobox&lt;br /&gt;
 | topic_area  = Machine Learning&lt;br /&gt;
 | difficulty  = Research&lt;br /&gt;
 | authors     = Heng-Tze Cheng; Levent Koc; Jeremiah Harmsen; Tal Shaked; Tushar Chandra; Hrishi Aradhye; Glen Anderson; Greg Corrado; Wei Chai; Mustafa Ispir; Rohan Anil; Zakaria Haque; Lichan Hong; Vihan Jain; Xiaobing Liu; Hemal Shah&lt;br /&gt;
 | year        = 2016&lt;br /&gt;
 | arxiv_id    = 1606.07792&lt;br /&gt;
 | source_url  = https://arxiv.org/abs/1606.07792&lt;br /&gt;
 | pdf_url     = https://arxiv.org/pdf/1606.07792.pdf&lt;br /&gt;
}}&lt;br /&gt;
{{ContentMeta&lt;br /&gt;
 | generated_by   = claude-code-direct&lt;br /&gt;
 | model_used     = claude-opus-4-7&lt;br /&gt;
 | generated_date = 2026-04-27&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Wide &amp;amp; Deep Learning for Recommender Systems&amp;#039;&amp;#039;&amp;#039; is a 2016 paper by Heng-Tze Cheng and colleagues at Google that introduces a hybrid architecture for large-scale recommender systems. The model jointly trains a wide linear component, which memorizes specific feature interactions through cross-product transformations, and a deep feed-forward neural network, which generalizes to unseen feature combinations through low-dimensional embeddings. The framework was productionized in the Google Play app store and improved app acquisitions by 3.9% in live A/B tests, while a reference implementation was released in TensorFlow.&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
A recommender system can be viewed as a search ranking pipeline: a query consisting of user and contextual features is mapped to a ranked list of candidate items. Two competing capabilities are needed. &amp;#039;&amp;#039;Memorization&amp;#039;&amp;#039; learns the frequent co-occurrence of items or features from historical data and exploits direct correlations; it is well served by generalized linear models on cross-product features but does not extrapolate to unseen pairs. &amp;#039;&amp;#039;Generalization&amp;#039;&amp;#039; explores new feature combinations and is well served by embedding-based models such as factorization machines and deep neural networks, but dense embeddings can over-generalize on sparse, high-rank query-item matrices and surface less relevant items.&lt;br /&gt;
&lt;br /&gt;
The Wide &amp;amp; Deep framework combines both signals in a single jointly trained model. The wide branch handles exception-style rules with few parameters, while the deep branch covers the long tail of unseen interactions. The two branches share a single logistic loss, so each component can specialize without redundantly modeling what the other already captures.&lt;br /&gt;
&lt;br /&gt;
== Key Contributions ==&lt;br /&gt;
&lt;br /&gt;
* The Wide &amp;amp; Deep architecture, which jointly trains a feed-forward neural network with sparse-feature embeddings and a linear model with cross-product transformations for generic recommender systems.&lt;br /&gt;
* A productionized implementation evaluated on Google Play, a commercial mobile app store with over one billion active users and over one million apps, demonstrating a +3.9% online gain in app acquisitions over a strong wide-only baseline.&lt;br /&gt;
* An open-source TensorFlow implementation with a high-level API, lowering the barrier to applying the architecture to other ranking problems.&lt;br /&gt;
* Engineering details — warm-starting, multithreaded scoring, and feature-vocabulary generation — that bring per-request latency down to roughly 14 ms while scoring tens of millions of candidates per second.&lt;br /&gt;
&lt;br /&gt;
== Methods ==&lt;br /&gt;
&lt;br /&gt;
=== Wide component ===&lt;br /&gt;
&lt;br /&gt;
The wide component is a generalized linear model&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y = \mathbf{w}^{T}\mathbf{x} + b,&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathbf{x} = [x_1, x_2, \ldots, x_d]&amp;lt;/math&amp;gt; is the feature vector, &amp;lt;math&amp;gt;\mathbf{w}&amp;lt;/math&amp;gt; the weights, and &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; the bias. Beyond the raw inputs, the feature set includes &amp;#039;&amp;#039;cross-product transformations&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\phi_k(\mathbf{x}) = \prod_{i=1}^{d} x_i^{c_{ki}}, \quad c_{ki} \in \{0, 1\},&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;c_{ki}&amp;lt;/math&amp;gt; is 1 if feature &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt; participates in the &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt;-th transformation and 0 otherwise. For binary features, &amp;lt;math&amp;gt;\phi_k&amp;lt;/math&amp;gt; evaluates to 1 only when all participating features are 1, which captures conjunctions such as &amp;lt;code&amp;gt;AND(user_installed_app=netflix, impression_app=pandora)&amp;lt;/code&amp;gt; and injects nonlinearity into the linear model.&lt;br /&gt;
&lt;br /&gt;
=== Deep component ===&lt;br /&gt;
&lt;br /&gt;
The deep component is a feed-forward neural network. Each categorical feature is mapped to a dense embedding vector with dimensionality on the order of &amp;lt;math&amp;gt;O(10)&amp;lt;/math&amp;gt; to &amp;lt;math&amp;gt;O(100)&amp;lt;/math&amp;gt;, learned end-to-end. Embeddings are concatenated with normalized continuous features and propagated through hidden layers&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;a^{(l+1)} = f\!\left(W^{(l)} a^{(l)} + b^{(l)}\right),&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is the activation function (rectified linear units in the experiments) and &amp;lt;math&amp;gt;a^{(l)}&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;W^{(l)}&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;b^{(l)}&amp;lt;/math&amp;gt; are the activations, weights, and biases of layer &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Joint training ===&lt;br /&gt;
&lt;br /&gt;
The two branches are combined as a weighted sum of their output log-odds and fed to a shared logistic loss:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;P(Y = 1 \mid \mathbf{x}) = \sigma\!\left(\mathbf{w}_{wide}^{T}[\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{deep}^{T} a^{(l_f)} + b\right),&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; is the sigmoid function and &amp;lt;math&amp;gt;a^{(l_f)}&amp;lt;/math&amp;gt; is the final hidden activation of the deep network. Crucially this is &amp;#039;&amp;#039;joint&amp;#039;&amp;#039; training rather than ensembling: gradients flow back into both branches simultaneously, so the wide part only needs to complement the gaps left by the deep part. The wide branch is optimized with Follow-the-Regularized-Leader (FTRL) and &amp;lt;math&amp;gt;L_1&amp;lt;/math&amp;gt; regularization, while the deep branch uses AdaGrad. In the production model, a 32-dimensional embedding is learned per categorical feature, concatenated into a roughly 1200-dimensional input, and passed through three ReLU layers before a logistic output unit.&lt;br /&gt;
&lt;br /&gt;
=== System ===&lt;br /&gt;
&lt;br /&gt;
The recommendation pipeline retrieves a candidate set, then ranks it with the Wide &amp;amp; Deep model. Training data is built from impression logs with a binary label for app acquisition. To absorb the more than 500 billion training examples and the cost of frequent retraining, the team introduced &amp;#039;&amp;#039;warm-starting&amp;#039;&amp;#039;, which initializes a new model with the embeddings and linear weights of its predecessor. A dry-run sanity check guards against regressions before promotion to live serving.&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
In a 3-week live A/B test on Google Play, the Wide &amp;amp; Deep model improved the app acquisition rate on the main landing page by +3.9% relative to a heavily tuned wide-only logistic regression baseline (statistically significant) and by +1% over a deep-only model. Offline AUC differences were narrower (0.728 vs. 0.726 vs. 0.722 for Wide &amp;amp; Deep, Wide, and Deep), suggesting that the online gains come partly from the model&amp;#039;s ability to learn from new exploratory recommendations rather than only from offline ranking quality.&lt;br /&gt;
&lt;br /&gt;
On the serving side, single-threaded batch scoring took 31 ms; splitting each batch across multiple threads cut client-side latency to 14 ms while sustaining over 10 million app scores per second at peak.&lt;br /&gt;
&lt;br /&gt;
== Impact ==&lt;br /&gt;
&lt;br /&gt;
Wide &amp;amp; Deep became one of the canonical reference architectures for industrial click-through-rate (CTR) prediction and ranking, alongside [[Factorization Machines]] and the deep CTR models that followed. The pattern of pairing a memorization-friendly linear branch with a generalization-friendly deep branch motivated successors such as DeepFM, Deep &amp;amp; Cross Network, and xDeepFM, which automate the cross-feature engineering that the wide branch still relies on.&lt;br /&gt;
&lt;br /&gt;
The TensorFlow &amp;#039;&amp;#039;DNNLinearCombinedClassifier&amp;#039;&amp;#039; / &amp;#039;&amp;#039;DNNLinearCombinedRegressor&amp;#039;&amp;#039; estimators productized the architecture as a drop-in API, and the paper is widely cited in textbook treatments of [[recommender systems]] and [[deep learning]] applied to ranking. Beyond its direct influence on CTR models, the broader principle — that complementary inductive biases can be combined under a shared loss instead of via post-hoc ensembling — informed later hybrid designs that mix retrieval and ranking signals, structured priors with neural networks, or rule-based features with learned representations.&lt;br /&gt;
&lt;br /&gt;
The work is also frequently cited as an early successful case study of deploying deep learning in a high-traffic production ranking system. The discussion of warm-starting, vocabulary management, and quantile-based normalization of continuous features became a template for engineering teams building their first online deep ranking pipelines.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Recommender system]]&lt;br /&gt;
* [[Deep learning]]&lt;br /&gt;
* [[Factorization Machines]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[TensorFlow]]&lt;br /&gt;
* [[Click-through rate prediction]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., &amp;amp; Shah, H. (2016). &amp;#039;&amp;#039;Wide &amp;amp; Deep Learning for Recommender Systems&amp;#039;&amp;#039;. arXiv:1606.07792.&lt;br /&gt;
* Duchi, J., Hazan, E., &amp;amp; Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. &amp;#039;&amp;#039;Journal of Machine Learning Research&amp;#039;&amp;#039;, 12, 2121–2159.&lt;br /&gt;
* McMahan, H. B. (2011). Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In &amp;#039;&amp;#039;Proc. AISTATS&amp;#039;&amp;#039;.&lt;br /&gt;
* Rendle, S. (2012). Factorization machines with libFM. &amp;#039;&amp;#039;ACM Transactions on Intelligent Systems and Technology&amp;#039;&amp;#039;, 3(3), 57:1–57:22.&lt;br /&gt;
* He, K., Zhang, X., Ren, S., &amp;amp; Sun, J. (2016). Deep residual learning for image recognition. In &amp;#039;&amp;#039;Proc. IEEE Conference on Computer Vision and Pattern Recognition&amp;#039;&amp;#039;.&lt;br /&gt;
* Wang, H., Wang, N., &amp;amp; Yeung, D.-Y. (2015). Collaborative deep learning for recommender systems. In &amp;#039;&amp;#039;Proc. KDD&amp;#039;&amp;#039;, 1235–1244.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Research]]&lt;br /&gt;
[[Category:Research Papers]]&lt;/div&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
</feed>