Deep & Cross Network for Ad Click Predictions/en

Research Paper
Authors	Ruoxi Wang; Bin Fu; Gang Fu; Mingliang Wang
Year	2017
Topic area	Machine Learning
Difficulty	Research
arXiv	1708.05123
PDF	Download PDF

Other languages:

SummarySource

Deep & Cross Network for Ad Click Predictions (DCN) is a 2017 neural network architecture proposed by Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang of Google and Stanford for click-through rate (CTR) prediction. It augments a standard deep neural network with a parallel cross network that explicitly composes feature interactions of bounded degree at each layer, learning all polynomial cross terms up to a user-specified order with a parameter count that grows only linearly in the input dimension.

Overview

CTR prediction underpins billions of dollars of online advertising revenue but operates over feature spaces that are massive, sparse, and overwhelmingly categorical. Linear models scale well and are interpretable but cannot capture the cross-feature signal that drives accuracy; pure deep neural networks (DNNs) can in principle learn arbitrary functions but represent feature crosses only implicitly through stacked nonlinearities, often inefficiently.

DCN sits between these two regimes. After embedding sparse categorical inputs into low-dimensional dense vectors and stacking them with normalized continuous features, the model splits into two parallel branches: a cross network that applies an explicit, residual-style feature-crossing operation at every layer, and a standard deep network of fully connected ReLU layers. Their outputs are concatenated and passed through a logistic head trained with log loss. The cross network adds only $O(d \cdot L_c)$ parameters on top of the DNN, where $$ d $$ is the embedded input dimension and $$ L_c $$ is the number of cross layers, yet captures all cross terms up to degree $$ L_c + 1 $$ .

Key Contributions

A novel cross network that applies explicit feature crossing at every layer, with the highest polynomial degree of the represented interactions provably equal to the layer depth plus one.
A joint architecture that trains the cross network in parallel with a DNN, combining bounded-degree explicit crosses with deep implicit nonlinearities under one log-loss objective.
A theoretical analysis showing that the cross network reproduces all multinomial cross terms of bounded degree, generalizes factorization machines (FMs) from a single shallow interaction to a stack of high-order ones, and projects the implicit $$ d^2 $$ pairwise interactions back to dimension $$ d $$ in linear time and memory.
Empirical gains on Criteo Display Ads — the standard public CTR benchmark — together with strong results on the UCI forest covertype and Higgs datasets, showing that DCN matches or beats deep baselines while using substantially less memory.

Methods

The DCN model is composed of four stages: an embedding and stacking layer, the cross network, the deep network, and a combination layer.

Embedding and stacking. Each sparse categorical input $\mathbf{x}_i$ is mapped through a learned matrix $W_{\text{embed},i} \in \mathbb{R}^{n_e \times n_v}$ to a dense vector. Embedded categorical features are concatenated with normalized dense features $\mathbf{x}_{\text{dense}}$ into a single vector $\mathbf{x}_0$ that feeds both branches.

Cross network. Let $\mathbf{x}_l \in \mathbb{R}^d$ denote the output of cross layer $$ l $$ . Each layer applies

\mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}_l^{T} \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l,

where $\mathbf{w}_l, \mathbf{b}_l \in \mathbb{R}^d$ . The outer-product term $\mathbf{x}_0 \mathbf{x}_l^{T}$ creates pairwise interactions between the original input and the current state; the residual connection preserves lower-order signal. A theorem in the paper establishes that an $$ l $$ -layer cross network contains every cross term $x_1^{\alpha_1} x_2^{\alpha_2} \cdots x_d^{\alpha_d}$ of degree $1 \le |\boldsymbol{\alpha}| \le l + 1$ , each with a distinct coefficient determined by the weights $\{\mathbf{w}_k\}$ .

Deep network. A standard fully connected feed-forward stack with ReLU activations:

\mathbf{h}_{l+1} = f(W_l \mathbf{h}_l + \mathbf{b}_l).

Combination layer. The final cross-network output $\mathbf{x}_{L_1}$ and deep-network output $\mathbf{h}_{L_2}$ are concatenated and passed through a logistic head:

p = \sigma\!\left(\mathbf{w}_{\text{logits}}^{T} [\mathbf{x}_{L_1};\, \mathbf{h}_{L_2}]\right),\qquad \sigma(x) = \frac{1}{1 + e^{-x}}.

The training loss is the regularized log loss

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \big[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \big] + \lambda \|\mathbf{w}\|^2.

Efficient projection. A direct construction of all $$ d^2 $$ pairwise interactions would be cubic in $$ d $$ ; the cross-layer formula collapses this to $$ O(d) $$ work and parameters per layer because $\mathbf{x}_0 \mathbf{x}_l^{T} \mathbf{w}_l$ can be computed as $\mathbf{x}_0 (\mathbf{x}_l^{T} \mathbf{w}_l)$ — a vector–scalar product.

Connection to FMs. In an FM, each feature $$ x_i $$ carries a vector $\mathbf{v}_i$ and the weight of $$ x_i x_j $$ is $\langle \mathbf{v}_i, \mathbf{v}_j \rangle$ . In DCN the analogous parameters are scalars $\{w_k^{(i)}\}_{k=1}^{l}$ , and the weight of $$ x_i x_j $$ is a product across cross layers. DCN therefore extends FM's parameter sharing from a single second-order interaction to arbitrary-degree interactions across multiple layers.

Results

Criteo Display Ads. On the public Criteo CTR challenge (about 41 million records, 13 integer and 26 categorical features), DCN achieved a test log loss of 0.4422 ± 9 × 10⁻⁵, compared with 0.4430 ± 3.7 × 10⁻⁴ for a tuned DNN, 0.4430 ± 4.3 × 10⁻⁴ for Deep Crossing (DC), and weaker results for logistic regression, FMs, and Wide & Deep. The optimal DCN used 6 cross layers and 2 deep layers of size 1024; that the deepest cross configuration won supports the claim that higher-order explicit interactions are valuable. In follow-up sweeps over memory budget and loss tolerance, DCN matched DNN accuracy while using roughly 40% fewer parameters, and matched the best DNN log loss with about an order-of-magnitude smaller deep stack.

Non-CTR datasets. On UCI forest covertype (581k samples, 54 features), DCN reached 0.9740 test accuracy versus 0.9737 for DNN and DC, with the smallest memory footprint. On Higgs (11M samples, 28 features), DCN obtained log loss 0.4494 against 0.4506 for DNN, while using roughly half the parameters.

Impact

DCN became one of the canonical baselines for deep CTR and recommender models, alongside Wide & Deep and DeepFM. Its core idea — a parameter-efficient module that performs explicit, higher-order feature crossing alongside a DNN — was widely adopted in industry feature-interaction modeling, and the original cross-layer formulation was later refined into DCN-V2 (Wang et al., 2021) using a full weight matrix per cross layer for greater expressiveness at production scale at Google. Beyond advertising, the architecture's strong showing on dense classification tasks helped popularize parallel "explicit + implicit" feature-interaction designs in tabular deep learning.

References

Wang, R., Fu, B., Fu, G., & Wang, M. (2017). Deep & Cross Network for Ad Click Predictions. Proceedings of the ADKDD'17. arXiv:1708.05123.
Cheng, H.-T. et al. (2016). Wide & Deep Learning for Recommender Systems. DLRS.
Rendle, S. (2010). Factorization Machines. ICDM.
Shan, Y. et al. (2016). Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features. KDD.
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR.
Wang, R. et al. (2021). DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. WWW.