Factorization Machines: Difference between revisions
([deploy-bot] Claude-authored article: Factorization Machines) Tags: content-generation Manual revert Reverted |
(Marked this version for translation) Tag: Manual revert |
||
| Line 14: | Line 14: | ||
'''Factorization Machines''' (FMs) are a family of supervised learning models that combine the expressiveness of polynomial regression with the parameter efficiency of low-rank matrix factorization. They are particularly effective on sparse, high-dimensional data such as the categorical features that arise in {{Term|click-through rate}} prediction, {{Term|recommender system|recommender systems}}, and ranking problems. | '''Factorization Machines''' (FMs) are a family of supervised learning models that combine the expressiveness of polynomial regression with the parameter efficiency of low-rank matrix factorization. They are particularly effective on sparse, high-dimensional data such as the categorical features that arise in {{Term|click-through rate}} prediction, {{Term|recommender system|recommender systems}}, and ranking problems. | ||
<!--T:2--> | == Overview == <!--T:2--> | ||
<!--T:3--> | <!--T:3--> | ||
| Line 23: | Line 22: | ||
FMs were introduced by Steffen Rendle in 2010 as a unified framework that subsumes several specialised collaborative filtering models (matrix factorization, SVD++, factorized personalised Markov chains) and also generalises naturally to arbitrary feature vectors. Today they remain a strong baseline for tabular sparse data and form the algorithmic core of many production ranking systems. | FMs were introduced by Steffen Rendle in 2010 as a unified framework that subsumes several specialised collaborative filtering models (matrix factorization, SVD++, factorized personalised Markov chains) and also generalises naturally to arbitrary feature vectors. Today they remain a strong baseline for tabular sparse data and form the algorithmic core of many production ranking systems. | ||
<!--T:5--> | == Key Concepts == <!--T:5--> | ||
<!--T:6--> | <!--T:6--> | ||
| Line 33: | Line 31: | ||
* '''Linear-time prediction''' — a closed-form rearrangement reduces the apparent <math>O(d^2 k)</math> sum over pairs to <math>O(d k)</math>, making inference practical at web scale. | * '''Linear-time prediction''' — a closed-form rearrangement reduces the apparent <math>O(d^2 k)</math> sum over pairs to <math>O(d k)</math>, making inference practical at web scale. | ||
<!--T:7--> | == History == <!--T:7--> | ||
<!--T:8--> | <!--T:8--> | ||
| Line 42: | Line 39: | ||
In 2016 Juan, Zhuang, Chin, and Lin introduced '''Field-aware Factorization Machines''' (FFMs) — a refinement in which a feature has a different latent vector per interacting field — and used FFMs to win the Criteo and Avazu {{Term|click-through rate|CTR}} prediction Kaggle competitions, cementing the family's reputation in industrial advertising. The same year, Guo et al. fused FMs with deep neural networks in '''DeepFM''', launching a wave of "deep + FM" hybrids (xDeepFM, AutoInt, DCN-V2) that dominated {{Term|click-through rate|CTR}} benchmarks for the latter half of the 2010s. FMs and their descendants remain widely deployed at companies including Criteo, Yahoo, Alibaba, and Meituan. | In 2016 Juan, Zhuang, Chin, and Lin introduced '''Field-aware Factorization Machines''' (FFMs) — a refinement in which a feature has a different latent vector per interacting field — and used FFMs to win the Criteo and Avazu {{Term|click-through rate|CTR}} prediction Kaggle competitions, cementing the family's reputation in industrial advertising. The same year, Guo et al. fused FMs with deep neural networks in '''DeepFM''', launching a wave of "deep + FM" hybrids (xDeepFM, AutoInt, DCN-V2) that dominated {{Term|click-through rate|CTR}} benchmarks for the latter half of the 2010s. FMs and their descendants remain widely deployed at companies including Criteo, Yahoo, Alibaba, and Meituan. | ||
<!--T:10--> | == Key Approaches == <!--T:10--> | ||
=== The second-order model === <!--T:11--> | |||
=== The second-order model === | |||
<!--T:12--> | <!--T:12--> | ||
| Line 57: | Line 52: | ||
where <math>w_0 \in \mathbb{R}</math> is a global bias, <math>w_j \in \mathbb{R}</math> are first-order weights, and <math>\mathbf{v}_j \in \mathbb{R}^k</math> are latent factors. The {{Term|hyperparameter}} <math>k</math> controls capacity; values between 8 and 64 are typical. | where <math>w_0 \in \mathbb{R}</math> is a global bias, <math>w_j \in \mathbb{R}</math> are first-order weights, and <math>\mathbf{v}_j \in \mathbb{R}^k</math> are latent factors. The {{Term|hyperparameter}} <math>k</math> controls capacity; values between 8 and 64 are typical. | ||
<!--T:15--> | === Linear-time reformulation === <!--T:15--> | ||
<!--T:16--> | <!--T:16--> | ||
| Line 69: | Line 63: | ||
This identity collapses the apparent quadratic cost to <math>O(d k)</math> per example, and to <math>O(\bar{n} k)</math> when only <math>\bar{n}</math> entries of <math>\mathbf{x}</math> are non-zero. The same trick yields cheap analytic gradients, enabling efficient training with {{Term|stochastic gradient descent}} or coordinate ascent. | This identity collapses the apparent quadratic cost to <math>O(d k)</math> per example, and to <math>O(\bar{n} k)</math> when only <math>\bar{n}</math> entries of <math>\mathbf{x}</math> are non-zero. The same trick yields cheap analytic gradients, enabling efficient training with {{Term|stochastic gradient descent}} or coordinate ascent. | ||
<!--T:19--> | === Loss functions and training === <!--T:19--> | ||
<!--T:20--> | <!--T:20--> | ||
FMs are loss-agnostic. Common choices are squared loss for regression, logistic loss for binary classification ({{Term|click-through rate|CTR}} prediction), and pairwise ranking losses such as Bayesian Personalised Ranking. Training options include {{Term|stochastic gradient descent}}, alternating least squares (closed-form coordinate updates that exploit the model's bilinear structure), and Bayesian inference via Gibbs sampling, which removes most {{Term|hyperparameter}} tuning at the cost of compute. | FMs are loss-agnostic. Common choices are squared loss for regression, logistic loss for binary classification ({{Term|click-through rate|CTR}} prediction), and pairwise ranking losses such as Bayesian Personalised Ranking. Training options include {{Term|stochastic gradient descent}}, alternating least squares (closed-form coordinate updates that exploit the model's bilinear structure), and Bayesian inference via Gibbs sampling, which removes most {{Term|hyperparameter}} tuning at the cost of compute. | ||
=== Field-aware Factorization Machines === <!--T:21--> | |||
=== Field-aware Factorization Machines === | |||
<!--T:22--> | <!--T:22--> | ||
In FFMs, every feature carries a separate latent vector for each '''field''' (group of related features) it can interact with. If feature <math>j</math> belongs to field <math>F_j</math>, the interaction term becomes <math>\langle \mathbf{v}_{j, F_{j'}}, \mathbf{v}_{j', F_j} \rangle x_j x_{j'}</math>. This raises parameter count by a factor equal to the number of fields but consistently improves {{Term|click-through rate|CTR}} accuracy when fields are well chosen. | In FFMs, every feature carries a separate latent vector for each '''field''' (group of related features) it can interact with. If feature <math>j</math> belongs to field <math>F_j</math>, the interaction term becomes <math>\langle \mathbf{v}_{j, F_{j'}}, \mathbf{v}_{j', F_j} \rangle x_j x_{j'}</math>. This raises parameter count by a factor equal to the number of fields but consistently improves {{Term|click-through rate|CTR}} accuracy when fields are well chosen. | ||
<!--T:23--> | === DeepFM and neural hybrids === <!--T:23--> | ||
<!--T:24--> | <!--T:24--> | ||
'''DeepFM''' shares an {{Term|embedding}} table between an FM component (capturing low-order interactions) and a deep [[Neural Networks|neural network]] (capturing high-order non-linear interactions). xDeepFM adds an explicit cross-network for vector-wise products, and AutoInt replaces the FM component with a self-{{Term|attention}} layer (see [[Attention Mechanisms]]) that learns which interactions matter. Despite continuing innovation, well-tuned plain FMs remain competitive on many tabular benchmarks. | '''DeepFM''' shares an {{Term|embedding}} table between an FM component (capturing low-order interactions) and a deep [[Neural Networks|neural network]] (capturing high-order non-linear interactions). xDeepFM adds an explicit cross-network for vector-wise products, and AutoInt replaces the FM component with a self-{{Term|attention}} layer (see [[Attention Mechanisms]]) that learns which interactions matter. Despite continuing innovation, well-tuned plain FMs remain competitive on many tabular benchmarks. | ||
=== Higher-order factorization machines === <!--T:25--> | |||
=== Higher-order factorization machines === | |||
<!--T:26--> | <!--T:26--> | ||
| Line 99: | Line 89: | ||
Higher orders capture richer combinatorial structure but inflate parameter count and rarely outperform second-order FMs paired with thoughtful feature engineering. | Higher orders capture richer combinatorial structure but inflate parameter count and rarely outperform second-order FMs paired with thoughtful feature engineering. | ||
<!--T:29--> | == Connections == <!--T:29--> | ||
<!--T:30--> | <!--T:30--> | ||
Factorization machines occupy a middle ground between several classical models. With no interaction term they reduce to [[Linear Regression|linear regression]]; with logistic loss and no factorization they recover {{Term|logistic regression}} with manually crafted cross features. When the input encodes only {{Term|one-hot encoding|one-hot}} user and item indicators they recover matrix factorization (the workhorse of collaborative filtering). FMs share their low-rank inductive bias with [[Word Embeddings|word embeddings]] — both rely on the idea that interactions between sparse symbolic entities can be decomposed into shared latent vectors. Training typically uses [[Stochastic Gradient Descent|stochastic gradient descent]] or its variants, with the same {{Term|regularization|regularisation}} and learning-rate considerations as in any [[Loss Functions|loss-driven]] setting; [[Overfitting and Regularization|L2 regularisation]] on both <math>w_j</math> and <math>\mathbf{v}_j</math> is standard. In hybrid models such as DeepFM, FMs supply an explicit second-order signal that complements the implicit higher-order features learned by a deep [[Neural Networks|neural network]] via [[Backpropagation|backpropagation]]. For binary {{Term|click-through rate|CTR}} tasks the final prediction is squashed through a sigmoid (a binary [[Softmax Function|softmax]]) and trained with [[Cross-Entropy Loss|cross-entropy loss]]. | Factorization machines occupy a middle ground between several classical models. With no interaction term they reduce to [[Linear Regression|linear regression]]; with logistic loss and no factorization they recover {{Term|logistic regression}} with manually crafted cross features. When the input encodes only {{Term|one-hot encoding|one-hot}} user and item indicators they recover matrix factorization (the workhorse of collaborative filtering). FMs share their low-rank inductive bias with [[Word Embeddings|word embeddings]] — both rely on the idea that interactions between sparse symbolic entities can be decomposed into shared latent vectors. Training typically uses [[Stochastic Gradient Descent|stochastic gradient descent]] or its variants, with the same {{Term|regularization|regularisation}} and learning-rate considerations as in any [[Loss Functions|loss-driven]] setting; [[Overfitting and Regularization|L2 regularisation]] on both <math>w_j</math> and <math>\mathbf{v}_j</math> is standard. In hybrid models such as DeepFM, FMs supply an explicit second-order signal that complements the implicit higher-order features learned by a deep [[Neural Networks|neural network]] via [[Backpropagation|backpropagation]]. For binary {{Term|click-through rate|CTR}} tasks the final prediction is squashed through a sigmoid (a binary [[Softmax Function|softmax]]) and trained with [[Cross-Entropy Loss|cross-entropy loss]]. | ||
<!--T:31--> | == See also == <!--T:31--> | ||
<!--T:32--> | <!--T:32--> | ||
| Line 117: | Line 105: | ||
* [[Attention Mechanisms]] | * [[Attention Mechanisms]] | ||
<!--T:33--> | == References == <!--T:33--> | ||
<!--T:34--> | <!--T:34--> | ||
Revision as of 23:33, 27 April 2026
| Article | |
|---|---|
| Topic area | Machine Learning |
| Difficulty | Introductory |
Factorization Machines (FMs) are a family of supervised learning models that combine the expressiveness of polynomial regression with the parameter efficiency of low-rank matrix factorization. They are particularly effective on sparse, high-dimensional data such as the categorical features that arise in click-through rate prediction, recommender systems, and ranking problems.
Overview
Standard polynomial regression models all pairwise feature interactions with an independent parameter per pair, which is impractical when the input is sparse: most pairs are observed too rarely to estimate reliably. Factorization machines sidestep this by representing each feature with a low-dimensional latent vector and modelling each interaction as the dot product of the corresponding vectors. This dramatically reduces parameter count, allows interaction strengths to be inferred even for pairs never co-observed in training, and yields a model that trains in time linear in the number of non-zero entries of the input.
FMs were introduced by Steffen Rendle in 2010 as a unified framework that subsumes several specialised collaborative filtering models (matrix factorization, SVD++, factorized personalised Markov chains) and also generalises naturally to arbitrary feature vectors. Today they remain a strong baseline for tabular sparse data and form the algorithmic core of many production ranking systems.
Key Concepts
- Latent factor representation — each feature $ j $ is associated with a vector $ \mathbf{v}_j \in \mathbb{R}^k $ of latent factors; pairwise interactions reuse these shared vectors.
- Sparsity friendliness — gradients are non-zero only for active (non-zero) features, so optimisation cost scales with input density rather than dimensionality.
- Generality — a single feature vector can encode user IDs, item IDs, contextual signals, and side information, letting FMs replace several hand-crafted recommender models.
- Order of interactions — second-order FMs model pairwise interactions; higher-order FMs extend the idea to triples and beyond, though second order dominates in practice.
- Linear-time prediction — a closed-form rearrangement reduces the apparent $ O(d^2 k) $ sum over pairs to $ O(d k) $, making inference practical at web scale.
History
Factorization machines were proposed by Steffen Rendle at ICDM 2010 and elaborated in the 2012 ACM TIST paper "Factorization Machines with libFM", which introduced the influential libFM reference implementation supporting stochastic gradient descent, alternating least squares, and Markov chain Monte Carlo training. In 2011 Rendle and collaborators showed that FMs subsume tensor factorization and time-aware recommender variants.
In 2016 Juan, Zhuang, Chin, and Lin introduced Field-aware Factorization Machines (FFMs) — a refinement in which a feature has a different latent vector per interacting field — and used FFMs to win the Criteo and Avazu CTR prediction Kaggle competitions, cementing the family's reputation in industrial advertising. The same year, Guo et al. fused FMs with deep neural networks in DeepFM, launching a wave of "deep + FM" hybrids (xDeepFM, AutoInt, DCN-V2) that dominated CTR benchmarks for the latter half of the 2010s. FMs and their descendants remain widely deployed at companies including Criteo, Yahoo, Alibaba, and Meituan.
Key Approaches
The second-order model
For an input vector $ \mathbf{x} \in \mathbb{R}^d $, a second-order factorization machine predicts:
- $ \hat{y}(\mathbf{x}) = w_0 + \sum_{j=1}^{d} w_j x_j + \sum_{j=1}^{d} \sum_{j'=j+1}^{d} \langle \mathbf{v}_j, \mathbf{v}_{j'} \rangle \, x_j x_{j'} $
where $ w_0 \in \mathbb{R} $ is a global bias, $ w_j \in \mathbb{R} $ are first-order weights, and $ \mathbf{v}_j \in \mathbb{R}^k $ are latent factors. The hyperparameter $ k $ controls capacity; values between 8 and 64 are typical.
Linear-time reformulation
The pairwise sum can be rewritten as:
- $ \sum_{j<j'} \langle \mathbf{v}_j, \mathbf{v}_{j'} \rangle x_j x_{j'} = \tfrac{1}{2} \sum_{f=1}^{k} \left[ \left( \sum_{j=1}^{d} v_{j,f} x_j \right)^{\!2} - \sum_{j=1}^{d} v_{j,f}^{2} x_j^{2} \right] $
This identity collapses the apparent quadratic cost to $ O(d k) $ per example, and to $ O(\bar{n} k) $ when only $ \bar{n} $ entries of $ \mathbf{x} $ are non-zero. The same trick yields cheap analytic gradients, enabling efficient training with stochastic gradient descent or coordinate ascent.
Loss functions and training
FMs are loss-agnostic. Common choices are squared loss for regression, logistic loss for binary classification (CTR prediction), and pairwise ranking losses such as Bayesian Personalised Ranking. Training options include stochastic gradient descent, alternating least squares (closed-form coordinate updates that exploit the model's bilinear structure), and Bayesian inference via Gibbs sampling, which removes most hyperparameter tuning at the cost of compute.
Field-aware Factorization Machines
In FFMs, every feature carries a separate latent vector for each field (group of related features) it can interact with. If feature $ j $ belongs to field $ F_j $, the interaction term becomes $ \langle \mathbf{v}_{j, F_{j'}}, \mathbf{v}_{j', F_j} \rangle x_j x_{j'} $. This raises parameter count by a factor equal to the number of fields but consistently improves CTR accuracy when fields are well chosen.
DeepFM and neural hybrids
DeepFM shares an embedding table between an FM component (capturing low-order interactions) and a deep neural network (capturing high-order non-linear interactions). xDeepFM adds an explicit cross-network for vector-wise products, and AutoInt replaces the FM component with a self-attention layer (see Attention Mechanisms) that learns which interactions matter. Despite continuing innovation, well-tuned plain FMs remain competitive on many tabular benchmarks.
Higher-order factorization machines
The framework extends to $ m $-way interactions:
- $ \hat{y}(\mathbf{x}) = w_0 + \sum_j w_j x_j + \sum_{l=2}^{m} \sum_{j_1 < \dots < j_l} \left( \prod_{r=1}^{l} x_{j_r} \right) \sum_{f=1}^{k_l} \prod_{r=1}^{l} v^{(l)}_{j_r,f} $
Higher orders capture richer combinatorial structure but inflate parameter count and rarely outperform second-order FMs paired with thoughtful feature engineering.
Connections
Factorization machines occupy a middle ground between several classical models. With no interaction term they reduce to linear regression; with logistic loss and no factorization they recover logistic regression with manually crafted cross features. When the input encodes only one-hot user and item indicators they recover matrix factorization (the workhorse of collaborative filtering). FMs share their low-rank inductive bias with word embeddings — both rely on the idea that interactions between sparse symbolic entities can be decomposed into shared latent vectors. Training typically uses stochastic gradient descent or its variants, with the same regularisation and learning-rate considerations as in any loss-driven setting; L2 regularisation on both $ w_j $ and $ \mathbf{v}_j $ is standard. In hybrid models such as DeepFM, FMs supply an explicit second-order signal that complements the implicit higher-order features learned by a deep neural network via backpropagation. For binary CTR tasks the final prediction is squashed through a sigmoid (a binary softmax) and trained with cross-entropy loss.
See also
- Linear Regression
- Stochastic Gradient Descent
- Loss Functions
- Overfitting and Regularization
- Neural Networks
- Word Embeddings
- Attention Mechanisms
References
- Rendle, S. (2010). "Factorization Machines". Proceedings of the IEEE International Conference on Data Mining (ICDM).
- Rendle, S. (2012). "Factorization Machines with libFM". ACM Transactions on Intelligent Systems and Technology, 3(3).
- Juan, Y., Zhuang, Y., Chin, W.-S. and Lin, C.-J. (2016). "Field-aware Factorization Machines for CTR Prediction". Proceedings of the 10th ACM Conference on Recommender Systems.
- Guo, H., Tang, R., Ye, Y., Li, Z. and He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction". Proceedings of IJCAI.
- Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X. and Sun, G. (2018). "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems". Proceedings of KDD.
- Blondel, M., Fujino, A., Ueda, N. and Ishihata, M. (2016). "Higher-Order Factorization Machines". Advances in Neural Information Processing Systems (NeurIPS).