Logistic regression: Difference between revisions
([deploy-bot] Claude-authored article: Logistic regression) Tags: content-generation Manual revert Reverted |
(Marked this version for translation) Tag: Manual revert |
||
| Line 14: | Line 14: | ||
'''Logistic regression''' is a foundational statistical model for binary classification that predicts the probability of a categorical outcome by passing a linear combination of features through the logistic (sigmoid) function. Despite its name, it is a classification method, not a regression method, and it remains one of the most widely used and interpretable models in statistics, epidemiology, and machine learning. | '''Logistic regression''' is a foundational statistical model for binary classification that predicts the probability of a categorical outcome by passing a linear combination of features through the logistic (sigmoid) function. Despite its name, it is a classification method, not a regression method, and it remains one of the most widely used and interpretable models in statistics, epidemiology, and machine learning. | ||
<!--T:2--> | == Overview == <!--T:2--> | ||
<!--T:3--> | <!--T:3--> | ||
| Line 23: | Line 22: | ||
The popularity of logistic regression stems from a rare combination of properties: it is a probabilistic classifier with a convex loss, parameters are easy to interpret as log-odds ratios, training scales to massive datasets via [[Stochastic Gradient Descent]], and it doubles as the final layer of most modern [[Neural Networks]] used for binary classification. | The popularity of logistic regression stems from a rare combination of properties: it is a probabilistic classifier with a convex loss, parameters are easy to interpret as log-odds ratios, training scales to massive datasets via [[Stochastic Gradient Descent]], and it doubles as the final layer of most modern [[Neural Networks]] used for binary classification. | ||
<!--T:5--> | == Key Concepts == <!--T:5--> | ||
<!--T:6--> | <!--T:6--> | ||
| Line 34: | Line 32: | ||
* '''{{Term|convex optimization|Convex optimisation}}''': the loss is convex in the parameters, so any local minimum is a global one. | * '''{{Term|convex optimization|Convex optimisation}}''': the loss is convex in the parameters, so any local minimum is a global one. | ||
<!--T:7--> | == History == <!--T:7--> | ||
<!--T:8--> | <!--T:8--> | ||
| Line 46: | Line 43: | ||
In the 1970s and 1980s, logistic regression became the default model for case-control studies in epidemiology, partly because the odds ratio it produces is invariant to outcome-based sampling. With the rise of machine learning, the model found a second life as a baseline classifier and as the output layer of [[Neural Networks]]. Multinomial logistic regression, which generalises the model to more than two classes via the [[Softmax Function]], is the workhorse classifier underlying nearly all modern deep classification systems. | In the 1970s and 1980s, logistic regression became the default model for case-control studies in epidemiology, partly because the odds ratio it produces is invariant to outcome-based sampling. With the rise of machine learning, the model found a second life as a baseline classifier and as the output layer of [[Neural Networks]]. Multinomial logistic regression, which generalises the model to more than two classes via the [[Softmax Function]], is the workhorse classifier underlying nearly all modern deep classification systems. | ||
<!--T:11--> | == Key Approaches == <!--T:11--> | ||
<!--T:12--> | === Model Specification === <!--T:12--> | ||
<!--T:13--> | <!--T:13--> | ||
| Line 67: | Line 62: | ||
A unit increase in <math>x_j</math> multiplies the odds of the positive class by <math>e^{w_j}</math>, holding other features fixed. This direct interpretation of coefficients as '''log-odds ratios''' is one of the model's defining strengths. | A unit increase in <math>x_j</math> multiplies the odds of the positive class by <math>e^{w_j}</math>, holding other features fixed. This direct interpretation of coefficients as '''log-odds ratios''' is one of the model's defining strengths. | ||
=== Maximum Likelihood and Cross-Entropy === <!--T:18--> | |||
=== Maximum Likelihood and Cross-Entropy === | |||
<!--T:19--> | <!--T:19--> | ||
| Line 91: | Line 85: | ||
i.e. the average feature vector weighted by the prediction error. | i.e. the average feature vector weighted by the prediction error. | ||
<!--T:26--> | === Optimisation === <!--T:26--> | ||
<!--T:27--> | <!--T:27--> | ||
| Line 102: | Line 95: | ||
* '''[[Stochastic Gradient Descent]]''': the default for large-scale and online settings, with the same gradient form as a single-layer neural network. | * '''[[Stochastic Gradient Descent]]''': the default for large-scale and online settings, with the same gradient form as a single-layer neural network. | ||
<!--T:29--> | === Regularisation === <!--T:29--> | ||
<!--T:30--> | <!--T:30--> | ||
| Line 114: | Line 106: | ||
L2 (ridge) {{Term|regularization|regularisation}}, <math>R(\mathbf{w}) = \tfrac{1}{2}\|\mathbf{w}\|_2^2</math>, shrinks weights toward zero and corresponds to a Gaussian prior. L1 (lasso) {{Term|regularization|regularisation}}, <math>R(\mathbf{w}) = \|\mathbf{w}\|_1</math>, promotes sparsity and acts as feature selection. Elastic Net combines both. See [[Overfitting and Regularization]] for the broader treatment. | L2 (ridge) {{Term|regularization|regularisation}}, <math>R(\mathbf{w}) = \tfrac{1}{2}\|\mathbf{w}\|_2^2</math>, shrinks weights toward zero and corresponds to a Gaussian prior. L1 (lasso) {{Term|regularization|regularisation}}, <math>R(\mathbf{w}) = \|\mathbf{w}\|_1</math>, promotes sparsity and acts as feature selection. Elastic Net combines both. See [[Overfitting and Regularization]] for the broader treatment. | ||
<!--T:33--> | === Multinomial Extension === <!--T:33--> | ||
<!--T:34--> | <!--T:34--> | ||
| Line 126: | Line 117: | ||
This is exactly the [[Softmax Function]] applied to a linear score, and it forms the output layer of essentially every modern multi-class classifier built with [[Neural Networks]]. | This is exactly the [[Softmax Function]] applied to a linear score, and it forms the output layer of essentially every modern multi-class classifier built with [[Neural Networks]]. | ||
<!--T:37--> | == Connections == <!--T:37--> | ||
<!--T:38--> | <!--T:38--> | ||
| Line 135: | Line 125: | ||
Logistic regression is also a '''generalised linear model''' (GLM) with a Bernoulli response and the canonical {{Term|logits|logit}} link, placing it in the same family as Poisson regression and [[Linear Regression]] (with Gaussian response and identity link). It is closely related to '''linear discriminant analysis''' (LDA): both produce linear decision boundaries, but LDA models <math>P(\mathbf{x} \mid y)</math> while logistic regression models <math>P(y \mid \mathbf{x})</math> directly, making it a '''discriminative''' rather than '''generative''' classifier. The multinomial form connects directly to the [[Softmax Function]] and is the standard final layer for classifiers operating on [[Word Embeddings]] and the outputs of [[Attention Mechanisms]] alike. | Logistic regression is also a '''generalised linear model''' (GLM) with a Bernoulli response and the canonical {{Term|logits|logit}} link, placing it in the same family as Poisson regression and [[Linear Regression]] (with Gaussian response and identity link). It is closely related to '''linear discriminant analysis''' (LDA): both produce linear decision boundaries, but LDA models <math>P(\mathbf{x} \mid y)</math> while logistic regression models <math>P(y \mid \mathbf{x})</math> directly, making it a '''discriminative''' rather than '''generative''' classifier. The multinomial form connects directly to the [[Softmax Function]] and is the standard final layer for classifiers operating on [[Word Embeddings]] and the outputs of [[Attention Mechanisms]] alike. | ||
<!--T:40--> | == See also == <!--T:40--> | ||
<!--T:41--> | <!--T:41--> | ||
| Line 148: | Line 137: | ||
* [[Loss Functions]] | * [[Loss Functions]] | ||
<!--T:42--> | == References == <!--T:42--> | ||
<!--T:43--> | <!--T:43--> | ||
Revision as of 23:32, 27 April 2026
| Article | |
|---|---|
| Topic area | Machine Learning |
| Difficulty | Introductory |
Logistic regression is a foundational statistical model for binary classification that predicts the probability of a categorical outcome by passing a linear combination of features through the logistic (sigmoid) function. Despite its name, it is a classification method, not a regression method, and it remains one of the most widely used and interpretable models in statistics, epidemiology, and machine learning.
Overview
Logistic regression models the probability that an observation belongs to a positive class as a function of input features. Given a feature vector $ \mathbf{x} \in \mathbb{R}^d $ and a binary label $ y \in \{0, 1\} $, the model assumes that the log-odds of the positive class are a linear function of $ \mathbf{x} $. The output is constrained to the unit interval, making it directly interpretable as a probability and well-suited for downstream decisions, calibration, and risk scoring.
The popularity of logistic regression stems from a rare combination of properties: it is a probabilistic classifier with a convex loss, parameters are easy to interpret as log-odds ratios, training scales to massive datasets via Stochastic Gradient Descent, and it doubles as the final layer of most modern Neural Networks used for binary classification.
Key Concepts
- Sigmoid (logistic) function: the squashing nonlinearity $ \sigma(z) = 1/(1 + e^{-z}) $ that maps any real number to $ (0, 1) $.
- Linear decision boundary: in feature space, the set of points where $ \mathbf{w}^{\!\top}\mathbf{x} + b = 0 $ separates the two classes; logistic regression is therefore a linear classifier.
- Log-odds (logit): the inverse of the sigmoid, $ \mathrm{logit}(p) = \log\frac{p}{1-p} $; logistic regression assumes the logit is linear in the features.
- Maximum likelihood estimation: parameters are fit by maximising the probability of the observed labels under the model.
- Cross-entropy loss: the negative log-likelihood of the Bernoulli model, equivalent to the Cross-Entropy Loss used in deep learning.
- Convex optimisation: the loss is convex in the parameters, so any local minimum is a global one.
History
The logistic function was introduced by Belgian mathematician Pierre François Verhulst in 1838 to model population growth under resource constraints. Its use as a statistical tool grew through the early twentieth century in chemistry and biology, where it described autocatalytic reactions and dose-response curves.
The modern statistical formulation took shape in the mid-twentieth century. Joseph Berkson popularised the term logit in 1944 as an alternative to the probit model favoured by Chester Bliss and R. A. Fisher. David Cox's 1958 paper "The Regression Analysis of Binary Sequences" established logistic regression as the standard tool for binary outcomes in statistics, and Walker and Duncan (1967) extended it to multiple covariates.
In the 1970s and 1980s, logistic regression became the default model for case-control studies in epidemiology, partly because the odds ratio it produces is invariant to outcome-based sampling. With the rise of machine learning, the model found a second life as a baseline classifier and as the output layer of Neural Networks. Multinomial logistic regression, which generalises the model to more than two classes via the Softmax Function, is the workhorse classifier underlying nearly all modern deep classification systems.
Key Approaches
Model Specification
For a binary label $ y \in \{0, 1\} $, logistic regression models
- $ P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^{\!\top}\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^{\!\top}\mathbf{x} + b)}} $
Equivalently, the log-odds are linear:
- $ \log \frac{P(y=1 \mid \mathbf{x})}{P(y=0 \mid \mathbf{x})} = \mathbf{w}^{\!\top}\mathbf{x} + b $
A unit increase in $ x_j $ multiplies the odds of the positive class by $ e^{w_j} $, holding other features fixed. This direct interpretation of coefficients as log-odds ratios is one of the model's defining strengths.
Maximum Likelihood and Cross-Entropy
Given a dataset $ \{(\mathbf{x}_i, y_i)\}_{i=1}^{N} $, the likelihood under the Bernoulli model is
- $ \mathcal{L}(\mathbf{w}, b) = \prod_{i=1}^{N} p_i^{y_i}(1 - p_i)^{1 - y_i}, \quad p_i = \sigma(\mathbf{w}^{\!\top}\mathbf{x}_i + b) $
Taking the negative log gives the binary cross-entropy loss:
- $ \mathcal{J}(\mathbf{w}, b) = -\frac{1}{N}\sum_{i=1}^{N} \big[y_i \log p_i + (1 - y_i)\log(1 - p_i)\big] $
This loss is convex, and its gradient has the elegant form
- $ \nabla_{\mathbf{w}} \mathcal{J} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)\,\mathbf{x}_i $
i.e. the average feature vector weighted by the prediction error.
Optimisation
Unlike linear regression, logistic regression has no closed-form solution. Standard optimisation choices include:
- Iteratively reweighted least squares (IRLS): the classical statistical algorithm, equivalent to Newton's method on the log-likelihood; converges in few iterations on small problems.
- Gradient Descent and L-BFGS: practical for medium-scale problems where IRLS is too memory-hungry.
- Stochastic Gradient Descent: the default for large-scale and online settings, with the same gradient form as a single-layer neural network.
Regularisation
To prevent overfitting and stabilise estimates when features are correlated or numerous, the loss is augmented with a penalty:
- $ \mathcal{J}_{\mathrm{reg}}(\mathbf{w}) = \mathcal{J}(\mathbf{w}) + \lambda\, R(\mathbf{w}) $
L2 (ridge) regularisation, $ R(\mathbf{w}) = \tfrac{1}{2}\|\mathbf{w}\|_2^2 $, shrinks weights toward zero and corresponds to a Gaussian prior. L1 (lasso) regularisation, $ R(\mathbf{w}) = \|\mathbf{w}\|_1 $, promotes sparsity and acts as feature selection. Elastic Net combines both. See Overfitting and Regularization for the broader treatment.
Multinomial Extension
For $ K > 2 $ classes, logistic regression generalises to multinomial logistic regression (also called softmax regression):
- $ P(y = k \mid \mathbf{x}) = \frac{\exp(\mathbf{w}_k^{\!\top}\mathbf{x} + b_k)}{\sum_{j=1}^{K} \exp(\mathbf{w}_j^{\!\top}\mathbf{x} + b_j)} $
This is exactly the Softmax Function applied to a linear score, and it forms the output layer of essentially every modern multi-class classifier built with Neural Networks.
Connections
Logistic regression sits at a crossroads of several major themes in statistics and machine learning. Structurally, it is the simplest case of a neural network: a single neuron with a sigmoid activation. Its loss function is precisely the Cross-Entropy Loss used to train deep classifiers, and the gradient computation is a one-step instance of Backpropagation. The optimiser of choice in modern practice — Stochastic Gradient Descent — was originally developed and analysed in the context of generalised linear models such as this one.
Logistic regression is also a generalised linear model (GLM) with a Bernoulli response and the canonical logit link, placing it in the same family as Poisson regression and Linear Regression (with Gaussian response and identity link). It is closely related to linear discriminant analysis (LDA): both produce linear decision boundaries, but LDA models $ P(\mathbf{x} \mid y) $ while logistic regression models $ P(y \mid \mathbf{x}) $ directly, making it a discriminative rather than generative classifier. The multinomial form connects directly to the Softmax Function and is the standard final layer for classifiers operating on Word Embeddings and the outputs of Attention Mechanisms alike.
See also
- Linear Regression
- Cross-Entropy Loss
- Softmax Function
- Gradient Descent
- Stochastic Gradient Descent
- Neural Networks
- Overfitting and Regularization
- Loss Functions
References
- Cox, D. R. (1958). "The Regression Analysis of Binary Sequences". Journal of the Royal Statistical Society, Series B, 20(2), 215–242.
- Berkson, J. (1944). "Application of the Logistic Function to Bio-Assay". Journal of the American Statistical Association, 39(227), 357–365.
- Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
- McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, Chapter 4.
- Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer, Chapter 4.