DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

2026-04-24T07:08:59Z

[deploy-bot] Deploy from CI (8c92aeb)

← Older revision		Revision as of 07:08, 24 April 2026
Line 116:		Line 116:
	[[Category:Statistics]]		[[Category:Statistics]]
	[[Category:Introductory]]		[[Category:Introductory]]
	~~<!--v1.2.0 cache-bust-->~~
	~~<!-- pass 2 -->~~

DeployBot: Pass 2 force re-parse

2026-04-24T07:00:49Z

Pass 2 force re-parse

← Older revision		Revision as of 07:00, 24 April 2026
Line 117:		Line 117:
	[[Category:Introductory]]		[[Category:Introductory]]
	<!--v1.2.0 cache-bust-->		<!--v1.2.0 cache-bust-->
			<!-- pass 2 -->

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

2026-04-24T06:58:13Z

Force re-parse after Math source-mode rollout (v1.2.0)

← Older revision		Revision as of 06:58, 24 April 2026
Line 116:		Line 116:
	[[Category:Statistics]]		[[Category:Statistics]]
	[[Category:Introductory]]		[[Category:Introductory]]
			<!--v1.2.0 cache-bust-->

DeployBot: [deploy-bot] Deploy from CI (775ba6e)

2026-04-24T04:01:43Z

[deploy-bot] Deploy from CI (775ba6e)

New page

{{LanguageBar | page = Linear Regression}}
{{ArticleInfobox | topic_area = Statistics | difficulty = Introductory | prerequisites = }}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Linear regression''' is a fundamental statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It is one of the oldest and most widely used techniques in statistics and machine learning, serving as both a practical predictive tool and a building block for understanding more complex models.

== Problem Setup ==

Given a dataset of <math>N</math> observations <math>\{(\mathbf{x}_i, y_i)\}_{i=1}^{N}</math>, where <math>\mathbf{x}_i \in \mathbb{R}^d</math> is a feature vector and <math>y_i \in \mathbb{R}</math> is the target, linear regression assumes the relationship:

:<math>y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + b + \epsilon_i</math>

where <math>\mathbf{w} \in \mathbb{R}^d</math> is the weight vector, <math>b</math> is the bias (intercept), and <math>\epsilon_i</math> is the error term. By absorbing the bias into the weight vector (appending a 1 to each <math>\mathbf{x}_i</math>), this simplifies to <math>y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + \epsilon_i</math>.

== Ordinary Least Squares ==

The '''ordinary least squares''' (OLS) method finds the weights that minimize the sum of squared residuals:

:<math>\mathcal{L}(\mathbf{w}) = \sum_{i=1}^{N} (y_i - \mathbf{w}^{\!\top} \mathbf{x}_i)^2 = \|\mathbf{y} - X\mathbf{w}\|^2</math>

where <math>X \in \mathbb{R}^{N \times d}</math> is the design matrix and <math>\mathbf{y} \in \mathbb{R}^N</math> is the target vector.

=== Closed-Form Solution ===

Setting the gradient to zero yields the '''normal equations''':

:<math>\nabla_{\mathbf{w}} \mathcal{L} = -2 X^{\!\top}(\mathbf{y} - X\mathbf{w}) = 0</math>

:<math>\hat{\mathbf{w}} = (X^{\!\top} X)^{-1} X^{\!\top} \mathbf{y}</math>

This solution exists and is unique when <math>X^{\!\top} X</math> is invertible (i.e., the features are linearly independent). The computational cost is <math>O(Nd^2 + d^3)</math>, which is efficient for moderate <math>d</math> but becomes expensive for high-dimensional problems.

=== Gradient Descent Approach ===

When the closed-form solution is impractical (large <math>d</math> or <math>N</math>), iterative optimization via [[Stochastic Gradient Descent|gradient descent]] is used. The gradient is:

:<math>\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{N} X^{\!\top}(\mathbf{y} - X\mathbf{w})</math>

The update rule is <math>\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} \mathcal{L}</math>, where <math>\eta</math> is the learning rate. Stochastic and mini-batch variants scale to millions of data points.

== Assumptions of OLS ==

The classical OLS estimator is '''BLUE''' (Best Linear Unbiased Estimator) under the Gauss-Markov conditions:

# '''Linearity''': The true relationship between features and target is linear.
# '''Independence''': Observations are independent of each other.
# '''Homoscedasticity''': The error variance <math>\mathrm{Var}(\epsilon_i) = \sigma^2</math> is constant across observations.
# '''No perfect multicollinearity''': No feature is an exact linear combination of others.
# '''Exogeneity''': <math>E[\epsilon_i \mid \mathbf{x}_i] = 0</math> — errors are uncorrelated with features.

Violations of these assumptions do not necessarily make linear regression useless, but they may invalidate confidence intervals and hypothesis tests derived from the model.

== Evaluation Metrics ==

{| class="wikitable"
|-
! Metric !! Formula !! Interpretation
|-
| '''MSE''' || <math>\frac{1}{N}\sum(y_i - \hat{y}_i)^2</math> || Average squared error; penalises large errors
|-
| '''RMSE''' || <math>\sqrt{\mathrm{MSE}}</math> || In the same units as the target
|-
| '''MAE''' || <math>\frac{1}{N}\sum|y_i - \hat{y}_i|</math> || Average absolute error; robust to outliers
|-
| '''R-squared''' || <math>1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}</math> || Proportion of variance explained (0 to 1)
|}

An <math>R^2</math> of 1 indicates perfect prediction, while <math>R^2 = 0</math> means the model does no better than predicting the mean. The '''adjusted R-squared''' penalises for the number of features, preventing artificial inflation from adding irrelevant predictors.

== Multiple Regression ==

When <math>d > 1</math>, the model is called '''multiple linear regression'''. Each coefficient <math>w_j</math> represents the expected change in <math>y</math> per unit change in <math>x_j</math>, holding all other features constant. Interpreting coefficients requires caution when features are correlated (multicollinearity), as individual coefficients may become unstable even though the overall model fits well.

== Regularized Variants ==

When the number of features is large relative to the number of observations, or when features are correlated, OLS can overfit. Regularization adds a penalty to the loss function:

=== Ridge Regression (L2) ===

:<math>\mathcal{L}_{\mathrm{ridge}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2</math>

The closed-form solution becomes <math>\hat{\mathbf{w}} = (X^{\!\top} X + \lambda I)^{-1} X^{\!\top} \mathbf{y}</math>. Ridge shrinks coefficients toward zero but never sets them exactly to zero.

=== Lasso Regression (L1) ===

:<math>\mathcal{L}_{\mathrm{lasso}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1</math>

Lasso can drive coefficients to exactly zero, performing automatic '''feature selection'''. It has no closed-form solution and is typically solved via coordinate descent.

=== Elastic Net ===

Elastic Net combines both penalties: <math>\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2</math>, balancing sparsity and stability.

== Practical Considerations ==

* '''Feature scaling''': Standardizing features (zero mean, unit variance) improves gradient descent convergence and makes regularization fair across features.
* '''Polynomial features''': Adding polynomial terms (e.g., <math>x^2, x_1 x_2</math>) allows linear regression to capture nonlinear relationships.
* '''Outliers''': OLS is sensitive to outliers because of the squared loss. Robust alternatives include Huber regression and RANSAC.
* '''Diagnostic plots''': Residual plots help detect violations of assumptions (non-linearity, heteroscedasticity, non-normality).

== See also ==

* [[Stochastic Gradient Descent]]
* [[Logistic regression]]
* [[Loss Functions]]
* [[Overfitting and Regularization]]
* [[Neural Networks]]

== References ==

* Hastie, T., Tibshirani, R. and Friedman, J. (2009). ''The Elements of Statistical Learning''. Springer, Chapter 3.
* Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012). ''Introduction to Linear Regression Analysis''. Wiley.
* Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems". ''Technometrics''.
* Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". ''Journal of the Royal Statistical Society, Series B''.

[[Category:Statistics]]
[[Category:Introductory]]

Linear Regression - Revision history

DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

DeployBot: Pass 2 force re-parse

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

DeployBot: [deploy-bot] Deploy from CI (775ba6e)