Linear Regression: Difference between revisions

    From Marovi AI
    ([deploy-bot] Deploy from CI (775ba6e))
    Tag: ci-deploy
     
    (Force re-parse after Math source-mode rollout (v1.2.0))
    Tags: ci-deploy Reverted
    Line 116: Line 116:
    [[Category:Statistics]]
    [[Category:Statistics]]
    [[Category:Introductory]]
    [[Category:Introductory]]
    <!--v1.2.0 cache-bust-->

    Revision as of 06:58, 24 April 2026

    Languages: English | Español | 中文
    Article
    Topic area Statistics
    Difficulty Introductory

    Linear regression is a fundamental statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It is one of the oldest and most widely used techniques in statistics and machine learning, serving as both a practical predictive tool and a building block for understanding more complex models.

    Problem Setup

    Given a dataset of $ N $ observations $ \{(\mathbf{x}_i, y_i)\}_{i=1}^{N} $, where $ \mathbf{x}_i \in \mathbb{R}^d $ is a feature vector and $ y_i \in \mathbb{R} $ is the target, linear regression assumes the relationship:

    $ y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + b + \epsilon_i $

    where $ \mathbf{w} \in \mathbb{R}^d $ is the weight vector, $ b $ is the bias (intercept), and $ \epsilon_i $ is the error term. By absorbing the bias into the weight vector (appending a 1 to each $ \mathbf{x}_i $), this simplifies to $ y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + \epsilon_i $.

    Ordinary Least Squares

    The ordinary least squares (OLS) method finds the weights that minimize the sum of squared residuals:

    $ \mathcal{L}(\mathbf{w}) = \sum_{i=1}^{N} (y_i - \mathbf{w}^{\!\top} \mathbf{x}_i)^2 = \|\mathbf{y} - X\mathbf{w}\|^2 $

    where $ X \in \mathbb{R}^{N \times d} $ is the design matrix and $ \mathbf{y} \in \mathbb{R}^N $ is the target vector.

    Closed-Form Solution

    Setting the gradient to zero yields the normal equations:

    $ \nabla_{\mathbf{w}} \mathcal{L} = -2 X^{\!\top}(\mathbf{y} - X\mathbf{w}) = 0 $
    $ \hat{\mathbf{w}} = (X^{\!\top} X)^{-1} X^{\!\top} \mathbf{y} $

    This solution exists and is unique when $ X^{\!\top} X $ is invertible (i.e., the features are linearly independent). The computational cost is $ O(Nd^2 + d^3) $, which is efficient for moderate $ d $ but becomes expensive for high-dimensional problems.

    Gradient Descent Approach

    When the closed-form solution is impractical (large $ d $ or $ N $), iterative optimization via gradient descent is used. The gradient is:

    $ \nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{N} X^{\!\top}(\mathbf{y} - X\mathbf{w}) $

    The update rule is $ \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} \mathcal{L} $, where $ \eta $ is the learning rate. Stochastic and mini-batch variants scale to millions of data points.

    Assumptions of OLS

    The classical OLS estimator is BLUE (Best Linear Unbiased Estimator) under the Gauss-Markov conditions:

    1. Linearity: The true relationship between features and target is linear.
    2. Independence: Observations are independent of each other.
    3. Homoscedasticity: The error variance $ \mathrm{Var}(\epsilon_i) = \sigma^2 $ is constant across observations.
    4. No perfect multicollinearity: No feature is an exact linear combination of others.
    5. Exogeneity: $ E[\epsilon_i \mid \mathbf{x}_i] = 0 $ — errors are uncorrelated with features.

    Violations of these assumptions do not necessarily make linear regression useless, but they may invalidate confidence intervals and hypothesis tests derived from the model.

    Evaluation Metrics

    Metric Formula Interpretation
    MSE $ \frac{1}{N}\sum(y_i - \hat{y}_i)^2 $ Average squared error; penalises large errors
    RMSE $ \sqrt{\mathrm{MSE}} $ In the same units as the target
    MAE $ \frac{1}{N}\sum|y_i - \hat{y}_i| $ Average absolute error; robust to outliers
    R-squared $ 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2} $ Proportion of variance explained (0 to 1)

    An $ R^2 $ of 1 indicates perfect prediction, while $ R^2 = 0 $ means the model does no better than predicting the mean. The adjusted R-squared penalises for the number of features, preventing artificial inflation from adding irrelevant predictors.

    Multiple Regression

    When $ d > 1 $, the model is called multiple linear regression. Each coefficient $ w_j $ represents the expected change in $ y $ per unit change in $ x_j $, holding all other features constant. Interpreting coefficients requires caution when features are correlated (multicollinearity), as individual coefficients may become unstable even though the overall model fits well.

    Regularized Variants

    When the number of features is large relative to the number of observations, or when features are correlated, OLS can overfit. Regularization adds a penalty to the loss function:

    Ridge Regression (L2)

    $ \mathcal{L}_{\mathrm{ridge}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2 $

    The closed-form solution becomes $ \hat{\mathbf{w}} = (X^{\!\top} X + \lambda I)^{-1} X^{\!\top} \mathbf{y} $. Ridge shrinks coefficients toward zero but never sets them exactly to zero.

    Lasso Regression (L1)

    $ \mathcal{L}_{\mathrm{lasso}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1 $

    Lasso can drive coefficients to exactly zero, performing automatic feature selection. It has no closed-form solution and is typically solved via coordinate descent.

    Elastic Net

    Elastic Net combines both penalties: $ \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2 $, balancing sparsity and stability.

    Practical Considerations

    • Feature scaling: Standardizing features (zero mean, unit variance) improves gradient descent convergence and makes regularization fair across features.
    • Polynomial features: Adding polynomial terms (e.g., $ x^2, x_1 x_2 $) allows linear regression to capture nonlinear relationships.
    • Outliers: OLS is sensitive to outliers because of the squared loss. Robust alternatives include Huber regression and RANSAC.
    • Diagnostic plots: Residual plots help detect violations of assumptions (non-linearity, heteroscedasticity, non-normality).

    See also

    References

    • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning. Springer, Chapter 3.
    • Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012). Introduction to Linear Regression Analysis. Wiley.
    • Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems". Technometrics.
    • Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". Journal of the Royal Statistical Society, Series B.