Stochastic Gradient Descent: Difference between revisions

    From Marovi AI
    (Marked this version for translation)
    Tag: Reverted
    ([deploy-bot] Deploy from CI (a613d0f))
    Tags: ci-deploy Manual revert Reverted
    Line 17: Line 17:
    '''Stochastic gradient descent''' (often abbreviated '''{{Term|SGD|SGD}}''') is an iterative optimisation algorithm used to minimise an {{Term|objective function|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.
    '''Stochastic gradient descent''' (often abbreviated '''{{Term|SGD|SGD}}''') is an iterative optimisation algorithm used to minimise an {{Term|objective function|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.


    == Motivation == <!--T:2-->
    <!--T:2-->
    == Motivation ==


    <!--T:3-->
    <!--T:3-->
    In classical {{Term|gradient descent}}, the full gradient of the {{Term|loss function}} is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small '''{{Term|mini-batch}}''') at each step, trading a noisier estimate for dramatically lower per-iteration cost.
    In classical {{Term|gradient descent}}, the full gradient of the {{Term|loss function}} is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small '''{{Term|mini-batch}}''') at each step, trading a noisier estimate for dramatically lower per-iteration cost.


    == Algorithm == <!--T:4-->
    <!--T:4-->
    == Algorithm ==


    <!--T:5-->
    <!--T:5-->
    Line 39: Line 41:
    where <math>\eta_t</math> is the '''{{Term|learning rate}}''' (step size) and <math>i_t</math> is a randomly selected index.
    where <math>\eta_t</math> is the '''{{Term|learning rate}}''' (step size) and <math>i_t</math> is a randomly selected index.


    === Mini-batch variant === <!--T:10-->
    <!--T:10-->
    === Mini-batch variant ===


    <!--T:11-->
    <!--T:11-->
    Line 50: Line 53:
    Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.
    Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.


    === Pseudocode === <!--T:14-->
    <!--T:14-->
    === Pseudocode ===


    <!--T:15-->
    <!--T:15-->
    Line 64: Line 68:
    </pre>
    </pre>


    == Learning rate schedules == <!--T:16-->
    <!--T:16-->
    == Learning rate schedules ==


    <!--T:17-->
    <!--T:17-->
    Line 76: Line 81:
    * '''Linear warm-up''' — ramp up from a small <math>\eta</math> during the first few iterations to stabilise early training.
    * '''Linear warm-up''' — ramp up from a small <math>\eta</math> during the first few iterations to stabilise early training.


    == Convergence properties == <!--T:19-->
    <!--T:19-->
    == Convergence properties ==


    <!--T:20-->
    <!--T:20-->
    Line 87: Line 93:
    converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.
    converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.


    == Popular variants == <!--T:23-->
    <!--T:23-->
    == Popular variants ==


    <!--T:24-->
    <!--T:24-->
    Line 110: Line 117:
    |}
    |}


    == Practical considerations == <!--T:26-->
    <!--T:26-->
    == Practical considerations ==


    <!--T:27-->
    <!--T:27-->
    Line 118: Line 126:
    * '''Mixed-precision training''' — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.
    * '''Mixed-precision training''' — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.


    == Applications == <!--T:28-->
    <!--T:28-->
    == Applications ==


    <!--T:29-->
    <!--T:29-->
    Line 130: Line 139:
    * Online learning settings where data arrives in a stream
    * Online learning settings where data arrives in a stream


    == See also == <!--T:31-->
    <!--T:31-->
    == See also ==


    <!--T:32-->
    <!--T:32-->
    Line 139: Line 149:
    * [[Convex optimisation]]
    * [[Convex optimisation]]


    == References == <!--T:33-->
    <!--T:33-->
    == References ==


    <!--T:34-->
    <!--T:34-->

    Revision as of 02:09, 27 April 2026

    Other languages:
    ArticlesMachine LearningOptimization algorithmsStochastic Gradient Descent
    Stochastic Gradient Descent

    Topics {{#arraymap:Optimization, Neural Networks, Gradient Methods|,|@@item@@|@@item@@|}}

    Summary
    Stochastic gradient descent (SGD) is the core optimization algorithm behind modern machine learning. Instead of computing gradients over an entire dataset, it estimates them from small random samples, making training feasible on large-scale data. Nearly all deep learning models are trained using SGD or one of its variants (Adam, RMSProp, etc.).
      {{#arraymap:Estimates gradients from random mini-batches instead of the full dataset; Learning rate schedule is critical for convergence; Variants like Adam and AdamW add adaptive per-parameter rates; Converges to global minimum for convex problems under Robbins–Monro conditions|;|@@item@@|
    • @@item@@
    • |}}

    Stochastic gradient descent (often abbreviated SGD) is an iterative optimisation algorithm used to minimise an objective function written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.

    Motivation

    In classical gradient descent, the full gradient of the loss function is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small mini-batch) at each step, trading a noisier estimate for dramatically lower per-iteration cost.

    Algorithm

    Given a parameterised loss function

    $ L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i) $

    the SGD update rule at step $ t $ is:

    $ \theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t}) $

    where $ \eta_t $ is the learning rate (step size) and $ i_t $ is a randomly selected index.

    Mini-batch variant

    In practice a mini-batch of $ B $ samples is used:

    $ \theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j}) $

    Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.

    Pseudocode

    initialise parameters θ
    for epoch = 1, 2, … do
        shuffle training set
        for each mini-batch B ⊂ training set do
            g ← (1/|B|) Σ ∇ℓ(θ; xᵢ, yᵢ)   # estimate gradient
            θ ← θ − η · g                     # update parameters
        end for
    end for
    

    Learning rate schedules

    The learning rate $ \eta_t $ strongly influences convergence. Common strategies include:

    • Constant — simple but may overshoot or stall.
    • Step decay — multiply $ \eta $ by a factor (e.g. 0.1) every $ k $ epochs.
    • Exponential decay$ \eta_t = \eta_0 \, e^{-\lambda t} $.
    • Cosine annealing — smoothly reduces the rate following a cosine curve, often with warm restarts.
    • Linear warm-up — ramp up from a small $ \eta $ during the first few iterations to stabilise early training.

    Convergence properties

    For convex objectives with Lipschitz-continuous gradients, SGD with a decaying learning rate satisfying

    $ \sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty $

    converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.

    Popular variants

    Several extensions reduce the variance of the gradient estimate or adapt the step size per parameter:

    Method Key idea Reference
    Momentum Accumulates an exponentially decaying moving average of past gradients Polyak, 1964
    Nesterov accelerated gradient Evaluates the gradient at a "look-ahead" position Nesterov, 1983
    Adagrad Per-parameter rates that shrink for frequently updated features Duchi et al., 2011
    RMSProp Fixes Adagrad's diminishing rates using a moving average of squared gradients Hinton (lecture notes), 2012
    Adam Combines momentum with RMSProp-style adaptive rates Kingma & Ba, 2015
    AdamW Decouples weight decay from the adaptive gradient step Loshchilov & Hutter, 2019

    Practical considerations

    • Data shuffling — Re-shuffle the dataset each epoch to avoid cyclic patterns.
    • Gradient clipping — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
    • Batch normalisation — Normalising layer inputs reduces sensitivity to the learning rate.
    • Mixed-precision training — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.

    Applications

    SGD and its variants are used across virtually all areas of machine learning:

    • Training deep neural networks (computer vision, NLP, speech recognition)
    • Large-scale linear models (logistic regression, SVMs via SGD)
    • Reinforcement learning policy optimisation
    • Recommendation systems and collaborative filtering
    • Online learning settings where data arrives in a stream

    See also

    References

    • Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method". Annals of Mathematical Statistics.
    • Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent". COMPSTAT.
    • Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR.
    • Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.