Stochastic Gradient Descent: Difference between revisions

Revision as of 06:05, 24 April 2026

Articles›Machine Learning›Optimization algorithms›Stochastic Gradient Descent

Stochastic Gradient Descent

Topics {{#arraymap:Optimization, Neural Networks, Gradient Methods|,|@@item@@|@@item@@|}}

Languages

@@item@@

Summary

Stochastic gradient descent (SGD) is the core optimization algorithm behind modern machine learning. Instead of computing gradients over an entire dataset, it estimates them from small random samples, making training feasible on large-scale data. Nearly all deep learning models are trained using SGD or one of its variants (Adam, RMSProp, etc.).

@@item@@

Stochastic gradient descent (often abbreviated Script error: No such module "Glossary".) is an iterative optimisation algorithm used to minimise an Script error: No such module "Glossary". written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.

Motivation

In classical Script error: No such module "Glossary"., the full gradient of the Script error: No such module "Glossary". is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small Script error: No such module "Glossary".) at each step, trading a noisier estimate for dramatically lower per-iteration cost.

Algorithm

Given a parameterised Script error: No such module "Glossary".

L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i)

the SGD update rule at step $$ t $$ is:

\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})

where $\eta_t$ is the Script error: No such module "Glossary". (step size) and $$ i_t $$ is a randomly selected index.

Mini-batch variant

In practice a Script error: No such module "Glossary". of $$ B $$ samples is used:

\theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j})

Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.

Pseudocode

initialise parameters θ
for epoch = 1, 2, … do
    shuffle training set
    for each mini-batch B ⊂ training set do
        g ← (1/|B|) Σ ∇ℓ(θ; xᵢ, yᵢ)   # estimate gradient
        θ ← θ − η · g                     # update parameters
    end for
end for

Learning rate schedules

The Script error: No such module "Glossary". $\eta_t$ strongly influences Script error: No such module "Glossary".. Common strategies include:

Constant — simple but may overshoot or stall.
Step decay — multiply $\eta$ by a factor (e.g. 0.1) every $$ k $$ epochs.
Exponential decay — $\eta_t = \eta_0 \, e^{-\lambda t}$ .
Cosine annealing — smoothly reduces the rate following a cosine curve, often with warm restarts.
Linear warm-up — ramp up from a small $\eta$ during the first few iterations to stabilise early training.

Convergence properties

For Script error: No such module "Glossary". objectives with Lipschitz-continuous gradients, SGD with a decaying Script error: No such module "Glossary". satisfying

\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty

converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.

Popular variants

Several extensions reduce the variance of the gradient estimate or adapt the step size per parameter:

Method	Key idea	Reference
Script error: No such module "Glossary".	Accumulates an exponentially decaying moving average of past gradients	Polyak, 1964
Nesterov accelerated gradient	Evaluates the gradient at a "look-ahead" position	Nesterov, 1983
Adagrad	Per-parameter rates that shrink for frequently updated features	Duchi et al., 2011
RMSProp	Fixes Adagrad's diminishing rates using a moving average of squared gradients	Hinton (lecture notes), 2012
Script error: No such module "Glossary".	Combines Script error: No such module "Glossary". with RMSProp-style adaptive rates	Kingma & Ba, 2015
AdamW	Decouples weight decay from the adaptive gradient step	Loshchilov & Hutter, 2019

Practical considerations

Data shuffling — Re-shuffle the dataset each epoch to avoid cyclic patterns.
Script error: No such module "Glossary". — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
Script error: No such module "Glossary". — Normalising layer inputs reduces sensitivity to the Script error: No such module "Glossary"..
Mixed-precision training — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.

Applications

SGD and its variants are used across virtually all areas of machine learning:

Training deep neural networks (computer vision, NLP, speech recognition)
Large-scale linear models (logistic regression, SVMs via SGD)
Reinforcement learning policy optimisation
Recommendation systems and collaborative filtering
Online learning settings where data arrives in a stream

References

Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method". Annals of Mathematical Statistics.
Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent". COMPSTAT.
Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR.
Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.

@@ Line 1: / Line 1: @@
-'''Stochastic gradient descent''' (often abbreviated '''SGD''') is an iterative optimisation algorithm used to minimise an objective function written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.
+{{TopicNav|field=Machine Learning|subfield=Optimization algorithms}}
+{{Article
+|title=Stochastic Gradient Descent
+|field=Machine Learning
+|topics=Optimization, Neural Networks, Gradient Methods
+|related_papers=
+|languages=es
+}}
+{{Summary
+|text=Stochastic gradient descent (SGD) is the core optimization algorithm behind modern machine learning. Instead of computing gradients over an entire dataset, it estimates them from small random samples, making training feasible on large-scale data. Nearly all deep learning models are trained using SGD or one of its variants (Adam, RMSProp, etc.).
+|key_points=Estimates gradients from random mini-batches instead of the full dataset; Learning rate schedule is critical for convergence; Variants like Adam and AdamW add adaptive per-parameter rates; Converges to global minimum for convex problems under Robbins–Monro conditions
+}}
+'''Stochastic gradient descent''' (often abbreviated '''{{Term|SGD|SGD}}''') is an iterative optimisation algorithm used to minimise an {{Term|objective function|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.
 == Motivation ==
-In classical [[gradient descent]], the full gradient of the loss function is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small '''mini-batch''') at each step, trading a noisier estimate for dramatically lower per-iteration cost.
+In classical {{Term|gradient descent}}, the full gradient of the {{Term|loss function}} is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small '''{{Term|mini-batch}}''') at each step, trading a noisier estimate for dramatically lower per-iteration cost.
 == Algorithm ==
-Given a parameterised loss function
+Given a parameterised {{Term|loss function}}
 :<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i)</math>
@@ Line 15: / Line 27: @@
 :<math>\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})</math>
-where <math>\eta_t</math> is the '''learning rate''' (step size) and <math>i_t</math> is a randomly selected index.
+where <math>\eta_t</math> is the '''{{Term|learning rate}}''' (step size) and <math>i_t</math> is a randomly selected index.
 === Mini-batch variant ===
-In practice a '''mini-batch''' of <math>B</math> samples is used:
+In practice a '''{{Term|mini-batch}}''' of <math>B</math> samples is used:
 :<math>\theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j})</math>
@@ Line 40: / Line 52: @@
 == Learning rate schedules ==
-The learning rate <math>\eta_t</math> strongly influences convergence. Common strategies include:
+The {{Term|learning rate}} <math>\eta_t</math> strongly influences {{Term|convergence}}. Common strategies include:
 * '''Constant''' — simple but may overshoot or stall.
@@ Line 50: / Line 62: @@
 == Convergence properties ==
-For convex objectives with Lipschitz-continuous gradients, SGD with a decaying learning rate satisfying
+For {{Term|convex optimization|convex}} objectives with Lipschitz-continuous gradients, SGD with a decaying {{Term|learning rate}} satisfying
 :<math>\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty</math>
@@ Line 64: / Line 76: @@
 ! Method !! Key idea !! Reference
 |-
-| '''Momentum''' || Accumulates an exponentially decaying moving average of past gradients || Polyak, 1964
+| '''{{Term|momentum|Momentum}}''' || Accumulates an exponentially decaying moving average of past gradients || Polyak, 1964
 |-
 | '''Nesterov accelerated gradient''' || Evaluates the gradient at a "look-ahead" position || Nesterov, 1983
@@ Line 72: / Line 84: @@
 | '''RMSProp''' || Fixes Adagrad's diminishing rates using a moving average of squared gradients || Hinton (lecture notes), 2012
 |-
-| '''Adam''' || Combines momentum with RMSProp-style adaptive rates || Kingma & Ba, 2015
+| '''{{Term|Adam}}''' || Combines {{Term|momentum}} with RMSProp-style adaptive rates || Kingma & Ba, 2015
 |-
 | '''AdamW''' || Decouples weight decay from the adaptive gradient step || Loshchilov & Hutter, 2019
@@ Line 80: / Line 92: @@
 * '''Data shuffling''' — Re-shuffle the dataset each epoch to avoid cyclic patterns.
-* '''Gradient clipping''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
+* '''{{Term|gradient clipping|Gradient clipping}}''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
-* '''Batch normalisation''' — Normalising layer inputs reduces sensitivity to the learning rate.
+* '''{{Term|batch normalization|Batch normalisation}}''' — Normalising layer inputs reduces sensitivity to the {{Term|learning rate}}.
 * '''Mixed-precision training''' — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.
@@ Line 108: / Line 120: @@
 * Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ''ICLR''.
 * Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.
+{{RelatedContent
+|articles=Gradient descent, Backpropagation, Adam (optimiser), Learning rate, Convex optimisation
+|papers=
+}}
 [[Category:Machine learning]]
 [[Category:Optimisation algorithms]]
 [[Category:Gradient methods]]