Stochastic Gradient Descent/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:25Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T00:36:02Z

Updating to match new version of source page

New page

<languages />
{{TopicNav|field=Machine Learning|subfield=Optimization algorithms}}
{{Article
|title=Stochastic Gradient Descent
|field=Machine Learning
|topics=Optimization, Neural Networks, Gradient Methods
|related_papers=
|languages=es
}}
{{Summary
|text=Stochastic gradient descent (SGD) is the core optimization algorithm behind modern machine learning. Instead of computing gradients over an entire dataset, it estimates them from small random samples, making training feasible on large-scale data. Nearly all deep learning models are trained using SGD or one of its variants (Adam, RMSProp, etc.).
|key_points=Estimates gradients from random mini-batches instead of the full dataset; Learning rate schedule is critical for convergence; Variants like Adam and AdamW add adaptive per-parameter rates; Converges to global minimum for convex problems under Robbins–Monro conditions
}}

'''Stochastic gradient descent''' (often abbreviated '''{{Term|SGD|SGD}}''') is an iterative optimisation algorithm used to minimise an {{Term|objective function|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.

== Motivation ==

In classical {{Term|gradient descent}}, the full gradient of the {{Term|loss function}} is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small '''{{Term|mini-batch}}''') at each step, trading a noisier estimate for dramatically lower per-iteration cost.

== Algorithm ==

Given a parameterised {{Term|loss function}}

:<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i)</math>

the SGD update rule at step <math>t</math> is:

:<math>\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})</math>

where <math>\eta_t</math> is the '''{{Term|learning rate}}''' (step size) and <math>i_t</math> is a randomly selected index.

=== Mini-batch variant ===

In practice a '''{{Term|mini-batch}}''' of <math>B</math> samples is used:

:<math>\theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j})</math>

Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.

=== Pseudocode ===

<pre>
initialise parameters θ
for epoch = 1, 2, … do
shuffle training set
for each mini-batch B ⊂ training set do
g ← (1/|B|) Σ ∇ℓ(θ; xᵢ, yᵢ) # estimate gradient
θ ← θ − η · g # update parameters
end for
end for
</pre>

== Learning rate schedules ==

The {{Term|learning rate}} <math>\eta_t</math> strongly influences {{Term|convergence}}. Common strategies include:

* '''Constant''' — simple but may overshoot or stall.
* '''Step decay''' — multiply <math>\eta</math> by a factor (e.g. 0.1) every <math>k</math> epochs.
* '''Exponential decay''' — <math>\eta_t = \eta_0 \, e^{-\lambda t}</math>.
* '''Cosine annealing''' — smoothly reduces the rate following a cosine curve, often with warm restarts.
* '''Linear warm-up''' — ramp up from a small <math>\eta</math> during the first few iterations to stabilise early training.

== Convergence properties ==

For {{Term|convex optimization|convex}} objectives with Lipschitz-continuous gradients, SGD with a decaying {{Term|learning rate}} satisfying

:<math>\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty</math>

converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.

== Popular variants ==

Several extensions reduce the variance of the gradient estimate or adapt the step size per parameter:

{| class="wikitable"
|-
! Method !! Key idea !! Reference
|-
| '''{{Term|momentum|Momentum}}''' || Accumulates an exponentially decaying moving average of past gradients || Polyak, 1964
|-
| '''Nesterov accelerated gradient''' || Evaluates the gradient at a "look-ahead" position || Nesterov, 1983
|-
| '''Adagrad''' || Per-parameter rates that shrink for frequently updated features || Duchi et al., 2011
|-
| '''RMSProp''' || Fixes Adagrad's diminishing rates using a moving average of squared gradients || Hinton (lecture notes), 2012
|-
| '''{{Term|Adam}}''' || Combines {{Term|momentum}} with RMSProp-style adaptive rates || Kingma & Ba, 2015
|-
| '''AdamW''' || Decouples weight decay from the adaptive gradient step || Loshchilov & Hutter, 2019
|}

== Practical considerations ==

* '''Data shuffling''' — Re-shuffle the dataset each epoch to avoid cyclic patterns.
* '''{{Term|gradient clipping|Gradient clipping}}''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
* '''{{Term|batch normalization|Batch normalisation}}''' — Normalising layer inputs reduces sensitivity to the {{Term|learning rate}}.
* '''Mixed-precision training''' — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.

== Applications ==

SGD and its variants are used across virtually all areas of machine learning:

* Training deep neural networks (computer vision, NLP, speech recognition)
* Large-scale linear models (logistic regression, SVMs via SGD)
* Reinforcement learning policy optimisation
* Recommendation systems and collaborative filtering
* Online learning settings where data arrives in a stream

== See also ==

* [[Gradient descent]]
* [[Backpropagation]]
* [[Adam (optimiser)]]
* [[Learning rate]]
* [[Convex optimisation]]

== References ==

* Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method". ''Annals of Mathematical Statistics''.
* Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent". ''COMPSTAT''.
* Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ''ICLR''.
* Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.

{{RelatedContent
|articles=Gradient descent, Backpropagation, Adam (optimiser), Learning rate, Convex optimisation
|papers=
}}

[[Category:Machine learning]]
[[Category:Optimisation algorithms]]
[[Category:Gradient methods]]

← Older revision		Revision as of 19:42, 27 April 2026
Line 13:		Line 13:
	}}		}}

	'''Stochastic gradient descent''' (often abbreviated '''{{Term\|SGD\|SGD}}''') is an iterative optimisation algorithm used to minimise an {{Term\|objective function\|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.		'''Stochastic {{Term\|gradient descent}}''' (often abbreviated '''{{Term\|SGD\|SGD}}''') is an iterative optimisation algorithm used to minimise an {{Term\|objective function\|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from {{Term\|logistic regression}} to deep neural networks.

	== Motivation ==		== Motivation ==
Line 29:		Line 29:
	:<math>\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})</math>		:<math>\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})</math>

	where <math>\eta_t</math> is the '''{{Term\|learning rate}}''' (step size) and <math>i_t</math> is a randomly selected index.		where <math>\eta_t</math> is the '''{{Term\|learning rate}}''' ({{Term\|learning rate\|step size}}) and <math>i_t</math> is a randomly selected index.

	=== Mini-batch variant ===		=== Mini-batch variant ===
Line 57:		Line 57:

	* '''Constant''' — simple but may overshoot or stall.		* '''Constant''' — simple but may overshoot or stall.
	* '''Step decay''' — multiply <math>\eta</math> by a factor (e.g. 0.1) every <math>k</math> epochs.		* '''Step decay''' — multiply <math>\eta</math> by a factor (e.g. 0.1) every <math>k</math> {{Term\|epoch\|epochs}}.
	* '''Exponential decay''' — <math>\eta_t = \eta_0 \, e^{-\lambda t}</math>.		* '''Exponential decay''' — <math>\eta_t = \eta_0 \, e^{-\lambda t}</math>.
	* '''Cosine annealing''' — smoothly reduces the rate following a cosine curve, often with warm restarts.		* '''Cosine annealing''' — smoothly reduces the rate following a cosine curve, often with warm restarts.
Line 68:		Line 68:
	:<math>\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty</math>		:<math>\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty</math>

	converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.		converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for {{Term\|deep learning}} — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.

	== Popular variants ==		== Popular variants ==

	Several extensions reduce the variance of the gradient estimate or adapt the step size per parameter:		Several extensions reduce the variance of the gradient estimate or adapt the {{Term\|learning rate\|step size}} per parameter:

	{\| class="wikitable"		{\| class="wikitable"
Line 82:		Line 82:
	\| '''Nesterov accelerated gradient''' \|\| Evaluates the gradient at a "look-ahead" position \|\| Nesterov, 1983		\| '''Nesterov accelerated gradient''' \|\| Evaluates the gradient at a "look-ahead" position \|\| Nesterov, 1983
	\|-		\|-
	\| '''~~Adagrad~~''' \|\| Per-parameter rates that shrink for frequently updated features \|\| Duchi et al., 2011		\| '''{{Term\|adagrad}}''' \|\| Per-parameter rates that shrink for frequently updated features \|\| Duchi et al., 2011
	\|-		\|-
	\| '''RMSProp''' \|\| Fixes ~~Adagrad~~'s diminishing rates using a moving average of squared gradients \|\| Hinton (lecture notes), 2012		\| '''RMSProp''' \|\| Fixes {{Term\|adagrad}}'s diminishing rates using a moving average of squared gradients \|\| Hinton (lecture notes), 2012
	\|-		\|-
	\| '''{{Term\|Adam}}''' \|\| Combines {{Term\|momentum}} with RMSProp-style adaptive rates \|\| Kingma & Ba, 2015		\| '''{{Term\|Adam}}''' \|\| Combines {{Term\|momentum}} with RMSProp-style adaptive rates \|\| Kingma & Ba, 2015
	\|-		\|-
	\| '''AdamW''' \|\| Decouples weight decay from the adaptive gradient step \|\| Loshchilov & Hutter, 2019		\| '''AdamW''' \|\| Decouples {{Term\|weight decay}} from the adaptive gradient step \|\| Loshchilov & Hutter, 2019
	\|}		\|}

	== Practical considerations ==		== Practical considerations ==

	* '''Data shuffling''' — Re-shuffle the dataset each epoch to avoid cyclic patterns.		* '''Data shuffling''' — Re-shuffle the dataset each {{Term\|epoch}} to avoid cyclic patterns.
	* '''{{Term\|gradient clipping\|Gradient clipping}}''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.		* '''{{Term\|gradient clipping\|Gradient clipping}}''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
	* '''{{Term\|batch normalization\|Batch normalisation}}''' — Normalising layer inputs reduces sensitivity to the {{Term\|learning rate}}.		* '''{{Term\|batch normalization\|Batch normalisation}}''' — Normalising layer inputs reduces sensitivity to the {{Term\|learning rate}}.
Line 103:		Line 103:

	* Training deep neural networks (computer vision, NLP, speech recognition)		* Training deep neural networks (computer vision, NLP, speech recognition)
	* Large-scale linear models (logistic regression, SVMs via SGD)		* Large-scale linear models ({{Term\|logistic regression}}, SVMs via SGD)
	* Reinforcement learning policy optimisation		* Reinforcement learning policy optimisation
	* Recommendation systems and collaborative filtering		* {{Term\|recommender system\|Recommendation systems}} and collaborative filtering
	* Online learning settings where data arrives in a stream		* Online learning settings where data arrives in a stream

Line 120:		Line 120:
	* Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method". ''Annals of Mathematical Statistics''.		* Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method". ''Annals of Mathematical Statistics''.
	* Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent". ''COMPSTAT''.		* Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent". ''COMPSTAT''.
	* Kingma, D. P. and Ba, J. (2015). "~~Adam~~: A Method for Stochastic Optimization". ''ICLR''.		* Kingma, D. P. and Ba, J. (2015). "{{Term\|adam}}: A Method for Stochastic Optimization". ''ICLR''.
	* Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.		* Ruder, S. (2016). "An overview of {{Term\|gradient descent}} optimization algorithms". ''arXiv:1609.04747''.

	{{RelatedContent		{{RelatedContent