Gradient Descent/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:47:19Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T22:01:43Z

Updating to match new version of source page

Show changes

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:39Z

Updating to match new version of source page

Show changes

FuzzyBot: Updating to match new version of source page

2026-04-27T02:37:49Z

Updating to match new version of source page

← Older revision		Revision as of 02:37, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Gradient Descent}}~~
	{{ArticleInfobox \| topic_area = Optimization \| difficulty = Introductory \| prerequisites = }}		{{ArticleInfobox \| topic_area = Optimization \| difficulty = Introductory \| prerequisites = }}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:30:40Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Gradient Descent}}
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Gradient descent''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.

== Intuition ==

Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.

The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.

== Mathematical formulation ==

Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''':

:<math>\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)</math>

where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate.

In the one-dimensional case this simplifies to:

:<math>\theta_{t+1} = \theta_t - \eta \, f'(\theta_t)</math>

The gradient <math>\nabla f</math> points in the direction of steepest ascent, so subtracting it moves the iterate downhill.

== Batch, stochastic, and mini-batch variants ==

When the objective has the form of an average over data points,

:<math>f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)</math>

three common strategies differ in how much data is used to estimate the gradient:

{| class="wikitable"
|-
! Variant !! Gradient computed over !! Per-step cost !! Gradient noise
|-
| '''Batch (full) gradient descent''' || All <math>N</math> samples || High || None
|-
| '''Stochastic gradient descent (SGD)''' || 1 random sample || Low || High
|-
| '''Mini-batch gradient descent''' || <math>B</math> random samples (<math>1 < B < N</math>) || Medium || Medium
|}

Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.

== Convergence ==

=== Convex functions ===

For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate:

:<math>f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)</math>

The ratio <math>\kappa = L / \mu</math> is called the '''condition number''' and governs how quickly the algorithm converges. Ill-conditioned problems (large <math>\kappa</math>) converge slowly.

=== Non-convex functions ===

Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.

== Learning rate selection ==

Choosing the learning rate is one of the most important practical decisions:

* '''Too large''' — the iterates oscillate or diverge.
* '''Too small''' — convergence is unacceptably slow.
* '''Learning rate schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
* '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.

A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability.

== Extensions and improvements ==

Several important modifications address limitations of vanilla gradient descent:

* '''Momentum''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
* '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
* '''Adaptive methods''' (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
* '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.

== Practical tips ==

* '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
* '''Gradient clipping''' — capping the norm of the gradient prevents excessively large updates.
* '''Random initialisation''' — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
* '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.

== Applications ==

Gradient descent and its variants are used throughout science and engineering:

* Training machine-learning models (linear models, neural networks, support vector machines)
* Signal processing and control systems
* Inverse problems in physics and imaging
* Operations research and logistics optimisation
* Economics and game-theoretic equilibrium computation

== See also ==

* [[Stochastic Gradient Descent]]
* [[Backpropagation]]
* [[Loss Functions]]
* [[Neural Networks]]
* [[Overfitting and Regularization]]

== References ==

* Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''.
* Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press.
* Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press.

[[Category:Optimization]]
[[Category:Introductory]]

← Older revision		Revision as of 23:47, 27 April 2026
Line 9:		Line 9:
	Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.		Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.

	The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.		The size of each step is controlled by a scalar called the '''{{Term\|learning rate}}''' (often denoted <math>\eta</math>). A large {{Term\|learning rate}} covers ground quickly but risks overshooting the minimum; a small {{Term\|learning rate}} converges more reliably but may take prohibitively many steps.

	== Mathematical formulation ==		== Mathematical formulation ==

	Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''':		Given a differentiable {{Term\|loss function\|objective function}} <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''':

	:<math>\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)</math>		:<math>\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)</math>

	where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate.		where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the {{Term\|learning rate}}.

	In the one-dimensional case this simplifies to:		In the one-dimensional case this simplifies to:
Line 39:		Line 39:
	\| '''Batch (full) gradient descent''' \|\| All <math>N</math> samples \|\| High \|\| None		\| '''Batch (full) gradient descent''' \|\| All <math>N</math> samples \|\| High \|\| None
	\|-		\|-
	\| '''~~Stochastic~~ gradient descent (SGD)''' \|\| 1 random sample \|\| Low \|\| High		\| '''{{Term\|stochastic gradient descent}} ({{Term\|stochastic gradient descent\|SGD}})''' \|\| 1 random sample \|\| Low \|\| High
	\|-		\|-
	\| '''~~Mini~~-batch gradient descent''' \|\| <math>B</math> random samples (<math>1 < B < N</math>) \|\| Medium \|\| Medium		\| '''{{Term\|mini-batch}} gradient descent''' \|\| <math>B</math> random samples (<math>1 < B < N</math>) \|\| Medium \|\| Medium
	\|}		\|}

	Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent\|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. ~~Mini~~-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.		Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent\|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. {{Term\|mini-batch}} gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.

	== Convergence ==		== Convergence ==
Line 50:		Line 50:
	=== Convex functions ===		=== Convex functions ===

	For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate:		For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed {{Term\|learning rate}} <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, {{Term\|convergence}} accelerates to a linear (exponential) rate:

	:<math>f(\theta_t) - f(\theta^) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^)\bigr)</math>		:<math>f(\theta_t) - f(\theta^) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^)\bigr)</math>
Line 62:		Line 62:
	== Learning rate selection ==		== Learning rate selection ==

	Choosing the learning rate is one of the most important practical decisions:		Choosing the {{Term\|learning rate}} is one of the most important practical decisions:

	* '''Too large''' — the iterates oscillate or diverge.		* '''Too large''' — the iterates oscillate or diverge.
	* '''Too small''' — convergence is unacceptably slow.		* '''Too small''' — {{Term\|convergence}} is unacceptably slow.
	* '''~~Learning~~ rate schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).		* '''{{Term\|learning rate}} schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
	* '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.		* '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in {{Term\|deep learning}}.

	A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability.		A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability.
Line 75:		Line 75:
	Several important modifications address limitations of vanilla gradient descent:		Several important modifications address limitations of vanilla gradient descent:

	* '''~~Momentum~~''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.		* '''{{Term\|momentum}}''' — accumulates a velocity vector from past gradients, helping to accelerate {{Term\|convergence}} in ravine-like landscapes.
	* '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.		* '''Nesterov accelerated gradient''' — a {{Term\|momentum}} variant that evaluates the gradient at a look-ahead position, yielding better theoretical {{Term\|convergence}} rates.
	* '''Adaptive methods''' (~~Adagrad~~, RMSProp, ~~Adam~~) — maintain per-parameter learning rates that adapt based on the history of gradients.		* '''Adaptive methods''' ({{Term\|adagrad}}, RMSProp, {{Term\|adam}}) — maintain per-parameter {{Term\|learning rate\|learning rates}} that adapt based on the history of gradients.
	* '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.		* '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster {{Term\|convergence}}, but are often too expensive for large-scale problems.

	== Practical tips ==		== Practical tips ==

	* '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.		* '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves {{Term\|convergence}}, because the loss surface becomes more isotropic.
	* '''~~Gradient~~ clipping''' — capping the norm of the gradient prevents excessively large updates.		* '''{{Term\|gradient clipping}}''' — capping the norm of the gradient prevents excessively large updates.
	* '''Random initialisation''' — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.		* '''Random initialisation''' — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
	* '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.		* '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the {{Term\|learning rate}} is too high.

	== Applications ==		== Applications ==