Gradient Descent: Difference between revisions

Article
Topic area	Optimization
Difficulty	Introductory

Revision as of 19:40, 27 April 2026

Other languages:

English
Español
中文

gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.

Intuition

Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the gradient) and moves in the opposite direction.

The size of each step is controlled by a scalar called the learning rate (often denoted $\eta$ ). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.

Mathematical formulation

Given a differentiable objective function $f:\mathbb{R}^n \to \mathbb{R}$ , gradient descent generates a sequence of iterates by the update rule:

\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)

where $\nabla f(\theta_t)$ is the gradient vector evaluated at the current point $\theta_t$ and $\eta > 0$ is the learning rate.

In the one-dimensional case this simplifies to:

\theta_{t+1} = \theta_t - \eta \, f'(\theta_t)

The gradient $\nabla f$ points in the direction of steepest ascent, so subtracting it moves the iterate downhill.

Batch, stochastic, and mini-batch variants

When the objective has the form of an average over data points,

f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)

three common strategies differ in how much data is used to estimate the gradient:

Variant	Gradient computed over	Per-step cost	Gradient noise
Batch (full) gradient descent	All $$ N $$ samples	High	None
stochastic gradient descent (SGD)	1 random sample	Low	High
mini-batch gradient descent	$$ B $$ random samples ( $$ 1 < B < N $$ )	Medium	Medium

Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. Stochastic gradient descent uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.

Convergence

Convex functions

For a convex function with Lipschitz-continuous gradients (constant $$ L $$ ), gradient descent with a fixed learning rate $\eta \leq 1/L$ converges at a rate of $$ O(1/t) $$ . If the function is additionally strongly convex with parameter $\mu > 0$ , convergence accelerates to a linear (exponential) rate:

f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)

The ratio $\kappa = L / \mu$ is called the condition number and governs how quickly the algorithm converges. Ill-conditioned problems (large $\kappa$ ) converge slowly.

Non-convex functions

Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where $\nabla f = 0$ ), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.

Learning rate selection

Choosing the learning rate is one of the most important practical decisions:

Too large — the iterates oscillate or diverge.
Too small — convergence is unacceptably slow.
learning rate schedules — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
Line search — classical numerical methods choose $\eta$ at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.

A common heuristic is to try several values on a logarithmic scale (e.g. $10^{-1}, 10^{-2}, 10^{-3}$ ) and pick the one that reduces the loss fastest without instability.

Extensions and improvements

Several important modifications address limitations of vanilla gradient descent:

momentum — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
Nesterov accelerated gradient — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
Adaptive methods (adagrad, RMSProp, adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
Second-order methods — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.

Practical tips

Feature scaling — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
gradient clipping — capping the norm of the gradient prevents excessively large updates.
Random initialisation — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
Monitoring the loss curve — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.

Applications

gradient descent and its variants are used throughout science and engineering:

Training machine-learning models (linear models, neural networks, support vector machines)
Signal processing and control systems
Inverse problems in physics and imaging
Operations research and logistics optimisation
Economics and game-theoretic equilibrium computation

References

Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". Comptes Rendus de l'Académie des Sciences.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 8. MIT Press.

@@ Line 5: / Line 5: @@
 <translate>
 <!--T:1-->
-'''Gradient descent''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.
+'''{{Term|gradient descent}}''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.
 <!--T:2-->
@@ Line 11: / Line 11: @@
 <!--T:3-->
-Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.
+Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. {{Term|gradient descent}} formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.
 <!--T:4-->
-The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.
+The size of each step is controlled by a scalar called the '''{{Term|learning rate}}''' (often denoted <math>\eta</math>). A large {{Term|learning rate}} covers ground quickly but risks overshooting the minimum; a small {{Term|learning rate}} converges more reliably but may take prohibitively many steps.
 <!--T:5-->
@@ Line 20: / Line 20: @@
 <!--T:6-->
-Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''':
+Given a differentiable {{Term|loss function|objective function}} <math>f:\mathbb{R}^n \to \mathbb{R}</math>, {{Term|gradient descent}} generates a sequence of iterates by the '''update rule''':
 <!--T:7-->
@@ Line 26: / Line 26: @@
 <!--T:8-->
-where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate.
+where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the {{Term|learning rate}}.
 <!--T:9-->
@@ Line 54: / Line 54: @@
 ! Variant !! Gradient computed over !! Per-step cost !! Gradient noise
 |-
-| '''Batch (full) gradient descent''' || All <math>N</math> samples || High || None
+| '''Batch (full) {{Term|gradient descent}}''' || All <math>N</math> samples || High || None
 |-
-| '''Stochastic gradient descent (SGD)''' || 1 random sample || Low || High
+| '''{{Term|stochastic gradient descent}} ({{Term|stochastic gradient descent|SGD}})''' || 1 random sample || Low || High
 |-
-| '''Mini-batch gradient descent''' || <math>B</math> random samples (<math>1 < B < N</math>) || Medium || Medium
+| '''{{Term|mini-batch}} {{Term|gradient descent}}''' || <math>B</math> random samples (<math>1 < B < N</math>) || Medium || Medium
 |}
 <!--T:17-->
-Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.
+Full batch {{Term|gradient descent}} computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. {{Term|mini-batch}} {{Term|gradient descent}} strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.
 <!--T:18-->
@@ Line 71: / Line 71: @@
 <!--T:20-->
-For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate:
+For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), {{Term|gradient descent}} with a fixed {{Term|learning rate}} <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, {{Term|convergence}} accelerates to a linear (exponential) rate:
 <!--T:21-->
@@ Line 83: / Line 83: @@
 <!--T:24-->
-Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.
+Most deep-learning objectives are non-convex. In this setting {{Term|gradient descent}} is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.
 <!--T:25-->
@@ Line 89: / Line 89: @@
 <!--T:26-->
-Choosing the learning rate is one of the most important practical decisions:
+Choosing the {{Term|learning rate}} is one of the most important practical decisions:
 <!--T:27-->
 * '''Too large''' — the iterates oscillate or diverge.
-* '''Too small''' — convergence is unacceptably slow.
+* '''Too small''' — {{Term|convergence}} is unacceptably slow.
-* '''Learning rate schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
+* '''{{Term|learning rate}} schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
-* '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.
+* '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in {{Term|deep learning}}.
 <!--T:28-->
@@ Line 104: / Line 104: @@
 <!--T:30-->
-Several important modifications address limitations of vanilla gradient descent:
+Several important modifications address limitations of vanilla {{Term|gradient descent}}:
 <!--T:31-->
-* '''Momentum''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
+* '''{{Term|momentum}}''' — accumulates a velocity vector from past gradients, helping to accelerate {{Term|convergence}} in ravine-like landscapes.
-* '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
+* '''Nesterov accelerated gradient''' — a {{Term|momentum}} variant that evaluates the gradient at a look-ahead position, yielding better theoretical {{Term|convergence}} rates.
-* '''Adaptive methods''' (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
+* '''Adaptive methods''' ({{Term|adagrad}}, RMSProp, {{Term|adam}}) — maintain per-parameter {{Term|learning rate|learning rates}} that adapt based on the history of gradients.
-* '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.
+* '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster {{Term|convergence}}, but are often too expensive for large-scale problems.
 <!--T:32-->
@@ Line 116: / Line 116: @@
 <!--T:33-->
-* '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
+* '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves {{Term|convergence}}, because the loss surface becomes more isotropic.
-* '''Gradient clipping''' — capping the norm of the gradient prevents excessively large updates.
+* '''{{Term|gradient clipping}}''' — capping the norm of the gradient prevents excessively large updates.
 * '''Random initialisation''' — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
-* '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.
+* '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the {{Term|learning rate}} is too high.
 <!--T:34-->
@@ Line 125: / Line 125: @@
 <!--T:35-->
-Gradient descent and its variants are used throughout science and engineering:
+{{Term|gradient descent}} and its variants are used throughout science and engineering:
 <!--T:36-->
@@ Line 150: / Line 150: @@
 * Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''.
 * Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press.
-* Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.
+* Ruder, S. (2016). "An overview of {{Term|gradient descent}} optimization algorithms". ''arXiv:1609.04747''.
 * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press.
 </translate>

Article
Topic area	Optimization
Difficulty	Introductory