Gradient Descent: Difference between revisions

Article
Topic area	Optimization
Difficulty	Introductory

Revision as of 00:30, 27 April 2026

Other languages:

English
Español
中文

Languages: English | Español | 中文

Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.

Intuition

Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the gradient) and moves in the opposite direction.

The size of each step is controlled by a scalar called the learning rate (often denoted $\eta$ ). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.

Mathematical formulation

Given a differentiable objective function $f:\mathbb{R}^n \to \mathbb{R}$ , gradient descent generates a sequence of iterates by the update rule:

\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)

where $\nabla f(\theta_t)$ is the gradient vector evaluated at the current point $\theta_t$ and $\eta > 0$ is the learning rate.

In the one-dimensional case this simplifies to:

\theta_{t+1} = \theta_t - \eta \, f'(\theta_t)

The gradient $\nabla f$ points in the direction of steepest ascent, so subtracting it moves the iterate downhill.

Batch, stochastic, and mini-batch variants

When the objective has the form of an average over data points,

f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)

three common strategies differ in how much data is used to estimate the gradient:

Variant	Gradient computed over	Per-step cost	Gradient noise
Batch (full) gradient descent	All $$ N $$ samples	High	None
Stochastic gradient descent (SGD)	1 random sample	Low	High
Mini-batch gradient descent	$$ B $$ random samples ( $$ 1 < B < N $$ )	Medium	Medium

Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. Stochastic gradient descent uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.

Convergence

Convex functions

For a convex function with Lipschitz-continuous gradients (constant $$ L $$ ), gradient descent with a fixed learning rate $\eta \leq 1/L$ converges at a rate of $$ O(1/t) $$ . If the function is additionally strongly convex with parameter $\mu > 0$ , convergence accelerates to a linear (exponential) rate:

f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)

The ratio $\kappa = L / \mu$ is called the condition number and governs how quickly the algorithm converges. Ill-conditioned problems (large $\kappa$ ) converge slowly.

Non-convex functions

Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where $\nabla f = 0$ ), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.

Learning rate selection

Choosing the learning rate is one of the most important practical decisions:

Too large — the iterates oscillate or diverge.
Too small — convergence is unacceptably slow.
Learning rate schedules — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
Line search — classical numerical methods choose $\eta$ at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.

A common heuristic is to try several values on a logarithmic scale (e.g. $10^{-1}, 10^{-2}, 10^{-3}$ ) and pick the one that reduces the loss fastest without instability.

Extensions and improvements

Several important modifications address limitations of vanilla gradient descent:

Momentum — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
Nesterov accelerated gradient — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
Adaptive methods (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
Second-order methods — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.

Practical tips

Feature scaling — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
Gradient clipping — capping the norm of the gradient prevents excessively large updates.
Random initialisation — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
Monitoring the loss curve — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.

Applications

Gradient descent and its variants are used throughout science and engineering:

Training machine-learning models (linear models, neural networks, support vector machines)
Signal processing and control systems
Inverse problems in physics and imaging
Operations research and logistics optimisation
Economics and game-theoretic equilibrium computation

References

Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". Comptes Rendus de l'Académie des Sciences.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 8. MIT Press.

@@ Line 1: / Line 1: @@
+<languages />
 {{LanguageBar | page = Gradient Descent}}
 {{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}
 {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}
+<translate>
+<!--T:1-->
 '''Gradient descent''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.
+<!--T:2-->
 == Intuition ==
+<!--T:3-->
 Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.
+<!--T:4-->
 The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.
+<!--T:5-->
 == Mathematical formulation ==
+<!--T:6-->
 Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''':
+<!--T:7-->
 :<math>\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)</math>
+<!--T:8-->
 where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate.
+<!--T:9-->
 In the one-dimensional case this simplifies to:
+<!--T:10-->
 :<math>\theta_{t+1} = \theta_t - \eta \, f'(\theta_t)</math>
+<!--T:11-->
 The gradient <math>\nabla f</math> points in the direction of steepest ascent, so subtracting it moves the iterate downhill.
+<!--T:12-->
 == Batch, stochastic, and mini-batch variants ==
+<!--T:13-->
 When the objective has the form of an average over data points,
+<!--T:14-->
 :<math>f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)</math>
+<!--T:15-->
 three common strategies differ in how much data is used to estimate the gradient:
+<!--T:16-->
 {| class="wikitable"
 |-
@@ Line 44: / Line 62: @@
 |}
+<!--T:17-->
 Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.
+<!--T:18-->
 == Convergence ==
+<!--T:19-->
 === Convex functions ===
+<!--T:20-->
 For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate:
+<!--T:21-->
 :<math>f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)</math>
+<!--T:22-->
 The ratio <math>\kappa = L / \mu</math> is called the '''condition number''' and governs how quickly the algorithm converges. Ill-conditioned problems (large <math>\kappa</math>) converge slowly.
+<!--T:23-->
 === Non-convex functions ===
+<!--T:24-->
 Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.
+<!--T:25-->
 == Learning rate selection ==
+<!--T:26-->
 Choosing the learning rate is one of the most important practical decisions:
+<!--T:27-->
 * '''Too large''' — the iterates oscillate or diverge.
 * '''Too small''' — convergence is unacceptably slow.
@@ Line 69: / Line 98: @@
 * '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.
+<!--T:28-->
 A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability.
+<!--T:29-->
 == Extensions and improvements ==
+<!--T:30-->
 Several important modifications address limitations of vanilla gradient descent:
+<!--T:31-->
 * '''Momentum''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
 * '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
@@ Line 80: / Line 113: @@
 * '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.
+<!--T:32-->
 == Practical tips ==
+<!--T:33-->
 * '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
 * '''Gradient clipping''' — capping the norm of the gradient prevents excessively large updates.
@@ Line 87: / Line 122: @@
 * '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.
+<!--T:34-->
 == Applications ==
+<!--T:35-->
 Gradient descent and its variants are used throughout science and engineering:
+<!--T:36-->
 * Training machine-learning models (linear models, neural networks, support vector machines)
 * Signal processing and control systems
@@ Line 97: / Line 135: @@
 * Economics and game-theoretic equilibrium computation
+<!--T:37-->
 == See also ==
+<!--T:38-->
 * [[Stochastic Gradient Descent]]
 * [[Backpropagation]]
@@ Line 105: / Line 145: @@
 * [[Overfitting and Regularization]]
+<!--T:39-->
 == References ==
+<!--T:40-->
 * Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''.
 * Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press.
 * Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.
 * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press.
+</translate>
 [[Category:Optimization]]
 [[Category:Introductory]]

Article
Topic area	Optimization
Difficulty	Introductory