Gradient Descent: Difference between revisions
([deploy-bot] Deploy from CI (8c92aeb)) Tags: ci-deploy Manual revert |
([deploy-bot] Convert Gradient Descent to Translate-extension page) |
||
| Line 1: | Line 1: | ||
<languages /> | |||
{{LanguageBar | page = Gradient Descent}} | {{LanguageBar | page = Gradient Descent}} | ||
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }} | {{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }} | ||
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}} | {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}} | ||
<translate> | |||
<!--T:1--> | |||
'''Gradient descent''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks. | '''Gradient descent''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks. | ||
<!--T:2--> | |||
== Intuition == | == Intuition == | ||
<!--T:3--> | |||
Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction. | Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction. | ||
<!--T:4--> | |||
The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps. | The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps. | ||
<!--T:5--> | |||
== Mathematical formulation == | == Mathematical formulation == | ||
<!--T:6--> | |||
Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''': | Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''': | ||
<!--T:7--> | |||
:<math>\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)</math> | :<math>\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)</math> | ||
<!--T:8--> | |||
where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate. | where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate. | ||
<!--T:9--> | |||
In the one-dimensional case this simplifies to: | In the one-dimensional case this simplifies to: | ||
<!--T:10--> | |||
:<math>\theta_{t+1} = \theta_t - \eta \, f'(\theta_t)</math> | :<math>\theta_{t+1} = \theta_t - \eta \, f'(\theta_t)</math> | ||
<!--T:11--> | |||
The gradient <math>\nabla f</math> points in the direction of steepest ascent, so subtracting it moves the iterate downhill. | The gradient <math>\nabla f</math> points in the direction of steepest ascent, so subtracting it moves the iterate downhill. | ||
<!--T:12--> | |||
== Batch, stochastic, and mini-batch variants == | == Batch, stochastic, and mini-batch variants == | ||
<!--T:13--> | |||
When the objective has the form of an average over data points, | When the objective has the form of an average over data points, | ||
<!--T:14--> | |||
:<math>f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)</math> | :<math>f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)</math> | ||
<!--T:15--> | |||
three common strategies differ in how much data is used to estimate the gradient: | three common strategies differ in how much data is used to estimate the gradient: | ||
<!--T:16--> | |||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
| Line 44: | Line 62: | ||
|} | |} | ||
<!--T:17--> | |||
Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512. | Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512. | ||
<!--T:18--> | |||
== Convergence == | == Convergence == | ||
<!--T:19--> | |||
=== Convex functions === | === Convex functions === | ||
<!--T:20--> | |||
For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate: | For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate: | ||
<!--T:21--> | |||
:<math>f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)</math> | :<math>f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)</math> | ||
<!--T:22--> | |||
The ratio <math>\kappa = L / \mu</math> is called the '''condition number''' and governs how quickly the algorithm converges. Ill-conditioned problems (large <math>\kappa</math>) converge slowly. | The ratio <math>\kappa = L / \mu</math> is called the '''condition number''' and governs how quickly the algorithm converges. Ill-conditioned problems (large <math>\kappa</math>) converge slowly. | ||
<!--T:23--> | |||
=== Non-convex functions === | === Non-convex functions === | ||
<!--T:24--> | |||
Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces. | Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces. | ||
<!--T:25--> | |||
== Learning rate selection == | == Learning rate selection == | ||
<!--T:26--> | |||
Choosing the learning rate is one of the most important practical decisions: | Choosing the learning rate is one of the most important practical decisions: | ||
<!--T:27--> | |||
* '''Too large''' — the iterates oscillate or diverge. | * '''Too large''' — the iterates oscillate or diverge. | ||
* '''Too small''' — convergence is unacceptably slow. | * '''Too small''' — convergence is unacceptably slow. | ||
| Line 69: | Line 98: | ||
* '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning. | * '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning. | ||
<!--T:28--> | |||
A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability. | A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability. | ||
<!--T:29--> | |||
== Extensions and improvements == | == Extensions and improvements == | ||
<!--T:30--> | |||
Several important modifications address limitations of vanilla gradient descent: | Several important modifications address limitations of vanilla gradient descent: | ||
<!--T:31--> | |||
* '''Momentum''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes. | * '''Momentum''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes. | ||
* '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates. | * '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates. | ||
| Line 80: | Line 113: | ||
* '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems. | * '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems. | ||
<!--T:32--> | |||
== Practical tips == | == Practical tips == | ||
<!--T:33--> | |||
* '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic. | * '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic. | ||
* '''Gradient clipping''' — capping the norm of the gradient prevents excessively large updates. | * '''Gradient clipping''' — capping the norm of the gradient prevents excessively large updates. | ||
| Line 87: | Line 122: | ||
* '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high. | * '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high. | ||
<!--T:34--> | |||
== Applications == | == Applications == | ||
<!--T:35--> | |||
Gradient descent and its variants are used throughout science and engineering: | Gradient descent and its variants are used throughout science and engineering: | ||
<!--T:36--> | |||
* Training machine-learning models (linear models, neural networks, support vector machines) | * Training machine-learning models (linear models, neural networks, support vector machines) | ||
* Signal processing and control systems | * Signal processing and control systems | ||
| Line 97: | Line 135: | ||
* Economics and game-theoretic equilibrium computation | * Economics and game-theoretic equilibrium computation | ||
<!--T:37--> | |||
== See also == | == See also == | ||
<!--T:38--> | |||
* [[Stochastic Gradient Descent]] | * [[Stochastic Gradient Descent]] | ||
* [[Backpropagation]] | * [[Backpropagation]] | ||
| Line 105: | Line 145: | ||
* [[Overfitting and Regularization]] | * [[Overfitting and Regularization]] | ||
<!--T:39--> | |||
== References == | == References == | ||
<!--T:40--> | |||
* Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''. | * Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''. | ||
* Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press. | * Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press. | ||
* Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''. | * Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''. | ||
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press. | * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press. | ||
</translate> | |||
[[Category:Optimization]] | [[Category:Optimization]] | ||
[[Category:Introductory]] | [[Category:Introductory]] | ||
Revision as of 00:30, 27 April 2026
| Article | |
|---|---|
| Topic area | Optimization |
| Difficulty | Introductory |
Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.
Intuition
Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the gradient) and moves in the opposite direction.
The size of each step is controlled by a scalar called the learning rate (often denoted $ \eta $). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.
Mathematical formulation
Given a differentiable objective function $ f:\mathbb{R}^n \to \mathbb{R} $, gradient descent generates a sequence of iterates by the update rule:
- $ \theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t) $
where $ \nabla f(\theta_t) $ is the gradient vector evaluated at the current point $ \theta_t $ and $ \eta > 0 $ is the learning rate.
In the one-dimensional case this simplifies to:
- $ \theta_{t+1} = \theta_t - \eta \, f'(\theta_t) $
The gradient $ \nabla f $ points in the direction of steepest ascent, so subtracting it moves the iterate downhill.
Batch, stochastic, and mini-batch variants
When the objective has the form of an average over data points,
- $ f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i) $
three common strategies differ in how much data is used to estimate the gradient:
| Variant | Gradient computed over | Per-step cost | Gradient noise |
|---|---|---|---|
| Batch (full) gradient descent | All $ N $ samples | High | None |
| Stochastic gradient descent (SGD) | 1 random sample | Low | High |
| Mini-batch gradient descent | $ B $ random samples ($ 1 < B < N $) | Medium | Medium |
Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. Stochastic gradient descent uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.
Convergence
Convex functions
For a convex function with Lipschitz-continuous gradients (constant $ L $), gradient descent with a fixed learning rate $ \eta \leq 1/L $ converges at a rate of $ O(1/t) $. If the function is additionally strongly convex with parameter $ \mu > 0 $, convergence accelerates to a linear (exponential) rate:
- $ f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr) $
The ratio $ \kappa = L / \mu $ is called the condition number and governs how quickly the algorithm converges. Ill-conditioned problems (large $ \kappa $) converge slowly.
Non-convex functions
Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where $ \nabla f = 0 $), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.
Learning rate selection
Choosing the learning rate is one of the most important practical decisions:
- Too large — the iterates oscillate or diverge.
- Too small — convergence is unacceptably slow.
- Learning rate schedules — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
- Line search — classical numerical methods choose $ \eta $ at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.
A common heuristic is to try several values on a logarithmic scale (e.g. $ 10^{-1}, 10^{-2}, 10^{-3} $) and pick the one that reduces the loss fastest without instability.
Extensions and improvements
Several important modifications address limitations of vanilla gradient descent:
- Momentum — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
- Nesterov accelerated gradient — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
- Adaptive methods (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
- Second-order methods — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.
Practical tips
- Feature scaling — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
- Gradient clipping — capping the norm of the gradient prevents excessively large updates.
- Random initialisation — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
- Monitoring the loss curve — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.
Applications
Gradient descent and its variants are used throughout science and engineering:
- Training machine-learning models (linear models, neural networks, support vector machines)
- Signal processing and control systems
- Inverse problems in physics and imaging
- Operations research and logistics optimisation
- Economics and game-theoretic equilibrium computation
See also
- Stochastic Gradient Descent
- Backpropagation
- Loss Functions
- Neural Networks
- Overfitting and Regularization
References
- Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". Comptes Rendus de l'Académie des Sciences.
- Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
- Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 8. MIT Press.