Gradient Descent: Difference between revisions

    From Marovi AI
    (Marked this version for translation)
    Tag: Reverted
    ([deploy-bot] Deploy from CI (5358a41))
    Tags: ci-deploy Manual revert Reverted
    Line 5: Line 5:
    <translate>
    <translate>
    <!--T:1-->
    <!--T:1-->
    '''{{Term|gradient descent}}''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.
    '''Gradient descent''' is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.


    == Intuition == <!--T:2-->
    <!--T:2-->
    == Intuition ==


    <!--T:3-->
    <!--T:3-->
    Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. {{Term|gradient descent}} formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.
    Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the '''gradient''') and moves in the opposite direction.


    <!--T:4-->
    <!--T:4-->
    The size of each step is controlled by a scalar called the '''{{Term|learning rate}}''' (often denoted <math>\eta</math>). A large {{Term|learning rate}} covers ground quickly but risks overshooting the minimum; a small {{Term|learning rate}} converges more reliably but may take prohibitively many steps.
    The size of each step is controlled by a scalar called the '''learning rate''' (often denoted <math>\eta</math>). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.


    == Mathematical formulation == <!--T:5-->
    <!--T:5-->
    == Mathematical formulation ==


    <!--T:6-->
    <!--T:6-->
    Given a differentiable {{Term|loss function|objective function}} <math>f:\mathbb{R}^n \to \mathbb{R}</math>, {{Term|gradient descent}} generates a sequence of iterates by the '''update rule''':
    Given a differentiable objective function <math>f:\mathbb{R}^n \to \mathbb{R}</math>, gradient descent generates a sequence of iterates by the '''update rule''':


    <!--T:7-->
    <!--T:7-->
    Line 24: Line 26:


    <!--T:8-->
    <!--T:8-->
    where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the {{Term|learning rate}}.
    where <math>\nabla f(\theta_t)</math> is the gradient vector evaluated at the current point <math>\theta_t</math> and <math>\eta > 0</math> is the learning rate.


    <!--T:9-->
    <!--T:9-->
    Line 35: Line 37:
    The gradient <math>\nabla f</math> points in the direction of steepest ascent, so subtracting it moves the iterate downhill.
    The gradient <math>\nabla f</math> points in the direction of steepest ascent, so subtracting it moves the iterate downhill.


    == Batch, stochastic, and mini-batch variants == <!--T:12-->
    <!--T:12-->
    == Batch, stochastic, and mini-batch variants ==


    <!--T:13-->
    <!--T:13-->
    Line 51: Line 54:
    ! Variant !! Gradient computed over !! Per-step cost !! Gradient noise
    ! Variant !! Gradient computed over !! Per-step cost !! Gradient noise
    |-
    |-
    | '''Batch (full) {{Term|gradient descent}}''' || All <math>N</math> samples || High || None
    | '''Batch (full) gradient descent''' || All <math>N</math> samples || High || None
    |-
    |-
    | '''{{Term|stochastic gradient descent}} ({{Term|stochastic gradient descent|SGD}})''' || 1 random sample || Low || High
    | '''Stochastic gradient descent (SGD)''' || 1 random sample || Low || High
    |-
    |-
    | '''{{Term|mini-batch}} {{Term|gradient descent}}''' || <math>B</math> random samples (<math>1 < B < N</math>) || Medium || Medium
    | '''Mini-batch gradient descent''' || <math>B</math> random samples (<math>1 < B < N</math>) || Medium || Medium
    |}
    |}


    <!--T:17-->
    <!--T:17-->
    Full batch {{Term|gradient descent}} computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. {{Term|mini-batch}} {{Term|gradient descent}} strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.
    Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.


    == Convergence == <!--T:18-->
    <!--T:18-->
    == Convergence ==


    === Convex functions === <!--T:19-->
    <!--T:19-->
    === Convex functions ===


    <!--T:20-->
    <!--T:20-->
    For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), {{Term|gradient descent}} with a fixed {{Term|learning rate}} <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, {{Term|convergence}} accelerates to a linear (exponential) rate:
    For a convex function with Lipschitz-continuous gradients (constant <math>L</math>), gradient descent with a fixed learning rate <math>\eta \leq 1/L</math> converges at a rate of <math>O(1/t)</math>. If the function is additionally '''strongly convex''' with parameter <math>\mu > 0</math>, convergence accelerates to a linear (exponential) rate:


    <!--T:21-->
    <!--T:21-->
    Line 74: Line 79:
    The ratio <math>\kappa = L / \mu</math> is called the '''condition number''' and governs how quickly the algorithm converges. Ill-conditioned problems (large <math>\kappa</math>) converge slowly.
    The ratio <math>\kappa = L / \mu</math> is called the '''condition number''' and governs how quickly the algorithm converges. Ill-conditioned problems (large <math>\kappa</math>) converge slowly.


    === Non-convex functions === <!--T:23-->
    <!--T:23-->
    === Non-convex functions ===


    <!--T:24-->
    <!--T:24-->
    Most deep-learning objectives are non-convex. In this setting {{Term|gradient descent}} is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.
    Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where <math>\nabla f = 0</math>), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.


    == Learning rate selection == <!--T:25-->
    <!--T:25-->
    == Learning rate selection ==


    <!--T:26-->
    <!--T:26-->
    Choosing the {{Term|learning rate}} is one of the most important practical decisions:
    Choosing the learning rate is one of the most important practical decisions:


    <!--T:27-->
    <!--T:27-->
    * '''Too large''' — the iterates oscillate or diverge.
    * '''Too large''' — the iterates oscillate or diverge.
    * '''Too small''' — {{Term|convergence}} is unacceptably slow.
    * '''Too small''' — convergence is unacceptably slow.
    * '''{{Term|learning rate}} schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
    * '''Learning rate schedules''' — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
    * '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in {{Term|deep learning}}.
    * '''Line search''' — classical numerical methods choose <math>\eta</math> at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.


    <!--T:28-->
    <!--T:28-->
    A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability.
    A common heuristic is to try several values on a logarithmic scale (e.g. <math>10^{-1}, 10^{-2}, 10^{-3}</math>) and pick the one that reduces the loss fastest without instability.


    == Extensions and improvements == <!--T:29-->
    <!--T:29-->
    == Extensions and improvements ==


    <!--T:30-->
    <!--T:30-->
    Several important modifications address limitations of vanilla {{Term|gradient descent}}:
    Several important modifications address limitations of vanilla gradient descent:


    <!--T:31-->
    <!--T:31-->
    * '''{{Term|momentum}}''' — accumulates a velocity vector from past gradients, helping to accelerate {{Term|convergence}} in ravine-like landscapes.
    * '''Momentum''' — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
    * '''Nesterov accelerated gradient''' — a {{Term|momentum}} variant that evaluates the gradient at a look-ahead position, yielding better theoretical {{Term|convergence}} rates.
    * '''Nesterov accelerated gradient''' — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
    * '''Adaptive methods''' ({{Term|adagrad}}, RMSProp, {{Term|adam}}) — maintain per-parameter {{Term|learning rate|learning rates}} that adapt based on the history of gradients.
    * '''Adaptive methods''' (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
    * '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster {{Term|convergence}}, but are often too expensive for large-scale problems.
    * '''Second-order methods''' — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.


    == Practical tips == <!--T:32-->
    <!--T:32-->
    == Practical tips ==


    <!--T:33-->
    <!--T:33-->
    * '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves {{Term|convergence}}, because the loss surface becomes more isotropic.
    * '''Feature scaling''' — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
    * '''{{Term|gradient clipping}}''' — capping the norm of the gradient prevents excessively large updates.
    * '''Gradient clipping''' — capping the norm of the gradient prevents excessively large updates.
    * '''Random initialisation''' — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
    * '''Random initialisation''' — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
    * '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the {{Term|learning rate}} is too high.
    * '''Monitoring the loss curve''' — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.


    == Applications == <!--T:34-->
    <!--T:34-->
    == Applications ==


    <!--T:35-->
    <!--T:35-->
    {{Term|gradient descent}} and its variants are used throughout science and engineering:
    Gradient descent and its variants are used throughout science and engineering:


    <!--T:36-->
    <!--T:36-->
    Line 124: Line 134:
    * Economics and game-theoretic equilibrium computation
    * Economics and game-theoretic equilibrium computation


    == See also == <!--T:37-->
    <!--T:37-->
    == See also ==


    <!--T:38-->
    <!--T:38-->
    Line 133: Line 144:
    * [[Overfitting and Regularization]]
    * [[Overfitting and Regularization]]


    == References == <!--T:39-->
    <!--T:39-->
    == References ==


    <!--T:40-->
    <!--T:40-->
    * Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''.
    * Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". ''Comptes Rendus de l'Académie des Sciences''.
    * Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press.
    * Boyd, S. and Vandenberghe, L. (2004). ''Convex Optimization''. Cambridge University Press.
    * Ruder, S. (2016). "An overview of {{Term|gradient descent}} optimization algorithms". ''arXiv:1609.04747''.
    * Ruder, S. (2016). "An overview of gradient descent optimization algorithms". ''arXiv:1609.04747''.
    * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press.
    * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 8. MIT Press.
    </translate>
    </translate>

    Revision as of 19:43, 27 April 2026

    Other languages:
    Article
    Topic area Optimization
    Difficulty Introductory

    Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.

    Intuition

    Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the gradient) and moves in the opposite direction.

    The size of each step is controlled by a scalar called the learning rate (often denoted $ \eta $). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.

    Mathematical formulation

    Given a differentiable objective function $ f:\mathbb{R}^n \to \mathbb{R} $, gradient descent generates a sequence of iterates by the update rule:

    $ \theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t) $

    where $ \nabla f(\theta_t) $ is the gradient vector evaluated at the current point $ \theta_t $ and $ \eta > 0 $ is the learning rate.

    In the one-dimensional case this simplifies to:

    $ \theta_{t+1} = \theta_t - \eta \, f'(\theta_t) $

    The gradient $ \nabla f $ points in the direction of steepest ascent, so subtracting it moves the iterate downhill.

    Batch, stochastic, and mini-batch variants

    When the objective has the form of an average over data points,

    $ f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i) $

    three common strategies differ in how much data is used to estimate the gradient:

    Variant Gradient computed over Per-step cost Gradient noise
    Batch (full) gradient descent All $ N $ samples High None
    Stochastic gradient descent (SGD) 1 random sample Low High
    Mini-batch gradient descent $ B $ random samples ($ 1 < B < N $) Medium Medium

    Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. Stochastic gradient descent uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.

    Convergence

    Convex functions

    For a convex function with Lipschitz-continuous gradients (constant $ L $), gradient descent with a fixed learning rate $ \eta \leq 1/L $ converges at a rate of $ O(1/t) $. If the function is additionally strongly convex with parameter $ \mu > 0 $, convergence accelerates to a linear (exponential) rate:

    $ f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr) $

    The ratio $ \kappa = L / \mu $ is called the condition number and governs how quickly the algorithm converges. Ill-conditioned problems (large $ \kappa $) converge slowly.

    Non-convex functions

    Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where $ \nabla f = 0 $), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.

    Learning rate selection

    Choosing the learning rate is one of the most important practical decisions:

    • Too large — the iterates oscillate or diverge.
    • Too small — convergence is unacceptably slow.
    • Learning rate schedules — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).
    • Line search — classical numerical methods choose $ \eta $ at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.

    A common heuristic is to try several values on a logarithmic scale (e.g. $ 10^{-1}, 10^{-2}, 10^{-3} $) and pick the one that reduces the loss fastest without instability.

    Extensions and improvements

    Several important modifications address limitations of vanilla gradient descent:

    • Momentum — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
    • Nesterov accelerated gradient — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
    • Adaptive methods (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
    • Second-order methods — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.

    Practical tips

    • Feature scaling — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
    • Gradient clipping — capping the norm of the gradient prevents excessively large updates.
    • Random initialisation — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
    • Monitoring the loss curve — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.

    Applications

    Gradient descent and its variants are used throughout science and engineering:

    • Training machine-learning models (linear models, neural networks, support vector machines)
    • Signal processing and control systems
    • Inverse problems in physics and imaging
    • Operations research and logistics optimisation
    • Economics and game-theoretic equilibrium computation

    See also

    References

    • Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". Comptes Rendus de l'Académie des Sciences.
    • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
    • Ruder, S. (2016). "An overview of gradient descent optimization algorithms". arXiv:1609.04747.
    • Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 8. MIT Press.