<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Gradient_Descent</id>
	<title>Gradient Descent - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Gradient_Descent"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent&amp;action=history"/>
	<updated>2026-04-24T11:53:59Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2137&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Deploy from CI (8c92aeb)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2137&amp;oldid=prev"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:08, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l114&quot;&gt;Line 114:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 114:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Optimization]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Optimization]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!--v1.2.0 cache-bust--&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!-- pass 2 --&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-2098:rev-2137 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2098&amp;oldid=prev</id>
		<title>DeployBot: Pass 2 force re-parse</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2098&amp;oldid=prev"/>
		<updated>2026-04-24T07:00:42Z</updated>

		<summary type="html">&lt;p&gt;Pass 2 force re-parse&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:00, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l115&quot;&gt;Line 115:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 115:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--v1.2.0 cache-bust--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--v1.2.0 cache-bust--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!-- pass 2 --&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-2061:rev-2098 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2061&amp;oldid=prev</id>
		<title>DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2061&amp;oldid=prev"/>
		<updated>2026-04-24T06:58:06Z</updated>

		<summary type="html">&lt;p&gt;Force re-parse after Math source-mode rollout (v1.2.0)&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:58, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l114&quot;&gt;Line 114:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 114:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Optimization]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Optimization]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!--v1.2.0 cache-bust--&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-1986:rev-2061 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=1986&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Deploy from CI (775ba6e)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=1986&amp;oldid=prev"/>
		<updated>2026-04-24T04:01:42Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Deploy from CI (775ba6e)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;{{LanguageBar | page = Gradient Descent}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Gradient descent&amp;#039;&amp;#039;&amp;#039; is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.&lt;br /&gt;
&lt;br /&gt;
== Intuition ==&lt;br /&gt;
&lt;br /&gt;
Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the &amp;#039;&amp;#039;&amp;#039;gradient&amp;#039;&amp;#039;&amp;#039;) and moves in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The size of each step is controlled by a scalar called the &amp;#039;&amp;#039;&amp;#039;learning rate&amp;#039;&amp;#039;&amp;#039; (often denoted &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.&lt;br /&gt;
&lt;br /&gt;
== Mathematical formulation ==&lt;br /&gt;
&lt;br /&gt;
Given a differentiable objective function &amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;, gradient descent generates a sequence of iterates by the &amp;#039;&amp;#039;&amp;#039;update rule&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; is the gradient vector evaluated at the current point &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; is the learning rate.&lt;br /&gt;
&lt;br /&gt;
In the one-dimensional case this simplifies to:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, f&amp;#039;(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The gradient &amp;lt;math&amp;gt;\nabla f&amp;lt;/math&amp;gt; points in the direction of steepest ascent, so subtracting it moves the iterate downhill.&lt;br /&gt;
&lt;br /&gt;
== Batch, stochastic, and mini-batch variants ==&lt;br /&gt;
&lt;br /&gt;
When the objective has the form of an average over data points,&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
three common strategies differ in how much data is used to estimate the gradient:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variant !! Gradient computed over !! Per-step cost !! Gradient noise&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Batch (full) gradient descent&amp;#039;&amp;#039;&amp;#039; || All &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples || High || None&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Stochastic gradient descent (SGD)&amp;#039;&amp;#039;&amp;#039; || 1 random sample || Low || High&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Mini-batch gradient descent&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; random samples (&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;) || Medium || Medium&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.&lt;br /&gt;
&lt;br /&gt;
== Convergence ==&lt;br /&gt;
&lt;br /&gt;
=== Convex functions ===&lt;br /&gt;
&lt;br /&gt;
For a convex function with Lipschitz-continuous gradients (constant &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;), gradient descent with a fixed learning rate &amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; converges at a rate of &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt;. If the function is additionally &amp;#039;&amp;#039;&amp;#039;strongly convex&amp;#039;&amp;#039;&amp;#039; with parameter &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, convergence accelerates to a linear (exponential) rate:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ratio &amp;lt;math&amp;gt;\kappa = L / \mu&amp;lt;/math&amp;gt; is called the &amp;#039;&amp;#039;&amp;#039;condition number&amp;#039;&amp;#039;&amp;#039; and governs how quickly the algorithm converges. Ill-conditioned problems (large &amp;lt;math&amp;gt;\kappa&amp;lt;/math&amp;gt;) converge slowly.&lt;br /&gt;
&lt;br /&gt;
=== Non-convex functions ===&lt;br /&gt;
&lt;br /&gt;
Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where &amp;lt;math&amp;gt;\nabla f = 0&amp;lt;/math&amp;gt;), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.&lt;br /&gt;
&lt;br /&gt;
== Learning rate selection ==&lt;br /&gt;
&lt;br /&gt;
Choosing the learning rate is one of the most important practical decisions:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Too large&amp;#039;&amp;#039;&amp;#039; — the iterates oscillate or diverge.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Too small&amp;#039;&amp;#039;&amp;#039; — convergence is unacceptably slow.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Learning rate schedules&amp;#039;&amp;#039;&amp;#039; — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Line search&amp;#039;&amp;#039;&amp;#039; — classical numerical methods choose &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.&lt;br /&gt;
&lt;br /&gt;
A common heuristic is to try several values on a logarithmic scale (e.g. &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;) and pick the one that reduces the loss fastest without instability.&lt;br /&gt;
&lt;br /&gt;
== Extensions and improvements ==&lt;br /&gt;
&lt;br /&gt;
Several important modifications address limitations of vanilla gradient descent:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Momentum&amp;#039;&amp;#039;&amp;#039; — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Nesterov accelerated gradient&amp;#039;&amp;#039;&amp;#039; — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Adaptive methods&amp;#039;&amp;#039;&amp;#039; (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Second-order methods&amp;#039;&amp;#039;&amp;#039; — algorithms like Newton&amp;#039;s method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.&lt;br /&gt;
&lt;br /&gt;
== Practical tips ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Feature scaling&amp;#039;&amp;#039;&amp;#039; — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Gradient clipping&amp;#039;&amp;#039;&amp;#039; — capping the norm of the gradient prevents excessively large updates.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Random initialisation&amp;#039;&amp;#039;&amp;#039; — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Monitoring the loss curve&amp;#039;&amp;#039;&amp;#039; — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
Gradient descent and its variants are used throughout science and engineering:&lt;br /&gt;
&lt;br /&gt;
* Training machine-learning models (linear models, neural networks, support vector machines)&lt;br /&gt;
* Signal processing and control systems&lt;br /&gt;
* Inverse problems in physics and imaging&lt;br /&gt;
* Operations research and logistics optimisation&lt;br /&gt;
* Economics and game-theoretic equilibrium computation&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Cauchy, A. (1847). &amp;quot;Méthode générale pour la résolution des systèmes d&amp;#039;équations simultanées&amp;quot;. &amp;#039;&amp;#039;Comptes Rendus de l&amp;#039;Académie des Sciences&amp;#039;&amp;#039;.&lt;br /&gt;
* Boyd, S. and Vandenberghe, L. (2004). &amp;#039;&amp;#039;Convex Optimization&amp;#039;&amp;#039;. Cambridge University Press.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &amp;#039;&amp;#039;arXiv:1609.04747&amp;#039;&amp;#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &amp;#039;&amp;#039;Deep Learning&amp;#039;&amp;#039;, Chapter 8. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Optimization]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
</feed>