<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Gradient_Descent%2Fen</id>
	<title>Gradient Descent/en - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Gradient_Descent%2Fen"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/en&amp;action=history"/>
	<updated>2026-04-28T00:30:17Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=18028&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=18028&amp;oldid=prev"/>
		<updated>2026-04-27T23:47:19Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 23:47, 27 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l9&quot;&gt;Line 9:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 9:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the &amp;#039;&amp;#039;&amp;#039;gradient&amp;#039;&amp;#039;&amp;#039;) and moves in the opposite direction.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the &amp;#039;&amp;#039;&amp;#039;gradient&amp;#039;&amp;#039;&amp;#039;) and moves in the opposite direction.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The size of each step is controlled by a scalar called the &amp;#039;&amp;#039;&amp;#039;learning rate&amp;#039;&amp;#039;&amp;#039; (often denoted &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The size of each step is controlled by a scalar called the &amp;#039;&amp;#039;&amp;#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;&amp;#039;&amp;#039;&amp;#039; (often denoted &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;). A large &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;covers ground quickly but risks overshooting the minimum; a small &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;converges more reliably but may take prohibitively many steps.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Mathematical formulation ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Mathematical formulation ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Given a differentiable objective function &amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;, gradient descent generates a sequence of iterates by the &amp;#039;&amp;#039;&amp;#039;update rule&amp;#039;&amp;#039;&amp;#039;:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Given a differentiable &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|loss function|&lt;/ins&gt;objective function&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;&amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;, gradient descent generates a sequence of iterates by the &amp;#039;&amp;#039;&amp;#039;update rule&amp;#039;&amp;#039;&amp;#039;:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;where &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; is the gradient vector evaluated at the current point &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; is the learning rate.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;where &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; is the gradient vector evaluated at the current point &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; is the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the one-dimensional case this simplifies to:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;In the one-dimensional case this simplifies to:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l39&quot;&gt;Line 39:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 39:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;| &amp;#039;&amp;#039;&amp;#039;Batch (full) gradient descent&amp;#039;&amp;#039;&amp;#039; || All &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples || High || None&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;| &amp;#039;&amp;#039;&amp;#039;Batch (full) gradient descent&amp;#039;&amp;#039;&amp;#039; || All &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples || High || None&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|-&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|-&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;| &amp;#039;&amp;#039;&amp;#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Stochastic &lt;/del&gt;gradient descent (SGD)&amp;#039;&amp;#039;&amp;#039; || 1 random sample || Low || High&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;| &amp;#039;&amp;#039;&amp;#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|stochastic &lt;/ins&gt;gradient descent&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;(&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|stochastic gradient descent|&lt;/ins&gt;SGD&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;)&amp;#039;&amp;#039;&amp;#039; || 1 random sample || Low || High&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|-&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|-&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;| &amp;#039;&amp;#039;&amp;#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Mini&lt;/del&gt;-batch gradient descent&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; random samples (&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;) || Medium || Medium&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;| &amp;#039;&amp;#039;&amp;#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|mini&lt;/ins&gt;-batch&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;gradient descent&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; random samples (&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;) || Medium || Medium&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|}&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;|}&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Mini&lt;/del&gt;-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|mini&lt;/ins&gt;-batch&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Convergence ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Convergence ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l50&quot;&gt;Line 50:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 50:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Convex functions ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Convex functions ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;For a convex function with Lipschitz-continuous gradients (constant &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;), gradient descent with a fixed learning rate &amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; converges at a rate of &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt;. If the function is additionally &amp;#039;&amp;#039;&amp;#039;strongly convex&amp;#039;&amp;#039;&amp;#039; with parameter &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, convergence accelerates to a linear (exponential) rate:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;For a convex function with Lipschitz-continuous gradients (constant &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;), gradient descent with a fixed &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;&amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; converges at a rate of &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt;. If the function is additionally &amp;#039;&amp;#039;&amp;#039;strongly convex&amp;#039;&amp;#039;&amp;#039; with parameter &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;convergence&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;accelerates to a linear (exponential) rate:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l62&quot;&gt;Line 62:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 62:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Learning rate selection ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Learning rate selection ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Choosing the learning rate is one of the most important practical decisions:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Choosing the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;is one of the most important practical decisions:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Too large&amp;#039;&amp;#039;&amp;#039; — the iterates oscillate or diverge.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Too large&amp;#039;&amp;#039;&amp;#039; — the iterates oscillate or diverge.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Too small&amp;#039;&amp;#039;&amp;#039; — convergence is unacceptably slow.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Too small&amp;#039;&amp;#039;&amp;#039; — &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;convergence&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;is unacceptably slow.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Learning &lt;/del&gt;rate schedules&amp;#039;&amp;#039;&amp;#039; — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|learning &lt;/ins&gt;rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;schedules&amp;#039;&amp;#039;&amp;#039; — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Line search&amp;#039;&amp;#039;&amp;#039; — classical numerical methods choose &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Line search&amp;#039;&amp;#039;&amp;#039; — classical numerical methods choose &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;deep learning&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;A common heuristic is to try several values on a logarithmic scale (e.g. &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;) and pick the one that reduces the loss fastest without instability.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;A common heuristic is to try several values on a logarithmic scale (e.g. &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;) and pick the one that reduces the loss fastest without instability.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l75&quot;&gt;Line 75:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 75:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Several important modifications address limitations of vanilla gradient descent:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Several important modifications address limitations of vanilla gradient descent:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Momentum&lt;/del&gt;&amp;#039;&amp;#039;&amp;#039; — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|momentum}}&lt;/ins&gt;&amp;#039;&amp;#039;&amp;#039; — accumulates a velocity vector from past gradients, helping to accelerate &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;convergence&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;in ravine-like landscapes.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Nesterov accelerated gradient&amp;#039;&amp;#039;&amp;#039; — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Nesterov accelerated gradient&amp;#039;&amp;#039;&amp;#039; — a &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;momentum&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;variant that evaluates the gradient at a look-ahead position, yielding better theoretical &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;convergence&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;rates.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Adaptive methods&amp;#039;&amp;#039;&amp;#039; (&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Adagrad&lt;/del&gt;, RMSProp, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Adam&lt;/del&gt;) — maintain per-parameter learning rates that adapt based on the history of gradients.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Adaptive methods&amp;#039;&amp;#039;&amp;#039; (&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|adagrad}}&lt;/ins&gt;, RMSProp, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|adam}}&lt;/ins&gt;) — maintain per-parameter &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|learning rate|&lt;/ins&gt;learning rates&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;that adapt based on the history of gradients.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Second-order methods&amp;#039;&amp;#039;&amp;#039; — algorithms like Newton&amp;#039;s method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Second-order methods&amp;#039;&amp;#039;&amp;#039; — algorithms like Newton&amp;#039;s method and L-BFGS use curvature information (the Hessian or its approximation) for faster &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;convergence&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;, but are often too expensive for large-scale problems.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Practical tips ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Practical tips ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Feature scaling&amp;#039;&amp;#039;&amp;#039; — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Feature scaling&amp;#039;&amp;#039;&amp;#039; — normalising input features so they have similar ranges dramatically improves &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;convergence&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;, because the loss surface becomes more isotropic.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Gradient &lt;/del&gt;clipping&amp;#039;&amp;#039;&amp;#039; — capping the norm of the gradient prevents excessively large updates.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|gradient &lt;/ins&gt;clipping&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}}&lt;/ins&gt;&amp;#039;&amp;#039;&amp;#039; — capping the norm of the gradient prevents excessively large updates.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Random initialisation&amp;#039;&amp;#039;&amp;#039; — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Random initialisation&amp;#039;&amp;#039;&amp;#039; — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Monitoring the loss curve&amp;#039;&amp;#039;&amp;#039; — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* &amp;#039;&amp;#039;&amp;#039;Monitoring the loss curve&amp;#039;&amp;#039;&amp;#039; — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{Term|&lt;/ins&gt;learning rate&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;}} &lt;/ins&gt;is too high.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Applications ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Applications ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=14709&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=14709&amp;oldid=prev"/>
		<updated>2026-04-27T22:01:43Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;a href=&quot;https://marovi.ai/index.php?title=Gradient_Descent/en&amp;amp;diff=14709&amp;amp;oldid=13180&quot;&gt;Show changes&lt;/a&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=13180&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=13180&amp;oldid=prev"/>
		<updated>2026-04-27T19:42:39Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;a href=&quot;https://marovi.ai/index.php?title=Gradient_Descent/en&amp;amp;diff=13180&amp;amp;oldid=4429&quot;&gt;Show changes&lt;/a&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=4429&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=4429&amp;oldid=prev"/>
		<updated>2026-04-27T02:37:49Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 02:37, 27 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;languages /&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;languages /&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;{{LanguageBar | page = Gradient Descent}}&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=2681&amp;oldid=prev</id>
		<title>FuzzyBot: Updating to match new version of source page</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/en&amp;diff=2681&amp;oldid=prev"/>
		<updated>2026-04-27T00:30:40Z</updated>

		<summary type="html">&lt;p&gt;Updating to match new version of source page&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;languages /&amp;gt;&lt;br /&gt;
{{LanguageBar | page = Gradient Descent}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Gradient descent&amp;#039;&amp;#039;&amp;#039; is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.&lt;br /&gt;
&lt;br /&gt;
== Intuition ==&lt;br /&gt;
&lt;br /&gt;
Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the &amp;#039;&amp;#039;&amp;#039;gradient&amp;#039;&amp;#039;&amp;#039;) and moves in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The size of each step is controlled by a scalar called the &amp;#039;&amp;#039;&amp;#039;learning rate&amp;#039;&amp;#039;&amp;#039; (often denoted &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.&lt;br /&gt;
&lt;br /&gt;
== Mathematical formulation ==&lt;br /&gt;
&lt;br /&gt;
Given a differentiable objective function &amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;, gradient descent generates a sequence of iterates by the &amp;#039;&amp;#039;&amp;#039;update rule&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; is the gradient vector evaluated at the current point &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; is the learning rate.&lt;br /&gt;
&lt;br /&gt;
In the one-dimensional case this simplifies to:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, f&amp;#039;(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The gradient &amp;lt;math&amp;gt;\nabla f&amp;lt;/math&amp;gt; points in the direction of steepest ascent, so subtracting it moves the iterate downhill.&lt;br /&gt;
&lt;br /&gt;
== Batch, stochastic, and mini-batch variants ==&lt;br /&gt;
&lt;br /&gt;
When the objective has the form of an average over data points,&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
three common strategies differ in how much data is used to estimate the gradient:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variant !! Gradient computed over !! Per-step cost !! Gradient noise&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Batch (full) gradient descent&amp;#039;&amp;#039;&amp;#039; || All &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples || High || None&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Stochastic gradient descent (SGD)&amp;#039;&amp;#039;&amp;#039; || 1 random sample || Low || High&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Mini-batch gradient descent&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; random samples (&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;) || Medium || Medium&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.&lt;br /&gt;
&lt;br /&gt;
== Convergence ==&lt;br /&gt;
&lt;br /&gt;
=== Convex functions ===&lt;br /&gt;
&lt;br /&gt;
For a convex function with Lipschitz-continuous gradients (constant &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;), gradient descent with a fixed learning rate &amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; converges at a rate of &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt;. If the function is additionally &amp;#039;&amp;#039;&amp;#039;strongly convex&amp;#039;&amp;#039;&amp;#039; with parameter &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, convergence accelerates to a linear (exponential) rate:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ratio &amp;lt;math&amp;gt;\kappa = L / \mu&amp;lt;/math&amp;gt; is called the &amp;#039;&amp;#039;&amp;#039;condition number&amp;#039;&amp;#039;&amp;#039; and governs how quickly the algorithm converges. Ill-conditioned problems (large &amp;lt;math&amp;gt;\kappa&amp;lt;/math&amp;gt;) converge slowly.&lt;br /&gt;
&lt;br /&gt;
=== Non-convex functions ===&lt;br /&gt;
&lt;br /&gt;
Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where &amp;lt;math&amp;gt;\nabla f = 0&amp;lt;/math&amp;gt;), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.&lt;br /&gt;
&lt;br /&gt;
== Learning rate selection ==&lt;br /&gt;
&lt;br /&gt;
Choosing the learning rate is one of the most important practical decisions:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Too large&amp;#039;&amp;#039;&amp;#039; — the iterates oscillate or diverge.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Too small&amp;#039;&amp;#039;&amp;#039; — convergence is unacceptably slow.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Learning rate schedules&amp;#039;&amp;#039;&amp;#039; — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Line search&amp;#039;&amp;#039;&amp;#039; — classical numerical methods choose &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.&lt;br /&gt;
&lt;br /&gt;
A common heuristic is to try several values on a logarithmic scale (e.g. &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;) and pick the one that reduces the loss fastest without instability.&lt;br /&gt;
&lt;br /&gt;
== Extensions and improvements ==&lt;br /&gt;
&lt;br /&gt;
Several important modifications address limitations of vanilla gradient descent:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Momentum&amp;#039;&amp;#039;&amp;#039; — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Nesterov accelerated gradient&amp;#039;&amp;#039;&amp;#039; — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Adaptive methods&amp;#039;&amp;#039;&amp;#039; (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Second-order methods&amp;#039;&amp;#039;&amp;#039; — algorithms like Newton&amp;#039;s method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.&lt;br /&gt;
&lt;br /&gt;
== Practical tips ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Feature scaling&amp;#039;&amp;#039;&amp;#039; — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Gradient clipping&amp;#039;&amp;#039;&amp;#039; — capping the norm of the gradient prevents excessively large updates.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Random initialisation&amp;#039;&amp;#039;&amp;#039; — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Monitoring the loss curve&amp;#039;&amp;#039;&amp;#039; — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
Gradient descent and its variants are used throughout science and engineering:&lt;br /&gt;
&lt;br /&gt;
* Training machine-learning models (linear models, neural networks, support vector machines)&lt;br /&gt;
* Signal processing and control systems&lt;br /&gt;
* Inverse problems in physics and imaging&lt;br /&gt;
* Operations research and logistics optimisation&lt;br /&gt;
* Economics and game-theoretic equilibrium computation&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Cauchy, A. (1847). &amp;quot;Méthode générale pour la résolution des systèmes d&amp;#039;équations simultanées&amp;quot;. &amp;#039;&amp;#039;Comptes Rendus de l&amp;#039;Académie des Sciences&amp;#039;&amp;#039;.&lt;br /&gt;
* Boyd, S. and Vandenberghe, L. (2004). &amp;#039;&amp;#039;Convex Optimization&amp;#039;&amp;#039;. Cambridge University Press.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &amp;#039;&amp;#039;arXiv:1609.04747&amp;#039;&amp;#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &amp;#039;&amp;#039;Deep Learning&amp;#039;&amp;#039;, Chapter 8. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Optimization]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>FuzzyBot</name></author>
	</entry>
</feed>