<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=DeployBot</id>
	<title>Marovi AI - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=DeployBot"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/Special:Contributions/DeployBot"/>
	<updated>2026-04-24T11:49:32Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:LanguageBar&amp;diff=2169</id>
		<title>Template:LanguageBar</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:LanguageBar&amp;diff=2169"/>
		<updated>2026-04-24T07:09:32Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Improve LanguageBar: show all languages, dim the ones without translation (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Language navigation bar. Usage:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{LanguageBar | page = Gradient Descent}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Marks the current language bold; renders links for available translations&lt;br /&gt;
and plain text (dimmed) for languages without a translation.&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div style=&amp;quot;background: #f0f0f0; border: 1px solid #ddd; padding: 4px 8px; margin-bottom: 1em; font-size: 90%;&amp;quot;&amp;gt;&#039;&#039;&#039;Languages:&#039;&#039;&#039; {{#ifeq:{{SUBPAGENAME}}|{{{page}}}|&#039;&#039;&#039;English&#039;&#039;&#039;|[[{{{page}}}|English]]}} &amp;amp;#124; {{#ifexist:{{{page}}}/es|{{#ifeq:{{FULLPAGENAME}}|{{{page}}}/es|&#039;&#039;&#039;Español&#039;&#039;&#039;|[[{{{page}}}/es|Español]]}}|&amp;lt;span style=&amp;quot;color:#888;&amp;quot;&amp;gt;Español&amp;lt;/span&amp;gt;}} &amp;amp;#124; {{#ifexist:{{{page}}}/zh|{{#ifeq:{{FULLPAGENAME}}|{{{page}}}/zh|&#039;&#039;&#039;中文&#039;&#039;&#039;|[[{{{page}}}/zh|中文]]}}|&amp;lt;span style=&amp;quot;color:#888;&amp;quot;&amp;gt;中文&amp;lt;/span&amp;gt;}}&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:RecentArticles&amp;diff=2168</id>
		<title>Template:RecentArticles</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:RecentArticles&amp;diff=2168"/>
		<updated>2026-04-24T07:09:17Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;* [[Word Embeddings]]&lt;br /&gt;
* [[Transfer Learning]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Linear Regression]]&lt;br /&gt;
* [[Gradient Descent]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Neural_Networks/zh&amp;diff=2167</id>
		<title>Neural Networks/zh</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Neural_Networks/zh&amp;diff=2167"/>
		<updated>2026-04-24T07:09:02Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;神经网络（Neural Networks）&#039;&#039;&#039;（也称为&#039;&#039;&#039;人工神经网络&#039;&#039;&#039;，即ANN）是受生物神经系统结构启发的计算模型。它们由称为&#039;&#039;&#039;神经元（Neuron）&#039;&#039;&#039;（或节点）的简单处理单元组成的互连层构成，是现代深度学习（Deep Learning）的基础。&lt;br /&gt;
&lt;br /&gt;
== 生物学启发 ==&lt;br /&gt;
&lt;br /&gt;
生物神经元通过其&#039;&#039;&#039;树突（Dendrite）&#039;&#039;&#039;接收电信号，在&#039;&#039;&#039;细胞体&#039;&#039;&#039;中进行整合，如果综合信号超过阈值，则沿其&#039;&#039;&#039;轴突（Axon）&#039;&#039;&#039;向下游神经元发送输出信号。人工神经网络对这一过程进行了抽象：每个人工神经元计算其输入的加权和，加上一个偏置项，然后通过一个非线性的&#039;&#039;&#039;激活函数（Activation Function）&#039;&#039;&#039;传递结果。&lt;br /&gt;
&lt;br /&gt;
虽然与生物学的类比激发了早期研究，但现代神经网络最好被理解为灵活的参数化函数逼近器，而非忠实的大脑模拟。&lt;br /&gt;
&lt;br /&gt;
== 感知机 ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;感知机（Perceptron）&#039;&#039;&#039;由Frank Rosenblatt于1958年提出，是最简单的神经网络。它计算：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y = \sigma\!\left(\sum_{i=1}^{n} w_i x_i + b\right) = \sigma(\mathbf{w}^\top \mathbf{x} + b)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;\mathbf{x}&amp;lt;/math&amp;gt; 是输入向量，&amp;lt;math&amp;gt;\mathbf{w}&amp;lt;/math&amp;gt; 是可学习的权重，&amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; 是偏置，&amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; 是一个阶跃函数，当参数为正时输出1，否则输出0。感知机可以学习任何线性可分函数，但众所周知无法表示异或（XOR）函数——这一局限性使神经网络研究停滞了十多年。&lt;br /&gt;
&lt;br /&gt;
== 前馈网络 ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;前馈神经网络（Feedforward Neural Network）&#039;&#039;&#039;（也称为&#039;&#039;&#039;多层感知机&#039;&#039;&#039;，即MLP）将多层神经元堆叠在一起。信息单向流动——从&#039;&#039;&#039;输入层&#039;&#039;&#039;经过一个或多个&#039;&#039;&#039;隐藏层（Hidden Layer）&#039;&#039;&#039;到达&#039;&#039;&#039;输出层&#039;&#039;&#039;。&lt;br /&gt;
&lt;br /&gt;
对于具有一个隐藏层的网络，计算过程为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h} = g(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y} = f(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;g&amp;lt;/math&amp;gt; 和 &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; 是激活函数，&amp;lt;math&amp;gt;\mathbf{W}_1, \mathbf{W}_2&amp;lt;/math&amp;gt; 是权重矩阵，&amp;lt;math&amp;gt;\mathbf{b}_1, \mathbf{b}_2&amp;lt;/math&amp;gt; 是偏置向量。隐藏层使网络能够学习单个感知机无法捕捉的非线性关系。&lt;br /&gt;
&lt;br /&gt;
具有多个隐藏层的网络称为&#039;&#039;&#039;深度&#039;&#039;&#039;神经网络，训练它们是&#039;&#039;&#039;深度学习&#039;&#039;&#039;的研究主题。&lt;br /&gt;
&lt;br /&gt;
== 激活函数 ==&lt;br /&gt;
&lt;br /&gt;
激活函数引入了非线性；没有它，多层网络将退化为单个线性变换。常见的选择包括：&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! 函数 !! 公式 !! 值域 !! 备注&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Sigmoid&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sigma(z) = \frac{1}{1+e^{-z}}&amp;lt;/math&amp;gt; || (0, 1) || 历史上广泛使用；存在梯度消失问题&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Tanh&#039;&#039;&#039; || &amp;lt;math&amp;gt;\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}&amp;lt;/math&amp;gt; || (−1, 1) || 以零为中心；对于大输入仍会饱和&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;ReLU&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0, z)&amp;lt;/math&amp;gt; || [0, ∞) || 现代网络的默认选择；可能导致&amp;quot;死亡神经元&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Leaky ReLU&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(\alpha z, z)&amp;lt;/math&amp;gt;，其中 &amp;lt;math&amp;gt;\alpha &amp;gt; 0&amp;lt;/math&amp;gt; 较小 || (−∞, ∞) || 解决了死亡神经元问题&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Softmax&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{e^{z_i}}{\sum_j e^{z_j}}&amp;lt;/math&amp;gt; || (0, 1) || 用于多分类任务的输出层&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== 万能近似定理 ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;万能近似定理（Universal Approximation Theorem）&#039;&#039;&#039;（Cybenko 1989，Hornik 1991）指出，一个包含有限个神经元的单隐藏层前馈网络，在激活函数满足温和条件（例如非常数、有界、连续）的前提下，可以在 &amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; 的紧子集上以任意精度逼近任何连续函数。&lt;br /&gt;
&lt;br /&gt;
该定理保证了良好近似的&#039;&#039;存在性&#039;&#039;，但没有说明如何&#039;&#039;找到&#039;&#039;它——在实践中，训练具有多层的深度网络比使用单个宽层要有效得多。&lt;br /&gt;
&lt;br /&gt;
== 训练概述 ==&lt;br /&gt;
&lt;br /&gt;
训练神经网络包括以下步骤：&lt;br /&gt;
&lt;br /&gt;
# &#039;&#039;&#039;定义损失函数（Loss Function）&#039;&#039;&#039; — 衡量网络预测与真实目标之间差距的指标（参见[[Loss Functions]]）。&lt;br /&gt;
# &#039;&#039;&#039;前向传播&#039;&#039;&#039; — 逐层传播数值，计算给定输入的网络输出。&lt;br /&gt;
# &#039;&#039;&#039;反向传播（Backpropagation）&#039;&#039;&#039; — 通过在网络中反向应用链式法则，计算损失相对于每个权重的梯度（参见[[Backpropagation]]）。&lt;br /&gt;
# &#039;&#039;&#039;参数更新&#039;&#039;&#039; — 使用[[Gradient Descent|梯度下降]]等优化算法调整权重。&lt;br /&gt;
# &#039;&#039;&#039;迭代&#039;&#039;&#039; — 在训练数据上重复步骤2-4多次遍历（轮次/Epoch）。&lt;br /&gt;
&lt;br /&gt;
成功的训练还需要注意&#039;&#039;&#039;初始化&#039;&#039;&#039;（例如Xavier或He方案）、&#039;&#039;&#039;正则化（Regularization）&#039;&#039;&#039;（以防止[[Overfitting and Regularization|过拟合]]）以及&#039;&#039;&#039;超参数调优&#039;&#039;&#039;（学习率、批量大小、网络架构）。&lt;br /&gt;
&lt;br /&gt;
== 常见架构 ==&lt;br /&gt;
&lt;br /&gt;
除了基本的前馈网络，还发展出了几种专门的架构：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;[[Convolutional Neural Networks|卷积神经网络]]&#039;&#039;&#039;（CNN）— 专为图像等网格结构数据设计，使用局部连接和权重共享。&lt;br /&gt;
* &#039;&#039;&#039;[[Recurrent Neural Networks|循环神经网络]]&#039;&#039;&#039;（RNN）— 专为序列数据设计，具有形成循环的连接以维持隐藏状态。&lt;br /&gt;
* &#039;&#039;&#039;Transformer&#039;&#039;&#039; — 基于注意力机制的架构，已在自然语言处理中占据主导地位，并越来越多地应用于视觉领域。&lt;br /&gt;
* &#039;&#039;&#039;自编码器（Autoencoder）&#039;&#039;&#039; — 训练重建其输入的网络，用于降维和生成建模。&lt;br /&gt;
* &#039;&#039;&#039;生成对抗网络（GAN）&#039;&#039;&#039; — 一对网络（生成器和判别器）通过竞争训练来生成逼真的数据。&lt;br /&gt;
&lt;br /&gt;
== 应用 ==&lt;br /&gt;
&lt;br /&gt;
神经网络被应用于广泛的领域：&lt;br /&gt;
&lt;br /&gt;
* 计算机视觉（图像分类、目标检测、语义分割）&lt;br /&gt;
* 自然语言处理（翻译、摘要、问答）&lt;br /&gt;
* 语音识别与合成&lt;br /&gt;
* 游戏博弈（AlphaGo、Atari智能体）&lt;br /&gt;
* 科学发现（蛋白质折叠、药物设计、天气预测）&lt;br /&gt;
* 自动驾驶与机器人&lt;br /&gt;
&lt;br /&gt;
== 参见 ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== 参考文献 ==&lt;br /&gt;
&lt;br /&gt;
* Rosenblatt, F. (1958). &amp;quot;The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain&amp;quot;. &#039;&#039;Psychological Review&#039;&#039;.&lt;br /&gt;
* Cybenko, G. (1989). &amp;quot;Approximation by Superpositions of a Sigmoidal Function&amp;quot;. &#039;&#039;Mathematics of Control, Signals, and Systems&#039;&#039;.&lt;br /&gt;
* Hornik, K. (1991). &amp;quot;Approximation Capabilities of Multilayer Feedforward Networks&amp;quot;. &#039;&#039;Neural Networks&#039;&#039;.&lt;br /&gt;
* LeCun, Y., Bengio, Y. and Hinton, G. (2015). &amp;quot;Deep learning&amp;quot;. &#039;&#039;Nature&#039;&#039;, 521, 436–444.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/zh&amp;diff=2166</id>
		<title>Gradient Descent/zh</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/zh&amp;diff=2166"/>
		<updated>2026-04-24T07:09:02Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Gradient Descent}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;梯度下降（Gradient Descent）&#039;&#039;&#039;是一种用于求解可微函数局部最小值的一阶迭代优化算法。它是几乎所有现代机器学习训练过程的基础，从简单的线性回归（Linear Regression）到拥有数十亿参数的深度神经网络。&lt;br /&gt;
&lt;br /&gt;
== 直觉理解 ==&lt;br /&gt;
&lt;br /&gt;
想象你站在浓雾弥漫的山坡上。你看不到谷底，但能感受到脚下的坡度。最自然的策略就是朝最陡峭的下坡方向迈出一步，然后重新评估。梯度下降正是将这一想法形式化：在每一步中，算法计算函数最陡上升方向（即&#039;&#039;&#039;梯度&#039;&#039;&#039;），然后朝相反方向移动。&lt;br /&gt;
&lt;br /&gt;
每步的大小由一个标量控制，称为&#039;&#039;&#039;学习率（Learning Rate）&#039;&#039;&#039;（通常记为 &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;）。较大的学习率前进速度快，但有越过最小值的风险；较小的学习率收敛更可靠，但可能需要过多的步数。&lt;br /&gt;
&lt;br /&gt;
== 数学公式 ==&lt;br /&gt;
&lt;br /&gt;
给定一个可微的目标函数 &amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;，梯度下降通过以下&#039;&#039;&#039;更新规则&#039;&#039;&#039;生成一系列迭代点：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; 是在当前点 &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; 处计算的梯度向量，&amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; 是学习率。&lt;br /&gt;
&lt;br /&gt;
在一维情况下，公式简化为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, f&#039;(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
梯度 &amp;lt;math&amp;gt;\nabla f&amp;lt;/math&amp;gt; 指向最陡上升方向，因此减去它会使迭代点向下移动。&lt;br /&gt;
&lt;br /&gt;
== 批量、随机和小批量变体 ==&lt;br /&gt;
&lt;br /&gt;
当目标函数具有数据点平均值的形式时，&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
三种常见策略在用多少数据来估计梯度方面有所不同：&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! 变体 !! 梯度计算范围 !! 每步计算成本 !! 梯度噪声&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;批量（全量）梯度下降&#039;&#039;&#039; || 所有 &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; 个样本 || 高 || 无&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;随机梯度下降（SGD）&#039;&#039;&#039; || 1个随机样本 || 低 || 高&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;小批量梯度下降（Mini-batch Gradient Descent）&#039;&#039;&#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; 个随机样本（&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;） || 中 || 中&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
全量批量梯度下降计算精确梯度，因此沿着平滑轨迹向最小值移动。[[Stochastic Gradient Descent|随机梯度下降]]使用单个样本来估计梯度，大幅减少每步计算量，但代价是轨迹更加嘈杂。小批量梯度下降在两者之间取得平衡，是实践中最常见的选择，典型的批量大小在32到512之间。&lt;br /&gt;
&lt;br /&gt;
== 收敛性 ==&lt;br /&gt;
&lt;br /&gt;
=== 凸函数 ===&lt;br /&gt;
&lt;br /&gt;
对于具有利普希茨连续梯度（常数 &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;）的凸函数（Convex Function），使用固定学习率 &amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; 的梯度下降以 &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt; 的速率收敛。如果函数还具有参数 &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt; 的&#039;&#039;&#039;强凸性&#039;&#039;&#039;，则收敛加速为线性（指数）速率：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
比值 &amp;lt;math&amp;gt;\kappa = L / \mu&amp;lt;/math&amp;gt; 称为&#039;&#039;&#039;条件数（Condition Number）&#039;&#039;&#039;，它决定了算法收敛的速度。病态问题（较大的 &amp;lt;math&amp;gt;\kappa&amp;lt;/math&amp;gt;）收敛缓慢。&lt;br /&gt;
&lt;br /&gt;
=== 非凸函数 ===&lt;br /&gt;
&lt;br /&gt;
大多数深度学习目标函数是非凸的。在这种情况下，梯度下降只能保证收敛到驻点（Stationary Point）（其中 &amp;lt;math&amp;gt;\nabla f = 0&amp;lt;/math&amp;gt;），这可能是局部最小值、鞍点（Saddle Point），甚至是局部最大值。在实践中，高维空间中鞍点比局部最小值更成问题。&lt;br /&gt;
&lt;br /&gt;
== 学习率选择 ==&lt;br /&gt;
&lt;br /&gt;
选择学习率是最重要的实际决策之一：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;过大&#039;&#039;&#039; — 迭代点振荡或发散。&lt;br /&gt;
* &#039;&#039;&#039;过小&#039;&#039;&#039; — 收敛速度慢得无法接受。&lt;br /&gt;
* &#039;&#039;&#039;学习率调度（Learning Rate Schedule）&#039;&#039;&#039; — 许多实践者从较大的学习率开始，随时间逐渐减小（阶梯衰减、指数衰减、余弦退火）。&lt;br /&gt;
* &#039;&#039;&#039;线搜索（Line Search）&#039;&#039;&#039; — 经典数值方法在每步选择 &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; 以满足沃尔夫条件或阿米霍条件等，但这在深度学习中很少使用。&lt;br /&gt;
&lt;br /&gt;
一种常用的启发式方法是在对数尺度上尝试多个值（例如 &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;），然后选择在不产生不稳定性的前提下损失下降最快的那个。&lt;br /&gt;
&lt;br /&gt;
== 扩展与改进 ==&lt;br /&gt;
&lt;br /&gt;
几种重要的改进方法解决了原始梯度下降的局限性：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;动量（Momentum）&#039;&#039;&#039; — 从过去的梯度中积累速度向量，帮助在峡谷状地形中加速收敛。&lt;br /&gt;
* &#039;&#039;&#039;涅斯捷罗夫加速梯度（Nesterov Accelerated Gradient）&#039;&#039;&#039; — 一种动量变体，在前瞻位置评估梯度，具有更好的理论收敛速率。&lt;br /&gt;
* &#039;&#039;&#039;自适应方法（Adaptive Methods）&#039;&#039;&#039;（Adagrad、RMSProp、Adam）— 根据梯度历史为每个参数维护自适应学习率。&lt;br /&gt;
* &#039;&#039;&#039;二阶方法（Second-order Methods）&#039;&#039;&#039; — 如牛顿法和L-BFGS等算法使用曲率信息（海森矩阵或其近似）以加速收敛，但对大规模问题通常计算成本过高。&lt;br /&gt;
&lt;br /&gt;
== 实践技巧 ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;特征缩放（Feature Scaling）&#039;&#039;&#039; — 归一化输入特征使其具有相似的范围可以显著改善收敛性，因为损失曲面变得更加各向同性。&lt;br /&gt;
* &#039;&#039;&#039;梯度裁剪（Gradient Clipping）&#039;&#039;&#039; — 限制梯度的范数以防止过大的更新。&lt;br /&gt;
* &#039;&#039;&#039;随机初始化&#039;&#039;&#039; — 从合理的随机初始化开始（例如神经网络中的Xavier或He初始化）可以避免对称性破缺问题。&lt;br /&gt;
* &#039;&#039;&#039;监控损失曲线&#039;&#039;&#039; — 绘制训练损失随迭代次数的变化是最简单的诊断方法：平滑下降的曲线表示训练健康；振荡则表明学习率过高。&lt;br /&gt;
&lt;br /&gt;
== 应用 ==&lt;br /&gt;
&lt;br /&gt;
梯度下降及其变体广泛应用于科学和工程领域：&lt;br /&gt;
&lt;br /&gt;
* 训练机器学习模型（线性模型、神经网络、支持向量机）&lt;br /&gt;
* 信号处理与控制系统&lt;br /&gt;
* 物理和成像中的逆问题&lt;br /&gt;
* 运筹学和物流优化&lt;br /&gt;
* 经济学和博弈论均衡计算&lt;br /&gt;
&lt;br /&gt;
== 参见 ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== 参考文献 ==&lt;br /&gt;
&lt;br /&gt;
* Cauchy, A. (1847). &amp;quot;Méthode générale pour la résolution des systèmes d&#039;équations simultanées&amp;quot;. &#039;&#039;Comptes Rendus de l&#039;Académie des Sciences&#039;&#039;.&lt;br /&gt;
* Boyd, S. and Vandenberghe, L. (2004). &#039;&#039;Convex Optimization&#039;&#039;. Cambridge University Press.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &#039;&#039;arXiv:1609.04747&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 8. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Optimization]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Convolutional_Neural_Networks/zh&amp;diff=2165</id>
		<title>Convolutional Neural Networks/zh</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Convolutional_Neural_Networks/zh&amp;diff=2165"/>
		<updated>2026-04-24T07:09:02Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Convolutional Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;卷积神经网络（Convolutional Neural Network）&#039;&#039;&#039;（&#039;&#039;&#039;CNN&#039;&#039;&#039;或&#039;&#039;&#039;ConvNet&#039;&#039;&#039;）是一类专门用于处理具有网格状拓扑结构数据的深度[[Neural Networks|神经网络]]，例如图像（二维像素网格）、音频频谱图和视频。它们通过局部连接、权重共享和池化来利用输入的空间结构，使其在视觉和空间任务上比全连接网络高效得多。&lt;br /&gt;
&lt;br /&gt;
== 卷积运算 ==&lt;br /&gt;
&lt;br /&gt;
核心构建模块是&#039;&#039;&#039;离散卷积（Discrete Convolution）&#039;&#039;&#039;。对于二维输入 &amp;lt;math&amp;gt;\mathbf{X}&amp;lt;/math&amp;gt; 和大小为 &amp;lt;math&amp;gt;k \times k&amp;lt;/math&amp;gt; 的滤波器（卷积核） &amp;lt;math&amp;gt;\mathbf{K}&amp;lt;/math&amp;gt;，输出特征图 &amp;lt;math&amp;gt;\mathbf{Y}&amp;lt;/math&amp;gt; 为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; 是偏置项。滤波器在输入上滑动（卷积），在每个位置计算点积。严格来说，大多数实现计算的是&#039;&#039;&#039;互相关（Cross-correlation）&#039;&#039;&#039;而非真正的卷积（后者会翻转卷积核），但由于卷积核权重是学习得到的，这一区别无关紧要。&lt;br /&gt;
&lt;br /&gt;
控制卷积的关键超参数：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;卷积核大小&#039;&#039;&#039; — 滤波器的空间范围（例如 &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt;，&amp;lt;math&amp;gt;5 \times 5&amp;lt;/math&amp;gt;）。&lt;br /&gt;
* &#039;&#039;&#039;步幅（Stride）&#039;&#039;&#039; — 卷积核连续位置之间的步长。步幅为2会将空间维度减半。&lt;br /&gt;
* &#039;&#039;&#039;填充（Padding）&#039;&#039;&#039; — 在输入边界周围添加零来控制输出大小。&amp;quot;Same&amp;quot;填充保持空间维度不变；&amp;quot;Valid&amp;quot;填充不使用填充。&lt;br /&gt;
&lt;br /&gt;
== 滤波器与特征检测 ==&lt;br /&gt;
&lt;br /&gt;
每个滤波器学习检测特定的局部模式。在早期层中，滤波器通常对边缘、角点和颜色梯度有响应。更深的层将这些组合成更高级的特征——纹理、部件，最终是完整的物体。&lt;br /&gt;
&lt;br /&gt;
卷积层并行应用多个滤波器，生成一叠特征图。如果输入有 &amp;lt;math&amp;gt;C_{\text{in}}&amp;lt;/math&amp;gt; 个通道，该层有 &amp;lt;math&amp;gt;C_{\text{out}}&amp;lt;/math&amp;gt; 个滤波器，则可学习参数的总数为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
这比具有相同输入和输出维度的全连接层少得多，因为权重在所有空间位置上是共享的。&lt;br /&gt;
&lt;br /&gt;
== 池化 ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;池化（Pooling）&#039;&#039;&#039;层对特征图进行下采样，减小其空间维度并提供一定程度的平移不变性。常见的池化操作：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;最大池化（Max Pooling）&#039;&#039;&#039; — 取每个局部窗口（例如 &amp;lt;math&amp;gt;2 \times 2&amp;lt;/math&amp;gt;）中的最大值。&lt;br /&gt;
* &#039;&#039;&#039;平均池化（Average Pooling）&#039;&#039;&#039; — 取每个窗口中的平均值。&lt;br /&gt;
* &#039;&#039;&#039;全局平均池化（Global Average Pooling）&#039;&#039;&#039; — 将每个完整的特征图平均为单个值，通常在最终分类层之前使用。&lt;br /&gt;
&lt;br /&gt;
池化减少了计算成本，并通过逐步抽象表示来帮助防止过拟合。&lt;br /&gt;
&lt;br /&gt;
== CNN的架构 ==&lt;br /&gt;
&lt;br /&gt;
典型的CNN交替使用卷积层和池化层，然后接一个或多个全连接层用于最终预测：&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Input → [Conv → ReLU → Pool] × N → Flatten → FC → FC → Output&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
每个卷积-池化模块提取越来越抽象的特征，而全连接层将它们组合用于分类或回归。&lt;br /&gt;
&lt;br /&gt;
== 里程碑架构 ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! 架构 !! 年份 !! 关键贡献 !! 深度&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;LeNet-5&#039;&#039;&#039; || 1998 || 开创了CNN用于手写数字识别（MNIST） || 5层&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;AlexNet&#039;&#039;&#039; || 2012 || 赢得ImageNet竞赛；推广了ReLU、Dropout、GPU训练 || 8层&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;VGGNet&#039;&#039;&#039; || 2014 || 证明深度很重要；全程仅使用 &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt; 滤波器 || 16-19层&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;GoogLeNet（Inception）&#039;&#039;&#039; || 2014 || 引入了具有并行滤波器大小的Inception模块 || 22层&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;ResNet&#039;&#039;&#039; || 2015 || 引入残差连接，使极深网络的训练成为可能 || 50-152+层&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DenseNet&#039;&#039;&#039; || 2017 || 通过密集块将每层与所有后续层连接 || 121-264层&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;EfficientNet&#039;&#039;&#039; || 2019 || 对深度、宽度和分辨率进行复合缩放 || 可变&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== 残差连接 ===&lt;br /&gt;
&lt;br /&gt;
ResNet引入的&#039;&#039;&#039;残差连接（Residual Connection）&#039;&#039;&#039;（或跳跃连接）将模块的输入直接加到其输出上：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
这使得梯度可以直接通过恒等路径流动，缓解了梯度消失问题，使得训练数百层的网络成为可能。残差连接已成为几乎所有现代架构的标准组件。&lt;br /&gt;
&lt;br /&gt;
== 计算机视觉中的应用 ==&lt;br /&gt;
&lt;br /&gt;
CNN在广泛的视觉任务中取得了最先进的性能：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;图像分类&#039;&#039;&#039; — 为整张图像分配标签（ImageNet、CIFAR）。&lt;br /&gt;
* &#039;&#039;&#039;目标检测（Object Detection）&#039;&#039;&#039; — 在图像中定位和分类目标（YOLO、Faster R-CNN、SSD）。&lt;br /&gt;
* &#039;&#039;&#039;语义分割（Semantic Segmentation）&#039;&#039;&#039; — 为每个像素分配类别标签（U-Net、DeepLab）。&lt;br /&gt;
* &#039;&#039;&#039;实例分割（Instance Segmentation）&#039;&#039;&#039; — 区分物体的各个实例（Mask R-CNN）。&lt;br /&gt;
* &#039;&#039;&#039;图像生成&#039;&#039;&#039; — 使用基于CNN的生成器生成逼真图像（GAN、扩散模型）。&lt;br /&gt;
* &#039;&#039;&#039;医学影像&#039;&#039;&#039; — 肿瘤检测、视网膜分析和放射学筛查。&lt;br /&gt;
&lt;br /&gt;
== 实践技巧 ==&lt;br /&gt;
&lt;br /&gt;
* 当标注数据有限时，使用预训练模型（迁移学习/Transfer Learning）。&lt;br /&gt;
* 优先使用堆叠的小卷积核（&amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt;）——两个 &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt; 层与一个 &amp;lt;math&amp;gt;5 \times 5&amp;lt;/math&amp;gt; 层具有相同的感受野（Receptive Field），但参数更少。&lt;br /&gt;
* 在卷积之后、激活之前应用批量归一化（Batch Normalization）。&lt;br /&gt;
* 大量使用数据增强（Data Augmentation）以减少[[Overfitting and Regularization|过拟合]]。&lt;br /&gt;
* 用全局平均池化替代全连接层以减少参数数量。&lt;br /&gt;
&lt;br /&gt;
== 参见 ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
&lt;br /&gt;
== 参考文献 ==&lt;br /&gt;
&lt;br /&gt;
* LeCun, Y. et al. (1998). &amp;quot;Gradient-Based Learning Applied to Document Recognition&amp;quot;. &#039;&#039;Proceedings of the IEEE&#039;&#039;.&lt;br /&gt;
* Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). &amp;quot;ImageNet Classification with Deep Convolutional Neural Networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Simonyan, K. and Zisserman, A. (2015). &amp;quot;Very Deep Convolutional Networks for Large-Scale Image Recognition&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* He, K. et al. (2016). &amp;quot;Deep Residual Learning for Image Recognition&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Tan, M. and Le, Q. V. (2019). &amp;quot;EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Backpropagation/zh&amp;diff=2164</id>
		<title>Backpropagation/zh</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Backpropagation/zh&amp;diff=2164"/>
		<updated>2026-04-24T07:09:02Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Backpropagation}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Gradient Descent]], [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;反向传播（Backpropagation）&#039;&#039;&#039;（&#039;&#039;&#039;误差反向传播&#039;&#039;&#039;的缩写）是一种高效计算损失函数相对于神经网络中每个权重的梯度的算法。它与[[Gradient Descent|梯度下降]]等优化方法相结合，构成了现代深度学习模型的标准训练过程。&lt;br /&gt;
&lt;br /&gt;
== 链式法则 ==&lt;br /&gt;
&lt;br /&gt;
反向传播本质上是微积分中&#039;&#039;&#039;链式法则（Chain Rule）&#039;&#039;&#039;的应用。如果变量 &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; 依赖于 &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt;，而 &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt; 又依赖于 &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt;，则：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
在神经网络中，损失 &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt; 依赖于输出，输出依赖于最后一个隐藏层的激活值，激活值又依赖于前一层的激活值，以此类推直到输入。链式法则允许我们将梯度分解为局部导数的乘积，每一层对应一个。&lt;br /&gt;
&lt;br /&gt;
== 前向传播 ==&lt;br /&gt;
&lt;br /&gt;
在前向传播过程中，输入数据逐层通过网络传播。对于全连接层 &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt;：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{a}^{(l)} = g^{(l)}(\mathbf{z}^{(l)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;\mathbf{a}^{(l-1)}&amp;lt;/math&amp;gt; 是前一层的激活值（&amp;lt;math&amp;gt;\mathbf{a}^{(0)} = \mathbf{x}&amp;lt;/math&amp;gt;），&amp;lt;math&amp;gt;\mathbf{W}^{(l)}&amp;lt;/math&amp;gt; 和 &amp;lt;math&amp;gt;\mathbf{b}^{(l)}&amp;lt;/math&amp;gt; 是权重和偏置，&amp;lt;math&amp;gt;g^{(l)}&amp;lt;/math&amp;gt; 是激活函数。前向传播存储所有中间值 &amp;lt;math&amp;gt;\mathbf{z}^{(l)}&amp;lt;/math&amp;gt; 和 &amp;lt;math&amp;gt;\mathbf{a}^{(l)}&amp;lt;/math&amp;gt;，因为它们在反向传播过程中需要使用。&lt;br /&gt;
&lt;br /&gt;
== 反向传播过程 ==&lt;br /&gt;
&lt;br /&gt;
反向传播从损失开始向输入方向计算梯度。定义第 &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt; 层的误差信号为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
对于输出层（第 &amp;lt;math&amp;gt;L_{\text{out}}&amp;lt;/math&amp;gt; 层）：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(L_{\text{out}})} = \frac{\partial L}{\partial \mathbf{a}^{(L_{\text{out}})}} \odot g&#039;^{(L_{\text{out}})}(\mathbf{z}^{(L_{\text{out}})})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
对于每个更早的层，误差向后传播：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(l)} = \bigl(\mathbf{W}^{(l+1)}\bigr)^\top \boldsymbol{\delta}^{(l+1)} \odot g&#039;^{(l)}(\mathbf{z}^{(l)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;\odot&amp;lt;/math&amp;gt; 表示逐元素乘法。一旦知道误差信号，参数梯度为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \bigl(\mathbf{a}^{(l-1)}\bigr)^\top, \qquad \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== 计算图 ==&lt;br /&gt;
&lt;br /&gt;
现代深度学习框架（PyTorch、TensorFlow、JAX）通过构建&#039;&#039;&#039;计算图（Computational Graph）&#039;&#039;&#039;来实现反向传播——这是一个有向无环图，其中每个节点表示一个运算，每条边传递一个张量。前向传播构建计算图；反向传播按逆拓扑排序遍历它，在每个节点应用链式法则。&lt;br /&gt;
&lt;br /&gt;
这种抽象使得对任意运算组合进行微分成为可能，而不仅限于标准层类型。存在两种实现策略：&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;静态图&#039;&#039;&#039; — 在执行前一次性定义图（早期的TensorFlow）。允许激进的编译器优化，但灵活性较低。&lt;br /&gt;
* &#039;&#039;&#039;动态图&#039;&#039;&#039; — 在每次前向传播时重建图（PyTorch、TensorFlow Eager模式）。更便于调试和处理依赖数据的控制流模型。&lt;br /&gt;
&lt;br /&gt;
== 自动微分 ==&lt;br /&gt;
&lt;br /&gt;
反向传播是&#039;&#039;&#039;反向模式自动微分（Reverse-mode Automatic Differentiation）&#039;&#039;&#039;（AD）的特例。与数值微分（近似的）或符号微分（可能产生冗长的表达式）不同，自动微分通过系统地对基本运算应用链式法则来计算精确导数。&lt;br /&gt;
&lt;br /&gt;
反向模式自动微分可以在单次反向传播中计算标量输出相对于所有输入的梯度，这使其非常适合神经网络——损失是标量，但参数数量达到数百万。&lt;br /&gt;
&lt;br /&gt;
反向传播的计算成本通常是前向传播的2-3倍，因为它必须评估局部雅可比矩阵（Jacobian）并将其与传入的误差信号相乘。&lt;br /&gt;
&lt;br /&gt;
== 梯度消失与梯度爆炸 ==&lt;br /&gt;
&lt;br /&gt;
当网络层数很多时，梯度是许多局部导数的乘积。如果这些因子持续小于1，梯度会指数级地趋向于零——即&#039;&#039;&#039;梯度消失（Vanishing Gradient）&#039;&#039;&#039;问题。如果它们持续大于1，梯度会指数级增长——即&#039;&#039;&#039;梯度爆炸（Exploding Gradient）&#039;&#039;&#039;问题。&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! 问题 !! 症状 !! 常见缓解措施&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;梯度消失&#039;&#039;&#039; || 早期层学习极其缓慢 || ReLU激活函数、残差连接、批量归一化、精心初始化&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;梯度爆炸&#039;&#039;&#039; || 损失发散或产生NaN值 || 梯度裁剪、权重正则化、降低学习率&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
在引入ReLU激活函数、残差连接（ResNet）和归一化技术之前，这些问题是训练深度网络的主要障碍。&lt;br /&gt;
&lt;br /&gt;
== 实践注意事项 ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;内存&#039;&#039;&#039; — 前向传播必须存储所有中间激活值以供反向传播使用。对于非常深的网络，这可能是不可承受的；&#039;&#039;&#039;梯度检查点（Gradient Checkpointing）&#039;&#039;&#039;通过在反向传播时重新计算激活值来以计算换取内存。&lt;br /&gt;
* &#039;&#039;&#039;数值稳定性&#039;&#039;&#039; — 使用log-sum-exp技巧和融合的softmax-交叉熵实现可以避免上溢和下溢。&lt;br /&gt;
* &#039;&#039;&#039;高阶梯度&#039;&#039;&#039; — 对反向传播本身再求导可以得到二阶信息（海森向量积），这对自然梯度下降和元学习（Meta-learning）等方法很有用。&lt;br /&gt;
* &#039;&#039;&#039;混合精度（Mixed Precision）&#039;&#039;&#039; — 在半精度下执行前向传播，同时保持权重的全精度主副本，可以在现代GPU上加速训练。&lt;br /&gt;
&lt;br /&gt;
== 历史发展 ==&lt;br /&gt;
&lt;br /&gt;
反向传播背后的关键思想由多位研究者独立发展。Seppo Linnainmaa于1970年描述了反向模式自动微分。Paul Werbos在1974年的博士论文中将其应用于神经网络。该算法在Rumelhart、Hinton和Williams于1986年发表的有影响力的论文之后获得了广泛采用，该论文展示了它在多层网络上的有效性。&lt;br /&gt;
&lt;br /&gt;
== 参见 ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== 参考文献 ==&lt;br /&gt;
&lt;br /&gt;
* Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). &amp;quot;Learning representations by back-propagating errors&amp;quot;. &#039;&#039;Nature&#039;&#039;, 323, 533–536.&lt;br /&gt;
* Linnainmaa, S. (1970). &amp;quot;The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors&amp;quot;. Master&#039;s thesis, University of Helsinki.&lt;br /&gt;
* Werbos, P. J. (1974). &amp;quot;Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences&amp;quot;. PhD thesis, Harvard University.&lt;br /&gt;
* Baydin, A. G. et al. (2018). &amp;quot;Automatic Differentiation in Machine Learning: a Survey&amp;quot;. &#039;&#039;JMLR&#039;&#039;, 18(153), 1–43.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 6. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Attention_Mechanisms/zh&amp;diff=2163</id>
		<title>Attention Mechanisms/zh</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Attention_Mechanisms/zh&amp;diff=2163"/>
		<updated>2026-04-24T07:09:02Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Attention Mechanisms}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Advanced | prerequisites = [[Neural Networks]], [[Recurrent Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;注意力机制（Attention Mechanism）&#039;&#039;&#039;是一组允许神经网络在生成每个输出元素时选择性地关注输入相关部分的技术。注意力机制最初被引入以克服序列到序列模型中固定长度上下文向量的局限性，如今已成为[[Transformer]]等现代架构的基础构建模块。&lt;br /&gt;
&lt;br /&gt;
== 动机 ==&lt;br /&gt;
&lt;br /&gt;
早期的序列到序列（Sequence-to-Sequence）模型使用[[Recurrent Neural Networks|循环神经网络]]将整个输入序列编码为单个固定维度的向量。这一&#039;&#039;&#039;瓶颈&#039;&#039;&#039;迫使长距离依赖关系被压缩到一个固定大小的向量中，导致在长序列上性能下降。注意力机制通过让解码器在每个生成步骤中查阅每个编码器隐藏状态来解决这一问题，根据学习到的相关性分数对它们进行加权。&lt;br /&gt;
&lt;br /&gt;
== Bahdanau（加性）注意力 ==&lt;br /&gt;
&lt;br /&gt;
Bahdanau等人（2015）提出了第一个被广泛采用的机器翻译注意力机制。给定编码器隐藏状态 &amp;lt;math&amp;gt;h_1, \dots, h_T&amp;lt;/math&amp;gt; 和解码器状态 &amp;lt;math&amp;gt;s_{t-1}&amp;lt;/math&amp;gt;，对齐分数计算为：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;W_s&amp;lt;/math&amp;gt;、&amp;lt;math&amp;gt;W_h&amp;lt;/math&amp;gt; 和 &amp;lt;math&amp;gt;v&amp;lt;/math&amp;gt; 是可学习参数。注意力权重通过应用softmax获得：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
上下文向量是加权和 &amp;lt;math&amp;gt;c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i&amp;lt;/math&amp;gt;，它与 &amp;lt;math&amp;gt;s_{t-1}&amp;lt;/math&amp;gt; 拼接后输入解码器。&lt;br /&gt;
&lt;br /&gt;
== Luong（乘性）注意力 ==&lt;br /&gt;
&lt;br /&gt;
Luong等人（2015）通过用点积或双线性形式替换加性网络来简化评分函数：&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! 变体 !! 评分函数&lt;br /&gt;
|-&lt;br /&gt;
| 点积（Dot） || &amp;lt;math&amp;gt;e_{t,i} = s_t^{\!\top} h_i&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| 一般（General） || &amp;lt;math&amp;gt;e_{t,i} = s_t^{\!\top} W_a\, h_i&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| 拼接（Concat） || &amp;lt;math&amp;gt;e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i])&amp;lt;/math&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
点积变体要求编码器和解码器维度匹配，而一般变体引入了一个可学习的权重矩阵 &amp;lt;math&amp;gt;W_a&amp;lt;/math&amp;gt;。&lt;br /&gt;
&lt;br /&gt;
== 缩放点积注意力 ==&lt;br /&gt;
&lt;br /&gt;
Vaswani等人（2017）引入了Transformer中使用的公式。给定查询（Query）矩阵 &amp;lt;math&amp;gt;Q&amp;lt;/math&amp;gt;、键（Key）矩阵 &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; 和值（Value）矩阵 &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt;：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
缩放因子 &amp;lt;math&amp;gt;\sqrt{d_k}&amp;lt;/math&amp;gt; 防止点积随着键维度 &amp;lt;math&amp;gt;d_k&amp;lt;/math&amp;gt; 的增加而变得很大，否则会将softmax推入梯度极小的区域。&lt;br /&gt;
&lt;br /&gt;
== 自注意力 ==&lt;br /&gt;
&lt;br /&gt;
在&#039;&#039;&#039;自注意力（Self-Attention）&#039;&#039;&#039;中，查询、键和值都来自同一个序列。每个位置关注所有其他位置（包括自身），使模型能够在单层中捕获长距离依赖关系。对于输入矩阵 &amp;lt;math&amp;gt;X \in \mathbb{R}^{n \times d}&amp;lt;/math&amp;gt;：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;Q = X W^Q, \quad K = X W^K, \quad V = X W^V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
自注意力的复杂度为 &amp;lt;math&amp;gt;O(n^2 d)&amp;lt;/math&amp;gt;，对于很长的序列来说可能非常昂贵。稀疏注意力（Sparse Attention）和线性注意力（Linear Attention）等高效变体降低了这一成本。&lt;br /&gt;
&lt;br /&gt;
== 多头注意力 ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;多头注意力（Multi-Head Attention）&#039;&#039;&#039;不是执行单个注意力函数，而是使用独立的投影运行 &amp;lt;math&amp;gt;h&amp;lt;/math&amp;gt; 个并行的注意力头：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
其中 &amp;lt;math&amp;gt;\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)&amp;lt;/math&amp;gt;。每个头可以学习关注输入的不同方面——例如，一个头可能捕获句法关系，另一个捕获语义关系。典型配置使用8个或16个头。&lt;br /&gt;
&lt;br /&gt;
== 位置编码 ==&lt;br /&gt;
&lt;br /&gt;
由于自注意力是置换不变的（Permutation-invariant）（它将输入视为无序集合），位置信息必须被显式注入。原始Transformer使用正弦位置编码：&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
可学习的位置嵌入和相对位置编码（例如RoPE、ALiBi）是常见的替代方案，它们可以更好地泛化到未见过的序列长度。&lt;br /&gt;
&lt;br /&gt;
== 交叉注意力 ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;交叉注意力（Cross-Attention）&#039;&#039;&#039;用于查询来自一个序列而键/值来自另一个序列的情况。在编码器-解码器Transformer中，解码器通过交叉注意力关注编码器输出，使模型能够根据完整的输入上下文来生成。&lt;br /&gt;
&lt;br /&gt;
== 实践注意事项 ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;掩码（Masking）&#039;&#039;&#039;：在自回归解码中，未来位置被屏蔽（在softmax之前设为 &amp;lt;math&amp;gt;-\infty&amp;lt;/math&amp;gt;）以保持因果结构。&lt;br /&gt;
* &#039;&#039;&#039;注意力Dropout&#039;&#039;&#039;：在训练期间随机丢弃注意力权重起到正则化的作用，减少对特定对齐模式的过拟合。&lt;br /&gt;
* &#039;&#039;&#039;键值缓存（KV Cache）&#039;&#039;&#039;：在推理过程中，缓存先前计算的键和值向量以避免冗余计算，显著加速自回归生成。&lt;br /&gt;
&lt;br /&gt;
== 参见 ==&lt;br /&gt;
&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Sequence-to-sequence models]]&lt;br /&gt;
* [[Self-supervised learning]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
&lt;br /&gt;
== 参考文献 ==&lt;br /&gt;
&lt;br /&gt;
* Bahdanau, D., Cho, K. and Bengio, Y. (2015). &amp;quot;Neural Machine Translation by Jointly Learning to Align and Translate&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Luong, M.-T., Pham, H. and Manning, C. D. (2015). &amp;quot;Effective Approaches to Attention-based Neural Machine Translation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Vaswani, A. et al. (2017). &amp;quot;Attention Is All You Need&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). &amp;quot;Self-Attention with Relative Position Representations&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Su, J. et al. (2021). &amp;quot;RoFormer: Enhanced Transformer with Rotary Position Embedding&amp;quot;. &#039;&#039;arXiv:2104.09864&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Advanced]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Word_Embeddings/es&amp;diff=2162</id>
		<title>Word Embeddings/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Word_Embeddings/es&amp;diff=2162"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Word Embeddings}}&lt;br /&gt;
{{ArticleInfobox | topic_area = NLP | difficulty = Intermediate | prerequisites = [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
Los &#039;&#039;&#039;word embeddings&#039;&#039;&#039; son representaciones vectoriales densas y de baja dimensionalidad de palabras en las que las palabras semanticamente similares se mapean a puntos cercanos en el espacio vectorial. Son un componente fundamental del procesamiento del lenguaje natural (PLN) moderno, reemplazando las codificaciones dispersas one-hot con representaciones que capturan significado, analogia y relaciones sintacticas.&lt;br /&gt;
&lt;br /&gt;
== La hipotesis distribucional ==&lt;br /&gt;
&lt;br /&gt;
Los word embeddings se fundamentan en la &#039;&#039;&#039;hipotesis distribucional&#039;&#039;&#039;, enunciada de forma celebre por J. R. Firth (1957): &amp;quot;Conoceras una palabra por la compania que mantiene.&amp;quot; La idea es que las palabras que aparecen en contextos similares tienden a tener significados similares. Por ejemplo, &amp;quot;perro&amp;quot; y &amp;quot;gato&amp;quot; aparecen frecuentemente cerca de palabras como &amp;quot;mascota&amp;quot;, &amp;quot;pelo&amp;quot; y &amp;quot;veterinario&amp;quot;, por lo que deberian tener representaciones similares.&lt;br /&gt;
&lt;br /&gt;
Los enfoques tempranos para explotar la informacion distribucional incluyen matrices de coocurrencia, informacion mutua puntual (PMI) y analisis semantico latente (LSA). Los metodos modernos de word embeddings aprenden vectores densos directamente utilizando redes neuronales.&lt;br /&gt;
&lt;br /&gt;
== Representaciones one-hot vs. densas ==&lt;br /&gt;
&lt;br /&gt;
=== Codificacion one-hot ===&lt;br /&gt;
&lt;br /&gt;
En un vocabulario de &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt; palabras, un vector one-hot para la &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt;-esima palabra es un vector de &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt; dimensiones con un 1 en la posicion &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt; y 0 en el resto. Esta representacion tiene dos deficiencias criticas:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Dimensionalidad&#039;&#039;&#039; — los vectores son de dimension extremadamente alta (tipicamente &amp;lt;math&amp;gt;V &amp;gt; 100{,}000&amp;lt;/math&amp;gt;).&lt;br /&gt;
* &#039;&#039;&#039;Sin similitud&#039;&#039;&#039; — cada par de vectores one-hot es igualmente distante: &amp;lt;math&amp;gt;\mathbf{e}_i^\top \mathbf{e}_j = 0&amp;lt;/math&amp;gt; para &amp;lt;math&amp;gt;i \neq j&amp;lt;/math&amp;gt;. &amp;quot;Gato&amp;quot; esta tan lejos de &amp;quot;perro&amp;quot; como lo esta de &amp;quot;democracia.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Embeddings densos ===&lt;br /&gt;
&lt;br /&gt;
Un word embedding mapea cada palabra a un vector de valores reales de &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt; dimensiones (tipicamente &amp;lt;math&amp;gt;d = 100&amp;lt;/math&amp;gt;–&amp;lt;math&amp;gt;300&amp;lt;/math&amp;gt;):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{w}_i \in \mathbb{R}^d, \quad d \ll V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Las palabras similares tienen una alta similitud coseno:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\text{sim}(\mathbf{w}_a, \mathbf{w}_b) = \frac{\mathbf{w}_a \cdot \mathbf{w}_b}{\|\mathbf{w}_a\|\;\|\mathbf{w}_b\|}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Word2Vec ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Word2Vec&#039;&#039;&#039; (Mikolov et al., 2013) introdujo dos arquitecturas eficientes para aprender word embeddings a partir de grandes corpus.&lt;br /&gt;
&lt;br /&gt;
=== Bolsa continua de palabras (CBOW) ===&lt;br /&gt;
&lt;br /&gt;
CBOW predice una palabra objetivo a partir de sus palabras de contexto circundantes. Dada una ventana de palabras de contexto &amp;lt;math&amp;gt;\{w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c}\}&amp;lt;/math&amp;gt;, el modelo maximiza:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;P(w_t \mid w_{t-c}, \ldots, w_{t+c})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Los vectores de las palabras de contexto se promedian y se pasan a traves de una capa softmax. CBOW es mas rapido de entrenar y funciona bien para palabras frecuentes.&lt;br /&gt;
&lt;br /&gt;
=== Skip-gram ===&lt;br /&gt;
&lt;br /&gt;
Skip-gram invierte la prediccion: dada una palabra objetivo, predice las palabras de contexto circundantes. Para cada par &amp;lt;math&amp;gt;(w_t, w_{t+j})&amp;lt;/math&amp;gt; donde &amp;lt;math&amp;gt;j \in [-c, c] \setminus \{0\}&amp;lt;/math&amp;gt;, el modelo maximiza:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;P(w_{t+j} \mid w_t) = \frac{\exp(\mathbf{v}&#039;_{w_{t+j}}{}^\top \mathbf{v}_{w_t})}{\sum_{w=1}^{V}\exp(\mathbf{v}&#039;_w{}^\top \mathbf{v}_{w_t})}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\mathbf{v}_w&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;\mathbf{v}&#039;_w&amp;lt;/math&amp;gt; son los vectores de embedding de entrada y salida. Calcular el softmax completo sobre el vocabulario es costoso, por lo que se utilizan dos aproximaciones comunes:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Muestreo negativo&#039;&#039;&#039; — en lugar de calcular el softmax completo, el modelo contrasta la palabra de contexto verdadera contra &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; palabras &amp;quot;negativas&amp;quot; muestreadas aleatoriamente.&lt;br /&gt;
* &#039;&#039;&#039;Softmax jerarquico&#039;&#039;&#039; — organiza el vocabulario en un arbol binario, reduciendo el coste del softmax de &amp;lt;math&amp;gt;O(V)&amp;lt;/math&amp;gt; a &amp;lt;math&amp;gt;O(\log V)&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Skip-gram funciona bien con palabras infrecuentes y captura relaciones sutiles. La famosa analogia &amp;quot;rey - hombre + mujer ≈ reina&amp;quot; surgio de embeddings Skip-gram.&lt;br /&gt;
&lt;br /&gt;
== GloVe ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;GloVe&#039;&#039;&#039; (Global Vectors, Pennington et al., 2014) combina las fortalezas de la factorizacion de matrices globales y los metodos de ventana de contexto local. Construye una matriz de coocurrencia de palabras &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; a partir del corpus, donde &amp;lt;math&amp;gt;X_{ij}&amp;lt;/math&amp;gt; cuenta con que frecuencia la palabra &amp;lt;math&amp;gt;j&amp;lt;/math&amp;gt; aparece en el contexto de la palabra &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt;, y luego optimiza:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J = \sum_{i,j=1}^{V} f(X_{ij})\bigl(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\bigr)^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; es una funcion de ponderacion que limita la influencia de coocurrencias muy frecuentes. Los embeddings de GloVe a menudo igualan o superan la calidad de Word2Vec, y el uso explicito de estadisticas globales puede mejorar el rendimiento en tareas de analogia.&lt;br /&gt;
&lt;br /&gt;
== fastText ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;fastText&#039;&#039;&#039; (Bojanowski et al., 2017) extiende Word2Vec representando cada palabra como una bolsa de n-gramas de caracteres. Por ejemplo, la palabra &amp;quot;donde&amp;quot; con &amp;lt;math&amp;gt;n = 3&amp;lt;/math&amp;gt; se representa por los n-gramas {&amp;quot;&amp;lt;do&amp;quot;, &amp;quot;don&amp;quot;, &amp;quot;ond&amp;quot;, &amp;quot;nde&amp;quot;, &amp;quot;de&amp;gt;&amp;quot;} mas la palabra completa &amp;quot;&amp;lt;donde&amp;gt;&amp;quot;. El embedding de una palabra es la suma de sus vectores de n-gramas.&lt;br /&gt;
&lt;br /&gt;
Este enfoque tiene dos ventajas clave:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Manejo de palabras raras y no vistas&#039;&#039;&#039; — incluso las palabras que no estan en el vocabulario de entrenamiento pueden recibir embeddings al sumar sus vectores de n-gramas de caracteres.&lt;br /&gt;
* &#039;&#039;&#039;Conciencia morfologica&#039;&#039;&#039; — las palabras que comparten subcadenas (por ejemplo, &amp;quot;ensenar&amp;quot;, &amp;quot;ensenanza&amp;quot;, &amp;quot;ensenante&amp;quot;) comparten automaticamente componentes del embedding.&lt;br /&gt;
&lt;br /&gt;
== Evaluacion de embeddings ==&lt;br /&gt;
&lt;br /&gt;
Los word embeddings se evaluan mediante:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Tipo de evaluacion !! Ejemplos !! Que mide&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Intrinseca: analogia&#039;&#039;&#039; || &amp;quot;rey : reina :: hombre : ?&amp;quot; || Estructura lineal del espacio&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Intrinseca: similitud&#039;&#039;&#039; || Correlacion con juicios de similitud humanos (SimLex-999, WS-353) || Calidad semantica&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Extrinseca: tarea posterior&#039;&#039;&#039; || Reconocimiento de entidades nombradas, analisis de sentimiento, parsing || Utilidad practica&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Las evaluaciones intrinsecas son rapidas pero no siempre predicen el rendimiento en tareas posteriores. La evaluacion extrinseca en la tarea objetivo es, en ultima instancia, la medida mas fiable.&lt;br /&gt;
&lt;br /&gt;
== Embeddings contextuales ==&lt;br /&gt;
&lt;br /&gt;
Los word embeddings tradicionales asignan un unico vector por palabra independientemente del contexto — la palabra &amp;quot;banco&amp;quot; tiene el mismo embedding ya sea que se refiera a un banco de rio o a una institucion financiera. Los &#039;&#039;&#039;embeddings contextuales&#039;&#039;&#039; abordan esta limitacion produciendo representaciones diferentes segun el texto circundante.&lt;br /&gt;
&lt;br /&gt;
Los modelos de embeddings contextuales mas notables incluyen:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;ELMo&#039;&#039;&#039; (Peters et al., 2018) — utiliza un LSTM bidireccional para generar representaciones de palabras dependientes del contexto.&lt;br /&gt;
* &#039;&#039;&#039;BERT&#039;&#039;&#039; (Devlin et al., 2019) — utiliza un codificador Transformer entrenado con modelado de lenguaje enmascarado.&lt;br /&gt;
* &#039;&#039;&#039;Serie GPT&#039;&#039;&#039; (Radford et al., 2018–) — utiliza un decodificador Transformer entrenado de forma autorregresiva.&lt;br /&gt;
&lt;br /&gt;
Estos modelos han reemplazado en gran medida a los embeddings estaticos para la mayoria de las tareas de PLN, aunque los embeddings estaticos siguen siendo utiles por su eficiencia, interpretabilidad y en entornos con recursos limitados.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Firth, J. R. (1957). &amp;quot;A synopsis of linguistic theory, 1930–1955&amp;quot;. In &#039;&#039;Studies in Linguistic Analysis&#039;&#039;.&lt;br /&gt;
* Mikolov, T. et al. (2013). &amp;quot;Efficient Estimation of Word Representations in Vector Space&amp;quot;. &#039;&#039;arXiv:1301.3781&#039;&#039;.&lt;br /&gt;
* Pennington, J., Socher, R. and Manning, C. D. (2014). &amp;quot;GloVe: Global Vectors for Word Representation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Bojanowski, P. et al. (2017). &amp;quot;Enriching Word Vectors with Subword Information&amp;quot;. &#039;&#039;TACL&#039;&#039;, 5, 135–146.&lt;br /&gt;
* Peters, M. E. et al. (2018). &amp;quot;Deep contextualized word representations&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Devlin, J. et al. (2019). &amp;quot;BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:NLP]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Transfer_Learning/es&amp;diff=2161</id>
		<title>Transfer Learning/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Transfer_Learning/es&amp;diff=2161"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Transfer Learning}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;transfer learning&#039;&#039;&#039; (aprendizaje por transferencia) es una tecnica de aprendizaje automatico en la que un modelo entrenado en una tarea se reutiliza como punto de partida para un modelo en una tarea diferente pero relacionada. Al aprovechar el conocimiento adquirido durante un preentrenamiento a gran escala, el transfer learning reduce drasticamente la cantidad de datos etiquetados, computo y tiempo de entrenamiento requeridos para las aplicaciones posteriores.&lt;br /&gt;
&lt;br /&gt;
== Motivacion ==&lt;br /&gt;
&lt;br /&gt;
Entrenar redes neuronales profundas desde cero tipicamente requiere grandes conjuntos de datos y recursos computacionales significativos. En muchos dominios practicos — imagen medica, analisis de textos legales, idiomas con pocos recursos — los datos etiquetados son escasos. El transfer learning aborda esta asimetria: un modelo preentrenado en una tarea fuente rica en datos captura caracteristicas generales (bordes, texturas, patrones sintacticos) que se transfieren bien a una tarea objetivo con pocos datos.&lt;br /&gt;
&lt;br /&gt;
== Conceptos clave ==&lt;br /&gt;
&lt;br /&gt;
=== Dominio y tarea ===&lt;br /&gt;
&lt;br /&gt;
Formalmente, un &#039;&#039;&#039;dominio&#039;&#039;&#039; &amp;lt;math&amp;gt;\mathcal{D} = \{\mathcal{X}, P(X)\}&amp;lt;/math&amp;gt; consiste en un espacio de caracteristicas &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt; y una distribucion marginal &amp;lt;math&amp;gt;P(X)&amp;lt;/math&amp;gt;. Una &#039;&#039;&#039;tarea&#039;&#039;&#039; &amp;lt;math&amp;gt;\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}&amp;lt;/math&amp;gt; consiste en un espacio de etiquetas &amp;lt;math&amp;gt;\mathcal{Y}&amp;lt;/math&amp;gt; y una funcion predictiva &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt;. El transfer learning se aplica cuando la fuente y el objetivo difieren en dominio, tarea o ambos.&lt;br /&gt;
&lt;br /&gt;
=== Adaptacion de dominio ===&lt;br /&gt;
&lt;br /&gt;
Cuando la fuente y el objetivo comparten la misma tarea pero difieren en la distribucion de datos (&amp;lt;math&amp;gt;P_s(X) \neq P_t(X)&amp;lt;/math&amp;gt;), el problema se denomina &#039;&#039;&#039;adaptacion de dominio&#039;&#039;&#039;. Las tecnicas incluyen:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Reponderacion de instancias&#039;&#039;&#039; — ajustar los pesos de las muestras para que la distribucion fuente aproxime a la objetivo.&lt;br /&gt;
* &#039;&#039;&#039;Alineamiento de caracteristicas&#039;&#039;&#039; — aprender representaciones invariantes al dominio (por ejemplo, mediante entrenamiento adversario o discrepancia de medias maxima).&lt;br /&gt;
* &#039;&#039;&#039;Autoentrenamiento&#039;&#039;&#039; — utilizar las predicciones del modelo sobre datos objetivo no etiquetados como pseudoetiquetas.&lt;br /&gt;
&lt;br /&gt;
== Ajuste fino vs. extraccion de caracteristicas ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Estrategia !! Descripcion !! Cuando utilizar&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Extraccion de caracteristicas&#039;&#039;&#039; || Congelar todas las capas preentrenadas; entrenar solo una nueva cabeza de salida || Conjunto de datos objetivo muy pequeno; fuente y objetivo estan estrechamente relacionados&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Ajuste fino (completo)&#039;&#039;&#039; || Descongelar todas las capas y entrenar de extremo a extremo con una tasa de aprendizaje pequena || Conjunto de datos objetivo moderado; fuente y objetivo difieren significativamente&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Descongelamiento gradual&#039;&#039;&#039; || Descongelar progresivamente las capas de arriba hacia abajo durante el entrenamiento || Equilibra la estabilidad de las caracteristicas inferiores con la adaptacion de las superiores&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Una heuristica comun es utilizar una tasa de aprendizaje entre 10 y 100 veces menor para las capas preentrenadas que para la nueva cabeza de clasificacion, previniendo el olvido catastrofico de las representaciones aprendidas.&lt;br /&gt;
&lt;br /&gt;
== Modelos preentrenados ==&lt;br /&gt;
&lt;br /&gt;
=== Vision por computador ===&lt;br /&gt;
&lt;br /&gt;
Las redes convolucionales preentrenadas en ImageNet (ResNet, EfficientNet, ViT) sirven como columnas vertebrales estandar. Las capas inferiores aprenden caracteristicas universales como bordes y texturas, mientras que las capas superiores aprenden patrones especificos de la tarea. Ajustar un modelo de ImageNet en un conjunto de datos de imagen medica con solo unos pocos miles de imagenes supera rutinariamente al entrenamiento desde cero.&lt;br /&gt;
&lt;br /&gt;
=== Procesamiento del lenguaje natural ===&lt;br /&gt;
&lt;br /&gt;
El preentrenamiento de modelos de lenguaje transformo el PLN. Los hitos clave incluyen:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Word2Vec / GloVe&#039;&#039;&#039; — word embeddings estaticos preentrenados en grandes corpus.&lt;br /&gt;
* &#039;&#039;&#039;ELMo&#039;&#039;&#039; — embeddings contextualizados a partir de LSTM bidireccionales.&lt;br /&gt;
* &#039;&#039;&#039;BERT&#039;&#039;&#039; (Devlin et al., 2019) — Transformer bidireccional preentrenado con modelado de lenguaje enmascarado; ajustado para clasificacion, respuesta a preguntas, reconocimiento de entidades nombradas y mas.&lt;br /&gt;
* &#039;&#039;&#039;Serie GPT&#039;&#039;&#039; — Transformers autorregresivos que demuestran que la escala y el preentrenamiento permiten la transferencia con pocos ejemplos e incluso sin ejemplos.&lt;br /&gt;
&lt;br /&gt;
== Cuando utilizar transfer learning ==&lt;br /&gt;
&lt;br /&gt;
El transfer learning es mas beneficioso cuando:&lt;br /&gt;
&lt;br /&gt;
# El conjunto de datos objetivo es pequeno en relacion con la capacidad del modelo.&lt;br /&gt;
# Los dominios fuente y objetivo comparten similitudes estructurales (por ejemplo, ambos involucran imagenes naturales o lenguaje natural).&lt;br /&gt;
# Los recursos computacionales para un preentrenamiento completo no estan disponibles.&lt;br /&gt;
# Se necesita un prototipado rapido antes de comprometerse con la recoleccion de datos a gran escala.&lt;br /&gt;
&lt;br /&gt;
Puede perjudicar el rendimiento (&#039;&#039;&#039;transferencia negativa&#039;&#039;&#039;) cuando los dominios fuente y objetivo son fundamentalmente disimiles — por ejemplo, transferir de imagenes naturales a espectrogramas sin una adaptacion adecuada.&lt;br /&gt;
&lt;br /&gt;
== Consejos practicos ==&lt;br /&gt;
&lt;br /&gt;
* El &#039;&#039;&#039;aumento de datos&#039;&#039;&#039; complementa el transfer learning al expandir artificialmente el tamano efectivo del conjunto de datos objetivo.&lt;br /&gt;
* El &#039;&#039;&#039;calentamiento de la tasa de aprendizaje&#039;&#039;&#039; ayuda a estabilizar el entrenamiento inicial al ajustar modelos preentrenados grandes.&lt;br /&gt;
* La &#039;&#039;&#039;parada temprana&#039;&#039;&#039; sobre un conjunto de validacion previene el sobreajuste durante el ajuste fino, especialmente con conjuntos de datos pequenos.&lt;br /&gt;
* El &#039;&#039;&#039;decaimiento de tasa de aprendizaje por capas&#039;&#039;&#039; asigna tasas menores a las capas iniciales (mas generales) y tasas mayores a las capas posteriores (mas especificas de la tarea).&lt;br /&gt;
* La &#039;&#039;&#039;transferencia de tarea intermedia&#039;&#039;&#039; — ajustar en una tarea intermedia relacionada antes de la tarea objetivo final (por ejemplo, inferencia de lenguaje natural antes de analisis de sentimiento) — puede mejorar aun mas los resultados.&lt;br /&gt;
&lt;br /&gt;
== Evaluacion ==&lt;br /&gt;
&lt;br /&gt;
La efectividad del transfer learning se mide tipicamente comparando:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Un &amp;lt;math&amp;gt;\Delta_{\mathrm{transfer}}&amp;lt;/math&amp;gt; positivo indica una transferencia de conocimiento exitosa. Los profesionales tambien monitorizan la velocidad de convergencia, ya que los modelos transferidos a menudo alcanzan el rendimiento objetivo en una fraccion de las epocas.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
* [[Self-supervised learning]]&lt;br /&gt;
* [[Domain adaptation]]&lt;br /&gt;
* [[Fine-tuning]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Pan, S. J. and Yang, Q. (2010). &amp;quot;A Survey on Transfer Learning&amp;quot;. &#039;&#039;IEEE Transactions on Knowledge and Data Engineering&#039;&#039;.&lt;br /&gt;
* Yosinski, J. et al. (2014). &amp;quot;How transferable are features in deep neural networks?&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Devlin, J. et al. (2019). &amp;quot;BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Howard, J. and Ruder, S. (2018). &amp;quot;Universal Language Model Fine-tuning for Text Classification&amp;quot;. &#039;&#039;ACL&#039;&#039;.&lt;br /&gt;
* Zhuang, F. et al. (2021). &amp;quot;A Comprehensive Survey on Transfer Learning&amp;quot;. &#039;&#039;Proceedings of the IEEE&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Stochastic_Gradient_Descent/es&amp;diff=2160</id>
		<title>Stochastic Gradient Descent/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Stochastic_Gradient_Descent/es&amp;diff=2160"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;El &#039;&#039;&#039;descenso de gradiente estocástico&#039;&#039;&#039; (a menudo abreviado &#039;&#039;&#039;SGD&#039;&#039;&#039;, del inglés &#039;&#039;stochastic gradient descent&#039;&#039;) es un algoritmo de optimización iterativa utilizado para minimizar una función objetivo expresada como la suma de subfunciones diferenciables. Es el motor principal del entrenamiento moderno de aprendizaje automático, impulsando desde la regresión logística hasta las redes neuronales profundas.&lt;br /&gt;
&lt;br /&gt;
== Motivación ==&lt;br /&gt;
&lt;br /&gt;
En el [[gradient descent|descenso de gradiente]] clásico, el gradiente completo de la función de pérdida se calcula sobre todo el conjunto de entrenamiento antes de cada actualización de parámetros. Cuando el conjunto de datos es grande, esto resulta prohibitivamente costoso. El SGD aborda este problema estimando el gradiente a partir de una sola muestra seleccionada aleatoriamente (o un pequeño &#039;&#039;&#039;mini-lote&#039;&#039;&#039;) en cada paso, intercambiando una estimación más ruidosa por un costo por iteración drásticamente menor.&lt;br /&gt;
&lt;br /&gt;
== Algoritmo ==&lt;br /&gt;
&lt;br /&gt;
Dada una función de pérdida parametrizada&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
la regla de actualización del SGD en el paso &amp;lt;math&amp;gt;t&amp;lt;/math&amp;gt; es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; es la &#039;&#039;&#039;tasa de aprendizaje&#039;&#039;&#039; (tamaño de paso) e &amp;lt;math&amp;gt;i_t&amp;lt;/math&amp;gt; es un índice seleccionado aleatoriamente.&lt;br /&gt;
&lt;br /&gt;
=== Variante con mini-lotes ===&lt;br /&gt;
&lt;br /&gt;
En la práctica se utiliza un &#039;&#039;&#039;mini-lote&#039;&#039;&#039; de &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; muestras:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Los tamaños de lote habituales oscilan entre 32 y 512. Lotes más grandes reducen la varianza del gradiente, pero incrementan el uso de memoria.&lt;br /&gt;
&lt;br /&gt;
=== Pseudocódigo ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
inicializar parámetros θ&lt;br /&gt;
para época = 1, 2, … hacer&lt;br /&gt;
    mezclar conjunto de entrenamiento&lt;br /&gt;
    para cada mini-lote B ⊂ conjunto de entrenamiento hacer&lt;br /&gt;
        g ← (1/|B|) Σ ∇ℓ(θ; xᵢ, yᵢ)   # estimar gradiente&lt;br /&gt;
        θ ← θ − η · g                     # actualizar parámetros&lt;br /&gt;
    fin para&lt;br /&gt;
fin para&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Programas de tasa de aprendizaje ==&lt;br /&gt;
&lt;br /&gt;
La tasa de aprendizaje &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; influye fuertemente en la convergencia. Las estrategias más comunes incluyen:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Constante&#039;&#039;&#039; — sencilla, pero puede sobrepasar el mínimo o estancarse.&lt;br /&gt;
* &#039;&#039;&#039;Decaimiento por pasos&#039;&#039;&#039; — multiplicar &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; por un factor (por ejemplo, 0.1) cada &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; épocas.&lt;br /&gt;
* &#039;&#039;&#039;Decaimiento exponencial&#039;&#039;&#039; — &amp;lt;math&amp;gt;\eta_t = \eta_0 \, e^{-\lambda t}&amp;lt;/math&amp;gt;.&lt;br /&gt;
* &#039;&#039;&#039;Recocido coseno&#039;&#039;&#039; — reduce suavemente la tasa siguiendo una curva coseno, a menudo con reinicios.&lt;br /&gt;
* &#039;&#039;&#039;Calentamiento lineal&#039;&#039;&#039; — aumentar gradualmente desde un &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; pequeño durante las primeras iteraciones para estabilizar el entrenamiento inicial.&lt;br /&gt;
&lt;br /&gt;
== Propiedades de convergencia ==&lt;br /&gt;
&lt;br /&gt;
Para objetivos convexos con gradientes Lipschitz-continuos, el SGD con una tasa de aprendizaje decreciente que satisfaga&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 &amp;lt; \infty&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
converge casi seguramente al mínimo global (condiciones de Robbins–Monro). Para problemas no convexos —el régimen típico en aprendizaje profundo— el SGD converge a un punto estacionario, y la evidencia empírica muestra que a menudo encuentra buenos mínimos locales.&lt;br /&gt;
&lt;br /&gt;
== Variantes populares ==&lt;br /&gt;
&lt;br /&gt;
Varias extensiones reducen la varianza de la estimación del gradiente o adaptan el tamaño de paso por parámetro:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Método !! Idea clave !! Referencia&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Momentum&#039;&#039;&#039; || Acumula un promedio móvil con decaimiento exponencial de gradientes pasados || Polyak, 1964&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Gradiente acelerado de Nesterov&#039;&#039;&#039; || Evalúa el gradiente en una posición &amp;quot;anticipada&amp;quot; || Nesterov, 1983&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Adagrad&#039;&#039;&#039; || Tasas por parámetro que disminuyen para características actualizadas frecuentemente || Duchi et al., 2011&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;RMSProp&#039;&#039;&#039; || Corrige las tasas decrecientes de Adagrad usando un promedio móvil de gradientes al cuadrado || Hinton (notas de clase), 2012&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Adam&#039;&#039;&#039; || Combina momentum con tasas adaptativas al estilo RMSProp || Kingma y Ba, 2015&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;AdamW&#039;&#039;&#039; || Desacopla la regularización de pesos del paso de gradiente adaptativo || Loshchilov y Hutter, 2019&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Consideraciones prácticas ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mezcla de datos&#039;&#039;&#039; — Reordenar aleatoriamente el conjunto de datos en cada época para evitar patrones cíclicos.&lt;br /&gt;
* &#039;&#039;&#039;Recorte de gradiente&#039;&#039;&#039; — Limitar la norma del gradiente para prevenir actualizaciones explosivas, especialmente en redes recurrentes.&lt;br /&gt;
* &#039;&#039;&#039;Normalización por lotes&#039;&#039;&#039; — Normalizar las entradas de cada capa reduce la sensibilidad a la tasa de aprendizaje.&lt;br /&gt;
* &#039;&#039;&#039;Entrenamiento con precisión mixta&#039;&#039;&#039; — Usar punto flotante de media precisión acelera el SGD en GPUs modernas con una pérdida mínima de exactitud.&lt;br /&gt;
&lt;br /&gt;
== Aplicaciones ==&lt;br /&gt;
&lt;br /&gt;
El SGD y sus variantes se utilizan en prácticamente todas las áreas del aprendizaje automático:&lt;br /&gt;
&lt;br /&gt;
* Entrenamiento de redes neuronales profundas (visión por computadora, PLN, reconocimiento de voz)&lt;br /&gt;
* Modelos lineales a gran escala (regresión logística, SVM mediante SGD)&lt;br /&gt;
* Optimización de políticas en aprendizaje por refuerzo&lt;br /&gt;
* Sistemas de recomendación y filtrado colaborativo&lt;br /&gt;
* Escenarios de aprendizaje en línea donde los datos llegan en flujo continuo&lt;br /&gt;
&lt;br /&gt;
== Véase también ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient descent|Descenso de gradiente]]&lt;br /&gt;
* [[Backpropagation|Retropropagación]]&lt;br /&gt;
* [[Adam (optimiser)|Adam (optimizador)]]&lt;br /&gt;
* [[Learning rate|Tasa de aprendizaje]]&lt;br /&gt;
* [[Convex optimisation|Optimización convexa]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Robbins, H. y Monro, S. (1951). &amp;quot;A Stochastic Approximation Method&amp;quot;. &#039;&#039;Annals of Mathematical Statistics&#039;&#039;.&lt;br /&gt;
* Bottou, L. (2010). &amp;quot;Large-Scale Machine Learning with Stochastic Gradient Descent&amp;quot;. &#039;&#039;COMPSTAT&#039;&#039;.&lt;br /&gt;
* Kingma, D. P. y Ba, J. (2015). &amp;quot;Adam: A Method for Stochastic Optimization&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &#039;&#039;arXiv:1609.04747&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Aprendizaje automático]]&lt;br /&gt;
[[Category:Algoritmos de optimización]]&lt;br /&gt;
[[Category:Métodos de gradiente]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Softmax_Function/es&amp;diff=2159</id>
		<title>Softmax Function/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function/es&amp;diff=2159"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Softmax Function}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;funcion softmax&#039;&#039;&#039; (tambien llamada &#039;&#039;&#039;funcion exponencial normalizada&#039;&#039;&#039;) es una funcion matematica que convierte un vector de numeros reales (&#039;&#039;&#039;logits&#039;&#039;&#039;) en una distribucion de probabilidad. Es la activacion de salida estandar para la clasificacion multiclase en redes neuronales y desempena un papel central en modelos que van desde la regresion logistica hasta los grandes modelos de lenguaje.&lt;br /&gt;
&lt;br /&gt;
== Definicion ==&lt;br /&gt;
&lt;br /&gt;
Dado un vector de logits &amp;lt;math&amp;gt;\mathbf{z} = (z_1, z_2, \dots, z_K)&amp;lt;/math&amp;gt; para &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; clases, la funcion softmax produce:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La salida satisface dos propiedades que la convierten en una distribucion de probabilidad valida:&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;\sigma(\mathbf{z})_k &amp;gt; 0&amp;lt;/math&amp;gt; para todo &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; (dado que la funcion exponencial es siempre positiva).&lt;br /&gt;
# &amp;lt;math&amp;gt;\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1&amp;lt;/math&amp;gt; (por construccion).&lt;br /&gt;
&lt;br /&gt;
== Intuicion ==&lt;br /&gt;
&lt;br /&gt;
La funcion softmax amplifica las diferencias entre los logits. Un logit mayor que los demas recibe una proporcion desproporcionadamente grande de la masa de probabilidad porque la funcion exponencial crece de forma superlineal. Por ejemplo:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Logits !! Salida softmax&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;(2.0,\; 1.0,\; 0.1)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(0.659,\; 0.242,\; 0.099)&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;(5.0,\; 1.0,\; 0.1)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(0.993,\; 0.005,\; 0.002)&amp;lt;/math&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
A medida que la brecha entre el logit mas grande y los demas aumenta, la salida se aproxima a un vector one-hot. Este comportamiento de &amp;quot;el ganador se lleva la mayor parte&amp;quot; hace que softmax sea adecuada para la clasificacion donde una unica clase debe dominar.&lt;br /&gt;
&lt;br /&gt;
== Parametro de temperatura ==&lt;br /&gt;
&lt;br /&gt;
Un parametro de &#039;&#039;&#039;temperatura&#039;&#039;&#039; &amp;lt;math&amp;gt;T &amp;gt; 0&amp;lt;/math&amp;gt; controla la nitidez de la distribucion:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;T \to 0&amp;lt;/math&amp;gt;: La distribucion colapsa en un vector one-hot seleccionando el argmax — equivalente a una decision rigida.&lt;br /&gt;
* &amp;lt;math&amp;gt;T = 1&amp;lt;/math&amp;gt;: Softmax estandar.&lt;br /&gt;
* &amp;lt;math&amp;gt;T \to \infty&amp;lt;/math&amp;gt;: La distribucion se aproxima a la uniforme — todas las clases se vuelven igualmente probables.&lt;br /&gt;
&lt;br /&gt;
El escalado por temperatura se utiliza ampliamente en la destilacion de conocimiento (Hinton et al., 2015), donde una distribucion &amp;quot;suave&amp;quot; de un modelo maestro proporciona una senal de entrenamiento mas rica que las etiquetas rigidas. Tambien se utiliza para controlar la aleatoriedad en la generacion de texto a partir de modelos de lenguaje.&lt;br /&gt;
&lt;br /&gt;
== Estabilidad numerica ==&lt;br /&gt;
&lt;br /&gt;
Una implementacion ingenua de softmax puede desbordarse cuando los logits son grandes (por ejemplo, &amp;lt;math&amp;gt;e^{1000}&amp;lt;/math&amp;gt; es infinito en punto flotante). La solucion estandar es restar el logit maximo:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esto es matematicamente equivalente (la constante se cancela) pero asegura que el exponente mas grande sea &amp;lt;math&amp;gt;e^0 = 1&amp;lt;/math&amp;gt;, previniendo el desbordamiento. Todos los principales frameworks de aprendizaje profundo implementan esta version estabilizada automaticamente.&lt;br /&gt;
&lt;br /&gt;
== Relacion con la sigmoide ==&lt;br /&gt;
&lt;br /&gt;
Para el caso especial de &amp;lt;math&amp;gt;K = 2&amp;lt;/math&amp;gt; clases, la funcion softmax se reduce a la funcion &#039;&#039;&#039;sigmoide&#039;&#039;&#039; (logistica). Si se define &amp;lt;math&amp;gt;z = z_1 - z_2&amp;lt;/math&amp;gt;, entonces:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Por esto, los clasificadores binarios tipicamente utilizan una unica neurona de salida con activacion sigmoide en lugar de dos neuronas con softmax — son matematicamente equivalentes.&lt;br /&gt;
&lt;br /&gt;
== Gradiente ==&lt;br /&gt;
&lt;br /&gt;
El jacobiano de la funcion softmax con respecto a su entrada es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\delta_{kj}&amp;lt;/math&amp;gt; es la delta de Kronecker. Cuando se combina con la [[Cross-Entropy Loss|perdida de entropia cruzada]], el gradiente se simplifica a &amp;lt;math&amp;gt;\hat{y}_k - y_k&amp;lt;/math&amp;gt;, lo que es computacionalmente eficiente y numericamente estable.&lt;br /&gt;
&lt;br /&gt;
== Uso en clasificacion ==&lt;br /&gt;
&lt;br /&gt;
En un flujo de clasificacion tipico:&lt;br /&gt;
&lt;br /&gt;
# Una red neuronal produce logits crudos &amp;lt;math&amp;gt;\mathbf{z}&amp;lt;/math&amp;gt; a partir de su capa lineal final.&lt;br /&gt;
# Softmax convierte los logits en probabilidades: &amp;lt;math&amp;gt;\hat{\mathbf{y}} = \sigma(\mathbf{z})&amp;lt;/math&amp;gt;.&lt;br /&gt;
# La clase predicha es &amp;lt;math&amp;gt;\hat{c} = \arg\max_k \hat{y}_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
# El entrenamiento utiliza la [[Cross-Entropy Loss|perdida de entropia cruzada]] aplicada a la distribucion predicha y las etiquetas verdaderas.&lt;br /&gt;
&lt;br /&gt;
En la practica, softmax y la entropia cruzada se calculan conjuntamente por estabilidad numerica (la formulacion &#039;&#039;&#039;log-softmax&#039;&#039;&#039;), y el argmax en el momento de la inferencia puede aplicarse directamente a los logits sin calcular softmax en absoluto.&lt;br /&gt;
&lt;br /&gt;
== Mas alla de la clasificacion ==&lt;br /&gt;
&lt;br /&gt;
Softmax aparece en muchos contextos mas alla de la capa de salida:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mecanismos de atencion&#039;&#039;&#039;: Softmax normaliza las puntuaciones de alineamiento en pesos de atencion en la arquitectura [[Attention Mechanisms|Transformer]].&lt;br /&gt;
* &#039;&#039;&#039;Aprendizaje por refuerzo&#039;&#039;&#039;: Softmax sobre las estimaciones de valor de accion produce una politica estocastica (exploracion de Boltzmann).&lt;br /&gt;
* &#039;&#039;&#039;Modelos de mezcla&#039;&#039;&#039;: Softmax parametriza los coeficientes de mezcla en arquitecturas de mezcla de expertos.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Cross-Entropy Loss]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Attention Mechanisms]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Bishop, C. M. (2006). &#039;&#039;Pattern Recognition and Machine Learning&#039;&#039;. Springer, Section 4.3.4.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press, Section 6.2.2.3.&lt;br /&gt;
* Hinton, G., Vinyals, O. and Dean, J. (2015). &amp;quot;Distilling the Knowledge in a Neural Network&amp;quot;. &#039;&#039;arXiv:1503.02531&#039;&#039;.&lt;br /&gt;
* Bridle, J. S. (1990). &amp;quot;Probabilistic Interpretation of Feedforward Classification Network Outputs&amp;quot;. &#039;&#039;Neurocomputing&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Recurrent_Neural_Networks/es&amp;diff=2158</id>
		<title>Recurrent Neural Networks/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Recurrent_Neural_Networks/es&amp;diff=2158"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Recurrent Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
Las &#039;&#039;&#039;redes neuronales recurrentes&#039;&#039;&#039; (&#039;&#039;&#039;RNN&#039;&#039;&#039;) son una clase de [[Neural Networks|redes neuronales]] disenadas para procesar &#039;&#039;&#039;datos secuenciales&#039;&#039;&#039; — datos en los que el orden de los elementos importa. A diferencia de las redes prealimentadas, las RNN contienen conexiones recurrentes que permiten que la informacion persista a traves de los pasos temporales, otorgandoles una forma de memoria.&lt;br /&gt;
&lt;br /&gt;
== Modelado de secuencias ==&lt;br /&gt;
&lt;br /&gt;
Muchos problemas del mundo real involucran secuencias: el texto es una secuencia de palabras, el habla es una secuencia de tramas de audio, los precios de las acciones forman una serie temporal y el ADN es una secuencia de nucleotidos. Las redes prealimentadas estandar requieren entradas de tamano fijo y tratan cada entrada de forma independiente, lo que las hace inadecuadas para secuencias de longitud variable donde el contexto importa.&lt;br /&gt;
&lt;br /&gt;
Las RNN abordan esto procesando las entradas un elemento a la vez mientras mantienen un &#039;&#039;&#039;estado oculto&#039;&#039;&#039; que resume la informacion vista hasta el momento.&lt;br /&gt;
&lt;br /&gt;
== RNN basica ==&lt;br /&gt;
&lt;br /&gt;
En cada paso temporal &amp;lt;math&amp;gt;t&amp;lt;/math&amp;gt;, una RNN basica calcula:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\mathbf{x}_t&amp;lt;/math&amp;gt; es la entrada en el instante &amp;lt;math&amp;gt;t&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\mathbf{h}_t&amp;lt;/math&amp;gt; es el estado oculto, &amp;lt;math&amp;gt;\mathbf{y}_t&amp;lt;/math&amp;gt; es la salida, y &amp;lt;math&amp;gt;\mathbf{W}_{hh}, \mathbf{W}_{xh}, \mathbf{W}_{hy}&amp;lt;/math&amp;gt; son matrices de pesos compartidas en todos los pasos temporales. El estado oculto inicial &amp;lt;math&amp;gt;\mathbf{h}_0&amp;lt;/math&amp;gt; se establece tipicamente como el vector cero.&lt;br /&gt;
&lt;br /&gt;
La idea clave es que los mismos parametros se aplican en cada paso temporal — &#039;&#039;&#039;comparticion de pesos en el tiempo&#039;&#039;&#039; — lo que permite a la red generalizar a traves de diferentes posiciones en la secuencia.&lt;br /&gt;
&lt;br /&gt;
== Backpropagation a traves del tiempo (BPTT) ==&lt;br /&gt;
&lt;br /&gt;
El entrenamiento de una RNN requiere calcular los gradientes de la perdida con respecto a los pesos compartidos. &#039;&#039;&#039;Backpropagation a traves del tiempo&#039;&#039;&#039; (BPTT) &amp;quot;desenrolla&amp;quot; la RNN a lo largo de los pasos temporales, produciendo una red prealimentada profunda con pesos compartidos, y luego aplica [[Backpropagation|backpropagation]] estandar.&lt;br /&gt;
&lt;br /&gt;
Para una secuencia de longitud &amp;lt;math&amp;gt;T&amp;lt;/math&amp;gt;, el gradiente de la perdida con respecto a &amp;lt;math&amp;gt;\mathbf{W}_{hh}&amp;lt;/math&amp;gt; involucra un producto de jacobianos:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial L}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\frac{\partial L_t}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t}\frac{\partial L_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El producto de jacobianos &amp;lt;math&amp;gt;\prod \partial \mathbf{h}_j / \partial \mathbf{h}_{j-1}&amp;lt;/math&amp;gt; es la fuente de los problemas de gradientes que se desvanecen y explotan.&lt;br /&gt;
&lt;br /&gt;
== El problema del gradiente que se desvanece ==&lt;br /&gt;
&lt;br /&gt;
Cuando el radio espectral del jacobiano recurrente es menor que 1, la senal del gradiente decae exponencialmente a traves del tiempo — el problema del &#039;&#039;&#039;gradiente que se desvanece&#039;&#039;&#039;. Esto hace extremadamente dificil que las RNN basicas aprendan dependencias que abarquen mas de 10–20 pasos temporales.&lt;br /&gt;
&lt;br /&gt;
Por el contrario, cuando el radio espectral supera 1, los gradientes pueden crecer exponencialmente — el problema del &#039;&#039;&#039;gradiente que explota&#039;&#039;&#039;. Los gradientes que explotan se manejan tipicamente mediante &#039;&#039;&#039;recorte de gradientes&#039;&#039;&#039; (limitar la norma del gradiente a un umbral), pero los gradientes que se desvanecen requieren soluciones arquitectonicas.&lt;br /&gt;
&lt;br /&gt;
== Long Short-Term Memory (LSTM) ==&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;LSTM&#039;&#039;&#039; (Hochreiter y Schmidhuber, 1997) introduce un &#039;&#039;&#039;estado de celda&#039;&#039;&#039; &amp;lt;math&amp;gt;\mathbf{c}_t&amp;lt;/math&amp;gt; que fluye a traves del tiempo con minima interferencia, y tres &#039;&#039;&#039;compuertas&#039;&#039;&#039; que controlan el flujo de informacion:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;compuerta de olvido&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;compuerta de entrada&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;estado de celda candidato&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;actualizacion del estado de celda&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;compuerta de salida&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El estado de celda actua como una cinta transportadora: la compuerta de olvido decide que informacion antigua descartar, la compuerta de entrada decide que informacion nueva almacenar, y la compuerta de salida controla lo que se expone a la siguiente capa. Dado que el estado de celda se actualiza mediante suma (no multiplicacion), los gradientes fluyen mas facilmente a traves de secuencias largas.&lt;br /&gt;
&lt;br /&gt;
== Unidad Recurrente con Compuertas (GRU) ==&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;GRU&#039;&#039;&#039; (Cho et al., 2014) simplifica el LSTM fusionando el estado de celda y el estado oculto y utilizando solo dos compuertas:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;compuerta de actualizacion&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t])&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;compuerta de reinicio&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La GRU tiene menos parametros que el LSTM y a menudo alcanza un rendimiento comparable. En la practica, la eleccion entre LSTM y GRU se realiza tipicamente de forma empirica.&lt;br /&gt;
&lt;br /&gt;
== RNN bidireccionales ==&lt;br /&gt;
&lt;br /&gt;
Una &#039;&#039;&#039;RNN bidireccional&#039;&#039;&#039; procesa la secuencia en ambas direcciones — hacia adelante (de izquierda a derecha) y hacia atras (de derecha a izquierda) — y concatena los estados ocultos:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\; \overleftarrow{\mathbf{h}}_t]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esto permite al modelo utilizar tanto el contexto pasado como el futuro en cada paso temporal, lo cual es beneficioso para tareas como el reconocimiento de entidades nombradas y la traduccion automatica, donde el significado de una palabra depende de su contexto circundante.&lt;br /&gt;
&lt;br /&gt;
== Aplicaciones ==&lt;br /&gt;
&lt;br /&gt;
Las RNN y sus variantes con compuertas se han aplicado a una amplia gama de tareas secuenciales:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Modelado del lenguaje&#039;&#039;&#039; — predecir la siguiente palabra en una secuencia.&lt;br /&gt;
* &#039;&#039;&#039;Traduccion automatica&#039;&#039;&#039; — arquitecturas codificador-decodificador para traduccion secuencia a secuencia (Sutskever et al., 2014).&lt;br /&gt;
* &#039;&#039;&#039;Reconocimiento de voz&#039;&#039;&#039; — transcripcion de audio a texto (a menudo combinado con la perdida CTC).&lt;br /&gt;
* &#039;&#039;&#039;Analisis de sentimiento&#039;&#039;&#039; — clasificacion del sentimiento de un texto.&lt;br /&gt;
* &#039;&#039;&#039;Prediccion de series temporales&#039;&#039;&#039; — prediccion de valores futuros de datos financieros o de sensores.&lt;br /&gt;
* &#039;&#039;&#039;Generacion de musica&#039;&#039;&#039; — generacion de secuencias de notas.&lt;br /&gt;
&lt;br /&gt;
Cabe destacar que para muchas tareas de procesamiento del lenguaje natural, los &#039;&#039;&#039;Transformers&#039;&#039;&#039; (Vaswani et al., 2017) han reemplazado en gran medida a las RNN debido a su capacidad para procesar secuencias en paralelo y capturar dependencias de largo alcance de manera mas efectiva mediante autoatencion.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Word Embeddings]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Elman, J. L. (1990). &amp;quot;Finding Structure in Time&amp;quot;. &#039;&#039;Cognitive Science&#039;&#039;, 14(2), 179–211.&lt;br /&gt;
* Hochreiter, S. and Schmidhuber, J. (1997). &amp;quot;Long Short-Term Memory&amp;quot;. &#039;&#039;Neural Computation&#039;&#039;, 9(8), 1735–1780.&lt;br /&gt;
* Cho, K. et al. (2014). &amp;quot;Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Sutskever, I., Vinyals, O. and Le, Q. V. (2014). &amp;quot;Sequence to Sequence Learning with Neural Networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 10. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Overfitting_and_Regularization/es&amp;diff=2157</id>
		<title>Overfitting and Regularization/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Overfitting_and_Regularization/es&amp;diff=2157"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Overfitting and Regularization}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Loss Functions]], [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;sobreajuste&#039;&#039;&#039; (overfitting) ocurre cuando un modelo de aprendizaje automatico aprende los datos de entrenamiento demasiado bien — capturando ruido e idiosincrasias en lugar del patron subyacente — y, en consecuencia, tiene un rendimiento deficiente en datos no vistos. La &#039;&#039;&#039;regularizacion&#039;&#039;&#039; es la familia de tecnicas utilizadas para prevenir el sobreajuste y mejorar la capacidad del modelo para generalizar.&lt;br /&gt;
&lt;br /&gt;
== El equilibrio entre sesgo y varianza ==&lt;br /&gt;
&lt;br /&gt;
El error de prediccion sobre datos no vistos puede descomponerse en tres componentes:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* El &#039;&#039;&#039;sesgo&#039;&#039;&#039; mide cuan lejos esta la prediccion promedio del modelo del valor verdadero. Un sesgo alto indica que el modelo es demasiado simple para capturar la estructura de los datos (&#039;&#039;&#039;subajuste&#039;&#039;&#039;).&lt;br /&gt;
* La &#039;&#039;&#039;varianza&#039;&#039;&#039; mide cuanto fluctuan las predicciones entre diferentes conjuntos de entrenamiento. Una varianza alta indica que el modelo es demasiado sensible a los datos de entrenamiento particulares (&#039;&#039;&#039;sobreajuste&#039;&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
El objetivo es encontrar el punto optimo que minimice el error total. Un modelo con muy pocos parametros subajusta (sesgo alto); un modelo con demasiados parametros sobreajusta (varianza alta). Las tecnicas de regularizacion inclinan el equilibrio restringiendo la complejidad del modelo, aceptando un sesgo ligeramente mayor a cambio de una varianza sustancialmente menor.&lt;br /&gt;
&lt;br /&gt;
== Deteccion del sobreajuste ==&lt;br /&gt;
&lt;br /&gt;
El diagnostico mas claro es comparar el rendimiento en entrenamiento y validacion:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Perdida de entrenamiento decreciente, perdida de validacion tambien decreciente&#039;&#039;&#039; — el modelo aun esta aprendiendo; continuar el entrenamiento.&lt;br /&gt;
* &#039;&#039;&#039;Perdida de entrenamiento decreciente, perdida de validacion creciente&#039;&#039;&#039; — el modelo esta sobreajustando; aplicar regularizacion o detener el entrenamiento.&lt;br /&gt;
* &#039;&#039;&#039;Perdida de entrenamiento alta, perdida de validacion alta&#039;&#039;&#039; — el modelo esta subajustando; aumentar la capacidad o entrenar mas tiempo.&lt;br /&gt;
&lt;br /&gt;
Graficar estas &#039;&#039;&#039;curvas de aprendizaje&#039;&#039;&#039; a lo largo de las iteraciones de entrenamiento es una practica esencial. Una gran brecha entre la precision de entrenamiento y la precision de validacion es la marca distintiva del sobreajuste.&lt;br /&gt;
&lt;br /&gt;
== Regularizacion L2 (decaimiento de pesos) ==&lt;br /&gt;
&lt;br /&gt;
La regularizacion L2 anade una penalizacion proporcional a la magnitud al cuadrado de los pesos:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El gradiente del termino de regularizacion es &amp;lt;math&amp;gt;\lambda \theta&amp;lt;/math&amp;gt;, de modo que cada peso se reduce multiplicativamente hacia cero en cada actualizacion — de ahi el nombre &#039;&#039;&#039;decaimiento de pesos&#039;&#039;&#039;. El hiperparametro &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; controla la intensidad de la regularizacion.&lt;br /&gt;
&lt;br /&gt;
La regularizacion L2 es equivalente a colocar un prior gaussiano sobre los pesos desde una perspectiva bayesiana. Fomenta pesos pequenos y distribuidos y desalienta que cualquier peso individual se vuelva excesivamente grande.&lt;br /&gt;
&lt;br /&gt;
== Regularizacion L1 ==&lt;br /&gt;
&lt;br /&gt;
La regularizacion L1 penaliza la suma de valores absolutos:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j|&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A diferencia de L2, la penalizacion L1 lleva muchos pesos exactamente a cero, produciendo modelos &#039;&#039;&#039;dispersos&#039;&#039;&#039;. Esto hace que la regularizacion L1 sea util para la seleccion de caracteristicas. LASSO (Least Absolute Shrinkage and Selection Operator) es el ejemplo clasico de regresion lineal con regularizacion L1.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Propiedad !! L1 !! L2&lt;br /&gt;
|-&lt;br /&gt;
| Penalizacion || &amp;lt;math&amp;gt;\lambda\sum|\theta_j|&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\frac{\lambda}{2}\sum\theta_j^2&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Efecto sobre los pesos || Lleva muchos a exactamente cero || Reduce todos hacia cero&lt;br /&gt;
|-&lt;br /&gt;
| Dispersidad || Si || No&lt;br /&gt;
|-&lt;br /&gt;
| Interpretacion bayesiana || Prior de Laplace || Prior gaussiano&lt;br /&gt;
|-&lt;br /&gt;
| Caso de uso || Seleccion de caracteristicas, interpretabilidad || Regularizacion general&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Dropout ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Dropout&#039;&#039;&#039; (Srivastava et al., 2014) es una tecnica de regularizacion especifica para redes neuronales. Durante el entrenamiento, cada neurona es aleatoriamente &amp;quot;descartada&amp;quot; (establecida en cero) con probabilidad &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; en cada pasada hacia adelante. Esto evita que las neuronas se coadapten y obliga a la red a aprender representaciones redundantes.&lt;br /&gt;
&lt;br /&gt;
En el momento de la prueba, todas las neuronas estan activas pero sus salidas se escalan por &amp;lt;math&amp;gt;(1 - p)&amp;lt;/math&amp;gt; para compensar el mayor numero de unidades activas (o equivalentemente, las salidas se escalan por &amp;lt;math&amp;gt;1/(1-p)&amp;lt;/math&amp;gt; durante el entrenamiento — &#039;&#039;&#039;dropout invertido&#039;&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
Dropout puede interpretarse como un metodo de ensamblaje aproximado: cada paso de entrenamiento utiliza una subred diferente, y el modelo final aproxima la prediccion promedio de un numero exponencial de subredes.&lt;br /&gt;
&lt;br /&gt;
== Parada temprana ==&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;parada temprana&#039;&#039;&#039; monitoriza la perdida de validacion durante el entrenamiento y detiene la optimizacion cuando la perdida de validacion deja de mejorar. Es una de las estrategias de regularizacion mas simples y efectivas.&lt;br /&gt;
&lt;br /&gt;
En la practica, un parametro de &#039;&#039;&#039;paciencia&#039;&#039;&#039; especifica cuantas epocas esperar despues de la ultima mejora antes de detenerse. Los pesos del modelo se guardan en el punto de menor perdida de validacion y se restauran al final.&lt;br /&gt;
&lt;br /&gt;
La parada temprana actua como una forma implicita de regularizacion: limita el numero efectivo de pasos de entrenamiento, evitando que el modelo memorice completamente los datos de entrenamiento.&lt;br /&gt;
&lt;br /&gt;
== Aumento de datos ==&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;aumento de datos&#039;&#039;&#039; incrementa el tamano efectivo y la diversidad del conjunto de entrenamiento aplicando transformaciones que preservan las etiquetas. Para datos de imagenes, las transformaciones comunes incluyen:&lt;br /&gt;
&lt;br /&gt;
* Volteos horizontales/verticales aleatorios&lt;br /&gt;
* Recortes y redimensionamientos aleatorios&lt;br /&gt;
* Variacion de color (brillo, contraste, saturacion)&lt;br /&gt;
* Rotacion y transformaciones afines&lt;br /&gt;
* Mixup (interpolacion lineal de pares de imagenes y sus etiquetas)&lt;br /&gt;
* Cutout (enmascaramiento de parches aleatorios)&lt;br /&gt;
&lt;br /&gt;
Para datos de texto, las transformaciones incluyen sustitucion de sinonimos, retrotraduccion y parafraseo. El aumento de datos reduce el sobreajuste al exponer al modelo a entradas mas variadas sin recopilar datos adicionales.&lt;br /&gt;
&lt;br /&gt;
== Otras tecnicas de regularizacion ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Batch normalization&#039;&#039;&#039; — normalizar las entradas de las capas reduce el desplazamiento covariante interno y tiene un leve efecto regularizador.&lt;br /&gt;
* &#039;&#039;&#039;Suavizado de etiquetas&#039;&#039;&#039; — reemplaza los objetivos one-hot con una mezcla, por ejemplo &amp;lt;math&amp;gt;y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C&amp;lt;/math&amp;gt;, previniendo la sobreconfianza.&lt;br /&gt;
* &#039;&#039;&#039;Inyeccion de ruido&#039;&#039;&#039; — anadir ruido gaussiano a las entradas, pesos o gradientes durante el entrenamiento.&lt;br /&gt;
&lt;br /&gt;
== Directrices practicas ==&lt;br /&gt;
&lt;br /&gt;
# Comenzar con un modelo lo suficientemente grande como para sobreajustar los datos de entrenamiento — esto confirma que el modelo tiene capacidad suficiente.&lt;br /&gt;
# Anadir regularizacion incrementalmente (dropout, decaimiento de pesos, aumento de datos) y monitorizar el rendimiento en validacion.&lt;br /&gt;
# Utilizar la parada temprana como red de seguridad.&lt;br /&gt;
# Preferir mas datos de entrenamiento sobre una regularizacion mas fuerte siempre que sea posible — la regularizacion es un sustituto de los datos, no un reemplazo.&lt;br /&gt;
# Ajustar la intensidad de la regularizacion (&amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt;, tasa de dropout) utilizando un conjunto de validacion, nunca el conjunto de prueba.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Srivastava, N. et al. (2014). &amp;quot;Dropout: A Simple Way to Prevent Neural Networks from Overfitting&amp;quot;. &#039;&#039;JMLR&#039;&#039;, 15, 1929–1958.&lt;br /&gt;
* Tibshirani, R. (1996). &amp;quot;Regression Shrinkage and Selection via the Lasso&amp;quot;. &#039;&#039;JRSS Series B&#039;&#039;, 58(1), 267–288.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 7. MIT Press.&lt;br /&gt;
* Zhang, C. et al. (2017). &amp;quot;Understanding deep learning requires rethinking generalization&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Shorten, C. and Khoshgoftaar, T. M. (2019). &amp;quot;A survey on Image Data Augmentation for Deep Learning&amp;quot;. &#039;&#039;Journal of Big Data&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Neural_Networks/es&amp;diff=2156</id>
		<title>Neural Networks/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Neural_Networks/es&amp;diff=2156"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
Las &#039;&#039;&#039;redes neuronales&#039;&#039;&#039; (tambien llamadas &#039;&#039;&#039;redes neuronales artificiales&#039;&#039;&#039; o RNA) son modelos computacionales inspirados en la estructura de los sistemas nerviosos biologicos. Consisten en capas interconectadas de unidades de procesamiento simples denominadas &#039;&#039;&#039;neuronas&#039;&#039;&#039; (o nodos) y constituyen la base del aprendizaje profundo moderno.&lt;br /&gt;
&lt;br /&gt;
== Inspiracion biologica ==&lt;br /&gt;
&lt;br /&gt;
La neurona biologica recibe senales electricas a traves de sus &#039;&#039;&#039;dendritas&#039;&#039;&#039;, las integra en el &#039;&#039;&#039;cuerpo celular&#039;&#039;&#039; y, si la senal combinada supera un umbral, emite una senal de salida a lo largo de su &#039;&#039;&#039;axon&#039;&#039;&#039; hacia las neuronas posteriores. Las redes neuronales artificiales abstraen este proceso: cada neurona artificial calcula una suma ponderada de sus entradas, anade un termino de sesgo y pasa el resultado a traves de una &#039;&#039;&#039;funcion de activacion&#039;&#039;&#039; no lineal.&lt;br /&gt;
&lt;br /&gt;
Aunque la analogia con la biologia motivo la investigacion temprana, las redes neuronales modernas se entienden mejor como aproximadores de funciones parametrizados y flexibles, mas que como simulaciones fieles del cerebro.&lt;br /&gt;
&lt;br /&gt;
== El perceptron ==&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;perceptron&#039;&#039;&#039;, introducido por Frank Rosenblatt en 1958, es la red neuronal mas simple. Calcula:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y = \sigma\!\left(\sum_{i=1}^{n} w_i x_i + b\right) = \sigma(\mathbf{w}^\top \mathbf{x} + b)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\mathbf{x}&amp;lt;/math&amp;gt; es el vector de entrada, &amp;lt;math&amp;gt;\mathbf{w}&amp;lt;/math&amp;gt; son los pesos aprendibles, &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; es un sesgo y &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; es una funcion escalon que produce 1 si el argumento es positivo y 0 en caso contrario. El perceptron puede aprender cualquier funcion linealmente separable, pero notoriamente no puede representar la funcion XOR — una limitacion que detuvo la investigacion en redes neuronales durante mas de una decada.&lt;br /&gt;
&lt;br /&gt;
== Redes prealimentadas ==&lt;br /&gt;
&lt;br /&gt;
Una &#039;&#039;&#039;red neuronal prealimentada&#039;&#039;&#039; (tambien llamada &#039;&#039;&#039;perceptron multicapa&#039;&#039;&#039; o MLP) apila multiples capas de neuronas. La informacion fluye en una sola direccion — desde la &#039;&#039;&#039;capa de entrada&#039;&#039;&#039; a traves de una o mas &#039;&#039;&#039;capas ocultas&#039;&#039;&#039; hasta la &#039;&#039;&#039;capa de salida&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Para una red con una capa oculta, el calculo es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h} = g(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y} = f(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;g&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; son funciones de activacion, &amp;lt;math&amp;gt;\mathbf{W}_1, \mathbf{W}_2&amp;lt;/math&amp;gt; son matrices de pesos y &amp;lt;math&amp;gt;\mathbf{b}_1, \mathbf{b}_2&amp;lt;/math&amp;gt; son vectores de sesgo. La capa oculta permite a la red aprender relaciones no lineales que un unico perceptron no puede capturar.&lt;br /&gt;
&lt;br /&gt;
Las redes con muchas capas ocultas se denominan redes neuronales &#039;&#039;&#039;profundas&#039;&#039;&#039;, y su entrenamiento es el tema del &#039;&#039;&#039;aprendizaje profundo&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Funciones de activacion ==&lt;br /&gt;
&lt;br /&gt;
La funcion de activacion introduce no linealidad; sin ella, una red multicapa se reduciria a una unica transformacion lineal. Las opciones comunes incluyen:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Funcion !! Formula !! Rango !! Notas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Sigmoide&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sigma(z) = \frac{1}{1+e^{-z}}&amp;lt;/math&amp;gt; || (0, 1) || Historicamente popular; sufre de gradientes que se desvanecen&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Tanh&#039;&#039;&#039; || &amp;lt;math&amp;gt;\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}&amp;lt;/math&amp;gt; || (−1, 1) || Centrada en cero; aun se satura para entradas grandes&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;ReLU&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0, z)&amp;lt;/math&amp;gt; || [0, ∞) || Opcion predeterminada en redes modernas; puede causar &amp;quot;neuronas muertas&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Leaky ReLU&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(\alpha z, z)&amp;lt;/math&amp;gt; para &amp;lt;math&amp;gt;\alpha &amp;gt; 0&amp;lt;/math&amp;gt; pequeno || (−∞, ∞) || Aborda el problema de las neuronas muertas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Softmax&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{e^{z_i}}{\sum_j e^{z_j}}&amp;lt;/math&amp;gt; || (0, 1) || Utilizada en la capa de salida para clasificacion multiclase&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Teorema de aproximacion universal ==&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;teorema de aproximacion universal&#039;&#039;&#039; (Cybenko 1989, Hornik 1991) establece que una red prealimentada con una unica capa oculta que contenga un numero finito de neuronas puede aproximar cualquier funcion continua en un subconjunto compacto de &amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; con precision arbitraria, siempre que la funcion de activacion satisfaga condiciones suaves (por ejemplo, que sea no constante, acotada y continua).&lt;br /&gt;
&lt;br /&gt;
Este teorema garantiza la &#039;&#039;existencia&#039;&#039; de una buena aproximacion pero no dice nada sobre como &#039;&#039;encontrarla&#039;&#039; — en la practica, entrenar redes profundas con muchas capas es mucho mas efectivo que utilizar una unica capa ancha.&lt;br /&gt;
&lt;br /&gt;
== Vision general del entrenamiento ==&lt;br /&gt;
&lt;br /&gt;
El entrenamiento de una red neuronal implica:&lt;br /&gt;
&lt;br /&gt;
# &#039;&#039;&#039;Definir una funcion de perdida&#039;&#039;&#039; — una medida de cuan lejos estan las predicciones de la red de los objetivos verdaderos (vease [[Loss Functions]]).&lt;br /&gt;
# &#039;&#039;&#039;Pasada hacia adelante&#039;&#039;&#039; — calcular la salida de la red para una entrada dada propagando los valores capa por capa.&lt;br /&gt;
# &#039;&#039;&#039;Pasada hacia atras (backpropagation)&#039;&#039;&#039; — calcular el gradiente de la perdida con respecto a cada peso aplicando la regla de la cadena en orden inverso a traves de la red (vease [[Backpropagation]]).&lt;br /&gt;
# &#039;&#039;&#039;Actualizacion de parametros&#039;&#039;&#039; — ajustar los pesos utilizando un algoritmo de optimizacion como el [[Gradient Descent|descenso de gradiente]] o alguna de sus variantes.&lt;br /&gt;
# &#039;&#039;&#039;Iteracion&#039;&#039;&#039; — repetir los pasos 2–4 durante muchas pasadas (epocas) sobre los datos de entrenamiento.&lt;br /&gt;
&lt;br /&gt;
Un entrenamiento exitoso tambien requiere atencion a la &#039;&#039;&#039;inicializacion&#039;&#039;&#039; (por ejemplo, esquemas de Xavier o He), la &#039;&#039;&#039;regularizacion&#039;&#039;&#039; (para prevenir el [[Overfitting and Regularization|sobreajuste]]) y el &#039;&#039;&#039;ajuste de hiperparametros&#039;&#039;&#039; (tasa de aprendizaje, tamano de lote, arquitectura de la red).&lt;br /&gt;
&lt;br /&gt;
== Arquitecturas comunes ==&lt;br /&gt;
&lt;br /&gt;
Mas alla de la red prealimentada basica, se han desarrollado varias arquitecturas especializadas:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;[[Convolutional Neural Networks]]&#039;&#039;&#039; (CNN) — disenadas para datos con estructura de cuadricula como imagenes, utilizando conectividad local y comparticion de pesos.&lt;br /&gt;
* &#039;&#039;&#039;[[Recurrent Neural Networks]]&#039;&#039;&#039; (RNN) — disenadas para datos secuenciales, con conexiones que forman ciclos para mantener un estado oculto.&lt;br /&gt;
* &#039;&#039;&#039;Transformers&#039;&#039;&#039; — arquitecturas basadas en atencion que se han convertido en dominantes en el procesamiento del lenguaje natural y cada vez mas en vision.&lt;br /&gt;
* &#039;&#039;&#039;Autoencoders&#039;&#039;&#039; — redes entrenadas para reconstruir su entrada, utilizadas para reduccion de dimensionalidad y modelado generativo.&lt;br /&gt;
* &#039;&#039;&#039;Redes generativas adversarias&#039;&#039;&#039; (GAN) — pares de redes (generador y discriminador) entrenadas en competencia para generar datos realistas.&lt;br /&gt;
&lt;br /&gt;
== Aplicaciones ==&lt;br /&gt;
&lt;br /&gt;
Las redes neuronales se aplican en una amplia gama de dominios:&lt;br /&gt;
&lt;br /&gt;
* Vision por computador (clasificacion de imagenes, deteccion de objetos, segmentacion)&lt;br /&gt;
* Procesamiento del lenguaje natural (traduccion, resumen, respuesta a preguntas)&lt;br /&gt;
* Reconocimiento y sintesis de voz&lt;br /&gt;
* Juegos (AlphaGo, agentes de Atari)&lt;br /&gt;
* Descubrimiento cientifico (plegamiento de proteinas, diseno de farmacos, prediccion meteorologica)&lt;br /&gt;
* Vehiculos autonomos y robotica&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Rosenblatt, F. (1958). &amp;quot;The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain&amp;quot;. &#039;&#039;Psychological Review&#039;&#039;.&lt;br /&gt;
* Cybenko, G. (1989). &amp;quot;Approximation by Superpositions of a Sigmoidal Function&amp;quot;. &#039;&#039;Mathematics of Control, Signals, and Systems&#039;&#039;.&lt;br /&gt;
* Hornik, K. (1991). &amp;quot;Approximation Capabilities of Multilayer Feedforward Networks&amp;quot;. &#039;&#039;Neural Networks&#039;&#039;.&lt;br /&gt;
* LeCun, Y., Bengio, Y. and Hinton, G. (2015). &amp;quot;Deep learning&amp;quot;. &#039;&#039;Nature&#039;&#039;, 521, 436–444.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Loss_Functions/es&amp;diff=2155</id>
		<title>Loss Functions/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Loss_Functions/es&amp;diff=2155"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Loss Functions}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
Las &#039;&#039;&#039;funciones de perdida&#039;&#039;&#039; (tambien llamadas &#039;&#039;&#039;funciones de coste&#039;&#039;&#039; o &#039;&#039;&#039;funciones objetivo&#039;&#039;&#039;) cuantifican cuan lejos estan las predicciones de un modelo del resultado deseado. Minimizar la funcion de perdida es el objetivo central del proceso de entrenamiento en el aprendizaje automatico: el algoritmo de optimizacion ajusta los parametros del modelo para reducir la perdida al minimo posible.&lt;br /&gt;
&lt;br /&gt;
== Proposito ==&lt;br /&gt;
&lt;br /&gt;
Una funcion de perdida mapea la prediccion del modelo &amp;lt;math&amp;gt;\hat{y}&amp;lt;/math&amp;gt; y el objetivo verdadero &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt; a un numero real no negativo. Formalmente, para un unico ejemplo:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Sobre un conjunto de datos de &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; ejemplos, la perdida total es tipicamente el promedio:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La eleccion de la funcion de perdida codifica la estructura del problema — que tipo de errores importan y con que severidad deben ser penalizados. Una funcion de perdida mal elegida puede llevar a un modelo que optimiza el objetivo equivocado.&lt;br /&gt;
&lt;br /&gt;
== Error cuadratico medio ==&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;error cuadratico medio&#039;&#039;&#039; (MSE, por sus siglas en ingles) es la perdida predeterminada para tareas de regresion:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El MSE penaliza los errores grandes de forma cuadratica, lo que lo hace sensible a valores atipicos. Su gradiente es directo:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial}{\partial \hat{y}_i} (y_i - \hat{y}_i)^2 = -2(y_i - \hat{y}_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Una variante estrechamente relacionada es el &#039;&#039;&#039;error absoluto medio&#039;&#039;&#039; (MAE), &amp;lt;math&amp;gt;\frac{1}{N}\sum|y_i - \hat{y}_i|&amp;lt;/math&amp;gt;, que es mas robusto ante valores atipicos pero tiene un gradiente no suave en cero. La &#039;&#039;&#039;perdida de Huber&#039;&#039;&#039; combina ambas: se comporta como el MSE para errores pequenos y como el MAE para errores grandes.&lt;br /&gt;
&lt;br /&gt;
== Perdida de entropia cruzada ==&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;perdida de entropia cruzada&#039;&#039;&#039; es la opcion estandar para tareas de clasificacion. Mide la disimilitud entre la distribucion de probabilidad predicha y la distribucion de la etiqueta verdadera.&lt;br /&gt;
&lt;br /&gt;
=== Entropia cruzada binaria ===&lt;br /&gt;
&lt;br /&gt;
Para clasificacion binaria con probabilidad predicha &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; y etiqueta verdadera &amp;lt;math&amp;gt;y \in \{0, 1\}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log p_i + (1 - y_i)\log(1 - p_i)\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esta perdida se minimiza cuando la probabilidad predicha coincide perfectamente con la etiqueta verdadera (&amp;lt;math&amp;gt;p = 1&amp;lt;/math&amp;gt; cuando &amp;lt;math&amp;gt;y = 1&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;p = 0&amp;lt;/math&amp;gt; cuando &amp;lt;math&amp;gt;y = 0&amp;lt;/math&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
=== Entropia cruzada categorica ===&lt;br /&gt;
&lt;br /&gt;
Para clasificacion multiclase con &amp;lt;math&amp;gt;C&amp;lt;/math&amp;gt; clases y vector de probabilidad predicho &amp;lt;math&amp;gt;\hat{\mathbf{y}}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cuando las etiquetas verdaderas estan codificadas en formato one-hot, solo sobrevive el termino correspondiente a la clase correcta.&lt;br /&gt;
&lt;br /&gt;
== Perdida de bisagra ==&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;perdida de bisagra&#039;&#039;&#039; esta asociada con las maquinas de vectores de soporte (SVM) y los clasificadores de margen maximo. Para un problema de clasificacion binaria con etiquetas &amp;lt;math&amp;gt;y \in \{-1, +1\}&amp;lt;/math&amp;gt; y salida cruda del modelo &amp;lt;math&amp;gt;s&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{hinge}} = \frac{1}{N}\sum_{i=1}^{N}\max(0,\; 1 - y_i \, s_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La perdida de bisagra es cero cuando la prediccion tiene el signo correcto con un margen de al menos 1, y aumenta linealmente en caso contrario. Dado que no es diferenciable en el punto de bisagra, se utilizan metodos de subgradiente para la optimizacion.&lt;br /&gt;
&lt;br /&gt;
== Otras funciones de perdida comunes ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Perdida !! Formula !! Uso tipico&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Huber&#039;&#039;&#039; || &amp;lt;math&amp;gt;\begin{cases}\tfrac{1}{2}(y-\hat{y})^2 &amp;amp; |y-\hat{y}|\leq\delta \\ \delta(|y-\hat{y}|-\tfrac{\delta}{2}) &amp;amp; \text{otherwise}\end{cases}&amp;lt;/math&amp;gt; || Regresion robusta&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Divergencia KL&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sum_c p_c \log\frac{p_c}{q_c}&amp;lt;/math&amp;gt; || Ajuste de distribuciones, VAE&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Perdida focal&#039;&#039;&#039; || &amp;lt;math&amp;gt;-\alpha(1-p_t)^\gamma \log p_t&amp;lt;/math&amp;gt; || Clasificacion desbalanceada&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Perdida CTC&#039;&#039;&#039; || Programacion dinamica sobre alineamientos || Reconocimiento de voz, OCR&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Perdida de tripleta&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0,\; d(a,p) - d(a,n) + m)&amp;lt;/math&amp;gt; || Aprendizaje de metricas, verificacion facial&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Eleccion de la perdida adecuada ==&lt;br /&gt;
&lt;br /&gt;
La funcion de perdida apropiada depende de la tarea:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Regresion&#039;&#039;&#039; — el MSE es la opcion predeterminada; se cambia a MAE o Huber si los valores atipicos son una preocupacion.&lt;br /&gt;
* &#039;&#039;&#039;Clasificacion binaria&#039;&#039;&#039; — entropia cruzada binaria con salida sigmoide.&lt;br /&gt;
* &#039;&#039;&#039;Clasificacion multiclase&#039;&#039;&#039; — entropia cruzada categorica con salida softmax.&lt;br /&gt;
* &#039;&#039;&#039;Clasificacion multietiqueta&#039;&#039;&#039; — entropia cruzada binaria aplicada independientemente por etiqueta.&lt;br /&gt;
* &#039;&#039;&#039;Ranking o recuperacion&#039;&#039;&#039; — perdida contrastiva, perdida de tripleta o perdidas de ranking por lista.&lt;br /&gt;
&lt;br /&gt;
Una consideracion importante es si la perdida esta &#039;&#039;&#039;calibrada&#039;&#039;&#039; — es decir, si minimizarla produce probabilidades predichas bien calibradas. La entropia cruzada es una regla de puntuacion propia y produce probabilidades calibradas, mientras que la perdida de bisagra no.&lt;br /&gt;
&lt;br /&gt;
== Terminos de regularizacion ==&lt;br /&gt;
&lt;br /&gt;
En la practica, el objetivo total a menudo incluye un &#039;&#039;&#039;termino de regularizacion&#039;&#039;&#039; que penaliza la complejidad del modelo:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J(\theta) = L(\theta) + \lambda \, R(\theta)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; controla la intensidad de la regularizacion. Las opciones comunes incluyen la regularizacion L2 (&amp;lt;math&amp;gt;R = \|\theta\|_2^2&amp;lt;/math&amp;gt;) y la regularizacion L1 (&amp;lt;math&amp;gt;R = \|\theta\|_1&amp;lt;/math&amp;gt;). Vease [[Overfitting and Regularization]] para mas detalles.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Bishop, C. M. (2006). &#039;&#039;Pattern Recognition and Machine Learning&#039;&#039;, Chapter 1. Springer.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapters 6 and 8. MIT Press.&lt;br /&gt;
* Lin, T.-Y. et al. (2017). &amp;quot;Focal Loss for Dense Object Detection&amp;quot;. &#039;&#039;ICCV&#039;&#039;.&lt;br /&gt;
* Murphy, K. P. (2022). &#039;&#039;Probabilistic Machine Learning: An Introduction&#039;&#039;. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Linear_Regression/es&amp;diff=2154</id>
		<title>Linear Regression/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Linear_Regression/es&amp;diff=2154"/>
		<updated>2026-04-24T07:09:01Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Linear Regression}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Statistics | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;regresion lineal&#039;&#039;&#039; es un metodo estadistico fundamental que modela la relacion entre una variable dependiente y una o mas variables independientes ajustando una ecuacion lineal a los datos observados. Es una de las tecnicas mas antiguas y ampliamente utilizadas en estadistica y aprendizaje automatico, sirviendo tanto como herramienta predictiva practica como bloque de construccion para comprender modelos mas complejos.&lt;br /&gt;
&lt;br /&gt;
== Planteamiento del problema ==&lt;br /&gt;
&lt;br /&gt;
Dado un conjunto de datos de &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; observaciones &amp;lt;math&amp;gt;\{(\mathbf{x}_i, y_i)\}_{i=1}^{N}&amp;lt;/math&amp;gt;, donde &amp;lt;math&amp;gt;\mathbf{x}_i \in \mathbb{R}^d&amp;lt;/math&amp;gt; es un vector de caracteristicas y &amp;lt;math&amp;gt;y_i \in \mathbb{R}&amp;lt;/math&amp;gt; es el objetivo, la regresion lineal asume la relacion:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + b + \epsilon_i&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\mathbf{w} \in \mathbb{R}^d&amp;lt;/math&amp;gt; es el vector de pesos, &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; es el sesgo (intercepto) y &amp;lt;math&amp;gt;\epsilon_i&amp;lt;/math&amp;gt; es el termino de error. Al absorber el sesgo en el vector de pesos (anadiendo un 1 a cada &amp;lt;math&amp;gt;\mathbf{x}_i&amp;lt;/math&amp;gt;), esto se simplifica a &amp;lt;math&amp;gt;y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + \epsilon_i&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Minimos cuadrados ordinarios ==&lt;br /&gt;
&lt;br /&gt;
El metodo de &#039;&#039;&#039;minimos cuadrados ordinarios&#039;&#039;&#039; (MCO) encuentra los pesos que minimizan la suma de los residuos al cuadrado:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}(\mathbf{w}) = \sum_{i=1}^{N} (y_i - \mathbf{w}^{\!\top} \mathbf{x}_i)^2 = \|\mathbf{y} - X\mathbf{w}\|^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;X \in \mathbb{R}^{N \times d}&amp;lt;/math&amp;gt; es la matriz de diseno y &amp;lt;math&amp;gt;\mathbf{y} \in \mathbb{R}^N&amp;lt;/math&amp;gt; es el vector de objetivos.&lt;br /&gt;
&lt;br /&gt;
=== Solucion en forma cerrada ===&lt;br /&gt;
&lt;br /&gt;
Igualando el gradiente a cero se obtienen las &#039;&#039;&#039;ecuaciones normales&#039;&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\nabla_{\mathbf{w}} \mathcal{L} = -2 X^{\!\top}(\mathbf{y} - X\mathbf{w}) = 0&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\hat{\mathbf{w}} = (X^{\!\top} X)^{-1} X^{\!\top} \mathbf{y}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esta solucion existe y es unica cuando &amp;lt;math&amp;gt;X^{\!\top} X&amp;lt;/math&amp;gt; es invertible (es decir, las caracteristicas son linealmente independientes). El coste computacional es &amp;lt;math&amp;gt;O(Nd^2 + d^3)&amp;lt;/math&amp;gt;, lo cual es eficiente para &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt; moderado pero se vuelve costoso para problemas de alta dimensionalidad.&lt;br /&gt;
&lt;br /&gt;
=== Enfoque por descenso de gradiente ===&lt;br /&gt;
&lt;br /&gt;
Cuando la solucion en forma cerrada es impracticable (valores grandes de &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt; o &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;), se utiliza la optimizacion iterativa mediante [[Stochastic Gradient Descent|descenso de gradiente]]. El gradiente es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{N} X^{\!\top}(\mathbf{y} - X\mathbf{w})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La regla de actualizacion es &amp;lt;math&amp;gt;\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} \mathcal{L}&amp;lt;/math&amp;gt;, donde &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; es la tasa de aprendizaje. Las variantes estocastica y por mini-lotes escalan a millones de puntos de datos.&lt;br /&gt;
&lt;br /&gt;
== Supuestos de MCO ==&lt;br /&gt;
&lt;br /&gt;
El estimador MCO clasico es &#039;&#039;&#039;BLUE&#039;&#039;&#039; (Mejor Estimador Lineal Insesgado) bajo las condiciones de Gauss-Markov:&lt;br /&gt;
&lt;br /&gt;
# &#039;&#039;&#039;Linealidad&#039;&#039;&#039;: La relacion verdadera entre las caracteristicas y el objetivo es lineal.&lt;br /&gt;
# &#039;&#039;&#039;Independencia&#039;&#039;&#039;: Las observaciones son independientes entre si.&lt;br /&gt;
# &#039;&#039;&#039;Homocedasticidad&#039;&#039;&#039;: La varianza del error &amp;lt;math&amp;gt;\mathrm{Var}(\epsilon_i) = \sigma^2&amp;lt;/math&amp;gt; es constante en todas las observaciones.&lt;br /&gt;
# &#039;&#039;&#039;Sin multicolinealidad perfecta&#039;&#039;&#039;: Ninguna caracteristica es una combinacion lineal exacta de otras.&lt;br /&gt;
# &#039;&#039;&#039;Exogeneidad&#039;&#039;&#039;: &amp;lt;math&amp;gt;E[\epsilon_i \mid \mathbf{x}_i] = 0&amp;lt;/math&amp;gt; — los errores no estan correlacionados con las caracteristicas.&lt;br /&gt;
&lt;br /&gt;
Las violaciones de estos supuestos no necesariamente hacen que la regresion lineal sea inutil, pero pueden invalidar los intervalos de confianza y las pruebas de hipotesis derivadas del modelo.&lt;br /&gt;
&lt;br /&gt;
== Metricas de evaluacion ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Metrica !! Formula !! Interpretacion&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;MSE&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{1}{N}\sum(y_i - \hat{y}_i)^2&amp;lt;/math&amp;gt; || Error cuadratico promedio; penaliza errores grandes&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;RMSE&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sqrt{\mathrm{MSE}}&amp;lt;/math&amp;gt; || En las mismas unidades que el objetivo&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;MAE&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{1}{N}\sum|y_i - \hat{y}_i|&amp;lt;/math&amp;gt; || Error absoluto promedio; robusto a valores atipicos&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;R-cuadrado&#039;&#039;&#039; || &amp;lt;math&amp;gt;1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}&amp;lt;/math&amp;gt; || Proporcion de varianza explicada (0 a 1)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Un &amp;lt;math&amp;gt;R^2&amp;lt;/math&amp;gt; de 1 indica prediccion perfecta, mientras que &amp;lt;math&amp;gt;R^2 = 0&amp;lt;/math&amp;gt; significa que el modelo no es mejor que predecir la media. El &#039;&#039;&#039;R-cuadrado ajustado&#039;&#039;&#039; penaliza por el numero de caracteristicas, previniendo la inflacion artificial al anadir predictores irrelevantes.&lt;br /&gt;
&lt;br /&gt;
== Regresion multiple ==&lt;br /&gt;
&lt;br /&gt;
Cuando &amp;lt;math&amp;gt;d &amp;gt; 1&amp;lt;/math&amp;gt;, el modelo se denomina &#039;&#039;&#039;regresion lineal multiple&#039;&#039;&#039;. Cada coeficiente &amp;lt;math&amp;gt;w_j&amp;lt;/math&amp;gt; representa el cambio esperado en &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt; por unidad de cambio en &amp;lt;math&amp;gt;x_j&amp;lt;/math&amp;gt;, manteniendo todas las demas caracteristicas constantes. Interpretar los coeficientes requiere cautela cuando las caracteristicas estan correlacionadas (multicolinealidad), ya que los coeficientes individuales pueden volverse inestables aunque el modelo global ajuste bien.&lt;br /&gt;
&lt;br /&gt;
== Variantes regularizadas ==&lt;br /&gt;
&lt;br /&gt;
Cuando el numero de caracteristicas es grande en relacion con el numero de observaciones, o cuando las caracteristicas estan correlacionadas, MCO puede sobreajustar. La regularizacion anade una penalizacion a la funcion de perdida:&lt;br /&gt;
&lt;br /&gt;
=== Regresion Ridge (L2) ===&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{ridge}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La solucion en forma cerrada se convierte en &amp;lt;math&amp;gt;\hat{\mathbf{w}} = (X^{\!\top} X + \lambda I)^{-1} X^{\!\top} \mathbf{y}&amp;lt;/math&amp;gt;. Ridge reduce los coeficientes hacia cero pero nunca los establece exactamente en cero.&lt;br /&gt;
&lt;br /&gt;
=== Regresion Lasso (L1) ===&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{lasso}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Lasso puede llevar los coeficientes exactamente a cero, realizando una &#039;&#039;&#039;seleccion automatica de caracteristicas&#039;&#039;&#039;. No tiene solucion en forma cerrada y se resuelve tipicamente mediante descenso por coordenadas.&lt;br /&gt;
&lt;br /&gt;
=== Elastic Net ===&lt;br /&gt;
&lt;br /&gt;
Elastic Net combina ambas penalizaciones: &amp;lt;math&amp;gt;\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2&amp;lt;/math&amp;gt;, equilibrando dispersidad y estabilidad.&lt;br /&gt;
&lt;br /&gt;
== Consideraciones practicas ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Escalado de caracteristicas&#039;&#039;&#039;: Estandarizar las caracteristicas (media cero, varianza unitaria) mejora la convergencia del descenso de gradiente y hace que la regularizacion sea equitativa entre las caracteristicas.&lt;br /&gt;
* &#039;&#039;&#039;Caracteristicas polinomicas&#039;&#039;&#039;: Anadir terminos polinomicos (por ejemplo, &amp;lt;math&amp;gt;x^2, x_1 x_2&amp;lt;/math&amp;gt;) permite a la regresion lineal capturar relaciones no lineales.&lt;br /&gt;
* &#039;&#039;&#039;Valores atipicos&#039;&#039;&#039;: MCO es sensible a los valores atipicos debido a la perdida cuadratica. Las alternativas robustas incluyen la regresion de Huber y RANSAC.&lt;br /&gt;
* &#039;&#039;&#039;Graficos de diagnostico&#039;&#039;&#039;: Los graficos de residuos ayudan a detectar violaciones de los supuestos (no linealidad, heterocedasticidad, no normalidad).&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Hastie, T., Tibshirani, R. and Friedman, J. (2009). &#039;&#039;The Elements of Statistical Learning&#039;&#039;. Springer, Chapter 3.&lt;br /&gt;
* Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012). &#039;&#039;Introduction to Linear Regression Analysis&#039;&#039;. Wiley.&lt;br /&gt;
* Hoerl, A. E. and Kennard, R. W. (1970). &amp;quot;Ridge Regression: Biased Estimation for Nonorthogonal Problems&amp;quot;. &#039;&#039;Technometrics&#039;&#039;.&lt;br /&gt;
* Tibshirani, R. (1996). &amp;quot;Regression Shrinkage and Selection via the Lasso&amp;quot;. &#039;&#039;Journal of the Royal Statistical Society, Series B&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Statistics]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent/es&amp;diff=2153</id>
		<title>Gradient Descent/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent/es&amp;diff=2153"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Gradient Descent}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Descenso de gradiente&#039;&#039;&#039; es un algoritmo de optimizacion iterativo de primer orden para encontrar un minimo local de una funcion diferenciable. Constituye la base de practicamente todos los procedimientos de entrenamiento del aprendizaje automatico moderno, desde la regresion lineal simple hasta las redes neuronales profundas con miles de millones de parametros.&lt;br /&gt;
&lt;br /&gt;
== Intuicion ==&lt;br /&gt;
&lt;br /&gt;
Imaginese de pie en la ladera de una montana en medio de una niebla espesa. No se puede ver el fondo del valle, pero se puede sentir la pendiente bajo los pies. La estrategia mas natural consiste en dar un paso en la direccion de mayor descenso y luego reevaluar. El descenso de gradiente formaliza precisamente esta idea: en cada paso, el algoritmo calcula la direccion de mayor crecimiento de la funcion (el &#039;&#039;&#039;gradiente&#039;&#039;&#039;) y se desplaza en la direccion opuesta.&lt;br /&gt;
&lt;br /&gt;
El tamano de cada paso esta controlado por un escalar denominado &#039;&#039;&#039;tasa de aprendizaje&#039;&#039;&#039; (frecuentemente denotada &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;). Una tasa de aprendizaje grande cubre terreno rapidamente pero corre el riesgo de sobrepasar el minimo; una tasa de aprendizaje pequena converge de forma mas fiable pero puede requerir un numero prohibitivamente alto de pasos.&lt;br /&gt;
&lt;br /&gt;
== Formulacion matematica ==&lt;br /&gt;
&lt;br /&gt;
Dada una funcion objetivo diferenciable &amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;, el descenso de gradiente genera una secuencia de iteraciones mediante la &#039;&#039;&#039;regla de actualizacion&#039;&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; es el vector gradiente evaluado en el punto actual &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; es la tasa de aprendizaje.&lt;br /&gt;
&lt;br /&gt;
En el caso unidimensional, esto se simplifica a:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, f&#039;(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El gradiente &amp;lt;math&amp;gt;\nabla f&amp;lt;/math&amp;gt; apunta en la direccion de mayor ascenso, por lo que restarlo mueve la iteracion cuesta abajo.&lt;br /&gt;
&lt;br /&gt;
== Variantes por lotes, estocastica y mini-lotes ==&lt;br /&gt;
&lt;br /&gt;
Cuando la funcion objetivo tiene la forma de un promedio sobre puntos de datos,&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
tres estrategias comunes difieren en la cantidad de datos utilizada para estimar el gradiente:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variante !! Gradiente calculado sobre !! Coste por paso !! Ruido del gradiente&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Descenso de gradiente por lotes (completo)&#039;&#039;&#039; || Las &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; muestras || Alto || Ninguno&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Descenso de gradiente estocastico (SGD)&#039;&#039;&#039; || 1 muestra aleatoria || Bajo || Alto&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Descenso de gradiente por mini-lotes&#039;&#039;&#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; muestras aleatorias (&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;) || Medio || Medio&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
El descenso de gradiente por lotes completo calcula el gradiente exacto y, por lo tanto, sigue una trayectoria suave hacia el minimo. [[Stochastic Gradient Descent|El descenso de gradiente estocastico]] utiliza una unica muestra para estimar el gradiente, lo que reduce drasticamente el calculo por paso a costa de una trayectoria mas ruidosa. El descenso de gradiente por mini-lotes logra un equilibrio y es la opcion mas comun en la practica, con tamanos de lote tipicos entre 32 y 512.&lt;br /&gt;
&lt;br /&gt;
== Convergencia ==&lt;br /&gt;
&lt;br /&gt;
=== Funciones convexas ===&lt;br /&gt;
&lt;br /&gt;
Para una funcion convexa con gradientes Lipschitz-continuos (constante &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;), el descenso de gradiente con una tasa de aprendizaje fija &amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; converge a una tasa de &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt;. Si la funcion es adicionalmente &#039;&#039;&#039;fuertemente convexa&#039;&#039;&#039; con parametro &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, la convergencia se acelera a una tasa lineal (exponencial):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La relacion &amp;lt;math&amp;gt;\kappa = L / \mu&amp;lt;/math&amp;gt; se denomina &#039;&#039;&#039;numero de condicion&#039;&#039;&#039; y determina la velocidad de convergencia del algoritmo. Los problemas mal condicionados (con &amp;lt;math&amp;gt;\kappa&amp;lt;/math&amp;gt; grande) convergen lentamente.&lt;br /&gt;
&lt;br /&gt;
=== Funciones no convexas ===&lt;br /&gt;
&lt;br /&gt;
La mayoria de las funciones objetivo del aprendizaje profundo son no convexas. En este escenario, solo se garantiza que el descenso de gradiente converja a un punto estacionario (donde &amp;lt;math&amp;gt;\nabla f = 0&amp;lt;/math&amp;gt;), que podria ser un minimo local, un punto de silla o incluso un maximo local. En la practica, los puntos de silla son mas problematicos que los minimos locales en espacios de alta dimension.&lt;br /&gt;
&lt;br /&gt;
== Seleccion de la tasa de aprendizaje ==&lt;br /&gt;
&lt;br /&gt;
La eleccion de la tasa de aprendizaje es una de las decisiones practicas mas importantes:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Demasiado grande&#039;&#039;&#039; — las iteraciones oscilan o divergen.&lt;br /&gt;
* &#039;&#039;&#039;Demasiado pequena&#039;&#039;&#039; — la convergencia es inaceptablemente lenta.&lt;br /&gt;
* &#039;&#039;&#039;Programas de tasa de aprendizaje&#039;&#039;&#039; — muchos profesionales comienzan con una tasa mayor y la reducen con el tiempo (decaimiento por escalones, decaimiento exponencial, recocido coseno).&lt;br /&gt;
* &#039;&#039;&#039;Busqueda de linea&#039;&#039;&#039; — los metodos numericos clasicos eligen &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; en cada paso para satisfacer condiciones como las de Wolfe o Armijo, aunque esto es poco habitual en el aprendizaje profundo.&lt;br /&gt;
&lt;br /&gt;
Una heuristica comun consiste en probar varios valores en una escala logaritmica (por ejemplo, &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;) y elegir el que reduzca la perdida mas rapidamente sin inestabilidad.&lt;br /&gt;
&lt;br /&gt;
== Extensiones y mejoras ==&lt;br /&gt;
&lt;br /&gt;
Varias modificaciones importantes abordan las limitaciones del descenso de gradiente basico:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Momento&#039;&#039;&#039; — acumula un vector de velocidad a partir de gradientes anteriores, lo que ayuda a acelerar la convergencia en paisajes con forma de barranco.&lt;br /&gt;
* &#039;&#039;&#039;Gradiente acelerado de Nesterov&#039;&#039;&#039; — una variante del momento que evalua el gradiente en una posicion anticipada, obteniendo mejores tasas de convergencia teoricas.&lt;br /&gt;
* &#039;&#039;&#039;Metodos adaptativos&#039;&#039;&#039; (Adagrad, RMSProp, Adam) — mantienen tasas de aprendizaje por parametro que se adaptan segun el historial de gradientes.&lt;br /&gt;
* &#039;&#039;&#039;Metodos de segundo orden&#039;&#039;&#039; — algoritmos como el metodo de Newton y L-BFGS utilizan informacion de curvatura (la hessiana o su aproximacion) para una convergencia mas rapida, pero suelen ser demasiado costosos para problemas a gran escala.&lt;br /&gt;
&lt;br /&gt;
== Consejos practicos ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Escalado de caracteristicas&#039;&#039;&#039; — normalizar las caracteristicas de entrada para que tengan rangos similares mejora drasticamente la convergencia, porque la superficie de perdida se vuelve mas isotropica.&lt;br /&gt;
* &#039;&#039;&#039;Recorte de gradientes&#039;&#039;&#039; — limitar la norma del gradiente evita actualizaciones excesivamente grandes.&lt;br /&gt;
* &#039;&#039;&#039;Inicializacion aleatoria&#039;&#039;&#039; — comenzar desde una inicializacion aleatoria razonable (por ejemplo, inicializacion de Xavier o He para redes neuronales) evita problemas de ruptura de simetria.&lt;br /&gt;
* &#039;&#039;&#039;Monitorizacion de la curva de perdida&#039;&#039;&#039; — graficar la perdida de entrenamiento a lo largo de las iteraciones es el diagnostico mas sencillo: una curva que decrece suavemente indica un entrenamiento saludable; las oscilaciones sugieren que la tasa de aprendizaje es demasiado alta.&lt;br /&gt;
&lt;br /&gt;
== Aplicaciones ==&lt;br /&gt;
&lt;br /&gt;
El descenso de gradiente y sus variantes se utilizan en toda la ciencia y la ingenieria:&lt;br /&gt;
&lt;br /&gt;
* Entrenamiento de modelos de aprendizaje automatico (modelos lineales, redes neuronales, maquinas de vectores de soporte)&lt;br /&gt;
* Procesamiento de senales y sistemas de control&lt;br /&gt;
* Problemas inversos en fisica e imagen&lt;br /&gt;
* Investigacion de operaciones y optimizacion logistica&lt;br /&gt;
* Economia y calculo de equilibrios en teoria de juegos&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Cauchy, A. (1847). &amp;quot;Methode generale pour la resolution des systemes d&#039;equations simultanees&amp;quot;. &#039;&#039;Comptes Rendus de l&#039;Academie des Sciences&#039;&#039;.&lt;br /&gt;
* Boyd, S. and Vandenberghe, L. (2004). &#039;&#039;Convex Optimization&#039;&#039;. Cambridge University Press.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &#039;&#039;arXiv:1609.04747&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 8. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Optimization]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Dropout/es&amp;diff=2152</id>
		<title>Dropout/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Dropout/es&amp;diff=2152"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Dropout}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Overfitting and Regularization]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Dropout&#039;&#039;&#039; es una tecnica de regularizacion para redes neuronales que establece aleatoriamente una fraccion de las activaciones neuronales a cero durante cada paso de entrenamiento. Introducida por Srivastava et al. (2014), dropout es uno de los metodos mas ampliamente utilizados para prevenir el sobreajuste en el aprendizaje profundo.&lt;br /&gt;
&lt;br /&gt;
== Motivacion: coadaptacion ==&lt;br /&gt;
&lt;br /&gt;
En las redes neuronales grandes, las neuronas pueden desarrollar patrones complejos de &#039;&#039;&#039;coadaptacion&#039;&#039;&#039; — grupos de neuronas que solo funcionan correctamente en presencia de otras neuronas especificas. Este acoplamiento estrecho hace que la red sea fragil y propensa al sobreajuste, ya que las caracteristicas aprendidas dependen de las idiosincrasias particulares de los datos de entrenamiento en lugar de capturar patrones robustos y generales.&lt;br /&gt;
&lt;br /&gt;
Dropout rompe estas coadaptaciones obligando a cada neurona a aprender caracteristicas que sean utiles en conjunto con muchos subconjuntos aleatorios diferentes de las demas neuronas.&lt;br /&gt;
&lt;br /&gt;
== El algoritmo de dropout ==&lt;br /&gt;
&lt;br /&gt;
=== Durante el entrenamiento ===&lt;br /&gt;
&lt;br /&gt;
En cada paso de entrenamiento, cada neurona en una capa de dropout se retiene independientemente con probabilidad &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; (la &#039;&#039;&#039;probabilidad de retencion&#039;&#039;&#039;) o se establece en cero con probabilidad &amp;lt;math&amp;gt;1 - p&amp;lt;/math&amp;gt;. Formalmente, para una capa con vector de activacion &amp;lt;math&amp;gt;\mathbf{h}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;r_j \sim \mathrm{Bernoulli}(p)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{h}_j = r_j \cdot h_j&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;r_j&amp;lt;/math&amp;gt; es una mascara binaria extraida independientemente para cada neurona &amp;lt;math&amp;gt;j&amp;lt;/math&amp;gt;. Una probabilidad de retencion tipica es &amp;lt;math&amp;gt;p = 0.5&amp;lt;/math&amp;gt; para capas ocultas y &amp;lt;math&amp;gt;p = 0.8&amp;lt;/math&amp;gt; o mayor para la capa de entrada.&lt;br /&gt;
&lt;br /&gt;
Cada paso de entrenamiento efectivamente entrena una subred &amp;quot;adelgazada&amp;quot; diferente muestreada de la arquitectura completa. Con &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt; neuronas, existen &amp;lt;math&amp;gt;2^n&amp;lt;/math&amp;gt; posibles subredes, creando un ensamblaje implicito.&lt;br /&gt;
&lt;br /&gt;
=== Durante la inferencia: dropout invertido ===&lt;br /&gt;
&lt;br /&gt;
En el momento de la inferencia, todas las neuronas estan activas, por lo que la salida esperada de cada neurona esta escalada por un factor de &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; en relacion con el entrenamiento. Dos enfoques abordan esto:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Dropout estandar&#039;&#039;&#039;: Multiplicar todos los pesos por &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; en el momento de la prueba.&lt;br /&gt;
* &#039;&#039;&#039;Dropout invertido&#039;&#039;&#039; (mas comun): Durante el entrenamiento, dividir las activaciones retenidas por &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{h}_j = \frac{r_j \cdot h_j}{p}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El dropout invertido asegura que el valor esperado de &amp;lt;math&amp;gt;\tilde{h}_j&amp;lt;/math&amp;gt; sea igual a &amp;lt;math&amp;gt;h_j&amp;lt;/math&amp;gt; durante el entrenamiento, de modo que no se necesita ajuste en la inferencia. Esta es la implementacion predeterminada en frameworks como PyTorch y TensorFlow.&lt;br /&gt;
&lt;br /&gt;
== Interpretacion teorica ==&lt;br /&gt;
&lt;br /&gt;
=== Perspectiva de ensamblaje ===&lt;br /&gt;
&lt;br /&gt;
Dropout puede verse como el entrenamiento de un ensamblaje exponencialmente grande de subredes con amplia comparticion de pesos. En el momento de la prueba, utilizar la red completa con pesos escalados aproxima la media geometrica de las predicciones de todas las &amp;lt;math&amp;gt;2^n&amp;lt;/math&amp;gt; subredes. Este promediado del ensamblaje reduce la varianza y mejora la generalizacion.&lt;br /&gt;
&lt;br /&gt;
=== Interpretacion bayesiana ===&lt;br /&gt;
&lt;br /&gt;
Gal y Ghahramani (2016) demostraron que una red neuronal con dropout aplicado antes de cada capa de pesos es matematicamente equivalente a una aproximacion de un proceso gaussiano profundo. Realizar dropout en el momento de la prueba (&#039;&#039;&#039;dropout de Monte Carlo&#039;&#039;&#039;) produce una distribucion sobre las predicciones, proporcionando una estimacion practica de la incertidumbre del modelo.&lt;br /&gt;
&lt;br /&gt;
== Variantes de dropout ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variante !! Descripcion !! Aplicacion tipica&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Dropout estandar&#039;&#039;&#039; || Descarta neuronas individuales || Capas completamente conectadas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Dropout espacial&#039;&#039;&#039; || Descarta mapas de caracteristicas completos (canales) || Redes convolucionales&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DropConnect&#039;&#039;&#039; || Descarta pesos individuales en lugar de neuronas || Capas densas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Dropout variacional&#039;&#039;&#039; || Aprende la tasa de dropout por neurona/peso || Aprendizaje profundo bayesiano&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DropBlock&#039;&#039;&#039; || Descarta regiones contiguas de mapas de caracteristicas || Redes convolucionales&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Alpha dropout&#039;&#039;&#039; || Mantiene la propiedad de autonormalizacion (para activaciones SELU) || Redes autonormalizantes&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;dropout espacial&#039;&#039;&#039; (Tompson et al., 2015) es particularmente importante para las redes convolucionales. El dropout estandar sobre mapas de caracteristicas convolucionales es ineficaz porque las activaciones adyacentes estan altamente correlacionadas; descartar pixeles individuales aun deja informacion espacial redundante. El dropout espacial en cambio descarta canales completos, obligando a la red a utilizar representaciones de caracteristicas diversas.&lt;br /&gt;
&lt;br /&gt;
== Directrices practicas ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Ubicacion&#039;&#039;&#039;: Aplicar dropout despues de la funcion de activacion en capas completamente conectadas. En los Transformers, dropout se aplica a los pesos de atencion y despues de las subcapas de red prealimentada.&lt;br /&gt;
* &#039;&#039;&#039;Seleccion de tasa&#039;&#039;&#039;: Comenzar con &amp;lt;math&amp;gt;p = 0.5&amp;lt;/math&amp;gt; para capas ocultas. Utilizar tasas de retencion mas altas (menor dropout) para capas con menos parametros. Aumentar el dropout para modelos mas grandes o conjuntos de datos mas pequenos.&lt;br /&gt;
* &#039;&#039;&#039;Interaccion con BatchNorm&#039;&#039;&#039;: Utilizar dropout y [[Batch Normalization]] juntos requiere cuidado, ya que dropout introduce varianza que puede desestabilizar las estadisticas del lote. Una practica comun es aplicar dropout solo despues de la ultima capa con batch normalization.&lt;br /&gt;
* &#039;&#039;&#039;Dropout programado&#039;&#039;&#039;: Algunos regimenes de entrenamiento comienzan sin dropout y lo aumentan gradualmente, o viceversa, a lo largo del entrenamiento.&lt;br /&gt;
&lt;br /&gt;
== Efecto sobre el entrenamiento ==&lt;br /&gt;
&lt;br /&gt;
Dropout tipicamente aumenta la perdida de entrenamiento y ralentiza la convergencia, ya que la capacidad efectiva del modelo se reduce en cada paso. Sin embargo, disminuye la brecha entre el rendimiento de entrenamiento y validacion, lo que conduce a una mejor generalizacion. Si la perdida de entrenamiento ya es alta (subajuste), el dropout debe reducirse o eliminarse.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Batch Normalization]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Bayesian deep learning]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Srivastava, N. et al. (2014). &amp;quot;Dropout: A Simple Way to Prevent Neural Networks from Overfitting&amp;quot;. &#039;&#039;Journal of Machine Learning Research&#039;&#039; 15(56):1929–1958.&lt;br /&gt;
* Gal, Y. and Ghahramani, Z. (2016). &amp;quot;Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
* Tompson, J. et al. (2015). &amp;quot;Efficient Object Localization Using Convolutional Networks&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Wan, L. et al. (2013). &amp;quot;Regularization of Neural Networks using DropConnect&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
* Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). &amp;quot;DropBlock: A regularization method for convolutional networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Cross-Entropy_Loss/es&amp;diff=2151</id>
		<title>Cross-Entropy Loss/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss/es&amp;diff=2151"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Cross-Entropy Loss}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Loss Functions]], [[Softmax Function]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;perdida de entropia cruzada&#039;&#039;&#039; (tambien llamada &#039;&#039;&#039;perdida logaritmica&#039;&#039;&#039;) es la funcion de perdida mas ampliamente utilizada para tareas de clasificacion en el aprendizaje automatico. Con raices en la teoria de la informacion, mide la disimilitud entre la distribucion de la etiqueta verdadera y la distribucion de probabilidad predicha por el modelo, proporcionando un objetivo suave y diferenciable que impulsa a los clasificadores probabilisticos hacia predicciones correctas y con alta confianza.&lt;br /&gt;
&lt;br /&gt;
== Fundamentos de la teoria de la informacion ==&lt;br /&gt;
&lt;br /&gt;
=== Entropia ===&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;entropia&#039;&#039;&#039; de una distribucion de probabilidad discreta &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; cuantifica su incertidumbre:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;H(p) = -\sum_{k=1}^{K} p_k \log p_k&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Para una distribucion determinista (etiqueta one-hot), &amp;lt;math&amp;gt;H(p) = 0&amp;lt;/math&amp;gt;. La entropia se maximiza cuando todos los resultados son igualmente probables.&lt;br /&gt;
&lt;br /&gt;
=== Divergencia KL ===&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;divergencia de Kullback-Leibler&#039;&#039;&#039; mide cuanto difiere una distribucion &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; de una distribucion de referencia &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La divergencia KL es no negativa e igual a cero si y solo si &amp;lt;math&amp;gt;p = q&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Entropia cruzada ===&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;entropia cruzada&#039;&#039;&#039; entre las distribuciones &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; (verdadera) y &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; (predicha) es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Dado que &amp;lt;math&amp;gt;H(p)&amp;lt;/math&amp;gt; es constante con respecto a los parametros del modelo, minimizar la entropia cruzada es equivalente a minimizar la divergencia KL — es decir, hacer que la distribucion predicha &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; sea lo mas cercana posible a la distribucion verdadera &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Entropia cruzada binaria ==&lt;br /&gt;
&lt;br /&gt;
Para clasificacion binaria con etiqueta verdadera &amp;lt;math&amp;gt;y \in \{0, 1\}&amp;lt;/math&amp;gt; y probabilidad predicha &amp;lt;math&amp;gt;\hat{y} = \sigma(z)&amp;lt;/math&amp;gt; (donde &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; es la [[Softmax Function|funcion sigmoide]]):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Sobre un conjunto de datos de &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; muestras:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El gradiente con respecto al logit &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; toma la forma elegantemente simple &amp;lt;math&amp;gt;\hat{y} - y&amp;lt;/math&amp;gt;, que es tanto intuitiva como computacionalmente eficiente.&lt;br /&gt;
&lt;br /&gt;
== Entropia cruzada categorica ==&lt;br /&gt;
&lt;br /&gt;
Para clasificacion multiclase con &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; clases, la etiqueta verdadera es tipicamente un vector one-hot &amp;lt;math&amp;gt;\mathbf{y}&amp;lt;/math&amp;gt; con &amp;lt;math&amp;gt;y_c = 1&amp;lt;/math&amp;gt; para la clase correcta &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt;. Las probabilidades predichas &amp;lt;math&amp;gt;\hat{\mathbf{y}}&amp;lt;/math&amp;gt; se obtienen mediante la [[Softmax Function|funcion softmax]]:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esto se reduce a la probabilidad logaritmica negativa de la clase correcta, razon por la cual la entropia cruzada categorica tambien se denomina &#039;&#039;&#039;verosimilitud logaritmica negativa&#039;&#039;&#039; en este contexto.&lt;br /&gt;
&lt;br /&gt;
== Estabilidad numerica ==&lt;br /&gt;
&lt;br /&gt;
=== El truco log-sum-exp ===&lt;br /&gt;
&lt;br /&gt;
Calcular ingenuamente &amp;lt;math&amp;gt;\log(\mathrm{softmax}(z_k))&amp;lt;/math&amp;gt; implica exponenciar logits potencialmente grandes, causando desbordamiento. El truco &#039;&#039;&#039;log-sum-exp&#039;&#039;&#039; evita esto:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;m = \max_j z_j&amp;lt;/math&amp;gt;. Restar el logit maximo asegura que el exponente mas grande sea cero, previniendo el desbordamiento. Todos los principales frameworks de aprendizaje profundo implementan esta operacion fusionada (por ejemplo, &amp;lt;code&amp;gt;CrossEntropyLoss&amp;lt;/code&amp;gt; de PyTorch acepta logits crudos).&lt;br /&gt;
&lt;br /&gt;
=== Recorte ===&lt;br /&gt;
&lt;br /&gt;
Las probabilidades predichas deben recortarse lejos de exactamente 0 y 1 para evitar &amp;lt;math&amp;gt;\log(0) = -\infty&amp;lt;/math&amp;gt;. Tipicamente se utiliza un epsilon pequeno (por ejemplo, &amp;lt;math&amp;gt;10^{-7}&amp;lt;/math&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
== Suavizado de etiquetas ==&lt;br /&gt;
&lt;br /&gt;
El &#039;&#039;&#039;suavizado de etiquetas&#039;&#039;&#039; (Szegedy et al., 2016) reemplaza el objetivo one-hot rigido con una distribucion suave:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; es una constante pequena (comunmente 0.1). Esto evita que el modelo se vuelva excesivamente confiado, mejora la calibracion y a menudo produce una mejor generalizacion. Es practica estandar en el entrenamiento de grandes clasificadores de imagenes y modelos Transformer.&lt;br /&gt;
&lt;br /&gt;
== Comparacion con otras perdidas ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Perdida !! Formula !! Uso tipico&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Entropia cruzada&#039;&#039;&#039; || &amp;lt;math&amp;gt;-\sum y_k \log \hat{y}_k&amp;lt;/math&amp;gt; || Clasificacion&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Error cuadratico medio&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{1}{K}\sum(y_k - \hat{y}_k)^2&amp;lt;/math&amp;gt; || Regresion (inadecuado para clasificacion)&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Perdida de bisagra&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0, 1 - y \cdot z)&amp;lt;/math&amp;gt; || Clasificacion tipo SVM&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Perdida focal&#039;&#039;&#039; || &amp;lt;math&amp;gt;-(1-\hat{y}_c)^\gamma \log \hat{y}_c&amp;lt;/math&amp;gt; || Clasificacion desbalanceada&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
La entropia cruzada tiene gradientes mas pronunciados que el MSE cuando la prediccion es confidencialmente erronea, lo que conduce a una correccion mas rapida de los errores grandes.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Information theory]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Shannon, C. E. (1948). &amp;quot;A Mathematical Theory of Communication&amp;quot;. &#039;&#039;Bell System Technical Journal&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press, Chapter 6.&lt;br /&gt;
* Szegedy, C. et al. (2016). &amp;quot;Rethinking the Inception Architecture for Computer Vision&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Lin, T.-Y. et al. (2017). &amp;quot;Focal Loss for Dense Object Detection&amp;quot;. &#039;&#039;ICCV&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Convolutional_Neural_Networks/es&amp;diff=2150</id>
		<title>Convolutional Neural Networks/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Convolutional_Neural_Networks/es&amp;diff=2150"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Convolutional Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
Las &#039;&#039;&#039;redes neuronales convolucionales&#039;&#039;&#039; (&#039;&#039;&#039;CNN&#039;&#039;&#039; o &#039;&#039;&#039;ConvNets&#039;&#039;&#039;) son una clase de [[Neural Networks|redes neuronales]] profundas disenadas especificamente para procesar datos con una topologia de cuadricula, como imagenes (cuadriculas 2D de pixeles), espectrogramas de audio y video. Explotan la estructura espacial de la entrada mediante conectividad local, comparticion de pesos y agrupamiento (pooling), lo que las hace mucho mas eficientes que las redes completamente conectadas para tareas visuales y espaciales.&lt;br /&gt;
&lt;br /&gt;
== La operacion de convolucion ==&lt;br /&gt;
&lt;br /&gt;
El bloque de construccion fundamental es la &#039;&#039;&#039;convolucion discreta&#039;&#039;&#039;. Para una entrada 2D &amp;lt;math&amp;gt;\mathbf{X}&amp;lt;/math&amp;gt; y un filtro (kernel) &amp;lt;math&amp;gt;\mathbf{K}&amp;lt;/math&amp;gt; de tamano &amp;lt;math&amp;gt;k \times k&amp;lt;/math&amp;gt;, el mapa de caracteristicas de salida &amp;lt;math&amp;gt;\mathbf{Y}&amp;lt;/math&amp;gt; es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; es un termino de sesgo. El filtro se desliza (convoluciona) sobre la entrada, calculando un producto escalar en cada posicion. Tecnicamente, la mayoria de las implementaciones calculan una &#039;&#039;&#039;correlacion cruzada&#039;&#039;&#039; en lugar de una convolucion verdadera (que voltearía el kernel), pero la distincion es irrelevante dado que los pesos del kernel se aprenden.&lt;br /&gt;
&lt;br /&gt;
Hiperparametros clave que controlan la convolucion:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Tamano del kernel&#039;&#039;&#039; — la extension espacial del filtro (por ejemplo, &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;5 \times 5&amp;lt;/math&amp;gt;).&lt;br /&gt;
* &#039;&#039;&#039;Paso (stride)&#039;&#039;&#039; — el tamano del desplazamiento entre posiciones sucesivas del kernel. Un stride de 2 reduce las dimensiones espaciales a la mitad.&lt;br /&gt;
* &#039;&#039;&#039;Relleno (padding)&#039;&#039;&#039; — anadir ceros alrededor del borde de la entrada para controlar el tamano de la salida. El relleno &amp;quot;same&amp;quot; preserva las dimensiones espaciales; el relleno &amp;quot;valid&amp;quot; no utiliza relleno.&lt;br /&gt;
&lt;br /&gt;
== Filtros y deteccion de caracteristicas ==&lt;br /&gt;
&lt;br /&gt;
Cada filtro aprende a detectar un patron local especifico. En las capas iniciales, los filtros tipicamente responden a bordes, esquinas y gradientes de color. Las capas mas profundas componen estos en caracteristicas de nivel superior — texturas, partes y eventualmente objetos completos.&lt;br /&gt;
&lt;br /&gt;
Una capa convolucional aplica multiples filtros en paralelo, produciendo una pila de mapas de caracteristicas. Si la entrada tiene &amp;lt;math&amp;gt;C_{\text{in}}&amp;lt;/math&amp;gt; canales y la capa tiene &amp;lt;math&amp;gt;C_{\text{out}}&amp;lt;/math&amp;gt; filtros, el numero total de parametros aprendibles es:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esto es drasticamente menor que una capa completamente conectada con las mismas dimensiones de entrada y salida, porque los pesos se comparten en todas las posiciones espaciales.&lt;br /&gt;
&lt;br /&gt;
== Agrupamiento (pooling) ==&lt;br /&gt;
&lt;br /&gt;
Las capas de &#039;&#039;&#039;agrupamiento&#039;&#039;&#039; (pooling) submuestrean los mapas de caracteristicas, reduciendo sus dimensiones espaciales y proporcionando cierto grado de invariancia a la traslacion. Operaciones de agrupamiento comunes:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Max pooling&#039;&#039;&#039; — toma el valor maximo en cada ventana local (por ejemplo, &amp;lt;math&amp;gt;2 \times 2&amp;lt;/math&amp;gt;).&lt;br /&gt;
* &#039;&#039;&#039;Average pooling&#039;&#039;&#039; — toma el valor medio en cada ventana.&lt;br /&gt;
* &#039;&#039;&#039;Global average pooling&#039;&#039;&#039; — promedia cada mapa de caracteristicas completo a un unico valor, frecuentemente utilizado antes de la capa de clasificacion final.&lt;br /&gt;
&lt;br /&gt;
El agrupamiento reduce el coste computacional y ayuda a prevenir el sobreajuste al abstraer progresivamente la representacion.&lt;br /&gt;
&lt;br /&gt;
== Arquitectura de una CNN ==&lt;br /&gt;
&lt;br /&gt;
Una CNN tipica alterna capas convolucionales y capas de agrupamiento, seguidas de una o mas capas completamente conectadas para la prediccion final:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Entrada → [Conv → ReLU → Pool] × N → Aplanar → FC → FC → Salida&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cada bloque conv-pool extrae caracteristicas cada vez mas abstractas, mientras que las capas completamente conectadas las combinan para la clasificacion o regresion.&lt;br /&gt;
&lt;br /&gt;
== Arquitecturas historicas ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Arquitectura !! Ano !! Contribucion clave !! Profundidad&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;LeNet-5&#039;&#039;&#039; || 1998 || Pionera de las CNN para el reconocimiento de digitos manuscritos (MNIST) || 5 capas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;AlexNet&#039;&#039;&#039; || 2012 || Gano ImageNet; popularizo ReLU, dropout y entrenamiento en GPU || 8 capas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;VGGNet&#039;&#039;&#039; || 2014 || Demostro que la profundidad importa; uso solo filtros de &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt; || 16–19 capas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;GoogLeNet (Inception)&#039;&#039;&#039; || 2014 || Introdujo modulos inception con tamanos de filtro en paralelo || 22 capas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;ResNet&#039;&#039;&#039; || 2015 || Introdujo conexiones residuales que permiten redes muy profundas || 50–152+ capas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DenseNet&#039;&#039;&#039; || 2017 || Conecto cada capa con todas las capas subsiguientes mediante bloques densos || 121–264 capas&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;EfficientNet&#039;&#039;&#039; || 2019 || Escalado compuesto de profundidad, anchura y resolucion || Variable&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Conexiones residuales ===&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;conexion residual&#039;&#039;&#039; (o conexion de salto) introducida por ResNet suma la entrada de un bloque directamente a su salida:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Esto permite que los gradientes fluyan directamente a traves de la ruta identidad, mitigando el problema del gradiente que se desvanece y permitiendo el entrenamiento de redes con cientos de capas. Las conexiones residuales se han convertido en un componente estandar en practicamente todas las arquitecturas modernas.&lt;br /&gt;
&lt;br /&gt;
== Aplicaciones en vision por computador ==&lt;br /&gt;
&lt;br /&gt;
Las CNN han alcanzado un rendimiento de vanguardia en una amplia gama de tareas de vision:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Clasificacion de imagenes&#039;&#039;&#039; — asignar una etiqueta a una imagen completa (ImageNet, CIFAR).&lt;br /&gt;
* &#039;&#039;&#039;Deteccion de objetos&#039;&#039;&#039; — localizar y clasificar objetos dentro de una imagen (YOLO, Faster R-CNN, SSD).&lt;br /&gt;
* &#039;&#039;&#039;Segmentacion semantica&#039;&#039;&#039; — asignar una etiqueta de clase a cada pixel (U-Net, DeepLab).&lt;br /&gt;
* &#039;&#039;&#039;Segmentacion de instancias&#039;&#039;&#039; — distinguir instancias individuales de objetos (Mask R-CNN).&lt;br /&gt;
* &#039;&#039;&#039;Generacion de imagenes&#039;&#039;&#039; — generar imagenes realistas utilizando generadores basados en CNN (GAN, modelos de difusion).&lt;br /&gt;
* &#039;&#039;&#039;Imagen medica&#039;&#039;&#039; — deteccion de tumores, analisis de retina y cribado radiologico.&lt;br /&gt;
&lt;br /&gt;
== Consejos practicos ==&lt;br /&gt;
&lt;br /&gt;
* Utilizar modelos preentrenados (transfer learning) cuando los datos etiquetados son limitados.&lt;br /&gt;
* Preferir kernels pequenos (&amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt;) apilados en profundidad — dos capas de &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt; tienen el mismo campo receptivo que una capa de &amp;lt;math&amp;gt;5 \times 5&amp;lt;/math&amp;gt; pero con menos parametros.&lt;br /&gt;
* Aplicar batch normalization despues de la convolucion y antes de la activacion.&lt;br /&gt;
* Utilizar el aumento de datos generosamente para reducir el [[Overfitting and Regularization|sobreajuste]].&lt;br /&gt;
* Reemplazar las capas completamente conectadas con global average pooling para reducir parametros.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* LeCun, Y. et al. (1998). &amp;quot;Gradient-Based Learning Applied to Document Recognition&amp;quot;. &#039;&#039;Proceedings of the IEEE&#039;&#039;.&lt;br /&gt;
* Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). &amp;quot;ImageNet Classification with Deep Convolutional Neural Networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Simonyan, K. and Zisserman, A. (2015). &amp;quot;Very Deep Convolutional Networks for Large-Scale Image Recognition&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* He, K. et al. (2016). &amp;quot;Deep Residual Learning for Image Recognition&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Tan, M. and Le, Q. V. (2019). &amp;quot;EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Batch_Normalization/es&amp;diff=2149</id>
		<title>Batch Normalization/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Batch_Normalization/es&amp;diff=2149"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Batch Normalization}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Batch normalization&#039;&#039;&#039; (frecuentemente abreviado &#039;&#039;&#039;BatchNorm&#039;&#039;&#039; o &#039;&#039;&#039;BN&#039;&#039;&#039;) es una tecnica para mejorar la velocidad, estabilidad y rendimiento de las redes neuronales profundas mediante la normalizacion de las entradas a cada capa. Introducida por Ioffe y Szegedy en 2015, se ha convertido en un componente estandar en la mayoria de las arquitecturas modernas de aprendizaje profundo.&lt;br /&gt;
&lt;br /&gt;
== Desplazamiento covariante interno ==&lt;br /&gt;
&lt;br /&gt;
La motivacion original de batch normalization fue abordar el &#039;&#039;&#039;desplazamiento covariante interno&#039;&#039;&#039; — el fenomeno por el cual la distribucion de las entradas de cada capa cambia durante el entrenamiento a medida que se actualizan los parametros de las capas precedentes. Esta distribucion cambiante obliga a cada capa a adaptarse continuamente, ralentizando la convergencia y requiriendo una inicializacion cuidadosa y tasas de aprendizaje pequenas.&lt;br /&gt;
&lt;br /&gt;
Aunque el papel preciso del desplazamiento covariante interno ha sido debatido (Santurkar et al., 2018, argumentaron que los beneficios de BatchNorm provienen mas del suavizado del paisaje de perdida), la efectividad practica de la tecnica esta bien establecida.&lt;br /&gt;
&lt;br /&gt;
== El algoritmo de batch normalization ==&lt;br /&gt;
&lt;br /&gt;
=== Durante el entrenamiento ===&lt;br /&gt;
&lt;br /&gt;
Para un mini-lote &amp;lt;math&amp;gt;\mathcal{B} = \{x_1, \dots, x_m\}&amp;lt;/math&amp;gt; de activaciones en una capa dada, BatchNorm procede de la siguiente manera:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Paso 1.&#039;&#039;&#039; Calcular la media y varianza del mini-lote:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Paso 2.&#039;&#039;&#039; Normalizar:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\epsilon&amp;lt;/math&amp;gt; es una constante pequena (por ejemplo, &amp;lt;math&amp;gt;10^{-5}&amp;lt;/math&amp;gt;) para estabilidad numerica.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Paso 3.&#039;&#039;&#039; Escalar y desplazar con parametros aprendidos &amp;lt;math&amp;gt;\gamma&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;\beta&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_i = \gamma \hat{x}_i + \beta&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Los parametros &amp;lt;math&amp;gt;\gamma&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;\beta&amp;lt;/math&amp;gt; se aprenden durante el entrenamiento. Restauran la capacidad de la red para representar la transformacion identidad si esta es optima, asegurando que la normalizacion no reduzca la expresividad del modelo.&lt;br /&gt;
&lt;br /&gt;
=== Durante la inferencia ===&lt;br /&gt;
&lt;br /&gt;
En el momento de la inferencia, las estadisticas de mini-lotes individuales no son fiables (la entrada puede ser un unico ejemplo). En su lugar, BatchNorm utiliza estimaciones acumuladas de la media y varianza poblacional acumuladas durante el entrenamiento mediante promedios moviles exponenciales:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mu_{\mathrm{running}} \leftarrow (1 - \alpha)\, \mu_{\mathrm{running}} + \alpha\, \mu_{\mathcal{B}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; es el parametro de momento (tipicamente 0.1). Estas estadisticas fijas aseguran salidas deterministas en la inferencia.&lt;br /&gt;
&lt;br /&gt;
== Beneficios ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Tasas de aprendizaje mas altas&#039;&#039;&#039;: Al restringir las distribuciones de activacion, BatchNorm permite pasos mas grandes sin divergencia.&lt;br /&gt;
* &#039;&#039;&#039;Menor sensibilidad a la inicializacion&#039;&#039;&#039;: Las redes con BatchNorm son mas tolerantes a una inicializacion de pesos deficiente.&lt;br /&gt;
* &#039;&#039;&#039;Efecto regularizador&#039;&#039;&#039;: El ruido introducido por las estadisticas del mini-lote actua como un regularizador suave, a veces reduciendo la necesidad de [[Dropout]].&lt;br /&gt;
* &#039;&#039;&#039;Convergencia mas rapida&#039;&#039;&#039;: El entrenamiento tipicamente requiere menos epocas para alcanzar un nivel dado de rendimiento.&lt;br /&gt;
&lt;br /&gt;
== Ubicacion ==&lt;br /&gt;
&lt;br /&gt;
BatchNorm se aplica tipicamente &#039;&#039;&#039;antes&#039;&#039;&#039; de la funcion de activacion (como en el articulo original), aunque algunos profesionales lo colocan &#039;&#039;&#039;despues&#039;&#039;&#039; de la activacion. Para capas convolucionales, la normalizacion se realiza por canal a traves de las dimensiones espaciales y la dimension del lote.&lt;br /&gt;
&lt;br /&gt;
== Alternativas de normalizacion ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Metodo !! Normaliza sobre !! Caso de uso&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Batch Norm&#039;&#039;&#039; || Dimensiones del lote y espaciales, por canal || CNN con lotes grandes&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Layer Norm&#039;&#039;&#039; || Todos los canales y dimensiones espaciales, por muestra || Transformers, RNN, lotes pequenos&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Instance Norm&#039;&#039;&#039; || Solo dimensiones espaciales, por muestra por canal || Transferencia de estilo, generacion de imagenes&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Group Norm&#039;&#039;&#039; || Grupos de canales, por muestra || Deteccion de objetos, entrenamiento con lotes pequenos&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;normalizacion de capa&#039;&#039;&#039; (Ba et al., 2016) normaliza a traves de todas las caracteristicas dentro de una unica muestra, haciendola independiente del tamano del lote. Es la opcion estandar en las arquitecturas Transformer.&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;normalizacion de grupo&#039;&#039;&#039; (Wu y He, 2018) divide los canales en grupos y normaliza dentro de cada grupo por muestra. Sirve de puente entre Layer Norm e Instance Norm y funciona bien cuando los tamanos de lote son demasiado pequenos para obtener estadisticas de lote fiables.&lt;br /&gt;
&lt;br /&gt;
== Limitaciones ==&lt;br /&gt;
&lt;br /&gt;
* El rendimiento se degrada con tamanos de lote muy pequenos, ya que las estadisticas del lote se vuelven ruidosas.&lt;br /&gt;
* Introduce una discrepancia entre el comportamiento de entrenamiento (estadisticas de lote) y el de inferencia (estadisticas acumuladas).&lt;br /&gt;
* No es directamente aplicable a secuencias de longitud variable sin relleno o enmascaramiento.&lt;br /&gt;
* Las estadisticas acumuladas requieren un manejo cuidadoso cuando se utiliza entrenamiento distribuido en multiples dispositivos.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Dropout]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Ioffe, S. and Szegedy, C. (2015). &amp;quot;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
* Ba, J. L., Kiros, J. R. and Hinton, G. E. (2016). &amp;quot;Layer Normalization&amp;quot;. &#039;&#039;arXiv:1607.06450&#039;&#039;.&lt;br /&gt;
* Wu, Y. and He, K. (2018). &amp;quot;Group Normalization&amp;quot;. &#039;&#039;ECCV&#039;&#039;.&lt;br /&gt;
* Santurkar, S. et al. (2018). &amp;quot;How Does Batch Normalization Help Optimization?&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Backpropagation/es&amp;diff=2148</id>
		<title>Backpropagation/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Backpropagation/es&amp;diff=2148"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Backpropagation}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Gradient Descent]], [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Backpropagation&#039;&#039;&#039; (abreviatura de &#039;&#039;&#039;propagacion hacia atras de errores&#039;&#039;&#039;) es un algoritmo para calcular de manera eficiente el gradiente de una funcion de perdida con respecto a cada peso en una red neuronal. Combinado con un metodo de optimizacion como el [[Gradient Descent|descenso de gradiente]], constituye el procedimiento estandar de entrenamiento para los modelos modernos de aprendizaje profundo.&lt;br /&gt;
&lt;br /&gt;
== La regla de la cadena ==&lt;br /&gt;
&lt;br /&gt;
Backpropagation es fundamentalmente una aplicacion de la &#039;&#039;&#039;regla de la cadena&#039;&#039;&#039; del calculo. Si una variable &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; depende de &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt;, que a su vez depende de &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt;, entonces:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
En una red neuronal, la perdida &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt; depende de la salida, que depende de las activaciones de la ultima capa oculta, que dependen de las activaciones de la capa anterior, y asi sucesivamente hasta la entrada. La regla de la cadena permite descomponer el gradiente en un producto de derivadas locales, una para cada capa.&lt;br /&gt;
&lt;br /&gt;
== Pasada hacia adelante ==&lt;br /&gt;
&lt;br /&gt;
Durante la pasada hacia adelante, los datos de entrada se propagan a traves de la red capa por capa. Para una capa completamente conectada &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{a}^{(l)} = g^{(l)}(\mathbf{z}^{(l)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\mathbf{a}^{(l-1)}&amp;lt;/math&amp;gt; es la activacion de la capa anterior (con &amp;lt;math&amp;gt;\mathbf{a}^{(0)} = \mathbf{x}&amp;lt;/math&amp;gt;), &amp;lt;math&amp;gt;\mathbf{W}^{(l)}&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;\mathbf{b}^{(l)}&amp;lt;/math&amp;gt; son los pesos y sesgos, y &amp;lt;math&amp;gt;g^{(l)}&amp;lt;/math&amp;gt; es la funcion de activacion. La pasada hacia adelante almacena todos los valores intermedios &amp;lt;math&amp;gt;\mathbf{z}^{(l)}&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;\mathbf{a}^{(l)}&amp;lt;/math&amp;gt; porque se necesitan durante la pasada hacia atras.&lt;br /&gt;
&lt;br /&gt;
== Pasada hacia atras ==&lt;br /&gt;
&lt;br /&gt;
La pasada hacia atras calcula los gradientes comenzando desde la perdida y avanzando hacia la entrada. Se define la senal de error en la capa &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt; como:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Para la capa de salida (capa &amp;lt;math&amp;gt;L_{\text{out}}&amp;lt;/math&amp;gt;):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(L_{\text{out}})} = \frac{\partial L}{\partial \mathbf{a}^{(L_{\text{out}})}} \odot g&#039;^{(L_{\text{out}})}(\mathbf{z}^{(L_{\text{out}})})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Para cada capa anterior, el error se propaga hacia atras:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(l)} = \bigl(\mathbf{W}^{(l+1)}\bigr)^\top \boldsymbol{\delta}^{(l+1)} \odot g&#039;^{(l)}(\mathbf{z}^{(l)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\odot&amp;lt;/math&amp;gt; denota la multiplicacion elemento a elemento. Una vez conocida la senal de error, los gradientes de los parametros son:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \bigl(\mathbf{a}^{(l-1)}\bigr)^\top, \qquad \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Grafos computacionales ==&lt;br /&gt;
&lt;br /&gt;
Los frameworks modernos de aprendizaje profundo (PyTorch, TensorFlow, JAX) implementan backpropagation construyendo un &#039;&#039;&#039;grafo computacional&#039;&#039;&#039; — un grafo dirigido aciclico donde cada nodo representa una operacion y cada arista transporta un tensor. La pasada hacia adelante construye el grafo; la pasada hacia atras lo recorre en orden topologico inverso, aplicando la regla de la cadena en cada nodo.&lt;br /&gt;
&lt;br /&gt;
Esta abstraccion permite diferenciar composiciones arbitrarias de operaciones, no solo tipos de capas estandar. Existen dos estrategias de implementacion:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Grafos estaticos&#039;&#039;&#039; — el grafo se define una vez antes de la ejecucion (TensorFlow en sus versiones iniciales). Permite optimizaciones agresivas del compilador pero es menos flexible.&lt;br /&gt;
* &#039;&#039;&#039;Grafos dinamicos&#039;&#039;&#039; — el grafo se reconstruye en cada pasada hacia adelante (PyTorch, modo Eager de TensorFlow). Mas intuitivo para la depuracion y para modelos con flujo de control dependiente de los datos.&lt;br /&gt;
&lt;br /&gt;
== Diferenciacion automatica ==&lt;br /&gt;
&lt;br /&gt;
Backpropagation es un caso especial de la &#039;&#039;&#039;diferenciacion automatica en modo inverso&#039;&#039;&#039; (AD). A diferencia de la diferenciacion numerica (que es aproximada) o la diferenciacion simbolica (que puede producir expresiones inmanejables), la AD calcula derivadas exactas aplicando sistematicamente la regla de la cadena a las operaciones elementales.&lt;br /&gt;
&lt;br /&gt;
La AD en modo inverso calcula el gradiente de una salida escalar con respecto a todas las entradas en una unica pasada hacia atras, lo que la hace ideal para las redes neuronales, donde la perdida es escalar pero los parametros se cuentan por millones.&lt;br /&gt;
&lt;br /&gt;
El coste de la pasada hacia atras es tipicamente entre 2 y 3 veces el de la pasada hacia adelante, ya que debe evaluar los jacobianos locales y multiplicarlos con la senal de error entrante.&lt;br /&gt;
&lt;br /&gt;
== Gradientes que se desvanecen y explotan ==&lt;br /&gt;
&lt;br /&gt;
Cuando una red tiene muchas capas, el gradiente es un producto de muchas derivadas locales. Si estos factores son consistentemente menores que 1, el gradiente se reduce exponencialmente hacia cero — el problema del &#039;&#039;&#039;gradiente que se desvanece&#039;&#039;&#039;. Si son consistentemente mayores que 1, el gradiente crece exponencialmente — el problema del &#039;&#039;&#039;gradiente que explota&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Problema !! Sintoma !! Mitigaciones comunes&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Gradientes que se desvanecen&#039;&#039;&#039; || Las capas iniciales aprenden extremadamente lento || Activaciones ReLU, conexiones residuales, batch normalization, inicializacion cuidadosa&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Gradientes que explotan&#039;&#039;&#039; || La perdida diverge o produce valores NaN || Recorte de gradientes, regularizacion de pesos, tasa de aprendizaje mas baja&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Estos problemas fueron obstaculos importantes para el entrenamiento de redes profundas antes de la introduccion de las activaciones ReLU, las conexiones residuales (ResNets) y las tecnicas de normalizacion.&lt;br /&gt;
&lt;br /&gt;
== Consideraciones practicas ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Memoria&#039;&#039;&#039; — la pasada hacia adelante debe almacenar todas las activaciones intermedias para la pasada hacia atras. Para redes muy profundas esto puede ser prohibitivo; el &#039;&#039;&#039;gradient checkpointing&#039;&#039;&#039; intercambia calculo por memoria al recalcular las activaciones durante la pasada hacia atras en lugar de almacenarlas.&lt;br /&gt;
* &#039;&#039;&#039;Estabilidad numerica&#039;&#039;&#039; — el uso de trucos como log-sum-exp e implementaciones fusionadas de softmax-cross-entropy evita el desbordamiento por exceso y por defecto.&lt;br /&gt;
* &#039;&#039;&#039;Gradientes de orden superior&#039;&#039;&#039; — diferenciar a traves de la propia pasada hacia atras produce informacion de segundo orden (productos hessiana-vector), util para metodos como el descenso de gradiente natural y el meta-aprendizaje.&lt;br /&gt;
* &#039;&#039;&#039;Precision mixta&#039;&#039;&#039; — realizar la pasada hacia adelante en media precision mientras se mantiene una copia maestra de los pesos en precision completa acelera el entrenamiento en GPUs modernas.&lt;br /&gt;
&lt;br /&gt;
== Desarrollo historico ==&lt;br /&gt;
&lt;br /&gt;
Las ideas clave detras de backpropagation fueron desarrolladas de forma independiente por varios investigadores. Seppo Linnainmaa describio la diferenciacion automatica en modo inverso en 1970. Paul Werbos la aplico a las redes neuronales en su tesis doctoral de 1974. El algoritmo alcanzo una adopcion generalizada tras el influyente articulo de 1986 de Rumelhart, Hinton y Williams, que demostro su eficacia en redes multicapa.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). &amp;quot;Learning representations by back-propagating errors&amp;quot;. &#039;&#039;Nature&#039;&#039;, 323, 533–536.&lt;br /&gt;
* Linnainmaa, S. (1970). &amp;quot;The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors&amp;quot;. Master&#039;s thesis, University of Helsinki.&lt;br /&gt;
* Werbos, P. J. (1974). &amp;quot;Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences&amp;quot;. PhD thesis, Harvard University.&lt;br /&gt;
* Baydin, A. G. et al. (2018). &amp;quot;Automatic Differentiation in Machine Learning: a Survey&amp;quot;. &#039;&#039;JMLR&#039;&#039;, 18(153), 1–43.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 6. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Attention_Mechanisms/es&amp;diff=2147</id>
		<title>Attention Mechanisms/es</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Attention_Mechanisms/es&amp;diff=2147"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Attention Mechanisms}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Advanced | prerequisites = [[Neural Networks]], [[Recurrent Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
Los &#039;&#039;&#039;mecanismos de atencion&#039;&#039;&#039; son una familia de tecnicas que permiten a las redes neuronales enfocarse selectivamente en las partes relevantes de su entrada al producir cada elemento de la salida. Introducidos originalmente para superar las limitaciones de los vectores de contexto de longitud fija en los modelos secuencia a secuencia, la atencion se ha convertido en el bloque de construccion fundamental de las arquitecturas modernas como el [[Transformer]].&lt;br /&gt;
&lt;br /&gt;
== Motivacion ==&lt;br /&gt;
&lt;br /&gt;
Los primeros modelos secuencia a secuencia codificaban una secuencia de entrada completa en un unico vector de dimension fija utilizando una [[Recurrent Neural Networks|red neuronal recurrente]]. Este &#039;&#039;cuello de botella&#039;&#039; obligaba a comprimir las dependencias de largo alcance en un vector de tamano constante, degradando el rendimiento en secuencias largas. La atencion resuelve esto permitiendo que el decodificador consulte todos los estados ocultos del codificador en cada paso de generacion, ponderandolos mediante puntuaciones de relevancia aprendidas.&lt;br /&gt;
&lt;br /&gt;
== Atencion de Bahdanau (aditiva) ==&lt;br /&gt;
&lt;br /&gt;
Bahdanau et al. (2015) propusieron el primer mecanismo de atencion ampliamente adoptado para la traduccion automatica. Dados los estados ocultos del codificador &amp;lt;math&amp;gt;h_1, \dots, h_T&amp;lt;/math&amp;gt; y el estado del decodificador &amp;lt;math&amp;gt;s_{t-1}&amp;lt;/math&amp;gt;, la puntuacion de alineamiento se calcula como:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;W_s&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;W_h&amp;lt;/math&amp;gt; y &amp;lt;math&amp;gt;v&amp;lt;/math&amp;gt; son parametros aprendidos. Los pesos de atencion se obtienen aplicando softmax:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El vector de contexto es la suma ponderada &amp;lt;math&amp;gt;c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i&amp;lt;/math&amp;gt;, que se concatena con &amp;lt;math&amp;gt;s_{t-1}&amp;lt;/math&amp;gt; y se alimenta al decodificador.&lt;br /&gt;
&lt;br /&gt;
== Atencion de Luong (multiplicativa) ==&lt;br /&gt;
&lt;br /&gt;
Luong et al. (2015) simplificaron la funcion de puntuacion reemplazando la red aditiva con un producto escalar o una forma bilineal:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variante !! Funcion de puntuacion&lt;br /&gt;
|-&lt;br /&gt;
| Dot || &amp;lt;math&amp;gt;e_{t,i} = s_t^{\!\top} h_i&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| General || &amp;lt;math&amp;gt;e_{t,i} = s_t^{\!\top} W_a\, h_i&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Concat || &amp;lt;math&amp;gt;e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i])&amp;lt;/math&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
La variante dot requiere que las dimensiones del codificador y el decodificador coincidan, mientras que la variante general introduce una matriz de pesos aprendible &amp;lt;math&amp;gt;W_a&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Atencion de producto escalar escalado ==&lt;br /&gt;
&lt;br /&gt;
Vaswani et al. (2017) introdujeron la formulacion utilizada en el Transformer. Dadas las matrices de consultas &amp;lt;math&amp;gt;Q&amp;lt;/math&amp;gt;, claves &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; y valores &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
El factor de escalado &amp;lt;math&amp;gt;\sqrt{d_k}&amp;lt;/math&amp;gt; evita que los productos escalares crezcan en magnitud a medida que la dimension de la clave &amp;lt;math&amp;gt;d_k&amp;lt;/math&amp;gt; aumenta, lo que empujaria al softmax hacia regiones de gradientes extremadamente pequenos.&lt;br /&gt;
&lt;br /&gt;
== Autoatencion ==&lt;br /&gt;
&lt;br /&gt;
En la &#039;&#039;&#039;autoatencion&#039;&#039;&#039;, las consultas, claves y valores se derivan de la misma secuencia. Cada posicion atiende a todas las demas posiciones (incluida ella misma), lo que permite al modelo capturar dependencias de largo alcance en una unica capa. Para una matriz de entrada &amp;lt;math&amp;gt;X \in \mathbb{R}^{n \times d}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;Q = X W^Q, \quad K = X W^K, \quad V = X W^V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
La autoatencion tiene una complejidad de &amp;lt;math&amp;gt;O(n^2 d)&amp;lt;/math&amp;gt;, lo que puede ser costoso para secuencias muy largas. Variantes eficientes como la atencion dispersa y la atencion lineal reducen este coste.&lt;br /&gt;
&lt;br /&gt;
== Atencion multicabezal ==&lt;br /&gt;
&lt;br /&gt;
En lugar de realizar una unica funcion de atencion, la &#039;&#039;&#039;atencion multicabezal&#039;&#039;&#039; ejecuta &amp;lt;math&amp;gt;h&amp;lt;/math&amp;gt; cabezas de atencion en paralelo con proyecciones independientes:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
donde &amp;lt;math&amp;gt;\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)&amp;lt;/math&amp;gt;. Cada cabeza puede aprender a atender diferentes aspectos de la entrada — por ejemplo, una cabeza podria capturar relaciones sintacticas mientras otra captura relaciones semanticas. Las configuraciones tipicas utilizan 8 o 16 cabezas.&lt;br /&gt;
&lt;br /&gt;
== Codificacion posicional ==&lt;br /&gt;
&lt;br /&gt;
Dado que la autoatencion es invariante a la permutacion (trata la entrada como un conjunto no ordenado), la informacion posicional debe inyectarse explicitamente. El Transformer original utiliza codificaciones sinusoidales:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Los embeddings posicionales aprendidos y las codificaciones posicionales relativas (por ejemplo, RoPE, ALiBi) son alternativas comunes que pueden generalizar mejor a longitudes de secuencia no vistas.&lt;br /&gt;
&lt;br /&gt;
== Atencion cruzada ==&lt;br /&gt;
&lt;br /&gt;
La &#039;&#039;&#039;atencion cruzada&#039;&#039;&#039; se utiliza cuando las consultas provienen de una secuencia y las claves/valores provienen de otra. En los Transformers codificador-decodificador, el decodificador atiende a las salidas del codificador mediante atencion cruzada, lo que permite al modelo condicionar su generacion en el contexto completo de la entrada.&lt;br /&gt;
&lt;br /&gt;
== Consideraciones practicas ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Enmascaramiento&#039;&#039;&#039;: En la decodificacion autorregresiva, las posiciones futuras se enmascaran (se establecen en &amp;lt;math&amp;gt;-\infty&amp;lt;/math&amp;gt; antes del softmax) para preservar la estructura causal.&lt;br /&gt;
* &#039;&#039;&#039;Dropout en atencion&#039;&#039;&#039;: Descartar pesos de atencion aleatoriamente durante el entrenamiento actua como regularizador y reduce el sobreajuste a patrones de alineamiento especificos.&lt;br /&gt;
* &#039;&#039;&#039;Cache de clave-valor&#039;&#039;&#039;: Durante la inferencia, los vectores de clave y valor previamente calculados se almacenan en cache para evitar calculo redundante, acelerando significativamente la generacion autorregresiva.&lt;br /&gt;
&lt;br /&gt;
== Vease tambien ==&lt;br /&gt;
&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Sequence-to-sequence models]]&lt;br /&gt;
* [[Self-supervised learning]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
&lt;br /&gt;
== Referencias ==&lt;br /&gt;
&lt;br /&gt;
* Bahdanau, D., Cho, K. and Bengio, Y. (2015). &amp;quot;Neural Machine Translation by Jointly Learning to Align and Translate&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Luong, M.-T., Pham, H. and Manning, C. D. (2015). &amp;quot;Effective Approaches to Attention-based Neural Machine Translation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Vaswani, A. et al. (2017). &amp;quot;Attention Is All You Need&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). &amp;quot;Self-Attention with Relative Position Representations&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Su, J. et al. (2021). &amp;quot;RoFormer: Enhanced Transformer with Rotary Position Embedding&amp;quot;. &#039;&#039;arXiv:2104.09864&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Advanced]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Word_Embeddings&amp;diff=2146</id>
		<title>Word Embeddings</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Word_Embeddings&amp;diff=2146"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Word Embeddings}}&lt;br /&gt;
{{ArticleInfobox | topic_area = NLP | difficulty = Intermediate | prerequisites = [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Word embeddings&#039;&#039;&#039; are dense, low-dimensional vector representations of words in which semantically similar words are mapped to nearby points in the vector space. They are a foundational component of modern natural language processing (NLP), replacing sparse one-hot encodings with representations that capture meaning, analogy, and syntactic relationships.&lt;br /&gt;
&lt;br /&gt;
== The distributional hypothesis ==&lt;br /&gt;
&lt;br /&gt;
Word embeddings are grounded in the &#039;&#039;&#039;distributional hypothesis&#039;&#039;&#039;, famously stated by J. R. Firth (1957): &amp;quot;You shall know a word by the company it keeps.&amp;quot; The idea is that words appearing in similar contexts tend to have similar meanings. For example, &amp;quot;dog&amp;quot; and &amp;quot;cat&amp;quot; frequently appear near words like &amp;quot;pet&amp;quot;, &amp;quot;fur&amp;quot;, and &amp;quot;veterinarian&amp;quot;, so they should have similar representations.&lt;br /&gt;
&lt;br /&gt;
Early approaches to exploiting distributional information include co-occurrence matrices, pointwise mutual information (PMI), and latent semantic analysis (LSA). Modern word embedding methods learn dense vectors directly using neural networks.&lt;br /&gt;
&lt;br /&gt;
== One-hot vs dense representations ==&lt;br /&gt;
&lt;br /&gt;
=== One-hot encoding ===&lt;br /&gt;
&lt;br /&gt;
In a vocabulary of &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt; words, a one-hot vector for the &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt;-th word is a &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt;-dimensional vector with a 1 in position &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt; and 0 elsewhere. This representation has two critical shortcomings:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Dimensionality&#039;&#039;&#039; — vectors are extremely high-dimensional (typically &amp;lt;math&amp;gt;V &amp;gt; 100{,}000&amp;lt;/math&amp;gt;).&lt;br /&gt;
* &#039;&#039;&#039;No similarity&#039;&#039;&#039; — every pair of one-hot vectors is equally distant: &amp;lt;math&amp;gt;\mathbf{e}_i^\top \mathbf{e}_j = 0&amp;lt;/math&amp;gt; for &amp;lt;math&amp;gt;i \neq j&amp;lt;/math&amp;gt;. &amp;quot;Cat&amp;quot; is as far from &amp;quot;dog&amp;quot; as it is from &amp;quot;democracy.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Dense embeddings ===&lt;br /&gt;
&lt;br /&gt;
A word embedding maps each word to a real-valued vector of &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt; dimensions (typically &amp;lt;math&amp;gt;d = 100&amp;lt;/math&amp;gt;–&amp;lt;math&amp;gt;300&amp;lt;/math&amp;gt;):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{w}_i \in \mathbb{R}^d, \quad d \ll V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Similar words have high cosine similarity:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\text{sim}(\mathbf{w}_a, \mathbf{w}_b) = \frac{\mathbf{w}_a \cdot \mathbf{w}_b}{\|\mathbf{w}_a\|\;\|\mathbf{w}_b\|}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Word2Vec ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Word2Vec&#039;&#039;&#039; (Mikolov et al., 2013) introduced two efficient architectures for learning word embeddings from large corpora.&lt;br /&gt;
&lt;br /&gt;
=== Continuous Bag of Words (CBOW) ===&lt;br /&gt;
&lt;br /&gt;
CBOW predicts a target word from its surrounding context words. Given a window of context words &amp;lt;math&amp;gt;\{w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c}\}&amp;lt;/math&amp;gt;, the model maximises:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;P(w_t \mid w_{t-c}, \ldots, w_{t+c})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The context word vectors are averaged and passed through a softmax layer. CBOW is faster to train and works well for frequent words.&lt;br /&gt;
&lt;br /&gt;
=== Skip-gram ===&lt;br /&gt;
&lt;br /&gt;
Skip-gram reverses the prediction: given a target word, it predicts the surrounding context words. For each pair &amp;lt;math&amp;gt;(w_t, w_{t+j})&amp;lt;/math&amp;gt; where &amp;lt;math&amp;gt;j \in [-c, c] \setminus \{0\}&amp;lt;/math&amp;gt;, the model maximises:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;P(w_{t+j} \mid w_t) = \frac{\exp(\mathbf{v}&#039;_{w_{t+j}}{}^\top \mathbf{v}_{w_t})}{\sum_{w=1}^{V}\exp(\mathbf{v}&#039;_w{}^\top \mathbf{v}_{w_t})}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathbf{v}_w&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\mathbf{v}&#039;_w&amp;lt;/math&amp;gt; are the input and output embedding vectors. Computing the full softmax over the vocabulary is expensive, so two approximations are commonly used:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Negative sampling&#039;&#039;&#039; — instead of computing the full softmax, the model contrasts the true context word against &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; randomly sampled &amp;quot;negative&amp;quot; words.&lt;br /&gt;
* &#039;&#039;&#039;Hierarchical softmax&#039;&#039;&#039; — organises the vocabulary in a binary tree, reducing the softmax cost from &amp;lt;math&amp;gt;O(V)&amp;lt;/math&amp;gt; to &amp;lt;math&amp;gt;O(\log V)&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Skip-gram performs well on rare words and captures subtle relationships. The famous analogy &amp;quot;king − man + woman ≈ queen&amp;quot; emerged from Skip-gram embeddings.&lt;br /&gt;
&lt;br /&gt;
== GloVe ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;GloVe&#039;&#039;&#039; (Global Vectors, Pennington et al., 2014) combines the strengths of global matrix factorisation and local context window methods. It constructs a word co-occurrence matrix &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; from the corpus, where &amp;lt;math&amp;gt;X_{ij}&amp;lt;/math&amp;gt; counts how often word &amp;lt;math&amp;gt;j&amp;lt;/math&amp;gt; appears in the context of word &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt;, and then optimises:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J = \sum_{i,j=1}^{V} f(X_{ij})\bigl(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\bigr)^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is a weighting function that caps the influence of very frequent co-occurrences. GloVe embeddings often match or exceed Word2Vec quality, and the explicit use of global statistics can improve performance on analogy tasks.&lt;br /&gt;
&lt;br /&gt;
== fastText ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;fastText&#039;&#039;&#039; (Bojanowski et al., 2017) extends Word2Vec by representing each word as a bag of character n-grams. For example, the word &amp;quot;where&amp;quot; with &amp;lt;math&amp;gt;n = 3&amp;lt;/math&amp;gt; is represented by the n-grams {&amp;quot;&amp;lt;wh&amp;quot;, &amp;quot;whe&amp;quot;, &amp;quot;her&amp;quot;, &amp;quot;ere&amp;quot;, &amp;quot;re&amp;gt;&amp;quot;} plus the whole word &amp;quot;&amp;lt;where&amp;gt;&amp;quot;. The embedding for a word is the sum of its n-gram vectors.&lt;br /&gt;
&lt;br /&gt;
This approach has two key advantages:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Handling rare and unseen words&#039;&#039;&#039; — even words not in the training vocabulary can receive embeddings by summing their character n-gram vectors.&lt;br /&gt;
* &#039;&#039;&#039;Morphological awareness&#039;&#039;&#039; — words sharing substrings (e.g. &amp;quot;teach&amp;quot;, &amp;quot;teacher&amp;quot;, &amp;quot;teaching&amp;quot;) automatically share embedding components.&lt;br /&gt;
&lt;br /&gt;
== Evaluation of embeddings ==&lt;br /&gt;
&lt;br /&gt;
Word embeddings are evaluated through:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Evaluation type !! Examples !! What it measures&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Intrinsic: analogy&#039;&#039;&#039; || &amp;quot;king : queen :: man : ?&amp;quot; || Linear structure of the space&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Intrinsic: similarity&#039;&#039;&#039; || Correlation with human similarity judgements (SimLex-999, WS-353) || Semantic quality&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Extrinsic: downstream&#039;&#039;&#039; || Named entity recognition, sentiment analysis, parsing || Practical utility&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Intrinsic evaluations are fast but do not always predict downstream performance. Extrinsic evaluation on the target task is ultimately the most reliable measure.&lt;br /&gt;
&lt;br /&gt;
== Contextual embeddings ==&lt;br /&gt;
&lt;br /&gt;
Traditional word embeddings assign a single vector per word regardless of context — the word &amp;quot;bank&amp;quot; has the same embedding whether it refers to a river bank or a financial institution. &#039;&#039;&#039;Contextual embeddings&#039;&#039;&#039; address this limitation by producing different representations depending on the surrounding text.&lt;br /&gt;
&lt;br /&gt;
Notable contextual embedding models include:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;ELMo&#039;&#039;&#039; (Peters et al., 2018) — uses a bidirectional LSTM to generate context-dependent word representations.&lt;br /&gt;
* &#039;&#039;&#039;BERT&#039;&#039;&#039; (Devlin et al., 2019) — uses a Transformer encoder trained with masked language modelling.&lt;br /&gt;
* &#039;&#039;&#039;GPT&#039;&#039;&#039; series (Radford et al., 2018–) — uses a Transformer decoder trained autoregressively.&lt;br /&gt;
&lt;br /&gt;
These models have largely superseded static embeddings for most NLP tasks, though static embeddings remain useful for efficiency, interpretability, and low-resource settings.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Firth, J. R. (1957). &amp;quot;A synopsis of linguistic theory, 1930–1955&amp;quot;. In &#039;&#039;Studies in Linguistic Analysis&#039;&#039;.&lt;br /&gt;
* Mikolov, T. et al. (2013). &amp;quot;Efficient Estimation of Word Representations in Vector Space&amp;quot;. &#039;&#039;arXiv:1301.3781&#039;&#039;.&lt;br /&gt;
* Pennington, J., Socher, R. and Manning, C. D. (2014). &amp;quot;GloVe: Global Vectors for Word Representation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Bojanowski, P. et al. (2017). &amp;quot;Enriching Word Vectors with Subword Information&amp;quot;. &#039;&#039;TACL&#039;&#039;, 5, 135–146.&lt;br /&gt;
* Peters, M. E. et al. (2018). &amp;quot;Deep contextualized word representations&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Devlin, J. et al. (2019). &amp;quot;BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:NLP]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Transfer_Learning&amp;diff=2145</id>
		<title>Transfer Learning</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Transfer_Learning&amp;diff=2145"/>
		<updated>2026-04-24T07:09:00Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Transfer Learning}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Transfer learning&#039;&#039;&#039; is a machine learning technique in which a model trained on one task is reused as the starting point for a model on a different but related task. By leveraging knowledge acquired from large-scale pretraining, transfer learning dramatically reduces the amount of labelled data, compute, and training time required for downstream applications.&lt;br /&gt;
&lt;br /&gt;
== Motivation ==&lt;br /&gt;
&lt;br /&gt;
Training deep neural networks from scratch typically requires large datasets and significant computational resources. In many practical domains — medical imaging, legal text analysis, low-resource languages — labelled data is scarce. Transfer learning addresses this mismatch: a model pretrained on a data-rich source task captures general features (edges, textures, syntactic patterns) that transfer well to a data-scarce target task.&lt;br /&gt;
&lt;br /&gt;
== Key Concepts ==&lt;br /&gt;
&lt;br /&gt;
=== Domain and Task ===&lt;br /&gt;
&lt;br /&gt;
Formally, a &#039;&#039;&#039;domain&#039;&#039;&#039; &amp;lt;math&amp;gt;\mathcal{D} = \{\mathcal{X}, P(X)\}&amp;lt;/math&amp;gt; consists of a feature space &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt; and a marginal distribution &amp;lt;math&amp;gt;P(X)&amp;lt;/math&amp;gt;. A &#039;&#039;&#039;task&#039;&#039;&#039; &amp;lt;math&amp;gt;\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}&amp;lt;/math&amp;gt; consists of a label space &amp;lt;math&amp;gt;\mathcal{Y}&amp;lt;/math&amp;gt; and a predictive function &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt;. Transfer learning applies when the source and target differ in domain, task, or both.&lt;br /&gt;
&lt;br /&gt;
=== Domain Adaptation ===&lt;br /&gt;
&lt;br /&gt;
When the source and target share the same task but differ in data distribution (&amp;lt;math&amp;gt;P_s(X) \neq P_t(X)&amp;lt;/math&amp;gt;), the problem is called &#039;&#039;&#039;domain adaptation&#039;&#039;&#039;. Techniques include:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Instance reweighting&#039;&#039;&#039; — adjusting sample weights so the source distribution approximates the target.&lt;br /&gt;
* &#039;&#039;&#039;Feature alignment&#039;&#039;&#039; — learning domain-invariant representations (e.g., via adversarial training or maximum mean discrepancy).&lt;br /&gt;
* &#039;&#039;&#039;Self-training&#039;&#039;&#039; — using model predictions on unlabelled target data as pseudo-labels.&lt;br /&gt;
&lt;br /&gt;
== Fine-Tuning vs Feature Extraction ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Strategy !! Description !! When to use&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Feature extraction&#039;&#039;&#039; || Freeze all pretrained layers; train only a new output head || Very small target dataset; source and target are closely related&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Fine-tuning (full)&#039;&#039;&#039; || Unfreeze all layers and train end-to-end with a small learning rate || Moderate target dataset; source and target differ meaningfully&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Gradual unfreezing&#039;&#039;&#039; || Progressively unfreeze layers from top to bottom over training || Balances stability of lower features with adaptation of higher ones&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
A common heuristic is to use a learning rate 10–100x smaller for pretrained layers than for the new classification head, preventing catastrophic forgetting of learned representations.&lt;br /&gt;
&lt;br /&gt;
== Pretrained Models ==&lt;br /&gt;
&lt;br /&gt;
=== Computer Vision ===&lt;br /&gt;
&lt;br /&gt;
ImageNet-pretrained convolutional networks (ResNet, EfficientNet, ViT) serve as standard backbones. Lower layers learn universal features such as edges and textures, while higher layers learn task-specific patterns. Fine-tuning an ImageNet model on a medical imaging dataset with only a few thousand images routinely outperforms training from scratch.&lt;br /&gt;
&lt;br /&gt;
=== Natural Language Processing ===&lt;br /&gt;
&lt;br /&gt;
Language model pretraining transformed NLP. Key milestones include:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Word2Vec / GloVe&#039;&#039;&#039; — static word embeddings pretrained on large corpora.&lt;br /&gt;
* &#039;&#039;&#039;ELMo&#039;&#039;&#039; — contextualised embeddings from bidirectional LSTMs.&lt;br /&gt;
* &#039;&#039;&#039;BERT&#039;&#039;&#039; (Devlin et al., 2019) — bidirectional Transformer pretrained with masked language modelling; fine-tuned for classification, QA, NER, and more.&lt;br /&gt;
* &#039;&#039;&#039;GPT series&#039;&#039;&#039; — autoregressive Transformers demonstrating that scale and pretraining enable few-shot and zero-shot transfer.&lt;br /&gt;
&lt;br /&gt;
== When to Use Transfer Learning ==&lt;br /&gt;
&lt;br /&gt;
Transfer learning is most beneficial when:&lt;br /&gt;
&lt;br /&gt;
# The target dataset is small relative to the model&#039;s capacity.&lt;br /&gt;
# The source and target domains share structural similarities (e.g., both involve natural images or natural language).&lt;br /&gt;
# Computational resources for full pretraining are unavailable.&lt;br /&gt;
# Rapid prototyping is needed before committing to large-scale data collection.&lt;br /&gt;
&lt;br /&gt;
It may hurt performance (&#039;&#039;&#039;negative transfer&#039;&#039;&#039;) when the source and target domains are fundamentally dissimilar — for instance, transferring from natural images to spectrograms without appropriate adaptation.&lt;br /&gt;
&lt;br /&gt;
== Practical Tips ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Data augmentation&#039;&#039;&#039; complements transfer learning by artificially expanding the effective size of the target dataset.&lt;br /&gt;
* &#039;&#039;&#039;Learning rate warmup&#039;&#039;&#039; helps stabilise early training when fine-tuning large pretrained models.&lt;br /&gt;
* &#039;&#039;&#039;Early stopping&#039;&#039;&#039; on a validation set prevents overfitting during fine-tuning, especially with small datasets.&lt;br /&gt;
* &#039;&#039;&#039;Layer-wise learning rate decay&#039;&#039;&#039; assigns smaller rates to earlier (more general) layers and larger rates to later (more task-specific) layers.&lt;br /&gt;
* &#039;&#039;&#039;Intermediate task transfer&#039;&#039;&#039; — fine-tuning on a related intermediate task before the final target (e.g., NLI before sentiment analysis) can further improve results.&lt;br /&gt;
&lt;br /&gt;
== Evaluation ==&lt;br /&gt;
&lt;br /&gt;
Transfer learning effectiveness is typically measured by comparing:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\Delta_{\mathrm{transfer}} = \mathrm{Acc}_{\mathrm{transfer}} - \mathrm{Acc}_{\mathrm{scratch}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A positive &amp;lt;math&amp;gt;\Delta_{\mathrm{transfer}}&amp;lt;/math&amp;gt; indicates successful knowledge transfer. Practitioners also track convergence speed, as transferred models often reach target performance in a fraction of the epochs.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
* [[Self-supervised learning]]&lt;br /&gt;
* [[Domain adaptation]]&lt;br /&gt;
* [[Fine-tuning]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Pan, S. J. and Yang, Q. (2010). &amp;quot;A Survey on Transfer Learning&amp;quot;. &#039;&#039;IEEE Transactions on Knowledge and Data Engineering&#039;&#039;.&lt;br /&gt;
* Yosinski, J. et al. (2014). &amp;quot;How transferable are features in deep neural networks?&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Devlin, J. et al. (2019). &amp;quot;BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Howard, J. and Ruder, S. (2018). &amp;quot;Universal Language Model Fine-tuning for Text Classification&amp;quot;. &#039;&#039;ACL&#039;&#039;.&lt;br /&gt;
* Zhuang, F. et al. (2021). &amp;quot;A Comprehensive Survey on Transfer Learning&amp;quot;. &#039;&#039;Proceedings of the IEEE&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Stochastic_Gradient_Descent&amp;diff=2144</id>
		<title>Stochastic Gradient Descent</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Stochastic_Gradient_Descent&amp;diff=2144"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TopicNav|field=Machine Learning|subfield=Optimization algorithms}}&lt;br /&gt;
{{Article&lt;br /&gt;
|title=Stochastic Gradient Descent&lt;br /&gt;
|field=Machine Learning&lt;br /&gt;
|topics=Optimization, Neural Networks, Gradient Methods&lt;br /&gt;
|related_papers=&lt;br /&gt;
|languages=es&lt;br /&gt;
}}&lt;br /&gt;
{{Summary&lt;br /&gt;
|text=Stochastic gradient descent (SGD) is the core optimization algorithm behind modern machine learning. Instead of computing gradients over an entire dataset, it estimates them from small random samples, making training feasible on large-scale data. Nearly all deep learning models are trained using SGD or one of its variants (Adam, RMSProp, etc.).&lt;br /&gt;
|key_points=Estimates gradients from random mini-batches instead of the full dataset; Learning rate schedule is critical for convergence; Variants like Adam and AdamW add adaptive per-parameter rates; Converges to global minimum for convex problems under Robbins–Monro conditions&lt;br /&gt;
}}&lt;br /&gt;
&#039;&#039;&#039;Stochastic gradient descent&#039;&#039;&#039; (often abbreviated &#039;&#039;&#039;{{Term|SGD|SGD}}&#039;&#039;&#039;) is an iterative optimisation algorithm used to minimise an {{Term|objective function|objective function}} written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from logistic regression to deep neural networks.&lt;br /&gt;
&lt;br /&gt;
== Motivation ==&lt;br /&gt;
&lt;br /&gt;
In classical {{Term|gradient descent}}, the full gradient of the {{Term|loss function}} is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small &#039;&#039;&#039;{{Term|mini-batch}}&#039;&#039;&#039;) at each step, trading a noisier estimate for dramatically lower per-iteration cost.&lt;br /&gt;
&lt;br /&gt;
== Algorithm ==&lt;br /&gt;
&lt;br /&gt;
Given a parameterised {{Term|loss function}}&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
the SGD update rule at step &amp;lt;math&amp;gt;t&amp;lt;/math&amp;gt; is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; is the &#039;&#039;&#039;{{Term|learning rate}}&#039;&#039;&#039; (step size) and &amp;lt;math&amp;gt;i_t&amp;lt;/math&amp;gt; is a randomly selected index.&lt;br /&gt;
&lt;br /&gt;
=== Mini-batch variant ===&lt;br /&gt;
&lt;br /&gt;
In practice a &#039;&#039;&#039;{{Term|mini-batch}}&#039;&#039;&#039; of &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; samples is used:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.&lt;br /&gt;
&lt;br /&gt;
=== Pseudocode ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
initialise parameters θ&lt;br /&gt;
for epoch = 1, 2, … do&lt;br /&gt;
    shuffle training set&lt;br /&gt;
    for each mini-batch B ⊂ training set do&lt;br /&gt;
        g ← (1/|B|) Σ ∇ℓ(θ; xᵢ, yᵢ)   # estimate gradient&lt;br /&gt;
        θ ← θ − η · g                     # update parameters&lt;br /&gt;
    end for&lt;br /&gt;
end for&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Learning rate schedules ==&lt;br /&gt;
&lt;br /&gt;
The {{Term|learning rate}} &amp;lt;math&amp;gt;\eta_t&amp;lt;/math&amp;gt; strongly influences {{Term|convergence}}. Common strategies include:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Constant&#039;&#039;&#039; — simple but may overshoot or stall.&lt;br /&gt;
* &#039;&#039;&#039;Step decay&#039;&#039;&#039; — multiply &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; by a factor (e.g. 0.1) every &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; epochs.&lt;br /&gt;
* &#039;&#039;&#039;Exponential decay&#039;&#039;&#039; — &amp;lt;math&amp;gt;\eta_t = \eta_0 \, e^{-\lambda t}&amp;lt;/math&amp;gt;.&lt;br /&gt;
* &#039;&#039;&#039;Cosine annealing&#039;&#039;&#039; — smoothly reduces the rate following a cosine curve, often with warm restarts.&lt;br /&gt;
* &#039;&#039;&#039;Linear warm-up&#039;&#039;&#039; — ramp up from a small &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; during the first few iterations to stabilise early training.&lt;br /&gt;
&lt;br /&gt;
== Convergence properties ==&lt;br /&gt;
&lt;br /&gt;
For {{Term|convex optimization|convex}} objectives with Lipschitz-continuous gradients, SGD with a decaying {{Term|learning rate}} satisfying&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 &amp;lt; \infty&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for deep learning — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.&lt;br /&gt;
&lt;br /&gt;
== Popular variants ==&lt;br /&gt;
&lt;br /&gt;
Several extensions reduce the variance of the gradient estimate or adapt the step size per parameter:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Method !! Key idea !! Reference&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;{{Term|momentum|Momentum}}&#039;&#039;&#039; || Accumulates an exponentially decaying moving average of past gradients || Polyak, 1964&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Nesterov accelerated gradient&#039;&#039;&#039; || Evaluates the gradient at a &amp;quot;look-ahead&amp;quot; position || Nesterov, 1983&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Adagrad&#039;&#039;&#039; || Per-parameter rates that shrink for frequently updated features || Duchi et al., 2011&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;RMSProp&#039;&#039;&#039; || Fixes Adagrad&#039;s diminishing rates using a moving average of squared gradients || Hinton (lecture notes), 2012&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;{{Term|Adam}}&#039;&#039;&#039; || Combines {{Term|momentum}} with RMSProp-style adaptive rates || Kingma &amp;amp; Ba, 2015&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;AdamW&#039;&#039;&#039; || Decouples weight decay from the adaptive gradient step || Loshchilov &amp;amp; Hutter, 2019&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Practical considerations ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Data shuffling&#039;&#039;&#039; — Re-shuffle the dataset each epoch to avoid cyclic patterns.&lt;br /&gt;
* &#039;&#039;&#039;{{Term|gradient clipping|Gradient clipping}}&#039;&#039;&#039; — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.&lt;br /&gt;
* &#039;&#039;&#039;{{Term|batch normalization|Batch normalisation}}&#039;&#039;&#039; — Normalising layer inputs reduces sensitivity to the {{Term|learning rate}}.&lt;br /&gt;
* &#039;&#039;&#039;Mixed-precision training&#039;&#039;&#039; — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
SGD and its variants are used across virtually all areas of machine learning:&lt;br /&gt;
&lt;br /&gt;
* Training deep neural networks (computer vision, NLP, speech recognition)&lt;br /&gt;
* Large-scale linear models (logistic regression, SVMs via SGD)&lt;br /&gt;
* Reinforcement learning policy optimisation&lt;br /&gt;
* Recommendation systems and collaborative filtering&lt;br /&gt;
* Online learning settings where data arrives in a stream&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Adam (optimiser)]]&lt;br /&gt;
* [[Learning rate]]&lt;br /&gt;
* [[Convex optimisation]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Robbins, H. and Monro, S. (1951). &amp;quot;A Stochastic Approximation Method&amp;quot;. &#039;&#039;Annals of Mathematical Statistics&#039;&#039;.&lt;br /&gt;
* Bottou, L. (2010). &amp;quot;Large-Scale Machine Learning with Stochastic Gradient Descent&amp;quot;. &#039;&#039;COMPSTAT&#039;&#039;.&lt;br /&gt;
* Kingma, D. P. and Ba, J. (2015). &amp;quot;Adam: A Method for Stochastic Optimization&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &#039;&#039;arXiv:1609.04747&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
{{RelatedContent&lt;br /&gt;
|articles=Gradient descent, Backpropagation, Adam (optimiser), Learning rate, Convex optimisation&lt;br /&gt;
|papers=&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine learning]]&lt;br /&gt;
[[Category:Optimisation algorithms]]&lt;br /&gt;
[[Category:Gradient methods]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2143</id>
		<title>Softmax Function</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2143"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Softmax Function}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;softmax function&#039;&#039;&#039; (also called the &#039;&#039;&#039;normalized exponential function&#039;&#039;&#039;) is a mathematical function that converts a vector of real numbers (&#039;&#039;&#039;logits&#039;&#039;&#039;) into a probability distribution. It is the standard output activation for multi-class classification in neural networks and plays a central role in models ranging from logistic regression to large language models.&lt;br /&gt;
&lt;br /&gt;
== Definition ==&lt;br /&gt;
&lt;br /&gt;
Given a vector of logits &amp;lt;math&amp;gt;\mathbf{z} = (z_1, z_2, \dots, z_K)&amp;lt;/math&amp;gt; for &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; classes, the softmax function produces:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output satisfies two properties that make it a valid probability distribution:&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;\sigma(\mathbf{z})_k &amp;gt; 0&amp;lt;/math&amp;gt; for all &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; (since the exponential is always positive).&lt;br /&gt;
# &amp;lt;math&amp;gt;\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1&amp;lt;/math&amp;gt; (by construction).&lt;br /&gt;
&lt;br /&gt;
== Intuition ==&lt;br /&gt;
&lt;br /&gt;
The softmax function amplifies differences between logits. A logit that is larger than its peers receives a disproportionately large share of the probability mass because the exponential function grows super-linearly. For example:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Logits !! Softmax output&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;(2.0,\; 1.0,\; 0.1)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(0.659,\; 0.242,\; 0.099)&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;(5.0,\; 1.0,\; 0.1)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(0.993,\; 0.005,\; 0.002)&amp;lt;/math&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
As the gap between the largest logit and the others increases, the output approaches a one-hot vector. This &amp;quot;winner-take-most&amp;quot; behavior makes softmax well-suited for classification where a single class should dominate.&lt;br /&gt;
&lt;br /&gt;
== Temperature Parameter ==&lt;br /&gt;
&lt;br /&gt;
A &#039;&#039;&#039;temperature&#039;&#039;&#039; parameter &amp;lt;math&amp;gt;T &amp;gt; 0&amp;lt;/math&amp;gt; controls the sharpness of the distribution:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;T \to 0&amp;lt;/math&amp;gt;: The distribution collapses to a one-hot vector selecting the argmax — equivalent to a hard decision.&lt;br /&gt;
* &amp;lt;math&amp;gt;T = 1&amp;lt;/math&amp;gt;: Standard softmax.&lt;br /&gt;
* &amp;lt;math&amp;gt;T \to \infty&amp;lt;/math&amp;gt;: The distribution approaches uniform — all classes become equally likely.&lt;br /&gt;
&lt;br /&gt;
Temperature scaling is widely used in knowledge distillation (Hinton et al., 2015), where a &amp;quot;soft&amp;quot; distribution from a teacher model provides richer training signal than hard labels. It is also used to control randomness in text generation from language models.&lt;br /&gt;
&lt;br /&gt;
== Numerical Stability ==&lt;br /&gt;
&lt;br /&gt;
A naive implementation of softmax can overflow when logits are large (e.g., &amp;lt;math&amp;gt;e^{1000}&amp;lt;/math&amp;gt; is infinite in floating point). The standard fix subtracts the maximum logit:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This is mathematically equivalent (the constant cancels) but ensures the largest exponent is &amp;lt;math&amp;gt;e^0 = 1&amp;lt;/math&amp;gt;, preventing overflow. All major deep learning frameworks implement this stabilized version automatically.&lt;br /&gt;
&lt;br /&gt;
== Relationship to Sigmoid ==&lt;br /&gt;
&lt;br /&gt;
For the special case of &amp;lt;math&amp;gt;K = 2&amp;lt;/math&amp;gt; classes, the softmax function reduces to the &#039;&#039;&#039;sigmoid&#039;&#039;&#039; (logistic) function. If we define &amp;lt;math&amp;gt;z = z_1 - z_2&amp;lt;/math&amp;gt;, then:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This is why binary classifiers typically use a single output neuron with a sigmoid activation rather than two neurons with softmax — they are mathematically equivalent.&lt;br /&gt;
&lt;br /&gt;
== Gradient ==&lt;br /&gt;
&lt;br /&gt;
The Jacobian of the softmax function with respect to its input is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\delta_{kj}&amp;lt;/math&amp;gt; is the Kronecker delta. When combined with [[Cross-Entropy Loss]], the gradient simplifies to &amp;lt;math&amp;gt;\hat{y}_k - y_k&amp;lt;/math&amp;gt;, which is computationally efficient and numerically stable.&lt;br /&gt;
&lt;br /&gt;
== Use in Classification ==&lt;br /&gt;
&lt;br /&gt;
In a typical classification pipeline:&lt;br /&gt;
&lt;br /&gt;
# A neural network produces raw logits &amp;lt;math&amp;gt;\mathbf{z}&amp;lt;/math&amp;gt; from its final linear layer.&lt;br /&gt;
# Softmax converts logits to probabilities: &amp;lt;math&amp;gt;\hat{\mathbf{y}} = \sigma(\mathbf{z})&amp;lt;/math&amp;gt;.&lt;br /&gt;
# The predicted class is &amp;lt;math&amp;gt;\hat{c} = \arg\max_k \hat{y}_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
# Training uses [[Cross-Entropy Loss]] applied to the predicted distribution and the true labels.&lt;br /&gt;
&lt;br /&gt;
In practice, the softmax and cross-entropy are computed jointly for numerical stability (the &#039;&#039;&#039;log-softmax&#039;&#039;&#039; formulation), and the argmax at inference time can be applied directly to the logits without computing softmax at all.&lt;br /&gt;
&lt;br /&gt;
== Beyond Classification ==&lt;br /&gt;
&lt;br /&gt;
Softmax appears in many contexts beyond the output layer:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Attention mechanisms&#039;&#039;&#039;: Softmax normalizes alignment scores into attention weights in the [[Attention Mechanisms|Transformer]] architecture.&lt;br /&gt;
* &#039;&#039;&#039;Reinforcement learning&#039;&#039;&#039;: Softmax over action-value estimates produces a stochastic policy (Boltzmann exploration).&lt;br /&gt;
* &#039;&#039;&#039;Mixture models&#039;&#039;&#039;: Softmax parameterizes mixing coefficients in mixture-of-experts architectures.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Cross-Entropy Loss]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Attention Mechanisms]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Bishop, C. M. (2006). &#039;&#039;Pattern Recognition and Machine Learning&#039;&#039;. Springer, Section 4.3.4.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press, Section 6.2.2.3.&lt;br /&gt;
* Hinton, G., Vinyals, O. and Dean, J. (2015). &amp;quot;Distilling the Knowledge in a Neural Network&amp;quot;. &#039;&#039;arXiv:1503.02531&#039;&#039;.&lt;br /&gt;
* Bridle, J. S. (1990). &amp;quot;Probabilistic Interpretation of Feedforward Classification Network Outputs&amp;quot;. &#039;&#039;Neurocomputing&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Recurrent_Neural_Networks&amp;diff=2142</id>
		<title>Recurrent Neural Networks</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Recurrent_Neural_Networks&amp;diff=2142"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Recurrent Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recurrent neural networks&#039;&#039;&#039; (&#039;&#039;&#039;RNNs&#039;&#039;&#039;) are a class of [[Neural Networks|neural networks]] designed to process &#039;&#039;&#039;sequential data&#039;&#039;&#039; — data where the order of elements matters. Unlike feedforward networks, RNNs contain recurrent connections that allow information to persist across time steps, giving them a form of memory.&lt;br /&gt;
&lt;br /&gt;
== Sequence modelling ==&lt;br /&gt;
&lt;br /&gt;
Many real-world problems involve sequences: text is a sequence of words, speech is a sequence of audio frames, stock prices form a time series, and DNA is a sequence of nucleotides. Standard feedforward networks require fixed-size inputs and treat each input independently, making them unsuitable for sequences of variable length where context matters.&lt;br /&gt;
&lt;br /&gt;
RNNs address this by processing inputs one element at a time while maintaining a &#039;&#039;&#039;hidden state&#039;&#039;&#039; that summarises the information seen so far.&lt;br /&gt;
&lt;br /&gt;
== Vanilla RNN ==&lt;br /&gt;
&lt;br /&gt;
At each time step &amp;lt;math&amp;gt;t&amp;lt;/math&amp;gt;, a vanilla RNN computes:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathbf{x}_t&amp;lt;/math&amp;gt; is the input at time &amp;lt;math&amp;gt;t&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\mathbf{h}_t&amp;lt;/math&amp;gt; is the hidden state, &amp;lt;math&amp;gt;\mathbf{y}_t&amp;lt;/math&amp;gt; is the output, and &amp;lt;math&amp;gt;\mathbf{W}_{hh}, \mathbf{W}_{xh}, \mathbf{W}_{hy}&amp;lt;/math&amp;gt; are weight matrices shared across all time steps. The initial hidden state &amp;lt;math&amp;gt;\mathbf{h}_0&amp;lt;/math&amp;gt; is typically set to the zero vector.&lt;br /&gt;
&lt;br /&gt;
The key insight is that the same parameters are applied at every time step — &#039;&#039;&#039;weight sharing in time&#039;&#039;&#039; — allowing the network to generalise across different positions in the sequence.&lt;br /&gt;
&lt;br /&gt;
== Backpropagation through time (BPTT) ==&lt;br /&gt;
&lt;br /&gt;
Training an RNN requires computing gradients of the loss with respect to the shared weights. &#039;&#039;&#039;Backpropagation through time&#039;&#039;&#039; (BPTT) &amp;quot;unrolls&amp;quot; the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation|backpropagation]].&lt;br /&gt;
&lt;br /&gt;
For a sequence of length &amp;lt;math&amp;gt;T&amp;lt;/math&amp;gt;, the gradient of the loss with respect to &amp;lt;math&amp;gt;\mathbf{W}_{hh}&amp;lt;/math&amp;gt; involves a product of Jacobians:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial L}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\frac{\partial L_t}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t}\frac{\partial L_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The product of Jacobians &amp;lt;math&amp;gt;\prod \partial \mathbf{h}_j / \partial \mathbf{h}_{j-1}&amp;lt;/math&amp;gt; is the source of the vanishing and exploding gradient problems.&lt;br /&gt;
&lt;br /&gt;
== The vanishing gradient problem ==&lt;br /&gt;
&lt;br /&gt;
When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the &#039;&#039;&#039;vanishing gradient problem&#039;&#039;&#039;. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.&lt;br /&gt;
&lt;br /&gt;
Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the &#039;&#039;&#039;exploding gradient problem&#039;&#039;&#039;. Exploding gradients are typically handled by &#039;&#039;&#039;gradient clipping&#039;&#039;&#039; (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.&lt;br /&gt;
&lt;br /&gt;
== Long Short-Term Memory (LSTM) ==&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;LSTM&#039;&#039;&#039; (Hochreiter and Schmidhuber, 1997) introduces a &#039;&#039;&#039;cell state&#039;&#039;&#039; &amp;lt;math&amp;gt;\mathbf{c}_t&amp;lt;/math&amp;gt; that flows through time with minimal interference, and three &#039;&#039;&#039;gates&#039;&#039;&#039; that control the flow of information:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;forget gate&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;input gate&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;candidate cell state&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;cell state update&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;output gate&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The cell state acts as a conveyor belt: the forget gate decides what old information to discard, the input gate decides what new information to store, and the output gate controls what is exposed to the next layer. Because the cell state is updated through addition (not multiplication), gradients flow more easily across long sequences.&lt;br /&gt;
&lt;br /&gt;
== Gated Recurrent Unit (GRU) ==&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;GRU&#039;&#039;&#039; (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state and using only two gates:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;update gate&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t])&amp;lt;/math&amp;gt;  (&#039;&#039;&#039;reset gate&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The GRU has fewer parameters than the LSTM and often achieves comparable performance. In practice, the choice between LSTM and GRU is typically made empirically.&lt;br /&gt;
&lt;br /&gt;
== Bidirectional RNNs ==&lt;br /&gt;
&lt;br /&gt;
A &#039;&#039;&#039;bidirectional RNN&#039;&#039;&#039; processes the sequence in both directions — forward (left to right) and backward (right to left) — and concatenates the hidden states:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\; \overleftarrow{\mathbf{h}}_t]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This allows the model to use both past and future context at every time step, which is beneficial for tasks like named entity recognition and machine translation where the meaning of a word depends on its surrounding context.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
RNNs and their gated variants have been applied to a wide range of sequence tasks:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Language modelling&#039;&#039;&#039; — predicting the next word in a sequence.&lt;br /&gt;
* &#039;&#039;&#039;Machine translation&#039;&#039;&#039; — encoder-decoder architectures for sequence-to-sequence translation (Sutskever et al., 2014).&lt;br /&gt;
* &#039;&#039;&#039;Speech recognition&#039;&#039;&#039; — transcribing audio to text (often combined with CTC loss).&lt;br /&gt;
* &#039;&#039;&#039;Sentiment analysis&#039;&#039;&#039; — classifying the sentiment of text.&lt;br /&gt;
* &#039;&#039;&#039;Time-series forecasting&#039;&#039;&#039; — predicting future values of financial or sensor data.&lt;br /&gt;
* &#039;&#039;&#039;Music generation&#039;&#039;&#039; — generating sequences of notes.&lt;br /&gt;
&lt;br /&gt;
Note that for many NLP tasks, &#039;&#039;&#039;Transformers&#039;&#039;&#039; (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-attention.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Word Embeddings]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Elman, J. L. (1990). &amp;quot;Finding Structure in Time&amp;quot;. &#039;&#039;Cognitive Science&#039;&#039;, 14(2), 179–211.&lt;br /&gt;
* Hochreiter, S. and Schmidhuber, J. (1997). &amp;quot;Long Short-Term Memory&amp;quot;. &#039;&#039;Neural Computation&#039;&#039;, 9(8), 1735–1780.&lt;br /&gt;
* Cho, K. et al. (2014). &amp;quot;Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Sutskever, I., Vinyals, O. and Le, Q. V. (2014). &amp;quot;Sequence to Sequence Learning with Neural Networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 10. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Overfitting_and_Regularization&amp;diff=2141</id>
		<title>Overfitting and Regularization</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Overfitting_and_Regularization&amp;diff=2141"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Overfitting and Regularization}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Loss Functions]], [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Overfitting&#039;&#039;&#039; occurs when a machine-learning model learns the training data too well — capturing noise and idiosyncrasies rather than the underlying pattern — and consequently performs poorly on unseen data. &#039;&#039;&#039;Regularization&#039;&#039;&#039; is the family of techniques used to prevent overfitting and improve a model&#039;s ability to generalise.&lt;br /&gt;
&lt;br /&gt;
== The bias–variance tradeoff ==&lt;br /&gt;
&lt;br /&gt;
Prediction error on unseen data can be decomposed into three components:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Bias&#039;&#039;&#039; measures how far the model&#039;s average prediction is from the true value. High bias indicates the model is too simple to capture the data&#039;s structure (&#039;&#039;&#039;underfitting&#039;&#039;&#039;).&lt;br /&gt;
* &#039;&#039;&#039;Variance&#039;&#039;&#039; measures how much predictions fluctuate across different training sets. High variance indicates the model is too sensitive to the particular training data (&#039;&#039;&#039;overfitting&#039;&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
The goal is to find the sweet spot that minimises total error. A model with too few parameters underfits (high bias); a model with too many parameters overfits (high variance). Regularization techniques tilt the balance by constraining model complexity, accepting slightly higher bias in exchange for substantially lower variance.&lt;br /&gt;
&lt;br /&gt;
== Detecting overfitting ==&lt;br /&gt;
&lt;br /&gt;
The clearest diagnostic is to compare training and validation performance:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Training loss decreasing, validation loss also decreasing&#039;&#039;&#039; — the model is still learning; continue training.&lt;br /&gt;
* &#039;&#039;&#039;Training loss decreasing, validation loss increasing&#039;&#039;&#039; — the model is overfitting; apply regularization or stop training.&lt;br /&gt;
* &#039;&#039;&#039;Training loss high, validation loss high&#039;&#039;&#039; — the model is underfitting; increase capacity or train longer.&lt;br /&gt;
&lt;br /&gt;
Plotting these &#039;&#039;&#039;learning curves&#039;&#039;&#039; over training iterations is essential practice. A large gap between training accuracy and validation accuracy is the hallmark of overfitting.&lt;br /&gt;
&lt;br /&gt;
== L2 regularization (weight decay) ==&lt;br /&gt;
&lt;br /&gt;
L2 regularization adds a penalty proportional to the squared magnitude of the weights:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2 = L(\theta) + \frac{\lambda}{2}\sum_j \theta_j^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The gradient of the regularization term is &amp;lt;math&amp;gt;\lambda \theta&amp;lt;/math&amp;gt;, so each weight is multiplicatively shrunk toward zero at every update — hence the name &#039;&#039;&#039;weight decay&#039;&#039;&#039;. The hyperparameter &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; controls the regularization strength.&lt;br /&gt;
&lt;br /&gt;
L2 regularization is equivalent to placing a Gaussian prior on the weights from a Bayesian perspective. It encourages small, distributed weights and discourages any single weight from becoming excessively large.&lt;br /&gt;
&lt;br /&gt;
== L1 regularization ==&lt;br /&gt;
&lt;br /&gt;
L1 regularization penalises the sum of absolute values:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J(\theta) = L(\theta) + \lambda \|\theta\|_1 = L(\theta) + \lambda \sum_j |\theta_j|&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Unlike L2, the L1 penalty drives many weights exactly to zero, producing &#039;&#039;&#039;sparse&#039;&#039;&#039; models. This makes L1 regularization useful for feature selection. The LASSO (Least Absolute Shrinkage and Selection Operator) is the classic example of L1-regularized linear regression.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Property !! L1 !! L2&lt;br /&gt;
|-&lt;br /&gt;
| Penalty || &amp;lt;math&amp;gt;\lambda\sum|\theta_j|&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\frac{\lambda}{2}\sum\theta_j^2&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Effect on weights || Drives many to exactly zero || Shrinks all toward zero&lt;br /&gt;
|-&lt;br /&gt;
| Sparsity || Yes || No&lt;br /&gt;
|-&lt;br /&gt;
| Bayesian interpretation || Laplace prior || Gaussian prior&lt;br /&gt;
|-&lt;br /&gt;
| Use case || Feature selection, interpretability || General regularization&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Dropout ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Dropout&#039;&#039;&#039; (Srivastava et al., 2014) is a regularization technique specific to neural networks. During training, each neuron is randomly &amp;quot;dropped&amp;quot; (set to zero) with probability &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; at each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations.&lt;br /&gt;
&lt;br /&gt;
At test time, all neurons are active but their outputs are scaled by &amp;lt;math&amp;gt;(1 - p)&amp;lt;/math&amp;gt; to compensate for the larger number of active units (or equivalently, outputs are scaled by &amp;lt;math&amp;gt;1/(1-p)&amp;lt;/math&amp;gt; during training — &#039;&#039;&#039;inverted dropout&#039;&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
Dropout can be interpreted as an approximate ensemble method: each training step uses a different subnetwork, and the final model approximates the average prediction of exponentially many subnetworks.&lt;br /&gt;
&lt;br /&gt;
== Early stopping ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Early stopping&#039;&#039;&#039; monitors the validation loss during training and halts optimisation when the validation loss stops improving. This is one of the simplest and most effective regularization strategies.&lt;br /&gt;
&lt;br /&gt;
In practice, a &#039;&#039;&#039;patience&#039;&#039;&#039; parameter specifies how many epochs to wait after the last improvement before stopping. The model weights are saved at the point of lowest validation loss and restored at the end.&lt;br /&gt;
&lt;br /&gt;
Early stopping acts as an implicit form of regularization: it limits the effective number of training steps, preventing the model from fully memorising the training data.&lt;br /&gt;
&lt;br /&gt;
== Data augmentation ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Data augmentation&#039;&#039;&#039; increases the effective size and diversity of the training set by applying label-preserving transformations. For image data, common augmentations include:&lt;br /&gt;
&lt;br /&gt;
* Random horizontal/vertical flips&lt;br /&gt;
* Random crops and resizing&lt;br /&gt;
* Colour jittering (brightness, contrast, saturation)&lt;br /&gt;
* Rotation and affine transformations&lt;br /&gt;
* Mixup (linear interpolation of pairs of images and their labels)&lt;br /&gt;
* Cutout (masking random patches)&lt;br /&gt;
&lt;br /&gt;
For text data, augmentations include synonym replacement, back-translation, and paraphrasing. Data augmentation reduces overfitting by exposing the model to more varied inputs without collecting additional data.&lt;br /&gt;
&lt;br /&gt;
== Other regularization techniques ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Batch normalization&#039;&#039;&#039; — normalising layer inputs reduces internal covariate shift and has a mild regularizing effect.&lt;br /&gt;
* &#039;&#039;&#039;Label smoothing&#039;&#039;&#039; — replaces one-hot targets with a mixture, e.g. &amp;lt;math&amp;gt;y_{\text{smooth}} = (1 - \epsilon)\, y + \epsilon / C&amp;lt;/math&amp;gt;, preventing overconfidence.&lt;br /&gt;
* &#039;&#039;&#039;Noise injection&#039;&#039;&#039; — adding Gaussian noise to inputs, weights, or gradients during training.&lt;br /&gt;
&lt;br /&gt;
== Practical guidelines ==&lt;br /&gt;
&lt;br /&gt;
# Start with a model large enough to overfit the training data — this confirms the model has sufficient capacity.&lt;br /&gt;
# Add regularization incrementally (dropout, weight decay, augmentation) and monitor validation performance.&lt;br /&gt;
# Use early stopping as a safety net.&lt;br /&gt;
# Prefer more training data over stronger regularization whenever possible — regularization is a substitute for data, not a replacement.&lt;br /&gt;
# Tune the regularization strength (&amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt;, dropout rate) using a validation set, never the test set.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Srivastava, N. et al. (2014). &amp;quot;Dropout: A Simple Way to Prevent Neural Networks from Overfitting&amp;quot;. &#039;&#039;JMLR&#039;&#039;, 15, 1929–1958.&lt;br /&gt;
* Tibshirani, R. (1996). &amp;quot;Regression Shrinkage and Selection via the Lasso&amp;quot;. &#039;&#039;JRSS Series B&#039;&#039;, 58(1), 267–288.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 7. MIT Press.&lt;br /&gt;
* Zhang, C. et al. (2017). &amp;quot;Understanding deep learning requires rethinking generalization&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Shorten, C. and Khoshgoftaar, T. M. (2019). &amp;quot;A survey on Image Data Augmentation for Deep Learning&amp;quot;. &#039;&#039;Journal of Big Data&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Neural_Networks&amp;diff=2140</id>
		<title>Neural Networks</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Neural_Networks&amp;diff=2140"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Neural networks&#039;&#039;&#039; (also called &#039;&#039;&#039;artificial neural networks&#039;&#039;&#039;, or ANNs) are computational models inspired by the structure of biological nervous systems. They consist of interconnected layers of simple processing units called &#039;&#039;&#039;neurons&#039;&#039;&#039; (or nodes) and form the basis of modern deep learning.&lt;br /&gt;
&lt;br /&gt;
== Biological inspiration ==&lt;br /&gt;
&lt;br /&gt;
The biological neuron receives electrical signals through its &#039;&#039;&#039;dendrites&#039;&#039;&#039;, integrates them in the &#039;&#039;&#039;cell body&#039;&#039;&#039;, and, if the combined signal exceeds a threshold, fires an output signal along its &#039;&#039;&#039;axon&#039;&#039;&#039; to downstream neurons. Artificial neural networks abstract this process: each artificial neuron computes a weighted sum of its inputs, adds a bias term, and passes the result through a nonlinear &#039;&#039;&#039;activation function&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
While the analogy to biology motivated early research, modern neural networks are best understood as flexible parameterised function approximators rather than faithful brain simulations.&lt;br /&gt;
&lt;br /&gt;
== The perceptron ==&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;perceptron&#039;&#039;&#039;, introduced by Frank Rosenblatt in 1958, is the simplest neural network. It computes:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y = \sigma\!\left(\sum_{i=1}^{n} w_i x_i + b\right) = \sigma(\mathbf{w}^\top \mathbf{x} + b)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathbf{x}&amp;lt;/math&amp;gt; is the input vector, &amp;lt;math&amp;gt;\mathbf{w}&amp;lt;/math&amp;gt; are learnable weights, &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; is a bias, and &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; is a step function that outputs 1 if the argument is positive and 0 otherwise. The perceptron can learn any linearly separable function but famously cannot represent the XOR function — a limitation that stalled neural-network research for over a decade.&lt;br /&gt;
&lt;br /&gt;
== Feedforward networks ==&lt;br /&gt;
&lt;br /&gt;
A &#039;&#039;&#039;feedforward neural network&#039;&#039;&#039; (also called a &#039;&#039;&#039;multilayer perceptron&#039;&#039;&#039;, or MLP) stacks multiple layers of neurons. Information flows in one direction — from the &#039;&#039;&#039;input layer&#039;&#039;&#039; through one or more &#039;&#039;&#039;hidden layers&#039;&#039;&#039; to the &#039;&#039;&#039;output layer&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
For a network with one hidden layer, the computation is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{h} = g(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y} = f(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;g&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; are activation functions, &amp;lt;math&amp;gt;\mathbf{W}_1, \mathbf{W}_2&amp;lt;/math&amp;gt; are weight matrices, and &amp;lt;math&amp;gt;\mathbf{b}_1, \mathbf{b}_2&amp;lt;/math&amp;gt; are bias vectors. The hidden layer enables the network to learn nonlinear relationships that a single perceptron cannot capture.&lt;br /&gt;
&lt;br /&gt;
Networks with many hidden layers are called &#039;&#039;&#039;deep&#039;&#039;&#039; neural networks, and training them is the subject of &#039;&#039;&#039;deep learning&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Activation functions ==&lt;br /&gt;
&lt;br /&gt;
The activation function introduces nonlinearity; without it, a multi-layer network would collapse to a single linear transformation. Common choices include:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Function !! Formula !! Range !! Notes&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Sigmoid&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sigma(z) = \frac{1}{1+e^{-z}}&amp;lt;/math&amp;gt; || (0, 1) || Historically popular; suffers from vanishing gradients&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Tanh&#039;&#039;&#039; || &amp;lt;math&amp;gt;\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}&amp;lt;/math&amp;gt; || (−1, 1) || Zero-centred; still saturates for large inputs&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;ReLU&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0, z)&amp;lt;/math&amp;gt; || [0, ∞) || Default choice in modern networks; can cause &amp;quot;dead neurons&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Leaky ReLU&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(\alpha z, z)&amp;lt;/math&amp;gt; for small &amp;lt;math&amp;gt;\alpha &amp;gt; 0&amp;lt;/math&amp;gt; || (−∞, ∞) || Addresses the dead-neuron problem&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Softmax&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{e^{z_i}}{\sum_j e^{z_j}}&amp;lt;/math&amp;gt; || (0, 1) || Used in output layer for multi-class classification&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Universal approximation theorem ==&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;universal approximation theorem&#039;&#039;&#039; (Cybenko 1989, Hornik 1991) states that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of &amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; to arbitrary accuracy, provided the activation function satisfies mild conditions (e.g. is non-constant, bounded, and continuous).&lt;br /&gt;
&lt;br /&gt;
This theorem guarantees the &#039;&#039;existence&#039;&#039; of a good approximation but says nothing about how to &#039;&#039;find&#039;&#039; it — in practice, training deep networks with many layers is far more effective than using a single wide layer.&lt;br /&gt;
&lt;br /&gt;
== Training overview ==&lt;br /&gt;
&lt;br /&gt;
Training a neural network involves:&lt;br /&gt;
&lt;br /&gt;
# &#039;&#039;&#039;Defining a loss function&#039;&#039;&#039; — a measure of how far the network&#039;s predictions are from the true targets (see [[Loss Functions]]).&lt;br /&gt;
# &#039;&#039;&#039;Forward pass&#039;&#039;&#039; — computing the output of the network for a given input by propagating values layer by layer.&lt;br /&gt;
# &#039;&#039;&#039;Backward pass (backpropagation)&#039;&#039;&#039; — computing the gradient of the loss with respect to every weight by applying the chain rule in reverse through the network (see [[Backpropagation]]).&lt;br /&gt;
# &#039;&#039;&#039;Parameter update&#039;&#039;&#039; — adjusting the weights using an optimisation algorithm such as [[Gradient Descent]] or one of its variants.&lt;br /&gt;
# &#039;&#039;&#039;Iteration&#039;&#039;&#039; — repeating steps 2–4 over many passes (epochs) through the training data.&lt;br /&gt;
&lt;br /&gt;
Successful training also requires attention to &#039;&#039;&#039;initialisation&#039;&#039;&#039; (e.g. Xavier or He schemes), &#039;&#039;&#039;regularisation&#039;&#039;&#039; (to prevent [[Overfitting and Regularization|overfitting]]), and &#039;&#039;&#039;hyperparameter tuning&#039;&#039;&#039; (learning rate, batch size, network architecture).&lt;br /&gt;
&lt;br /&gt;
== Common architectures ==&lt;br /&gt;
&lt;br /&gt;
Beyond the basic feedforward network, several specialised architectures have been developed:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;[[Convolutional Neural Networks]]&#039;&#039;&#039; (CNNs) — designed for grid-structured data such as images, using local connectivity and weight sharing.&lt;br /&gt;
* &#039;&#039;&#039;[[Recurrent Neural Networks]]&#039;&#039;&#039; (RNNs) — designed for sequential data, with connections that form cycles to maintain hidden state.&lt;br /&gt;
* &#039;&#039;&#039;Transformers&#039;&#039;&#039; — attention-based architectures that have become dominant in natural language processing and increasingly in vision.&lt;br /&gt;
* &#039;&#039;&#039;Autoencoders&#039;&#039;&#039; — networks trained to reconstruct their input, used for dimensionality reduction and generative modelling.&lt;br /&gt;
* &#039;&#039;&#039;Generative adversarial networks&#039;&#039;&#039; (GANs) — pairs of networks (generator and discriminator) trained in competition to generate realistic data.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
Neural networks are applied across a vast range of domains:&lt;br /&gt;
&lt;br /&gt;
* Computer vision (image classification, object detection, segmentation)&lt;br /&gt;
* Natural language processing (translation, summarisation, question answering)&lt;br /&gt;
* Speech recognition and synthesis&lt;br /&gt;
* Game playing (AlphaGo, Atari agents)&lt;br /&gt;
* Scientific discovery (protein folding, drug design, weather prediction)&lt;br /&gt;
* Autonomous vehicles and robotics&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Rosenblatt, F. (1958). &amp;quot;The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain&amp;quot;. &#039;&#039;Psychological Review&#039;&#039;.&lt;br /&gt;
* Cybenko, G. (1989). &amp;quot;Approximation by Superpositions of a Sigmoidal Function&amp;quot;. &#039;&#039;Mathematics of Control, Signals, and Systems&#039;&#039;.&lt;br /&gt;
* Hornik, K. (1991). &amp;quot;Approximation Capabilities of Multilayer Feedforward Networks&amp;quot;. &#039;&#039;Neural Networks&#039;&#039;.&lt;br /&gt;
* LeCun, Y., Bengio, Y. and Hinton, G. (2015). &amp;quot;Deep learning&amp;quot;. &#039;&#039;Nature&#039;&#039;, 521, 436–444.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Loss_Functions&amp;diff=2139</id>
		<title>Loss Functions</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Loss_Functions&amp;diff=2139"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Loss Functions}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Loss functions&#039;&#039;&#039; (also called &#039;&#039;&#039;cost functions&#039;&#039;&#039; or &#039;&#039;&#039;objective functions&#039;&#039;&#039;) quantify how far a model&#039;s predictions are from the desired output. Minimising the loss function is the central goal of the training process in machine learning: the optimisation algorithm adjusts the model&#039;s parameters to drive the loss as low as possible.&lt;br /&gt;
&lt;br /&gt;
== Purpose ==&lt;br /&gt;
&lt;br /&gt;
A loss function maps the model&#039;s prediction &amp;lt;math&amp;gt;\hat{y}&amp;lt;/math&amp;gt; and the true target &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt; to a non-negative real number. Formally, for a single example:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Over a dataset of &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; examples, the total loss is typically the average:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The choice of loss function encodes the problem&#039;s structure — what kind of errors matter and how severely they should be penalised. A poorly chosen loss can lead to a model that optimises the wrong objective.&lt;br /&gt;
&lt;br /&gt;
== Mean squared error ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Mean squared error&#039;&#039;&#039; (MSE) is the default loss for regression tasks:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
MSE penalises large errors quadratically, making it sensitive to outliers. Its gradient is straightforward:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial}{\partial \hat{y}_i} (y_i - \hat{y}_i)^2 = -2(y_i - \hat{y}_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A closely related variant is &#039;&#039;&#039;mean absolute error&#039;&#039;&#039; (MAE), &amp;lt;math&amp;gt;\frac{1}{N}\sum|y_i - \hat{y}_i|&amp;lt;/math&amp;gt;, which is more robust to outliers but has a non-smooth gradient at zero. The &#039;&#039;&#039;Huber loss&#039;&#039;&#039; combines both: it behaves like MSE for small errors and like MAE for large ones.&lt;br /&gt;
&lt;br /&gt;
== Cross-entropy loss ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Cross-entropy loss&#039;&#039;&#039; is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.&lt;br /&gt;
&lt;br /&gt;
=== Binary cross-entropy ===&lt;br /&gt;
&lt;br /&gt;
For binary classification with predicted probability &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; and true label &amp;lt;math&amp;gt;y \in \{0, 1\}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log p_i + (1 - y_i)\log(1 - p_i)\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This loss is minimised when the predicted probability matches the true label perfectly (&amp;lt;math&amp;gt;p = 1&amp;lt;/math&amp;gt; when &amp;lt;math&amp;gt;y = 1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;p = 0&amp;lt;/math&amp;gt; when &amp;lt;math&amp;gt;y = 0&amp;lt;/math&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
=== Categorical cross-entropy ===&lt;br /&gt;
&lt;br /&gt;
For multi-class classification with &amp;lt;math&amp;gt;C&amp;lt;/math&amp;gt; classes and predicted probability vector &amp;lt;math&amp;gt;\hat{\mathbf{y}}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the true labels are one-hot encoded, only the term corresponding to the correct class survives.&lt;br /&gt;
&lt;br /&gt;
== Hinge loss ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hinge loss&#039;&#039;&#039; is associated with support vector machines (SVMs) and maximum-margin classifiers. For a binary classification problem with labels &amp;lt;math&amp;gt;y \in \{-1, +1\}&amp;lt;/math&amp;gt; and raw model output &amp;lt;math&amp;gt;s&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;L_{\text{hinge}} = \frac{1}{N}\sum_{i=1}^{N}\max(0,\; 1 - y_i \, s_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The hinge loss is zero when the prediction has the correct sign with margin at least 1, and increases linearly otherwise. Because it is not differentiable at the hinge point, subgradient methods are used for optimisation.&lt;br /&gt;
&lt;br /&gt;
== Other common loss functions ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Loss !! Formula !! Typical use&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Huber&#039;&#039;&#039; || &amp;lt;math&amp;gt;\begin{cases}\tfrac{1}{2}(y-\hat{y})^2 &amp;amp; |y-\hat{y}|\leq\delta \\ \delta(|y-\hat{y}|-\tfrac{\delta}{2}) &amp;amp; \text{otherwise}\end{cases}&amp;lt;/math&amp;gt; || Robust regression&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;KL divergence&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sum_c p_c \log\frac{p_c}{q_c}&amp;lt;/math&amp;gt; || Distribution matching, VAEs&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Focal loss&#039;&#039;&#039; || &amp;lt;math&amp;gt;-\alpha(1-p_t)^\gamma \log p_t&amp;lt;/math&amp;gt; || Imbalanced classification&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;CTC loss&#039;&#039;&#039; || Dynamic programming over alignments || Speech recognition, OCR&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Triplet loss&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0,\; d(a,p) - d(a,n) + m)&amp;lt;/math&amp;gt; || Metric learning, face verification&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Choosing the right loss ==&lt;br /&gt;
&lt;br /&gt;
The appropriate loss function depends on the task:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Regression&#039;&#039;&#039; — MSE is the default; switch to MAE or Huber if outliers are a concern.&lt;br /&gt;
* &#039;&#039;&#039;Binary classification&#039;&#039;&#039; — binary cross-entropy with sigmoid output.&lt;br /&gt;
* &#039;&#039;&#039;Multi-class classification&#039;&#039;&#039; — categorical cross-entropy with softmax output.&lt;br /&gt;
* &#039;&#039;&#039;Multi-label classification&#039;&#039;&#039; — binary cross-entropy applied independently per label.&lt;br /&gt;
* &#039;&#039;&#039;Ranking or retrieval&#039;&#039;&#039; — contrastive loss, triplet loss, or listwise ranking losses.&lt;br /&gt;
&lt;br /&gt;
An important consideration is whether the loss is &#039;&#039;&#039;calibrated&#039;&#039;&#039; — i.e., whether minimising it yields well-calibrated predicted probabilities. Cross-entropy is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.&lt;br /&gt;
&lt;br /&gt;
== Regularisation terms ==&lt;br /&gt;
&lt;br /&gt;
In practice, the total objective often includes a &#039;&#039;&#039;regularisation term&#039;&#039;&#039; that penalises model complexity:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;J(\theta) = L(\theta) + \lambda \, R(\theta)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\lambda&amp;lt;/math&amp;gt; controls the strength of regularisation. Common choices include L2 regularisation (&amp;lt;math&amp;gt;R = \|\theta\|_2^2&amp;lt;/math&amp;gt;) and L1 regularisation (&amp;lt;math&amp;gt;R = \|\theta\|_1&amp;lt;/math&amp;gt;). See [[Overfitting and Regularization]] for more detail.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Bishop, C. M. (2006). &#039;&#039;Pattern Recognition and Machine Learning&#039;&#039;, Chapter 1. Springer.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapters 6 and 8. MIT Press.&lt;br /&gt;
* Lin, T.-Y. et al. (2017). &amp;quot;Focal Loss for Dense Object Detection&amp;quot;. &#039;&#039;ICCV&#039;&#039;.&lt;br /&gt;
* Murphy, K. P. (2022). &#039;&#039;Probabilistic Machine Learning: An Introduction&#039;&#039;. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Linear_Regression&amp;diff=2138</id>
		<title>Linear Regression</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Linear_Regression&amp;diff=2138"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Linear Regression}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Statistics | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Linear regression&#039;&#039;&#039; is a fundamental statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It is one of the oldest and most widely used techniques in statistics and machine learning, serving as both a practical predictive tool and a building block for understanding more complex models.&lt;br /&gt;
&lt;br /&gt;
== Problem Setup ==&lt;br /&gt;
&lt;br /&gt;
Given a dataset of &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; observations &amp;lt;math&amp;gt;\{(\mathbf{x}_i, y_i)\}_{i=1}^{N}&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\mathbf{x}_i \in \mathbb{R}^d&amp;lt;/math&amp;gt; is a feature vector and &amp;lt;math&amp;gt;y_i \in \mathbb{R}&amp;lt;/math&amp;gt; is the target, linear regression assumes the relationship:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + b + \epsilon_i&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathbf{w} \in \mathbb{R}^d&amp;lt;/math&amp;gt; is the weight vector, &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; is the bias (intercept), and &amp;lt;math&amp;gt;\epsilon_i&amp;lt;/math&amp;gt; is the error term. By absorbing the bias into the weight vector (appending a 1 to each &amp;lt;math&amp;gt;\mathbf{x}_i&amp;lt;/math&amp;gt;), this simplifies to &amp;lt;math&amp;gt;y_i = \mathbf{w}^{\!\top} \mathbf{x}_i + \epsilon_i&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Ordinary Least Squares ==&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;ordinary least squares&#039;&#039;&#039; (OLS) method finds the weights that minimize the sum of squared residuals:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}(\mathbf{w}) = \sum_{i=1}^{N} (y_i - \mathbf{w}^{\!\top} \mathbf{x}_i)^2 = \|\mathbf{y} - X\mathbf{w}\|^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;X \in \mathbb{R}^{N \times d}&amp;lt;/math&amp;gt; is the design matrix and &amp;lt;math&amp;gt;\mathbf{y} \in \mathbb{R}^N&amp;lt;/math&amp;gt; is the target vector.&lt;br /&gt;
&lt;br /&gt;
=== Closed-Form Solution ===&lt;br /&gt;
&lt;br /&gt;
Setting the gradient to zero yields the &#039;&#039;&#039;normal equations&#039;&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\nabla_{\mathbf{w}} \mathcal{L} = -2 X^{\!\top}(\mathbf{y} - X\mathbf{w}) = 0&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\hat{\mathbf{w}} = (X^{\!\top} X)^{-1} X^{\!\top} \mathbf{y}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This solution exists and is unique when &amp;lt;math&amp;gt;X^{\!\top} X&amp;lt;/math&amp;gt; is invertible (i.e., the features are linearly independent). The computational cost is &amp;lt;math&amp;gt;O(Nd^2 + d^3)&amp;lt;/math&amp;gt;, which is efficient for moderate &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt; but becomes expensive for high-dimensional problems.&lt;br /&gt;
&lt;br /&gt;
=== Gradient Descent Approach ===&lt;br /&gt;
&lt;br /&gt;
When the closed-form solution is impractical (large &amp;lt;math&amp;gt;d&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;), iterative optimization via [[Stochastic Gradient Descent|gradient descent]] is used. The gradient is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{N} X^{\!\top}(\mathbf{y} - X\mathbf{w})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The update rule is &amp;lt;math&amp;gt;\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} \mathcal{L}&amp;lt;/math&amp;gt;, where &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; is the learning rate. Stochastic and mini-batch variants scale to millions of data points.&lt;br /&gt;
&lt;br /&gt;
== Assumptions of OLS ==&lt;br /&gt;
&lt;br /&gt;
The classical OLS estimator is &#039;&#039;&#039;BLUE&#039;&#039;&#039; (Best Linear Unbiased Estimator) under the Gauss-Markov conditions:&lt;br /&gt;
&lt;br /&gt;
# &#039;&#039;&#039;Linearity&#039;&#039;&#039;: The true relationship between features and target is linear.&lt;br /&gt;
# &#039;&#039;&#039;Independence&#039;&#039;&#039;: Observations are independent of each other.&lt;br /&gt;
# &#039;&#039;&#039;Homoscedasticity&#039;&#039;&#039;: The error variance &amp;lt;math&amp;gt;\mathrm{Var}(\epsilon_i) = \sigma^2&amp;lt;/math&amp;gt; is constant across observations.&lt;br /&gt;
# &#039;&#039;&#039;No perfect multicollinearity&#039;&#039;&#039;: No feature is an exact linear combination of others.&lt;br /&gt;
# &#039;&#039;&#039;Exogeneity&#039;&#039;&#039;: &amp;lt;math&amp;gt;E[\epsilon_i \mid \mathbf{x}_i] = 0&amp;lt;/math&amp;gt; — errors are uncorrelated with features.&lt;br /&gt;
&lt;br /&gt;
Violations of these assumptions do not necessarily make linear regression useless, but they may invalidate confidence intervals and hypothesis tests derived from the model.&lt;br /&gt;
&lt;br /&gt;
== Evaluation Metrics ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Metric !! Formula !! Interpretation&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;MSE&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{1}{N}\sum(y_i - \hat{y}_i)^2&amp;lt;/math&amp;gt; || Average squared error; penalises large errors&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;RMSE&#039;&#039;&#039; || &amp;lt;math&amp;gt;\sqrt{\mathrm{MSE}}&amp;lt;/math&amp;gt; || In the same units as the target&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;MAE&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{1}{N}\sum|y_i - \hat{y}_i|&amp;lt;/math&amp;gt; || Average absolute error; robust to outliers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;R-squared&#039;&#039;&#039; || &amp;lt;math&amp;gt;1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}&amp;lt;/math&amp;gt; || Proportion of variance explained (0 to 1)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;math&amp;gt;R^2&amp;lt;/math&amp;gt; of 1 indicates perfect prediction, while &amp;lt;math&amp;gt;R^2 = 0&amp;lt;/math&amp;gt; means the model does no better than predicting the mean. The &#039;&#039;&#039;adjusted R-squared&#039;&#039;&#039; penalises for the number of features, preventing artificial inflation from adding irrelevant predictors.&lt;br /&gt;
&lt;br /&gt;
== Multiple Regression ==&lt;br /&gt;
&lt;br /&gt;
When &amp;lt;math&amp;gt;d &amp;gt; 1&amp;lt;/math&amp;gt;, the model is called &#039;&#039;&#039;multiple linear regression&#039;&#039;&#039;. Each coefficient &amp;lt;math&amp;gt;w_j&amp;lt;/math&amp;gt; represents the expected change in &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt; per unit change in &amp;lt;math&amp;gt;x_j&amp;lt;/math&amp;gt;, holding all other features constant. Interpreting coefficients requires caution when features are correlated (multicollinearity), as individual coefficients may become unstable even though the overall model fits well.&lt;br /&gt;
&lt;br /&gt;
== Regularized Variants ==&lt;br /&gt;
&lt;br /&gt;
When the number of features is large relative to the number of observations, or when features are correlated, OLS can overfit. Regularization adds a penalty to the loss function:&lt;br /&gt;
&lt;br /&gt;
=== Ridge Regression (L2) ===&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{ridge}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_2^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The closed-form solution becomes &amp;lt;math&amp;gt;\hat{\mathbf{w}} = (X^{\!\top} X + \lambda I)^{-1} X^{\!\top} \mathbf{y}&amp;lt;/math&amp;gt;. Ridge shrinks coefficients toward zero but never sets them exactly to zero.&lt;br /&gt;
&lt;br /&gt;
=== Lasso Regression (L1) ===&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{lasso}} = \|\mathbf{y} - X\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Lasso can drive coefficients to exactly zero, performing automatic &#039;&#039;&#039;feature selection&#039;&#039;&#039;. It has no closed-form solution and is typically solved via coordinate descent.&lt;br /&gt;
&lt;br /&gt;
=== Elastic Net ===&lt;br /&gt;
&lt;br /&gt;
Elastic Net combines both penalties: &amp;lt;math&amp;gt;\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2&amp;lt;/math&amp;gt;, balancing sparsity and stability.&lt;br /&gt;
&lt;br /&gt;
== Practical Considerations ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Feature scaling&#039;&#039;&#039;: Standardizing features (zero mean, unit variance) improves gradient descent convergence and makes regularization fair across features.&lt;br /&gt;
* &#039;&#039;&#039;Polynomial features&#039;&#039;&#039;: Adding polynomial terms (e.g., &amp;lt;math&amp;gt;x^2, x_1 x_2&amp;lt;/math&amp;gt;) allows linear regression to capture nonlinear relationships.&lt;br /&gt;
* &#039;&#039;&#039;Outliers&#039;&#039;&#039;: OLS is sensitive to outliers because of the squared loss. Robust alternatives include Huber regression and RANSAC.&lt;br /&gt;
* &#039;&#039;&#039;Diagnostic plots&#039;&#039;&#039;: Residual plots help detect violations of assumptions (non-linearity, heteroscedasticity, non-normality).&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Hastie, T., Tibshirani, R. and Friedman, J. (2009). &#039;&#039;The Elements of Statistical Learning&#039;&#039;. Springer, Chapter 3.&lt;br /&gt;
* Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012). &#039;&#039;Introduction to Linear Regression Analysis&#039;&#039;. Wiley.&lt;br /&gt;
* Hoerl, A. E. and Kennard, R. W. (1970). &amp;quot;Ridge Regression: Biased Estimation for Nonorthogonal Problems&amp;quot;. &#039;&#039;Technometrics&#039;&#039;.&lt;br /&gt;
* Tibshirani, R. (1996). &amp;quot;Regression Shrinkage and Selection via the Lasso&amp;quot;. &#039;&#039;Journal of the Royal Statistical Society, Series B&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Statistics]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2137</id>
		<title>Gradient Descent</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Gradient_Descent&amp;diff=2137"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Gradient Descent}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Optimization | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gradient descent&#039;&#039;&#039; is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the foundation of nearly all modern machine-learning training procedures, from simple linear regression to billion-parameter deep neural networks.&lt;br /&gt;
&lt;br /&gt;
== Intuition ==&lt;br /&gt;
&lt;br /&gt;
Imagine standing on a mountainside in thick fog. You cannot see the valley floor, but you can feel the slope beneath your feet. The most natural strategy is to take a step in the steepest downhill direction, then reassess. Gradient descent formalises precisely this idea: at each step, the algorithm computes the direction of steepest increase of the function (the &#039;&#039;&#039;gradient&#039;&#039;&#039;) and moves in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The size of each step is controlled by a scalar called the &#039;&#039;&#039;learning rate&#039;&#039;&#039; (often denoted &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt;). A large learning rate covers ground quickly but risks overshooting the minimum; a small learning rate converges more reliably but may take prohibitively many steps.&lt;br /&gt;
&lt;br /&gt;
== Mathematical formulation ==&lt;br /&gt;
&lt;br /&gt;
Given a differentiable objective function &amp;lt;math&amp;gt;f:\mathbb{R}^n \to \mathbb{R}&amp;lt;/math&amp;gt;, gradient descent generates a sequence of iterates by the &#039;&#039;&#039;update rule&#039;&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, \nabla f(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\nabla f(\theta_t)&amp;lt;/math&amp;gt; is the gradient vector evaluated at the current point &amp;lt;math&amp;gt;\theta_t&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\eta &amp;gt; 0&amp;lt;/math&amp;gt; is the learning rate.&lt;br /&gt;
&lt;br /&gt;
In the one-dimensional case this simplifies to:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\theta_{t+1} = \theta_t - \eta \, f&#039;(\theta_t)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The gradient &amp;lt;math&amp;gt;\nabla f&amp;lt;/math&amp;gt; points in the direction of steepest ascent, so subtracting it moves the iterate downhill.&lt;br /&gt;
&lt;br /&gt;
== Batch, stochastic, and mini-batch variants ==&lt;br /&gt;
&lt;br /&gt;
When the objective has the form of an average over data points,&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i, y_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
three common strategies differ in how much data is used to estimate the gradient:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variant !! Gradient computed over !! Per-step cost !! Gradient noise&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Batch (full) gradient descent&#039;&#039;&#039; || All &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples || High || None&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Stochastic gradient descent (SGD)&#039;&#039;&#039; || 1 random sample || Low || High&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Mini-batch gradient descent&#039;&#039;&#039; || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; random samples (&amp;lt;math&amp;gt;1 &amp;lt; B &amp;lt; N&amp;lt;/math&amp;gt;) || Medium || Medium&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Full batch gradient descent computes the exact gradient and therefore follows a smooth trajectory toward the minimum. [[Stochastic Gradient Descent|Stochastic gradient descent]] uses a single sample to estimate the gradient, drastically reducing computation per step at the cost of a noisier trajectory. Mini-batch gradient descent strikes a balance and is the most common choice in practice, with typical batch sizes between 32 and 512.&lt;br /&gt;
&lt;br /&gt;
== Convergence ==&lt;br /&gt;
&lt;br /&gt;
=== Convex functions ===&lt;br /&gt;
&lt;br /&gt;
For a convex function with Lipschitz-continuous gradients (constant &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt;), gradient descent with a fixed learning rate &amp;lt;math&amp;gt;\eta \leq 1/L&amp;lt;/math&amp;gt; converges at a rate of &amp;lt;math&amp;gt;O(1/t)&amp;lt;/math&amp;gt;. If the function is additionally &#039;&#039;&#039;strongly convex&#039;&#039;&#039; with parameter &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, convergence accelerates to a linear (exponential) rate:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;f(\theta_t) - f(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^t \bigl(f(\theta_0) - f(\theta^*)\bigr)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ratio &amp;lt;math&amp;gt;\kappa = L / \mu&amp;lt;/math&amp;gt; is called the &#039;&#039;&#039;condition number&#039;&#039;&#039; and governs how quickly the algorithm converges. Ill-conditioned problems (large &amp;lt;math&amp;gt;\kappa&amp;lt;/math&amp;gt;) converge slowly.&lt;br /&gt;
&lt;br /&gt;
=== Non-convex functions ===&lt;br /&gt;
&lt;br /&gt;
Most deep-learning objectives are non-convex. In this setting gradient descent is only guaranteed to converge to a stationary point (where &amp;lt;math&amp;gt;\nabla f = 0&amp;lt;/math&amp;gt;), which could be a local minimum, saddle point, or even a local maximum. In practice, saddle points are more problematic than local minima in high-dimensional spaces.&lt;br /&gt;
&lt;br /&gt;
== Learning rate selection ==&lt;br /&gt;
&lt;br /&gt;
Choosing the learning rate is one of the most important practical decisions:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Too large&#039;&#039;&#039; — the iterates oscillate or diverge.&lt;br /&gt;
* &#039;&#039;&#039;Too small&#039;&#039;&#039; — convergence is unacceptably slow.&lt;br /&gt;
* &#039;&#039;&#039;Learning rate schedules&#039;&#039;&#039; — many practitioners start with a larger rate and reduce it over time (step decay, exponential decay, cosine annealing).&lt;br /&gt;
* &#039;&#039;&#039;Line search&#039;&#039;&#039; — classical numerical methods choose &amp;lt;math&amp;gt;\eta&amp;lt;/math&amp;gt; at each step to satisfy conditions such as the Wolfe or Armijo conditions, though this is rare in deep learning.&lt;br /&gt;
&lt;br /&gt;
A common heuristic is to try several values on a logarithmic scale (e.g. &amp;lt;math&amp;gt;10^{-1}, 10^{-2}, 10^{-3}&amp;lt;/math&amp;gt;) and pick the one that reduces the loss fastest without instability.&lt;br /&gt;
&lt;br /&gt;
== Extensions and improvements ==&lt;br /&gt;
&lt;br /&gt;
Several important modifications address limitations of vanilla gradient descent:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Momentum&#039;&#039;&#039; — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.&lt;br /&gt;
* &#039;&#039;&#039;Nesterov accelerated gradient&#039;&#039;&#039; — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.&lt;br /&gt;
* &#039;&#039;&#039;Adaptive methods&#039;&#039;&#039; (Adagrad, RMSProp, Adam) — maintain per-parameter learning rates that adapt based on the history of gradients.&lt;br /&gt;
* &#039;&#039;&#039;Second-order methods&#039;&#039;&#039; — algorithms like Newton&#039;s method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.&lt;br /&gt;
&lt;br /&gt;
== Practical tips ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Feature scaling&#039;&#039;&#039; — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.&lt;br /&gt;
* &#039;&#039;&#039;Gradient clipping&#039;&#039;&#039; — capping the norm of the gradient prevents excessively large updates.&lt;br /&gt;
* &#039;&#039;&#039;Random initialisation&#039;&#039;&#039; — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.&lt;br /&gt;
* &#039;&#039;&#039;Monitoring the loss curve&#039;&#039;&#039; — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.&lt;br /&gt;
&lt;br /&gt;
== Applications ==&lt;br /&gt;
&lt;br /&gt;
Gradient descent and its variants are used throughout science and engineering:&lt;br /&gt;
&lt;br /&gt;
* Training machine-learning models (linear models, neural networks, support vector machines)&lt;br /&gt;
* Signal processing and control systems&lt;br /&gt;
* Inverse problems in physics and imaging&lt;br /&gt;
* Operations research and logistics optimisation&lt;br /&gt;
* Economics and game-theoretic equilibrium computation&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Cauchy, A. (1847). &amp;quot;Méthode générale pour la résolution des systèmes d&#039;équations simultanées&amp;quot;. &#039;&#039;Comptes Rendus de l&#039;Académie des Sciences&#039;&#039;.&lt;br /&gt;
* Boyd, S. and Vandenberghe, L. (2004). &#039;&#039;Convex Optimization&#039;&#039;. Cambridge University Press.&lt;br /&gt;
* Ruder, S. (2016). &amp;quot;An overview of gradient descent optimization algorithms&amp;quot;. &#039;&#039;arXiv:1609.04747&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 8. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Optimization]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Dropout&amp;diff=2136</id>
		<title>Dropout</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Dropout&amp;diff=2136"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Dropout}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Overfitting and Regularization]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Dropout&#039;&#039;&#039; is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during each training step. Introduced by Srivastava et al. (2014), dropout is one of the most widely used methods for preventing overfitting in deep learning.&lt;br /&gt;
&lt;br /&gt;
== Motivation: Co-Adaptation ==&lt;br /&gt;
&lt;br /&gt;
In large neural networks, neurons can develop complex &#039;&#039;&#039;co-adaptation&#039;&#039;&#039; patterns — groups of neurons that only function correctly in the presence of specific other neurons. This tight coupling makes the network brittle and prone to overfitting, since the learned features depend on the particular idiosyncrasies of the training data rather than capturing robust, general patterns.&lt;br /&gt;
&lt;br /&gt;
Dropout breaks these co-adaptations by forcing each neuron to learn features that are useful in conjunction with many different random subsets of the other neurons.&lt;br /&gt;
&lt;br /&gt;
== The Dropout Algorithm ==&lt;br /&gt;
&lt;br /&gt;
=== During Training ===&lt;br /&gt;
&lt;br /&gt;
At each training step, every neuron in a dropout layer is independently retained with probability &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; (the &#039;&#039;&#039;keep probability&#039;&#039;&#039;) or set to zero with probability &amp;lt;math&amp;gt;1 - p&amp;lt;/math&amp;gt;. Formally, for a layer with activation vector &amp;lt;math&amp;gt;\mathbf{h}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;r_j \sim \mathrm{Bernoulli}(p)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{h}_j = r_j \cdot h_j&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;r_j&amp;lt;/math&amp;gt; is a binary mask drawn independently for each neuron &amp;lt;math&amp;gt;j&amp;lt;/math&amp;gt;. A typical keep probability is &amp;lt;math&amp;gt;p = 0.5&amp;lt;/math&amp;gt; for hidden layers and &amp;lt;math&amp;gt;p = 0.8&amp;lt;/math&amp;gt; or higher for the input layer.&lt;br /&gt;
&lt;br /&gt;
Each training step effectively trains a different &amp;quot;thinned&amp;quot; sub-network sampled from the full architecture. With &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt; neurons, there are &amp;lt;math&amp;gt;2^n&amp;lt;/math&amp;gt; possible sub-networks, creating an implicit ensemble.&lt;br /&gt;
&lt;br /&gt;
=== During Inference: Inverted Dropout ===&lt;br /&gt;
&lt;br /&gt;
At inference time, all neurons are active, so the expected output of each neuron is scaled by a factor of &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; relative to training. Two approaches address this:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Standard dropout&#039;&#039;&#039;: Multiply all weights by &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; at test time.&lt;br /&gt;
* &#039;&#039;&#039;Inverted dropout&#039;&#039;&#039; (more common): During training, divide the retained activations by &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\tilde{h}_j = \frac{r_j \cdot h_j}{p}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Inverted dropout ensures that the expected value of &amp;lt;math&amp;gt;\tilde{h}_j&amp;lt;/math&amp;gt; equals &amp;lt;math&amp;gt;h_j&amp;lt;/math&amp;gt; during training, so no adjustment is needed at inference. This is the default implementation in frameworks such as PyTorch and TensorFlow.&lt;br /&gt;
&lt;br /&gt;
== Theoretical Interpretation ==&lt;br /&gt;
&lt;br /&gt;
=== Ensemble Perspective ===&lt;br /&gt;
&lt;br /&gt;
Dropout can be viewed as training an exponentially large ensemble of sub-networks with extensive weight sharing. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all &amp;lt;math&amp;gt;2^n&amp;lt;/math&amp;gt; sub-networks. This ensemble averaging reduces variance and improves generalization.&lt;br /&gt;
&lt;br /&gt;
=== Bayesian Interpretation ===&lt;br /&gt;
&lt;br /&gt;
Gal and Ghahramani (2016) showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. Performing dropout at test time (&#039;&#039;&#039;Monte Carlo dropout&#039;&#039;&#039;) yields a distribution over predictions, providing a practical estimate of model uncertainty.&lt;br /&gt;
&lt;br /&gt;
== Dropout Variants ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variant !! Description !! Typical application&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Standard dropout&#039;&#039;&#039; || Drops individual neurons || Fully connected layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Spatial dropout&#039;&#039;&#039; || Drops entire feature maps (channels) || Convolutional networks&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DropConnect&#039;&#039;&#039; || Drops individual weights instead of neurons || Dense layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Variational dropout&#039;&#039;&#039; || Learns the dropout rate per neuron/weight || Bayesian deep learning&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DropBlock&#039;&#039;&#039; || Drops contiguous regions of feature maps || Convolutional networks&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Alpha dropout&#039;&#039;&#039; || Maintains self-normalizing property (for SELU activations) || Self-normalizing networks&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Spatial dropout&#039;&#039;&#039; (Tompson et al., 2015) is particularly important for convolutional networks. Standard dropout on convolutional feature maps is ineffective because adjacent activations are highly correlated; dropping individual pixels still leaves redundant spatial information. Spatial dropout instead drops entire channels, forcing the network to use diverse feature representations.&lt;br /&gt;
&lt;br /&gt;
== Practical Guidelines ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Placement&#039;&#039;&#039;: Apply dropout after the activation function in fully connected layers. In Transformers, dropout is applied to attention weights and after feed-forward sub-layers.&lt;br /&gt;
* &#039;&#039;&#039;Rate selection&#039;&#039;&#039;: Start with &amp;lt;math&amp;gt;p = 0.5&amp;lt;/math&amp;gt; for hidden layers. Use higher keep rates (lower dropout) for layers with fewer parameters. Increase dropout for larger models or smaller datasets.&lt;br /&gt;
* &#039;&#039;&#039;Interaction with BatchNorm&#039;&#039;&#039;: Using dropout and [[Batch Normalization]] together requires care, as dropout introduces variance that can destabilize batch statistics. A common practice is to apply dropout only after the final batch-normalized layer.&lt;br /&gt;
* &#039;&#039;&#039;Scheduled dropout&#039;&#039;&#039;: Some training regimes start with no dropout and gradually increase the rate, or vice versa, over the course of training.&lt;br /&gt;
&lt;br /&gt;
== Effect on Training ==&lt;br /&gt;
&lt;br /&gt;
Dropout typically increases training loss and slows convergence, since the effective model capacity is reduced at each step. However, it decreases the gap between training and validation performance, leading to better generalization. If training loss is already high (underfitting), dropout should be reduced or removed.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Batch Normalization]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Bayesian deep learning]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Srivastava, N. et al. (2014). &amp;quot;Dropout: A Simple Way to Prevent Neural Networks from Overfitting&amp;quot;. &#039;&#039;Journal of Machine Learning Research&#039;&#039; 15(56):1929–1958.&lt;br /&gt;
* Gal, Y. and Ghahramani, Z. (2016). &amp;quot;Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
* Tompson, J. et al. (2015). &amp;quot;Efficient Object Localization Using Convolutional Networks&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Wan, L. et al. (2013). &amp;quot;Regularization of Neural Networks using DropConnect&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
* Ghiasi, G., Lin, T.-Y. and Le, Q. V. (2018). &amp;quot;DropBlock: A regularization method for convolutional networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2135</id>
		<title>Cross-Entropy Loss</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2135"/>
		<updated>2026-04-24T07:08:58Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Cross-Entropy Loss}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Loss Functions]], [[Softmax Function]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Cross-entropy loss&#039;&#039;&#039; (also called &#039;&#039;&#039;log loss&#039;&#039;&#039;) is the most widely used loss function for classification tasks in machine learning. Rooted in information theory, it measures the dissimilarity between the true label distribution and the model&#039;s predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions.&lt;br /&gt;
&lt;br /&gt;
== Information-Theoretic Foundations ==&lt;br /&gt;
&lt;br /&gt;
=== Entropy ===&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;entropy&#039;&#039;&#039; of a discrete probability distribution &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; quantifies its uncertainty:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;H(p) = -\sum_{k=1}^{K} p_k \log p_k&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a deterministic distribution (one-hot label), &amp;lt;math&amp;gt;H(p) = 0&amp;lt;/math&amp;gt;. Entropy is maximized when all outcomes are equally likely.&lt;br /&gt;
&lt;br /&gt;
=== KL Divergence ===&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;Kullback-Leibler divergence&#039;&#039;&#039; measures how one distribution &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; differs from a reference distribution &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
KL divergence is non-negative and equals zero if and only if &amp;lt;math&amp;gt;p = q&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Cross-Entropy ===&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;cross-entropy&#039;&#039;&#039; between distributions &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; (true) and &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; (predicted) is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since &amp;lt;math&amp;gt;H(p)&amp;lt;/math&amp;gt; is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; as close to the true distribution &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; as possible.&lt;br /&gt;
&lt;br /&gt;
== Binary Cross-Entropy ==&lt;br /&gt;
&lt;br /&gt;
For binary classification with true label &amp;lt;math&amp;gt;y \in \{0, 1\}&amp;lt;/math&amp;gt; and predicted probability &amp;lt;math&amp;gt;\hat{y} = \sigma(z)&amp;lt;/math&amp;gt; (where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; is the [[Softmax Function|sigmoid function]]):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Over a dataset of &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The gradient with respect to the logit &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; takes the elegantly simple form &amp;lt;math&amp;gt;\hat{y} - y&amp;lt;/math&amp;gt;, which is both intuitive and computationally efficient.&lt;br /&gt;
&lt;br /&gt;
== Categorical Cross-Entropy ==&lt;br /&gt;
&lt;br /&gt;
For multi-class classification with &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; classes, the true label is typically a one-hot vector &amp;lt;math&amp;gt;\mathbf{y}&amp;lt;/math&amp;gt; with &amp;lt;math&amp;gt;y_c = 1&amp;lt;/math&amp;gt; for the correct class &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt;. The predicted probabilities &amp;lt;math&amp;gt;\hat{\mathbf{y}}&amp;lt;/math&amp;gt; are obtained via the [[Softmax Function]]:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called &#039;&#039;&#039;negative log-likelihood&#039;&#039;&#039; in this context.&lt;br /&gt;
&lt;br /&gt;
== Numerical Stability ==&lt;br /&gt;
&lt;br /&gt;
=== The Log-Sum-Exp Trick ===&lt;br /&gt;
&lt;br /&gt;
Naively computing &amp;lt;math&amp;gt;\log(\mathrm{softmax}(z_k))&amp;lt;/math&amp;gt; involves exponentiating potentially large logits, causing overflow. The &#039;&#039;&#039;log-sum-exp&#039;&#039;&#039; trick avoids this:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;m = \max_j z_j&amp;lt;/math&amp;gt;. Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch&#039;s &amp;lt;code&amp;gt;CrossEntropyLoss&amp;lt;/code&amp;gt; accepts raw logits).&lt;br /&gt;
&lt;br /&gt;
=== Clamping ===&lt;br /&gt;
&lt;br /&gt;
Predicted probabilities should be clamped away from exactly 0 and 1 to avoid &amp;lt;math&amp;gt;\log(0) = -\infty&amp;lt;/math&amp;gt;. A small epsilon (e.g., &amp;lt;math&amp;gt;10^{-7}&amp;lt;/math&amp;gt;) is typically used.&lt;br /&gt;
&lt;br /&gt;
== Label Smoothing ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Label smoothing&#039;&#039;&#039; (Szegedy et al., 2016) replaces the hard one-hot target with a soft distribution:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and Transformer models.&lt;br /&gt;
&lt;br /&gt;
== Comparison with Other Losses ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Loss !! Formula !! Typical use&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Cross-entropy&#039;&#039;&#039; || &amp;lt;math&amp;gt;-\sum y_k \log \hat{y}_k&amp;lt;/math&amp;gt; || Classification&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Mean squared error&#039;&#039;&#039; || &amp;lt;math&amp;gt;\frac{1}{K}\sum(y_k - \hat{y}_k)^2&amp;lt;/math&amp;gt; || Regression (poor for classification)&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Hinge loss&#039;&#039;&#039; || &amp;lt;math&amp;gt;\max(0, 1 - y \cdot z)&amp;lt;/math&amp;gt; || SVM-style classification&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Focal loss&#039;&#039;&#039; || &amp;lt;math&amp;gt;-(1-\hat{y}_c)^\gamma \log \hat{y}_c&amp;lt;/math&amp;gt; || Imbalanced classification&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Information theory]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Shannon, C. E. (1948). &amp;quot;A Mathematical Theory of Communication&amp;quot;. &#039;&#039;Bell System Technical Journal&#039;&#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;. MIT Press, Chapter 6.&lt;br /&gt;
* Szegedy, C. et al. (2016). &amp;quot;Rethinking the Inception Architecture for Computer Vision&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Lin, T.-Y. et al. (2017). &amp;quot;Focal Loss for Dense Object Detection&amp;quot;. &#039;&#039;ICCV&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Convolutional_Neural_Networks&amp;diff=2134</id>
		<title>Convolutional Neural Networks</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Convolutional_Neural_Networks&amp;diff=2134"/>
		<updated>2026-04-24T07:08:58Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Convolutional Neural Networks}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Convolutional neural networks&#039;&#039;&#039; (&#039;&#039;&#039;CNNs&#039;&#039;&#039; or &#039;&#039;&#039;ConvNets&#039;&#039;&#039;) are a class of deep [[Neural Networks|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.&lt;br /&gt;
&lt;br /&gt;
== The convolution operation ==&lt;br /&gt;
&lt;br /&gt;
The core building block is the &#039;&#039;&#039;discrete convolution&#039;&#039;&#039;. For a 2D input &amp;lt;math&amp;gt;\mathbf{X}&amp;lt;/math&amp;gt; and a filter (kernel) &amp;lt;math&amp;gt;\mathbf{K}&amp;lt;/math&amp;gt; of size &amp;lt;math&amp;gt;k \times k&amp;lt;/math&amp;gt;, the output feature map &amp;lt;math&amp;gt;\mathbf{Y}&amp;lt;/math&amp;gt; is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute &#039;&#039;&#039;cross-correlation&#039;&#039;&#039; rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.&lt;br /&gt;
&lt;br /&gt;
Key hyperparameters controlling the convolution:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Kernel size&#039;&#039;&#039; — the spatial extent of the filter (e.g. &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;5 \times 5&amp;lt;/math&amp;gt;).&lt;br /&gt;
* &#039;&#039;&#039;Stride&#039;&#039;&#039; — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.&lt;br /&gt;
* &#039;&#039;&#039;Padding&#039;&#039;&#039; — adding zeros around the border of the input to control the output size. &amp;quot;Same&amp;quot; padding preserves spatial dimensions; &amp;quot;valid&amp;quot; padding uses no padding.&lt;br /&gt;
&lt;br /&gt;
== Filters and feature detection ==&lt;br /&gt;
&lt;br /&gt;
Each filter learns to detect a specific local pattern. In early layers, filters typically respond to edges, corners, and colour gradients. Deeper layers compose these into higher-level features — textures, parts, and eventually entire objects.&lt;br /&gt;
&lt;br /&gt;
A convolutional layer applies multiple filters in parallel, producing a stack of feature maps. If the input has &amp;lt;math&amp;gt;C_{\text{in}}&amp;lt;/math&amp;gt; channels and the layer has &amp;lt;math&amp;gt;C_{\text{out}}&amp;lt;/math&amp;gt; filters, the total number of learnable parameters is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This is dramatically fewer than a fully connected layer with the same input and output dimensions, because weights are shared across all spatial positions.&lt;br /&gt;
&lt;br /&gt;
== Pooling ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Pooling&#039;&#039;&#039; layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Max pooling&#039;&#039;&#039; — takes the maximum value in each local window (e.g. &amp;lt;math&amp;gt;2 \times 2&amp;lt;/math&amp;gt;).&lt;br /&gt;
* &#039;&#039;&#039;Average pooling&#039;&#039;&#039; — takes the mean value in each window.&lt;br /&gt;
* &#039;&#039;&#039;Global average pooling&#039;&#039;&#039; — averages each entire feature map to a single value, often used before the final classification layer.&lt;br /&gt;
&lt;br /&gt;
Pooling reduces computational cost and helps prevent overfitting by progressively abstracting the representation.&lt;br /&gt;
&lt;br /&gt;
== Architecture of a CNN ==&lt;br /&gt;
&lt;br /&gt;
A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Input → [Conv → ReLU → Pool] × N → Flatten → FC → FC → Output&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each conv-pool block extracts increasingly abstract features, while the fully connected layers combine them for classification or regression.&lt;br /&gt;
&lt;br /&gt;
== Landmark architectures ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Architecture !! Year !! Key contribution !! Depth&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;LeNet-5&#039;&#039;&#039; || 1998 || Pioneered CNNs for handwritten digit recognition (MNIST) || 5 layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;AlexNet&#039;&#039;&#039; || 2012 || Won ImageNet; popularised ReLU, dropout, GPU training || 8 layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;VGGNet&#039;&#039;&#039; || 2014 || Showed depth matters; used only &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt; filters throughout || 16–19 layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;GoogLeNet (Inception)&#039;&#039;&#039; || 2014 || Introduced inception modules with parallel filter sizes || 22 layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;ResNet&#039;&#039;&#039; || 2015 || Introduced residual connections enabling very deep networks || 50–152+ layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;DenseNet&#039;&#039;&#039; || 2017 || Connected each layer to every subsequent layer via dense blocks || 121–264 layers&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;EfficientNet&#039;&#039;&#039; || 2019 || Compound scaling of depth, width, and resolution || Variable&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Residual connections ===&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;residual connection&#039;&#039;&#039; (or skip connection) introduced by ResNet adds the input of a block directly to its output:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This allows gradients to flow directly through the identity path, mitigating the vanishing gradient problem and enabling the training of networks with hundreds of layers. Residual connections have become a standard component in nearly all modern architectures.&lt;br /&gt;
&lt;br /&gt;
== Applications in computer vision ==&lt;br /&gt;
&lt;br /&gt;
CNNs have achieved state-of-the-art performance across a wide range of vision tasks:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Image classification&#039;&#039;&#039; — assigning a label to an entire image (ImageNet, CIFAR).&lt;br /&gt;
* &#039;&#039;&#039;Object detection&#039;&#039;&#039; — localising and classifying objects within an image (YOLO, Faster R-CNN, SSD).&lt;br /&gt;
* &#039;&#039;&#039;Semantic segmentation&#039;&#039;&#039; — assigning a class label to every pixel (U-Net, DeepLab).&lt;br /&gt;
* &#039;&#039;&#039;Instance segmentation&#039;&#039;&#039; — distinguishing individual instances of objects (Mask R-CNN).&lt;br /&gt;
* &#039;&#039;&#039;Image generation&#039;&#039;&#039; — generating realistic images using CNN-based generators (GANs, diffusion models).&lt;br /&gt;
* &#039;&#039;&#039;Medical imaging&#039;&#039;&#039; — tumour detection, retinal analysis, and radiology screening.&lt;br /&gt;
&lt;br /&gt;
== Practical tips ==&lt;br /&gt;
&lt;br /&gt;
* Use pretrained models (transfer learning) when labelled data is limited.&lt;br /&gt;
* Prefer small kernels (&amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt;) stacked in depth — two &amp;lt;math&amp;gt;3 \times 3&amp;lt;/math&amp;gt; layers have the same receptive field as one &amp;lt;math&amp;gt;5 \times 5&amp;lt;/math&amp;gt; layer but with fewer parameters.&lt;br /&gt;
* Apply batch normalisation after convolution and before activation.&lt;br /&gt;
* Use data augmentation generously to reduce [[Overfitting and Regularization|overfitting]].&lt;br /&gt;
* Replace fully connected layers with global average pooling to reduce parameters.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Overfitting and Regularization]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* LeCun, Y. et al. (1998). &amp;quot;Gradient-Based Learning Applied to Document Recognition&amp;quot;. &#039;&#039;Proceedings of the IEEE&#039;&#039;.&lt;br /&gt;
* Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). &amp;quot;ImageNet Classification with Deep Convolutional Neural Networks&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Simonyan, K. and Zisserman, A. (2015). &amp;quot;Very Deep Convolutional Networks for Large-Scale Image Recognition&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* He, K. et al. (2016). &amp;quot;Deep Residual Learning for Image Recognition&amp;quot;. &#039;&#039;CVPR&#039;&#039;.&lt;br /&gt;
* Tan, M. and Le, Q. V. (2019). &amp;quot;EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Batch_Normalization&amp;diff=2133</id>
		<title>Batch Normalization</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Batch_Normalization&amp;diff=2133"/>
		<updated>2026-04-24T07:08:58Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Batch Normalization}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Batch normalization&#039;&#039;&#039; (often abbreviated &#039;&#039;&#039;BatchNorm&#039;&#039;&#039; or &#039;&#039;&#039;BN&#039;&#039;&#039;) is a technique for improving the speed, stability, and performance of deep neural networks by normalizing the inputs to each layer. Introduced by Ioffe and Szegedy in 2015, it has become a standard component in most modern deep learning architectures.&lt;br /&gt;
&lt;br /&gt;
== Internal Covariate Shift ==&lt;br /&gt;
&lt;br /&gt;
The original motivation for batch normalization was to address &#039;&#039;&#039;internal covariate shift&#039;&#039;&#039; — the phenomenon where the distribution of each layer&#039;s inputs changes during training as the parameters of preceding layers are updated. This shifting distribution forces each layer to continuously adapt, slowing down convergence and requiring careful initialization and small learning rates.&lt;br /&gt;
&lt;br /&gt;
While the precise role of internal covariate shift has been debated (Santurkar et al., 2018, argued that BatchNorm&#039;s benefits stem more from smoothing the loss landscape), the practical effectiveness of the technique is well established.&lt;br /&gt;
&lt;br /&gt;
== The Batch Normalization Algorithm ==&lt;br /&gt;
&lt;br /&gt;
=== During Training ===&lt;br /&gt;
&lt;br /&gt;
For a mini-batch &amp;lt;math&amp;gt;\mathcal{B} = \{x_1, \dots, x_m\}&amp;lt;/math&amp;gt; of activations at a given layer, BatchNorm proceeds as follows:&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Step 1.&#039;&#039;&#039; Compute the mini-batch mean and variance:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i, \qquad \sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Step 2.&#039;&#039;&#039; Normalize:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\epsilon&amp;lt;/math&amp;gt; is a small constant (e.g., &amp;lt;math&amp;gt;10^{-5}&amp;lt;/math&amp;gt;) for numerical stability.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Step 3.&#039;&#039;&#039; Scale and shift with learned parameters &amp;lt;math&amp;gt;\gamma&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\beta&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_i = \gamma \hat{x}_i + \beta&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The parameters &amp;lt;math&amp;gt;\gamma&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\beta&amp;lt;/math&amp;gt; are learned during training. They restore the network&#039;s ability to represent the identity transformation if that is optimal, ensuring that normalization does not reduce the model&#039;s expressiveness.&lt;br /&gt;
&lt;br /&gt;
=== During Inference ===&lt;br /&gt;
&lt;br /&gt;
At inference time, statistics from individual mini-batches are unreliable (the input may be a single example). Instead, BatchNorm uses running estimates of the population mean and variance accumulated during training via exponential moving averages:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mu_{\mathrm{running}} \leftarrow (1 - \alpha)\, \mu_{\mathrm{running}} + \alpha\, \mu_{\mathcal{B}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma^2_{\mathrm{running}} \leftarrow (1 - \alpha)\, \sigma^2_{\mathrm{running}} + \alpha\, \sigma^2_{\mathcal{B}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; is the momentum parameter (typically 0.1). These fixed statistics ensure deterministic outputs at inference.&lt;br /&gt;
&lt;br /&gt;
== Benefits ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Higher learning rates&#039;&#039;&#039;: By constraining activation distributions, BatchNorm allows larger step sizes without divergence.&lt;br /&gt;
* &#039;&#039;&#039;Reduced sensitivity to initialization&#039;&#039;&#039;: Networks with BatchNorm are more forgiving of poor weight initialization.&lt;br /&gt;
* &#039;&#039;&#039;Regularization effect&#039;&#039;&#039;: The noise introduced by mini-batch statistics acts as a mild regularizer, sometimes reducing the need for [[Dropout]].&lt;br /&gt;
* &#039;&#039;&#039;Faster convergence&#039;&#039;&#039;: Training typically requires fewer epochs to reach a given level of performance.&lt;br /&gt;
&lt;br /&gt;
== Placement ==&lt;br /&gt;
&lt;br /&gt;
BatchNorm is typically applied &#039;&#039;&#039;before&#039;&#039;&#039; the activation function (as in the original paper), though some practitioners place it &#039;&#039;&#039;after&#039;&#039;&#039; the activation. For convolutional layers, normalization is performed per-channel across the spatial dimensions and the batch dimension.&lt;br /&gt;
&lt;br /&gt;
== Normalization Alternatives ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Method !! Normalizes over !! Use case&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Batch Norm&#039;&#039;&#039; || Batch and spatial dims, per channel || CNNs with large batches&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Layer Norm&#039;&#039;&#039; || All channels and spatial dims, per sample || Transformers, RNNs, small batches&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Instance Norm&#039;&#039;&#039; || Spatial dims only, per sample per channel || Style transfer, image generation&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Group Norm&#039;&#039;&#039; || Groups of channels, per sample || Object detection, small-batch training&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Layer normalization&#039;&#039;&#039; (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in Transformer architectures.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Group normalization&#039;&#039;&#039; (Wu and He, 2018) divides channels into groups and normalizes within each group per sample. It bridges the gap between Layer Norm and Instance Norm and performs well when batch sizes are too small for reliable batch statistics.&lt;br /&gt;
&lt;br /&gt;
== Limitations ==&lt;br /&gt;
&lt;br /&gt;
* Performance degrades with very small batch sizes, as batch statistics become noisy.&lt;br /&gt;
* Introduces a discrepancy between training (batch statistics) and inference (running statistics) behavior.&lt;br /&gt;
* Not directly applicable to variable-length sequences without padding or masking.&lt;br /&gt;
* The running statistics require careful handling when using distributed training across multiple devices.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Backpropagation]]&lt;br /&gt;
* [[Dropout]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Ioffe, S. and Szegedy, C. (2015). &amp;quot;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift&amp;quot;. &#039;&#039;ICML&#039;&#039;.&lt;br /&gt;
* Ba, J. L., Kiros, J. R. and Hinton, G. E. (2016). &amp;quot;Layer Normalization&amp;quot;. &#039;&#039;arXiv:1607.06450&#039;&#039;.&lt;br /&gt;
* Wu, Y. and He, K. (2018). &amp;quot;Group Normalization&amp;quot;. &#039;&#039;ECCV&#039;&#039;.&lt;br /&gt;
* Santurkar, S. et al. (2018). &amp;quot;How Does Batch Normalization Help Optimization?&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Backpropagation&amp;diff=2132</id>
		<title>Backpropagation</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Backpropagation&amp;diff=2132"/>
		<updated>2026-04-24T07:08:58Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Backpropagation}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Gradient Descent]], [[Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Backpropagation&#039;&#039;&#039; (short for &#039;&#039;&#039;backward propagation of errors&#039;&#039;&#039;) is an algorithm for efficiently computing the gradient of a loss function with respect to every weight in a neural network. Combined with an optimisation method such as [[Gradient Descent|gradient descent]], it forms the standard training procedure for modern deep learning models.&lt;br /&gt;
&lt;br /&gt;
== The chain rule ==&lt;br /&gt;
&lt;br /&gt;
Backpropagation is fundamentally an application of the &#039;&#039;&#039;chain rule&#039;&#039;&#039; of calculus. If a variable &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; depends on &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt;, which in turn depends on &amp;lt;math&amp;gt;x&amp;lt;/math&amp;gt;, then:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a neural network the loss &amp;lt;math&amp;gt;L&amp;lt;/math&amp;gt; depends on the output, which depends on the activations of the last hidden layer, which depend on the activations of the previous layer, and so on back to the input. The chain rule allows us to decompose the gradient into a product of local derivatives, one for each layer.&lt;br /&gt;
&lt;br /&gt;
== Forward pass ==&lt;br /&gt;
&lt;br /&gt;
During the forward pass, input data propagates through the network layer by layer. For a fully connected layer &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathbf{a}^{(l)} = g^{(l)}(\mathbf{z}^{(l)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathbf{a}^{(l-1)}&amp;lt;/math&amp;gt; is the activation from the previous layer (with &amp;lt;math&amp;gt;\mathbf{a}^{(0)} = \mathbf{x}&amp;lt;/math&amp;gt;), &amp;lt;math&amp;gt;\mathbf{W}^{(l)}&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\mathbf{b}^{(l)}&amp;lt;/math&amp;gt; are the weights and biases, and &amp;lt;math&amp;gt;g^{(l)}&amp;lt;/math&amp;gt; is the activation function. The forward pass stores all intermediate values &amp;lt;math&amp;gt;\mathbf{z}^{(l)}&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\mathbf{a}^{(l)}&amp;lt;/math&amp;gt; because they are needed during the backward pass.&lt;br /&gt;
&lt;br /&gt;
== Backward pass ==&lt;br /&gt;
&lt;br /&gt;
The backward pass computes gradients starting from the loss and moving toward the input. Define the error signal at layer &amp;lt;math&amp;gt;l&amp;lt;/math&amp;gt; as:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For the output layer (layer &amp;lt;math&amp;gt;L_{\text{out}}&amp;lt;/math&amp;gt;):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(L_{\text{out}})} = \frac{\partial L}{\partial \mathbf{a}^{(L_{\text{out}})}} \odot g&#039;^{(L_{\text{out}})}(\mathbf{z}^{(L_{\text{out}})})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For each earlier layer, the error propagates backward:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\boldsymbol{\delta}^{(l)} = \bigl(\mathbf{W}^{(l+1)}\bigr)^\top \boldsymbol{\delta}^{(l+1)} \odot g&#039;^{(l)}(\mathbf{z}^{(l)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\odot&amp;lt;/math&amp;gt; denotes element-wise multiplication. Once the error signal is known, the parameter gradients are:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \bigl(\mathbf{a}^{(l-1)}\bigr)^\top, \qquad \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Computational graphs ==&lt;br /&gt;
&lt;br /&gt;
Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement backpropagation by constructing a &#039;&#039;&#039;computational graph&#039;&#039;&#039; — a directed acyclic graph where each node represents an operation and each edge carries a tensor. The forward pass builds the graph; the backward pass traverses it in reverse topological order, applying the chain rule at every node.&lt;br /&gt;
&lt;br /&gt;
This abstraction makes it possible to differentiate arbitrary compositions of operations, not just standard layer types. Two implementation strategies exist:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Static graphs&#039;&#039;&#039; — the graph is defined once before execution (early TensorFlow). Enables aggressive compiler optimisations but is less flexible.&lt;br /&gt;
* &#039;&#039;&#039;Dynamic graphs&#039;&#039;&#039; — the graph is rebuilt on every forward pass (PyTorch, TensorFlow Eager mode). More intuitive for debugging and models with data-dependent control flow.&lt;br /&gt;
&lt;br /&gt;
== Automatic differentiation ==&lt;br /&gt;
&lt;br /&gt;
Backpropagation is a special case of &#039;&#039;&#039;reverse-mode automatic differentiation&#039;&#039;&#039; (AD). Unlike numerical differentiation (which is approximate) or symbolic differentiation (which can produce unwieldy expressions), AD computes exact derivatives by systematically applying the chain rule to elementary operations.&lt;br /&gt;
&lt;br /&gt;
Reverse-mode AD computes the gradient of a scalar output with respect to all inputs in a single backward pass, making it ideally suited to neural networks where the loss is scalar but the parameters number in the millions.&lt;br /&gt;
&lt;br /&gt;
The cost of the backward pass is typically 2–3 times that of the forward pass, because it must evaluate the local Jacobians and multiply them with the incoming error signal.&lt;br /&gt;
&lt;br /&gt;
== Vanishing and exploding gradients ==&lt;br /&gt;
&lt;br /&gt;
When a network has many layers, the gradient is a product of many local derivatives. If these factors are consistently less than 1, the gradient shrinks exponentially toward zero — the &#039;&#039;&#039;vanishing gradient&#039;&#039;&#039; problem. If they are consistently greater than 1, the gradient grows exponentially — the &#039;&#039;&#039;exploding gradient&#039;&#039;&#039; problem.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Problem !! Symptom !! Common mitigations&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Vanishing gradients&#039;&#039;&#039; || Early layers learn extremely slowly || ReLU activations, residual connections, batch normalisation, careful initialisation&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Exploding gradients&#039;&#039;&#039; || Loss diverges or produces NaN values || Gradient clipping, weight regularisation, lower learning rate&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
These issues were major obstacles to training deep networks before the introduction of ReLU activations, residual connections (ResNets), and normalisation techniques.&lt;br /&gt;
&lt;br /&gt;
== Practical considerations ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Memory&#039;&#039;&#039; — the forward pass must store all intermediate activations for the backward pass. For very deep networks this can be prohibitive; &#039;&#039;&#039;gradient checkpointing&#039;&#039;&#039; trades compute for memory by recomputing activations during the backward pass instead of storing them.&lt;br /&gt;
* &#039;&#039;&#039;Numerical stability&#039;&#039;&#039; — using log-sum-exp tricks and fused softmax-cross-entropy implementations avoids overflow and underflow.&lt;br /&gt;
* &#039;&#039;&#039;Higher-order gradients&#039;&#039;&#039; — differentiating through the backward pass itself yields second-order information (Hessian-vector products), useful for methods like natural gradient descent and meta-learning.&lt;br /&gt;
* &#039;&#039;&#039;Mixed precision&#039;&#039;&#039; — computing the forward pass in half precision while keeping a master copy of the weights in full precision speeds up training on modern GPUs.&lt;br /&gt;
&lt;br /&gt;
== Historical development ==&lt;br /&gt;
&lt;br /&gt;
The key ideas behind backpropagation were developed independently by several researchers. Seppo Linnainmaa described reverse-mode automatic differentiation in 1970. Paul Werbos applied it to neural networks in his 1974 PhD thesis. The algorithm achieved widespread adoption after the influential 1986 paper by Rumelhart, Hinton, and Williams, which demonstrated its effectiveness on multi-layer networks.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Gradient Descent]]&lt;br /&gt;
* [[Stochastic Gradient Descent]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Convolutional Neural Networks]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). &amp;quot;Learning representations by back-propagating errors&amp;quot;. &#039;&#039;Nature&#039;&#039;, 323, 533–536.&lt;br /&gt;
* Linnainmaa, S. (1970). &amp;quot;The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors&amp;quot;. Master&#039;s thesis, University of Helsinki.&lt;br /&gt;
* Werbos, P. J. (1974). &amp;quot;Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences&amp;quot;. PhD thesis, Harvard University.&lt;br /&gt;
* Baydin, A. G. et al. (2018). &amp;quot;Automatic Differentiation in Machine Learning: a Survey&amp;quot;. &#039;&#039;JMLR&#039;&#039;, 18(153), 1–43.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &#039;&#039;Deep Learning&#039;&#039;, Chapter 6. MIT Press.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Attention_Mechanisms&amp;diff=2131</id>
		<title>Attention Mechanisms</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Attention_Mechanisms&amp;diff=2131"/>
		<updated>2026-04-24T07:08:58Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{LanguageBar | page = Attention Mechanisms}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Advanced | prerequisites = [[Neural Networks]], [[Recurrent Neural Networks]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Attention mechanisms&#039;&#039;&#039; are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the [[Transformer]].&lt;br /&gt;
&lt;br /&gt;
== Motivation ==&lt;br /&gt;
&lt;br /&gt;
Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks|recurrent neural network]]. This &#039;&#039;bottleneck&#039;&#039; forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.&lt;br /&gt;
&lt;br /&gt;
== Bahdanau (Additive) Attention ==&lt;br /&gt;
&lt;br /&gt;
Bahdanau et al. (2015) proposed the first widely adopted attention mechanism for machine translation. Given encoder hidden states &amp;lt;math&amp;gt;h_1, \dots, h_T&amp;lt;/math&amp;gt; and the decoder state &amp;lt;math&amp;gt;s_{t-1}&amp;lt;/math&amp;gt;, the alignment score is computed as:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;W_s&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;W_h&amp;lt;/math&amp;gt;, and &amp;lt;math&amp;gt;v&amp;lt;/math&amp;gt; are learned parameters. The attention weights are obtained by applying softmax:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The context vector is the weighted sum &amp;lt;math&amp;gt;c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i&amp;lt;/math&amp;gt;, which is concatenated with &amp;lt;math&amp;gt;s_{t-1}&amp;lt;/math&amp;gt; and fed into the decoder.&lt;br /&gt;
&lt;br /&gt;
== Luong (Multiplicative) Attention ==&lt;br /&gt;
&lt;br /&gt;
Luong et al. (2015) simplified the scoring function by replacing the additive network with a dot product or a bilinear form:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Variant !! Score function&lt;br /&gt;
|-&lt;br /&gt;
| Dot || &amp;lt;math&amp;gt;e_{t,i} = s_t^{\!\top} h_i&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| General || &amp;lt;math&amp;gt;e_{t,i} = s_t^{\!\top} W_a\, h_i&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Concat || &amp;lt;math&amp;gt;e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i])&amp;lt;/math&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The dot variant requires encoder and decoder dimensions to match, while the general variant introduces a learnable weight matrix &amp;lt;math&amp;gt;W_a&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Scaled Dot-Product Attention ==&lt;br /&gt;
&lt;br /&gt;
Vaswani et al. (2017) introduced the formulation used in the Transformer. Given matrices of queries &amp;lt;math&amp;gt;Q&amp;lt;/math&amp;gt;, keys &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt;, and values &amp;lt;math&amp;gt;V&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The scaling factor &amp;lt;math&amp;gt;\sqrt{d_k}&amp;lt;/math&amp;gt; prevents the dot products from growing large in magnitude as the key dimension &amp;lt;math&amp;gt;d_k&amp;lt;/math&amp;gt; increases, which would push the softmax into regions of extremely small gradients.&lt;br /&gt;
&lt;br /&gt;
== Self-Attention ==&lt;br /&gt;
&lt;br /&gt;
In &#039;&#039;&#039;self-attention&#039;&#039;&#039;, the queries, keys, and values all derive from the same sequence. Each position attends to every other position (including itself), enabling the model to capture long-range dependencies in a single layer. For an input matrix &amp;lt;math&amp;gt;X \in \mathbb{R}^{n \times d}&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;Q = X W^Q, \quad K = X W^K, \quad V = X W^V&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Self-attention has &amp;lt;math&amp;gt;O(n^2 d)&amp;lt;/math&amp;gt; complexity, which can be expensive for very long sequences. Efficient variants such as sparse attention and linear attention reduce this cost.&lt;br /&gt;
&lt;br /&gt;
== Multi-Head Attention ==&lt;br /&gt;
&lt;br /&gt;
Rather than performing a single attention function, &#039;&#039;&#039;multi-head attention&#039;&#039;&#039; runs &amp;lt;math&amp;gt;h&amp;lt;/math&amp;gt; parallel attention heads with independent projections:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)&amp;lt;/math&amp;gt;. Each head can learn to attend to different aspects of the input — for example, one head might capture syntactic relationships while another captures semantic ones. Typical configurations use 8 or 16 heads.&lt;br /&gt;
&lt;br /&gt;
== Positional Encoding ==&lt;br /&gt;
&lt;br /&gt;
Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original Transformer uses sinusoidal encodings:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.&lt;br /&gt;
&lt;br /&gt;
== Cross-Attention ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Cross-attention&#039;&#039;&#039; is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.&lt;br /&gt;
&lt;br /&gt;
== Practical Considerations ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Masking&#039;&#039;&#039;: In autoregressive decoding, future positions are masked (set to &amp;lt;math&amp;gt;-\infty&amp;lt;/math&amp;gt; before softmax) to preserve the causal structure.&lt;br /&gt;
* &#039;&#039;&#039;Attention dropout&#039;&#039;&#039;: Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.&lt;br /&gt;
* &#039;&#039;&#039;Key-value caching&#039;&#039;&#039;: During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Transformer]]&lt;br /&gt;
* [[Recurrent Neural Networks]]&lt;br /&gt;
* [[Sequence-to-sequence models]]&lt;br /&gt;
* [[Self-supervised learning]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Bahdanau, D., Cho, K. and Bengio, Y. (2015). &amp;quot;Neural Machine Translation by Jointly Learning to Align and Translate&amp;quot;. &#039;&#039;ICLR&#039;&#039;.&lt;br /&gt;
* Luong, M.-T., Pham, H. and Manning, C. D. (2015). &amp;quot;Effective Approaches to Attention-based Neural Machine Translation&amp;quot;. &#039;&#039;EMNLP&#039;&#039;.&lt;br /&gt;
* Vaswani, A. et al. (2017). &amp;quot;Attention Is All You Need&amp;quot;. &#039;&#039;NeurIPS&#039;&#039;.&lt;br /&gt;
* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). &amp;quot;Self-Attention with Relative Position Representations&amp;quot;. &#039;&#039;NAACL&#039;&#039;.&lt;br /&gt;
* Su, J. et al. (2021). &amp;quot;RoFormer: Enhanced Transformer with Rotary Position Embedding&amp;quot;. &#039;&#039;arXiv:2104.09864&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Deep Learning]]&lt;br /&gt;
[[Category:Advanced]]&lt;br /&gt;
[[Category:Neural Networks]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:GlossaryTranslation&amp;diff=2130</id>
		<title>Template:GlossaryTranslation</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:GlossaryTranslation&amp;diff=2130"/>
		<updated>2026-04-24T07:06:58Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;GlossaryTranslation&#039;&#039;&#039; — Template for language-specific glossary subpages.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&lt;br /&gt;
Place on subpages like &amp;lt;code&amp;gt;Glossary:Gradient_descent/es&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{GlossaryTranslation&lt;br /&gt;
|surface_form=descenso de gradiente&lt;br /&gt;
|aliases=descenso por gradiente, método del gradiente&lt;br /&gt;
|definition_status=reviewed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
Translated definition goes here (1-3 sentences in the target language).&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Parameter !! Required !! Description&lt;br /&gt;
|-&lt;br /&gt;
| surface_form || Yes || The term as written in this language&lt;br /&gt;
|-&lt;br /&gt;
| aliases || No || Comma-separated alternate phrasings in this language&lt;br /&gt;
|-&lt;br /&gt;
| definition_status || No || draft, reviewed, or verified&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;glossary-translation-box&amp;quot; style=&amp;quot;border:1px solid #c8ccd1; border-radius:4px; padding:12px 16px; margin-bottom:12px; background:#f8f9fa;&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;font-size:12px; color:#72777d; text-transform:uppercase; letter-spacing:0.5px; margin-bottom:4px;&amp;quot;&amp;gt;Translation{{#if:{{{definition_status|}}}|&amp;amp;ensp;·&amp;amp;ensp;&amp;lt;span style=&amp;quot;{{#ifeq:{{{definition_status|}}}|verified|color:#14866d|{{#ifeq:{{{definition_status|}}}|reviewed|color:#36c|color:#72777d}}}}&amp;quot;&amp;gt;{{{definition_status|}}}&amp;lt;/span&amp;gt;|}}&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;font-size:15px; font-weight:600; margin-bottom:4px;&amp;quot;&amp;gt;{{{surface_form|}}}&amp;lt;/div&amp;gt;&lt;br /&gt;
{{#if:{{{aliases|}}}|&amp;lt;div style=&amp;quot;font-size:13px; color:#54595d;&amp;quot;&amp;gt;&amp;lt;strong&amp;gt;Also:&amp;lt;/strong&amp;gt; {{{aliases|}}}&amp;lt;/div&amp;gt;|}}&lt;br /&gt;
&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:GlossaryConcept&amp;diff=2129</id>
		<title>Template:GlossaryConcept</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:GlossaryConcept&amp;diff=2129"/>
		<updated>2026-04-24T07:06:57Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;GlossaryConcept&#039;&#039;&#039; — Template for glossary entries in the Glossary: namespace.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{GlossaryConcept&lt;br /&gt;
|concept_id=gradient_descent&lt;br /&gt;
|domain=machine learning, optimization&lt;br /&gt;
|aliases=GD, batch gradient descent&lt;br /&gt;
|related=Glossary:Stochastic_gradient_descent, Glossary:Learning_rate&lt;br /&gt;
|article=Gradient descent&lt;br /&gt;
|article_section=&lt;br /&gt;
|difficulty=intermediate&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
Free-text definition goes here (1-3 sentences).&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Parameter !! Required !! Description&lt;br /&gt;
|-&lt;br /&gt;
| concept_id || Yes || Lowercase slug identifier for the concept&lt;br /&gt;
|-&lt;br /&gt;
| domain || Yes || Comma-separated domain tags (e.g. machine learning, linear algebra)&lt;br /&gt;
|-&lt;br /&gt;
| aliases || No || Comma-separated alternate English phrasings&lt;br /&gt;
|-&lt;br /&gt;
| related || No || Comma-separated links to related Glossary: entries&lt;br /&gt;
|-&lt;br /&gt;
| article || No || Title of the main wiki article for this concept&lt;br /&gt;
|-&lt;br /&gt;
| article_section || No || Anchor to a specific section in the article&lt;br /&gt;
|-&lt;br /&gt;
| difficulty || No || beginner, intermediate, or advanced&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;glossary-concept-box&amp;quot; style=&amp;quot;border:1px solid #c8ccd1; border-radius:4px; padding:12px 16px; margin-bottom:12px; background:#f8f9fa;&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;font-size:12px; color:#72777d; text-transform:uppercase; letter-spacing:0.5px; margin-bottom:4px;&amp;quot;&amp;gt;Glossary{{#if:{{{domain|}}}|&amp;amp;ensp;·&amp;amp;ensp;{{{domain|}}}|}}&amp;lt;/div&amp;gt;&lt;br /&gt;
{{#if:{{{article|}}}|&amp;lt;div style=&amp;quot;margin-bottom:6px;&amp;quot;&amp;gt;📖 [[{{{article}}}{{#if:{{{article_section|}}}|#{{{article_section}}}|}}|Read full article]]&amp;lt;/div&amp;gt;|}}&lt;br /&gt;
{{#if:{{{aliases|}}}|&amp;lt;div style=&amp;quot;font-size:13px; color:#54595d;&amp;quot;&amp;gt;&amp;lt;strong&amp;gt;Also known as:&amp;lt;/strong&amp;gt; {{{aliases|}}}&amp;lt;/div&amp;gt;|}}&lt;br /&gt;
{{#if:{{{related|}}}|&amp;lt;div style=&amp;quot;font-size:13px; color:#54595d; margin-top:4px;&amp;quot;&amp;gt;&amp;lt;strong&amp;gt;Related:&amp;lt;/strong&amp;gt; {{{related|}}}&amp;lt;/div&amp;gt;|}}&lt;br /&gt;
{{#if:{{{difficulty|}}}|&amp;lt;div style=&amp;quot;margin-top:4px;&amp;quot;&amp;gt;&amp;lt;span style=&amp;quot;display:inline-block; background:#eaecf0; color:#54595d; padding:1px 8px; border-radius:3px; font-size:11px;&amp;quot;&amp;gt;{{{difficulty|}}}&amp;lt;/span&amp;gt;&amp;lt;/div&amp;gt;|}}&lt;br /&gt;
&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:ArticleMeta&amp;diff=2128</id>
		<title>Template:ArticleMeta</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:ArticleMeta&amp;diff=2128"/>
		<updated>2026-04-24T07:06:55Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
== ArticleMeta ==&lt;br /&gt;
Hidden metadata template for Marovi content pages.  Records provenance,&lt;br /&gt;
generation, and review information.  Does not render any visible output.&lt;br /&gt;
&lt;br /&gt;
=== Parameters ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Parameter !! Description !! Example&lt;br /&gt;
|-&lt;br /&gt;
| content_type || paper, article, or summary || paper&lt;br /&gt;
|-&lt;br /&gt;
| source_url || URL of the original source || &amp;lt;nowiki&amp;gt;https://arxiv.org/abs/1706.03762&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| source_language || ISO language code of the source || en&lt;br /&gt;
|-&lt;br /&gt;
| generated_by || Bot key that created the content || claude-opus&lt;br /&gt;
|-&lt;br /&gt;
| generation_date || Date the content was generated || 2026-03-10&lt;br /&gt;
|-&lt;br /&gt;
| last_review_date || Date of last quality review || 2026-03-14&lt;br /&gt;
|-&lt;br /&gt;
| quality_score || Overall quality score (0.00–1.00) || 0.82&lt;br /&gt;
|-&lt;br /&gt;
| review_state || Current review state || reviewed&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Usage ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{ArticleMeta&lt;br /&gt;
| content_type = paper&lt;br /&gt;
| source_url = https://arxiv.org/abs/1706.03762&lt;br /&gt;
| source_language = en&lt;br /&gt;
| generated_by = claude-opus&lt;br /&gt;
| generation_date = 2026-03-10&lt;br /&gt;
| last_review_date =&lt;br /&gt;
| quality_score =&lt;br /&gt;
| review_state = draft&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;!-- ArticleMeta: content_type={{{content_type|}}} source_url={{{source_url|}}} source_language={{{source_language|}}} generated_by={{{generated_by|}}} generation_date={{{generation_date|}}} last_review_date={{{last_review_date|}}} quality_score={{{quality_score|}}} review_state={{{review_state|}}} --&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:SummaryInfobox&amp;diff=2127</id>
		<title>Template:SummaryInfobox</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:SummaryInfobox&amp;diff=2127"/>
		<updated>2026-04-24T07:06:54Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Infobox template for summary subpages. Usage:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{SummaryInfobox&lt;br /&gt;
| parent_page = Gradient Descent&lt;br /&gt;
| parent_type = article&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;{| class=&amp;quot;infobox&amp;quot; style=&amp;quot;border: 1px solid #aaa; background: #f9f9f9; padding: 0.5em; float: right; clear: right; margin: 0 0 1em 1em; width: 22em; font-size: 90%;&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;2&amp;quot; style=&amp;quot;background: #dcc; text-align: center; font-size: 110%;&amp;quot; | Summary&lt;br /&gt;
|-&lt;br /&gt;
| &#039;&#039;&#039;Full {{{parent_type|article}}}&#039;&#039;&#039;&lt;br /&gt;
| [[{{{parent_page}}}]]&lt;br /&gt;
|}&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:PaperInfobox&amp;diff=2126</id>
		<title>Template:PaperInfobox</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:PaperInfobox&amp;diff=2126"/>
		<updated>2026-04-24T07:06:52Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Infobox template for Marovi paper articles. Usage:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{PaperInfobox&lt;br /&gt;
| topic_area  = NLP&lt;br /&gt;
| difficulty  = Research&lt;br /&gt;
| authors     = Vaswani, A.; Shazeer, N.; Parmar, N.&lt;br /&gt;
| year        = 2017&lt;br /&gt;
| venue       = NeurIPS 2017&lt;br /&gt;
| arxiv_id    = 1706.03762&lt;br /&gt;
| source_url  = https://arxiv.org/abs/1706.03762&lt;br /&gt;
| pdf_url     = https://arxiv.org/pdf/1706.03762&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;{| class=&amp;quot;infobox&amp;quot; style=&amp;quot;border: 1px solid #aaa; background: #f9f9f9; padding: 0.5em; float: right; clear: right; margin: 0 0 1em 1em; width: 22em; font-size: 90%;&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;2&amp;quot; style=&amp;quot;background: #cdc; text-align: center; font-size: 110%;&amp;quot; | Research Paper&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{authors|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;Authors&#039;&#039;&#039;&lt;br /&gt;
{{!}} {{{authors}}}&lt;br /&gt;
}}&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{year|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;Year&#039;&#039;&#039;&lt;br /&gt;
{{!}} {{{year}}}&lt;br /&gt;
}}&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{venue|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;Venue&#039;&#039;&#039;&lt;br /&gt;
{{!}} {{{venue}}}&lt;br /&gt;
}}&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{topic_area|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;Topic area&#039;&#039;&#039;&lt;br /&gt;
{{!}} [[:Category:{{{topic_area}}}|{{{topic_area}}}]]&lt;br /&gt;
}}&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{difficulty|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;Difficulty&#039;&#039;&#039;&lt;br /&gt;
{{!}} [[:Category:{{{difficulty}}}|{{{difficulty}}}]]&lt;br /&gt;
}}&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{arxiv_id|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;arXiv&#039;&#039;&#039;&lt;br /&gt;
{{!}} [{{{source_url|https://arxiv.org/abs/{{{arxiv_id}}}}}} {{{arxiv_id}}}]&lt;br /&gt;
}}&lt;br /&gt;
|-&lt;br /&gt;
{{#if: {{{pdf_url|}}} |&lt;br /&gt;
{{!}} &#039;&#039;&#039;PDF&#039;&#039;&#039;&lt;br /&gt;
{{!}} [{{{pdf_url}}} Download PDF]&lt;br /&gt;
}}&lt;br /&gt;
|}&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:RelatedContent&amp;diff=2125</id>
		<title>Template:RelatedContent</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:RelatedContent&amp;diff=2125"/>
		<updated>2026-04-24T07:06:51Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Displays a related content box at the bottom of a page linking to related articles and papers.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{RelatedContent&lt;br /&gt;
|articles=Gradient descent, Backpropagation, Learning rate&lt;br /&gt;
|papers=Attention Is All You Need, BERT&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
* &#039;&#039;&#039;articles&#039;&#039;&#039; — Comma-separated list of related article page names.&lt;br /&gt;
* &#039;&#039;&#039;papers&#039;&#039;&#039; — Comma-separated list of related paper page names.&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;marovi-related&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-related-header&amp;quot;&amp;gt;Related Content&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-related-columns&amp;quot;&amp;gt;{{#if:{{{articles|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-related-column&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-related-column-title&amp;quot;&amp;gt;Articles&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;ul&amp;gt;{{#arraymap:{{{articles|}}}|,|@@item@@|&amp;lt;li&amp;gt;[[@@item@@]]&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{papers|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-related-column&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-related-column-title&amp;quot;&amp;gt;Papers&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;ul&amp;gt;{{#arraymap:{{{papers|}}}|,|@@item@@|&amp;lt;li&amp;gt;[[@@item@@]]&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:TopicNav&amp;diff=2124</id>
		<title>Template:TopicNav</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:TopicNav&amp;diff=2124"/>
		<updated>2026-04-24T07:06:50Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Displays a breadcrumb navigation bar above the article content.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{TopicNav|field=Machine Learning|subfield=Optimization}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
* &#039;&#039;&#039;field&#039;&#039;&#039; — Top-level field of study.&lt;br /&gt;
* &#039;&#039;&#039;subfield&#039;&#039;&#039; — Sub-topic within the field. Optional.&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;marovi-topicnav&amp;quot;&amp;gt;[[:Category:Articles|Articles]]{{#if:{{{field|}}}|&amp;lt;span class=&amp;quot;marovi-topicnav-separator&amp;quot;&amp;gt;&amp;amp;#8250;&amp;lt;/span&amp;gt;[[:Category:{{{field|}}}|{{{field|}}}]]}}{{#if:{{{subfield|}}}|&amp;lt;span class=&amp;quot;marovi-topicnav-separator&amp;quot;&amp;gt;&amp;amp;#8250;&amp;lt;/span&amp;gt;[[:Category:{{{subfield|}}}|{{{subfield|}}}]]}}&amp;lt;span class=&amp;quot;marovi-topicnav-separator&amp;quot;&amp;gt;&amp;amp;#8250;&amp;lt;/span&amp;gt;&#039;&#039;&#039;{{PAGENAME}}&#039;&#039;&#039;&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:Term&amp;diff=2123</id>
		<title>Template:Term</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:Term&amp;diff=2123"/>
		<updated>2026-04-24T07:06:48Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Wraps a technical term with glossary popup support. The term is looked up in [[Module:Glossary/data]] and rendered with a dotted underline. Hovering shows a popup with the definition.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{Term|SGD}}&lt;br /&gt;
{{Term|SGD|stochastic gradient descent}}&lt;br /&gt;
{{Term|learning rate}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first parameter is the lookup key. The optional second parameter is the display text.&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;{{#invoke:Glossary|term}}&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:Paper&amp;diff=2122</id>
		<title>Template:Paper</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:Paper&amp;diff=2122"/>
		<updated>2026-04-24T07:06:47Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Displays a paper metadata infobox on the right side of the page.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{Paper&lt;br /&gt;
|title=Attention Is All You Need&lt;br /&gt;
|authors=Vaswani et al.&lt;br /&gt;
|year=2017&lt;br /&gt;
|venue=NeurIPS&lt;br /&gt;
|arxiv=1706.03762&lt;br /&gt;
|doi=&lt;br /&gt;
|topics=Transformers, Attention, NLP&lt;br /&gt;
|related_articles=Transformer, Self-Attention&lt;br /&gt;
|languages=es, zh&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
* &#039;&#039;&#039;title&#039;&#039;&#039; — Paper title. Defaults to page name.&lt;br /&gt;
* &#039;&#039;&#039;authors&#039;&#039;&#039; — Author list.&lt;br /&gt;
* &#039;&#039;&#039;year&#039;&#039;&#039; — Publication year.&lt;br /&gt;
* &#039;&#039;&#039;venue&#039;&#039;&#039; — Conference or journal.&lt;br /&gt;
* &#039;&#039;&#039;arxiv&#039;&#039;&#039; — arXiv ID (just the number, e.g. 1706.03762).&lt;br /&gt;
* &#039;&#039;&#039;doi&#039;&#039;&#039; — DOI (without https://doi.org/ prefix).&lt;br /&gt;
* &#039;&#039;&#039;topics&#039;&#039;&#039; — Comma-separated topics.&lt;br /&gt;
* &#039;&#039;&#039;related_articles&#039;&#039;&#039; — Comma-separated related article page names.&lt;br /&gt;
* &#039;&#039;&#039;languages&#039;&#039;&#039; — Comma-separated ISO language codes with translations available.&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;marovi-infobox&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-header marovi-infobox-header--paper&amp;quot;&amp;gt;{{{title|{{PAGENAME}}}}}&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-body&amp;quot;&amp;gt;{{#if:{{{authors|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;Authors&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;{{{authors|}}}&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{year|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;Year&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;{{{year|}}}&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{venue|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;Venue&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;{{{venue|}}}&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{arxiv|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;arXiv&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;[https://arxiv.org/abs/{{{arxiv|}}} {{{arxiv|}}}]&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{doi|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;DOI&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;[https://doi.org/{{{doi|}}} {{{doi|}}}]&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{topics|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;Topics&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;&amp;lt;span class=&amp;quot;marovi-infobox-tags&amp;quot;&amp;gt;{{#arraymap:{{{topics|}}}|,|@@item@@|&amp;lt;span class=&amp;quot;marovi-infobox-tag&amp;quot;&amp;gt;@@item@@&amp;lt;/span&amp;gt;|}}&amp;lt;/span&amp;gt;&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{related_articles|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links-label&amp;quot;&amp;gt;Related Articles&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;ul&amp;gt;{{#arraymap:{{{related_articles|}}}|,|@@item@@|&amp;lt;li&amp;gt;[[@@item@@]]&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{languages|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links-label&amp;quot;&amp;gt;Languages&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;ul&amp;gt;{{#arraymap:{{{languages|}}}|,|@@item@@|&amp;lt;li&amp;gt;[[{{PAGENAME}}/@@item@@|@@item@@]]&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:Summary&amp;diff=2121</id>
		<title>Template:Summary</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:Summary&amp;diff=2121"/>
		<updated>2026-04-24T07:06:45Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Displays an inline summary box at the top of an article or paper page.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{Summary&lt;br /&gt;
|text=A concise summary of the article content.&lt;br /&gt;
|key_points=First key point; Second key point; Third key point&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
* &#039;&#039;&#039;text&#039;&#039;&#039; — Summary paragraph(s). Required.&lt;br /&gt;
* &#039;&#039;&#039;key_points&#039;&#039;&#039; — Semicolon-separated list of key takeaways. Optional.&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;marovi-summary&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-summary-label&amp;quot;&amp;gt;Summary&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-summary-text&amp;quot;&amp;gt;{{{text|}}}&amp;lt;/div&amp;gt;{{#if:{{{key_points|}}}|&lt;br /&gt;
&amp;lt;ul class=&amp;quot;marovi-summary-keypoints&amp;quot;&amp;gt;{{#arraymap:{{{key_points|}}}|;|@@item@@|&amp;lt;li&amp;gt;@@item@@&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;}}&lt;br /&gt;
&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Template:Article&amp;diff=2120</id>
		<title>Template:Article</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Template:Article&amp;diff=2120"/>
		<updated>2026-04-24T07:06:44Z</updated>

		<summary type="html">&lt;p&gt;DeployBot: [deploy-bot] Deploy template from PR #41 (v1.2.0)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;noinclude&amp;gt;&lt;br /&gt;
Displays an article metadata infobox on the right side of the page.&lt;br /&gt;
&lt;br /&gt;
== Usage ==&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
{{Article&lt;br /&gt;
|title=Stochastic Gradient Descent&lt;br /&gt;
|field=Machine Learning&lt;br /&gt;
|topics=Optimization, Neural Networks, Gradient Methods&lt;br /&gt;
|related_papers=Attention Is All You Need, BERT&lt;br /&gt;
|languages=es, zh&lt;br /&gt;
}}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Parameters ==&lt;br /&gt;
* &#039;&#039;&#039;title&#039;&#039;&#039; — Display title. Defaults to page name.&lt;br /&gt;
* &#039;&#039;&#039;field&#039;&#039;&#039; — Field of study.&lt;br /&gt;
* &#039;&#039;&#039;topics&#039;&#039;&#039; — Comma-separated list of topics (will be linked).&lt;br /&gt;
* &#039;&#039;&#039;related_papers&#039;&#039;&#039; — Comma-separated list of related paper page names.&lt;br /&gt;
* &#039;&#039;&#039;languages&#039;&#039;&#039; — Comma-separated ISO language codes with translations available.&lt;br /&gt;
&lt;br /&gt;
[[Category:Marovi templates]]&lt;br /&gt;
&amp;lt;/noinclude&amp;gt;&amp;lt;includeonly&amp;gt;&amp;lt;div class=&amp;quot;marovi-infobox&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-header&amp;quot;&amp;gt;{{{title|{{PAGENAME}}}}}&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-body&amp;quot;&amp;gt;{{#if:{{{field|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;Field&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;[[:Category:{{{field|}}}|{{{field|}}}]]&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{topics|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-label&amp;quot;&amp;gt;Topics&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;span class=&amp;quot;marovi-infobox-value&amp;quot;&amp;gt;&amp;lt;span class=&amp;quot;marovi-infobox-tags&amp;quot;&amp;gt;{{#arraymap:{{{topics|}}}|,|@@item@@|&amp;lt;span class=&amp;quot;marovi-infobox-tag&amp;quot;&amp;gt;@@item@@&amp;lt;/span&amp;gt;|}}&amp;lt;/span&amp;gt;&amp;lt;/span&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{related_papers|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links-label&amp;quot;&amp;gt;Related Papers&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;ul&amp;gt;{{#arraymap:{{{related_papers|}}}|,|@@item@@|&amp;lt;li&amp;gt;[[@@item@@]]&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}{{#if:{{{languages|}}}|&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;marovi-infobox-links-label&amp;quot;&amp;gt;Languages&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;ul&amp;gt;{{#arraymap:{{{languages|}}}|,|@@item@@|&amp;lt;li&amp;gt;[[{{PAGENAME}}/@@item@@|@@item@@]]&amp;lt;/li&amp;gt;|}}&amp;lt;/ul&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;}}&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&amp;lt;/includeonly&amp;gt;&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
</feed>