<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Cross-Entropy_Loss</id>
	<title>Cross-Entropy Loss - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Cross-Entropy_Loss"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;action=history"/>
	<updated>2026-04-24T12:44:37Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2135&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Deploy from CI (8c92aeb)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2135&amp;oldid=prev"/>
		<updated>2026-04-24T07:08:58Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:08, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l107&quot;&gt;Line 107:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 107:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Intermediate]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Intermediate]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!--v1.2.0 cache-bust--&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!-- pass 2 --&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-2094:rev-2135 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2094&amp;oldid=prev</id>
		<title>DeployBot: Pass 2 force re-parse</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2094&amp;oldid=prev"/>
		<updated>2026-04-24T07:00:32Z</updated>

		<summary type="html">&lt;p&gt;Pass 2 force re-parse&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:00, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l108&quot;&gt;Line 108:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 108:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Intermediate]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Intermediate]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--v1.2.0 cache-bust--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--v1.2.0 cache-bust--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!-- pass 2 --&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-2057:rev-2094 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2057&amp;oldid=prev</id>
		<title>DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=2057&amp;oldid=prev"/>
		<updated>2026-04-24T06:57:56Z</updated>

		<summary type="html">&lt;p&gt;Force re-parse after Math source-mode rollout (v1.2.0)&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:57, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l107&quot;&gt;Line 107:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 107:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Intermediate]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Intermediate]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!--v1.2.0 cache-bust--&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-1984:rev-2057 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=1984&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Deploy from CI (775ba6e)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Cross-Entropy_Loss&amp;diff=1984&amp;oldid=prev"/>
		<updated>2026-04-24T04:01:41Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Deploy from CI (775ba6e)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;{{LanguageBar | page = Cross-Entropy Loss}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Intermediate | prerequisites = [[Loss Functions]], [[Softmax Function]]}}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Cross-entropy loss&amp;#039;&amp;#039;&amp;#039; (also called &amp;#039;&amp;#039;&amp;#039;log loss&amp;#039;&amp;#039;&amp;#039;) is the most widely used loss function for classification tasks in machine learning. Rooted in information theory, it measures the dissimilarity between the true label distribution and the model&amp;#039;s predicted probability distribution, providing a smooth, differentiable objective that drives probabilistic classifiers toward confident, correct predictions.&lt;br /&gt;
&lt;br /&gt;
== Information-Theoretic Foundations ==&lt;br /&gt;
&lt;br /&gt;
=== Entropy ===&lt;br /&gt;
&lt;br /&gt;
The &amp;#039;&amp;#039;&amp;#039;entropy&amp;#039;&amp;#039;&amp;#039; of a discrete probability distribution &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; quantifies its uncertainty:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;H(p) = -\sum_{k=1}^{K} p_k \log p_k&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a deterministic distribution (one-hot label), &amp;lt;math&amp;gt;H(p) = 0&amp;lt;/math&amp;gt;. Entropy is maximized when all outcomes are equally likely.&lt;br /&gt;
&lt;br /&gt;
=== KL Divergence ===&lt;br /&gt;
&lt;br /&gt;
The &amp;#039;&amp;#039;&amp;#039;Kullback-Leibler divergence&amp;#039;&amp;#039;&amp;#039; measures how one distribution &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; differs from a reference distribution &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;D_{\mathrm{KL}}(p \,\|\, q) = \sum_{k=1}^{K} p_k \log \frac{p_k}{q_k}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
KL divergence is non-negative and equals zero if and only if &amp;lt;math&amp;gt;p = q&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Cross-Entropy ===&lt;br /&gt;
&lt;br /&gt;
The &amp;#039;&amp;#039;&amp;#039;cross-entropy&amp;#039;&amp;#039;&amp;#039; between distributions &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; (true) and &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; (predicted) is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;H(p, q) = -\sum_{k=1}^{K} p_k \log q_k = H(p) + D_{\mathrm{KL}}(p \,\|\, q)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Since &amp;lt;math&amp;gt;H(p)&amp;lt;/math&amp;gt; is constant with respect to model parameters, minimizing cross-entropy is equivalent to minimizing KL divergence — i.e., making the predicted distribution &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; as close to the true distribution &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; as possible.&lt;br /&gt;
&lt;br /&gt;
== Binary Cross-Entropy ==&lt;br /&gt;
&lt;br /&gt;
For binary classification with true label &amp;lt;math&amp;gt;y \in \{0, 1\}&amp;lt;/math&amp;gt; and predicted probability &amp;lt;math&amp;gt;\hat{y} = \sigma(z)&amp;lt;/math&amp;gt; (where &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt; is the [[Softmax Function|sigmoid function]]):&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{BCE}} = -\bigl[y \log \hat{y} + (1 - y) \log(1 - \hat{y})\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Over a dataset of &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt; samples:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)\bigr]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The gradient with respect to the logit &amp;lt;math&amp;gt;z&amp;lt;/math&amp;gt; takes the elegantly simple form &amp;lt;math&amp;gt;\hat{y} - y&amp;lt;/math&amp;gt;, which is both intuitive and computationally efficient.&lt;br /&gt;
&lt;br /&gt;
== Categorical Cross-Entropy ==&lt;br /&gt;
&lt;br /&gt;
For multi-class classification with &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; classes, the true label is typically a one-hot vector &amp;lt;math&amp;gt;\mathbf{y}&amp;lt;/math&amp;gt; with &amp;lt;math&amp;gt;y_c = 1&amp;lt;/math&amp;gt; for the correct class &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt;. The predicted probabilities &amp;lt;math&amp;gt;\hat{\mathbf{y}}&amp;lt;/math&amp;gt; are obtained via the [[Softmax Function]]:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This reduces to the negative log-probability of the correct class, which is why categorical cross-entropy is also called &amp;#039;&amp;#039;&amp;#039;negative log-likelihood&amp;#039;&amp;#039;&amp;#039; in this context.&lt;br /&gt;
&lt;br /&gt;
== Numerical Stability ==&lt;br /&gt;
&lt;br /&gt;
=== The Log-Sum-Exp Trick ===&lt;br /&gt;
&lt;br /&gt;
Naively computing &amp;lt;math&amp;gt;\log(\mathrm{softmax}(z_k))&amp;lt;/math&amp;gt; involves exponentiating potentially large logits, causing overflow. The &amp;#039;&amp;#039;&amp;#039;log-sum-exp&amp;#039;&amp;#039;&amp;#039; trick avoids this:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\log \hat{y}_k = z_k - \log \sum_{j=1}^{K} e^{z_j} = z_k - \left(m + \log \sum_{j=1}^{K} e^{z_j - m}\right)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;m = \max_j z_j&amp;lt;/math&amp;gt;. Subtracting the maximum logit ensures the largest exponent is zero, preventing overflow. All major deep learning frameworks implement this fused operation (e.g., PyTorch&amp;#039;s &amp;lt;code&amp;gt;CrossEntropyLoss&amp;lt;/code&amp;gt; accepts raw logits).&lt;br /&gt;
&lt;br /&gt;
=== Clamping ===&lt;br /&gt;
&lt;br /&gt;
Predicted probabilities should be clamped away from exactly 0 and 1 to avoid &amp;lt;math&amp;gt;\log(0) = -\infty&amp;lt;/math&amp;gt;. A small epsilon (e.g., &amp;lt;math&amp;gt;10^{-7}&amp;lt;/math&amp;gt;) is typically used.&lt;br /&gt;
&lt;br /&gt;
== Label Smoothing ==&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Label smoothing&amp;#039;&amp;#039;&amp;#039; (Szegedy et al., 2016) replaces the hard one-hot target with a soft distribution:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;y_k^{\mathrm{smooth}} = (1 - \alpha)\, y_k + \frac{\alpha}{K}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; is a small constant (commonly 0.1). This prevents the model from becoming overconfident, improves calibration, and often yields better generalization. It is standard practice in training large image classifiers and Transformer models.&lt;br /&gt;
&lt;br /&gt;
== Comparison with Other Losses ==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Loss !! Formula !! Typical use&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Cross-entropy&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;-\sum y_k \log \hat{y}_k&amp;lt;/math&amp;gt; || Classification&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Mean squared error&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;\frac{1}{K}\sum(y_k - \hat{y}_k)^2&amp;lt;/math&amp;gt; || Regression (poor for classification)&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Hinge loss&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;\max(0, 1 - y \cdot z)&amp;lt;/math&amp;gt; || SVM-style classification&lt;br /&gt;
|-&lt;br /&gt;
| &amp;#039;&amp;#039;&amp;#039;Focal loss&amp;#039;&amp;#039;&amp;#039; || &amp;lt;math&amp;gt;-(1-\hat{y}_c)^\gamma \log \hat{y}_c&amp;lt;/math&amp;gt; || Imbalanced classification&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Cross-entropy has steeper gradients than MSE when the prediction is confidently wrong, leading to faster correction of large errors.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Softmax Function]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Information theory]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Shannon, C. E. (1948). &amp;quot;A Mathematical Theory of Communication&amp;quot;. &amp;#039;&amp;#039;Bell System Technical Journal&amp;#039;&amp;#039;.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &amp;#039;&amp;#039;Deep Learning&amp;#039;&amp;#039;. MIT Press, Chapter 6.&lt;br /&gt;
* Szegedy, C. et al. (2016). &amp;quot;Rethinking the Inception Architecture for Computer Vision&amp;quot;. &amp;#039;&amp;#039;CVPR&amp;#039;&amp;#039;.&lt;br /&gt;
* Lin, T.-Y. et al. (2017). &amp;quot;Focal Loss for Dense Object Detection&amp;quot;. &amp;#039;&amp;#039;ICCV&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Intermediate]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
</feed>