<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Softmax_Function</id>
	<title>Softmax Function - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://marovi.ai/index.php?action=history&amp;feed=atom&amp;title=Softmax_Function"/>
	<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function&amp;action=history"/>
	<updated>2026-04-24T11:31:34Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.1</generator>
	<entry>
		<id>https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2143&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Deploy from CI (8c92aeb)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2143&amp;oldid=prev"/>
		<updated>2026-04-24T07:08:59Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Deploy from CI (8c92aeb)&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:08, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l103&quot;&gt;Line 103:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 103:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!--v1.2.0 cache-bust--&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!-- pass 2 --&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-2112:rev-2143 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2112&amp;oldid=prev</id>
		<title>DeployBot: Pass 2 force re-parse</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2112&amp;oldid=prev"/>
		<updated>2026-04-24T07:01:14Z</updated>

		<summary type="html">&lt;p&gt;Pass 2 force re-parse&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 07:01, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l104&quot;&gt;Line 104:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 104:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--v1.2.0 cache-bust--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;!--v1.2.0 cache-bust--&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!-- pass 2 --&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-2075:rev-2112 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2075&amp;oldid=prev</id>
		<title>DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function&amp;diff=2075&amp;oldid=prev"/>
		<updated>2026-04-24T06:58:37Z</updated>

		<summary type="html">&lt;p&gt;Force re-parse after Math source-mode rollout (v1.2.0)&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:58, 24 April 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l103&quot;&gt;Line 103:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 103:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Machine Learning]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Introductory]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;!--v1.2.0 cache-bust--&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mediawiki:diff::1.12:old-1992:rev-2075 --&gt;
&lt;/table&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
	<entry>
		<id>https://marovi.ai/index.php?title=Softmax_Function&amp;diff=1992&amp;oldid=prev</id>
		<title>DeployBot: [deploy-bot] Deploy from CI (775ba6e)</title>
		<link rel="alternate" type="text/html" href="https://marovi.ai/index.php?title=Softmax_Function&amp;diff=1992&amp;oldid=prev"/>
		<updated>2026-04-24T04:01:44Z</updated>

		<summary type="html">&lt;p&gt;[deploy-bot] Deploy from CI (775ba6e)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;{{LanguageBar | page = Softmax Function}}&lt;br /&gt;
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}&lt;br /&gt;
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}&lt;br /&gt;
&lt;br /&gt;
The &amp;#039;&amp;#039;&amp;#039;softmax function&amp;#039;&amp;#039;&amp;#039; (also called the &amp;#039;&amp;#039;&amp;#039;normalized exponential function&amp;#039;&amp;#039;&amp;#039;) is a mathematical function that converts a vector of real numbers (&amp;#039;&amp;#039;&amp;#039;logits&amp;#039;&amp;#039;&amp;#039;) into a probability distribution. It is the standard output activation for multi-class classification in neural networks and plays a central role in models ranging from logistic regression to large language models.&lt;br /&gt;
&lt;br /&gt;
== Definition ==&lt;br /&gt;
&lt;br /&gt;
Given a vector of logits &amp;lt;math&amp;gt;\mathbf{z} = (z_1, z_2, \dots, z_K)&amp;lt;/math&amp;gt; for &amp;lt;math&amp;gt;K&amp;lt;/math&amp;gt; classes, the softmax function produces:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output satisfies two properties that make it a valid probability distribution:&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;\sigma(\mathbf{z})_k &amp;gt; 0&amp;lt;/math&amp;gt; for all &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; (since the exponential is always positive).&lt;br /&gt;
# &amp;lt;math&amp;gt;\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1&amp;lt;/math&amp;gt; (by construction).&lt;br /&gt;
&lt;br /&gt;
== Intuition ==&lt;br /&gt;
&lt;br /&gt;
The softmax function amplifies differences between logits. A logit that is larger than its peers receives a disproportionately large share of the probability mass because the exponential function grows super-linearly. For example:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Logits !! Softmax output&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;(2.0,\; 1.0,\; 0.1)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(0.659,\; 0.242,\; 0.099)&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;(5.0,\; 1.0,\; 0.1)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(0.993,\; 0.005,\; 0.002)&amp;lt;/math&amp;gt;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
As the gap between the largest logit and the others increases, the output approaches a one-hot vector. This &amp;quot;winner-take-most&amp;quot; behavior makes softmax well-suited for classification where a single class should dominate.&lt;br /&gt;
&lt;br /&gt;
== Temperature Parameter ==&lt;br /&gt;
&lt;br /&gt;
A &amp;#039;&amp;#039;&amp;#039;temperature&amp;#039;&amp;#039;&amp;#039; parameter &amp;lt;math&amp;gt;T &amp;gt; 0&amp;lt;/math&amp;gt; controls the sharpness of the distribution:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;T \to 0&amp;lt;/math&amp;gt;: The distribution collapses to a one-hot vector selecting the argmax — equivalent to a hard decision.&lt;br /&gt;
* &amp;lt;math&amp;gt;T = 1&amp;lt;/math&amp;gt;: Standard softmax.&lt;br /&gt;
* &amp;lt;math&amp;gt;T \to \infty&amp;lt;/math&amp;gt;: The distribution approaches uniform — all classes become equally likely.&lt;br /&gt;
&lt;br /&gt;
Temperature scaling is widely used in knowledge distillation (Hinton et al., 2015), where a &amp;quot;soft&amp;quot; distribution from a teacher model provides richer training signal than hard labels. It is also used to control randomness in text generation from language models.&lt;br /&gt;
&lt;br /&gt;
== Numerical Stability ==&lt;br /&gt;
&lt;br /&gt;
A naive implementation of softmax can overflow when logits are large (e.g., &amp;lt;math&amp;gt;e^{1000}&amp;lt;/math&amp;gt; is infinite in floating point). The standard fix subtracts the maximum logit:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This is mathematically equivalent (the constant cancels) but ensures the largest exponent is &amp;lt;math&amp;gt;e^0 = 1&amp;lt;/math&amp;gt;, preventing overflow. All major deep learning frameworks implement this stabilized version automatically.&lt;br /&gt;
&lt;br /&gt;
== Relationship to Sigmoid ==&lt;br /&gt;
&lt;br /&gt;
For the special case of &amp;lt;math&amp;gt;K = 2&amp;lt;/math&amp;gt; classes, the softmax function reduces to the &amp;#039;&amp;#039;&amp;#039;sigmoid&amp;#039;&amp;#039;&amp;#039; (logistic) function. If we define &amp;lt;math&amp;gt;z = z_1 - z_2&amp;lt;/math&amp;gt;, then:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This is why binary classifiers typically use a single output neuron with a sigmoid activation rather than two neurons with softmax — they are mathematically equivalent.&lt;br /&gt;
&lt;br /&gt;
== Gradient ==&lt;br /&gt;
&lt;br /&gt;
The Jacobian of the softmax function with respect to its input is:&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;\delta_{kj}&amp;lt;/math&amp;gt; is the Kronecker delta. When combined with [[Cross-Entropy Loss]], the gradient simplifies to &amp;lt;math&amp;gt;\hat{y}_k - y_k&amp;lt;/math&amp;gt;, which is computationally efficient and numerically stable.&lt;br /&gt;
&lt;br /&gt;
== Use in Classification ==&lt;br /&gt;
&lt;br /&gt;
In a typical classification pipeline:&lt;br /&gt;
&lt;br /&gt;
# A neural network produces raw logits &amp;lt;math&amp;gt;\mathbf{z}&amp;lt;/math&amp;gt; from its final linear layer.&lt;br /&gt;
# Softmax converts logits to probabilities: &amp;lt;math&amp;gt;\hat{\mathbf{y}} = \sigma(\mathbf{z})&amp;lt;/math&amp;gt;.&lt;br /&gt;
# The predicted class is &amp;lt;math&amp;gt;\hat{c} = \arg\max_k \hat{y}_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
# Training uses [[Cross-Entropy Loss]] applied to the predicted distribution and the true labels.&lt;br /&gt;
&lt;br /&gt;
In practice, the softmax and cross-entropy are computed jointly for numerical stability (the &amp;#039;&amp;#039;&amp;#039;log-softmax&amp;#039;&amp;#039;&amp;#039; formulation), and the argmax at inference time can be applied directly to the logits without computing softmax at all.&lt;br /&gt;
&lt;br /&gt;
== Beyond Classification ==&lt;br /&gt;
&lt;br /&gt;
Softmax appears in many contexts beyond the output layer:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Attention mechanisms&amp;#039;&amp;#039;&amp;#039;: Softmax normalizes alignment scores into attention weights in the [[Attention Mechanisms|Transformer]] architecture.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Reinforcement learning&amp;#039;&amp;#039;&amp;#039;: Softmax over action-value estimates produces a stochastic policy (Boltzmann exploration).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Mixture models&amp;#039;&amp;#039;&amp;#039;: Softmax parameterizes mixing coefficients in mixture-of-experts architectures.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Cross-Entropy Loss]]&lt;br /&gt;
* [[Loss Functions]]&lt;br /&gt;
* [[Logistic regression]]&lt;br /&gt;
* [[Attention Mechanisms]]&lt;br /&gt;
* [[Neural Networks]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Bishop, C. M. (2006). &amp;#039;&amp;#039;Pattern Recognition and Machine Learning&amp;#039;&amp;#039;. Springer, Section 4.3.4.&lt;br /&gt;
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). &amp;#039;&amp;#039;Deep Learning&amp;#039;&amp;#039;. MIT Press, Section 6.2.2.3.&lt;br /&gt;
* Hinton, G., Vinyals, O. and Dean, J. (2015). &amp;quot;Distilling the Knowledge in a Neural Network&amp;quot;. &amp;#039;&amp;#039;arXiv:1503.02531&amp;#039;&amp;#039;.&lt;br /&gt;
* Bridle, J. S. (1990). &amp;quot;Probabilistic Interpretation of Feedforward Classification Network Outputs&amp;quot;. &amp;#039;&amp;#039;Neurocomputing&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine Learning]]&lt;br /&gt;
[[Category:Introductory]]&lt;/div&gt;</summary>
		<author><name>DeployBot</name></author>
	</entry>
</feed>