DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

2026-04-24T07:08:59Z

[deploy-bot] Deploy from CI (8c92aeb)

← Older revision		Revision as of 07:08, 24 April 2026
Line 103:		Line 103:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Introductory]]		[[Category:Introductory]]
	~~<!--v1.2.0 cache-bust-->~~
	~~<!-- pass 2 -->~~

DeployBot: Pass 2 force re-parse

2026-04-24T07:01:14Z

Pass 2 force re-parse

← Older revision		Revision as of 07:01, 24 April 2026
Line 104:		Line 104:
	[[Category:Introductory]]		[[Category:Introductory]]
	<!--v1.2.0 cache-bust-->		<!--v1.2.0 cache-bust-->
			<!-- pass 2 -->

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

2026-04-24T06:58:37Z

Force re-parse after Math source-mode rollout (v1.2.0)

← Older revision		Revision as of 06:58, 24 April 2026
Line 103:		Line 103:
	[[Category:Machine Learning]]		[[Category:Machine Learning]]
	[[Category:Introductory]]		[[Category:Introductory]]
			<!--v1.2.0 cache-bust-->

DeployBot: [deploy-bot] Deploy from CI (775ba6e)

2026-04-24T04:01:44Z

[deploy-bot] Deploy from CI (775ba6e)

New page

{{LanguageBar | page = Softmax Function}}
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

The '''softmax function''' (also called the '''normalized exponential function''') is a mathematical function that converts a vector of real numbers ('''logits''') into a probability distribution. It is the standard output activation for multi-class classification in neural networks and plays a central role in models ranging from logistic regression to large language models.

== Definition ==

Given a vector of logits <math>\mathbf{z} = (z_1, z_2, \dots, z_K)</math> for <math>K</math> classes, the softmax function produces:

:<math>\sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K</math>

The output satisfies two properties that make it a valid probability distribution:

# <math>\sigma(\mathbf{z})_k > 0</math> for all <math>k</math> (since the exponential is always positive).
# <math>\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1</math> (by construction).

== Intuition ==

The softmax function amplifies differences between logits. A logit that is larger than its peers receives a disproportionately large share of the probability mass because the exponential function grows super-linearly. For example:

{| class="wikitable"
|-
! Logits !! Softmax output
|-
| <math>(2.0,\; 1.0,\; 0.1)</math> || <math>(0.659,\; 0.242,\; 0.099)</math>
|-
| <math>(5.0,\; 1.0,\; 0.1)</math> || <math>(0.993,\; 0.005,\; 0.002)</math>
|}

As the gap between the largest logit and the others increases, the output approaches a one-hot vector. This "winner-take-most" behavior makes softmax well-suited for classification where a single class should dominate.

== Temperature Parameter ==

A '''temperature''' parameter <math>T > 0</math> controls the sharpness of the distribution:

:<math>\sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}</math>

* <math>T \to 0</math>: The distribution collapses to a one-hot vector selecting the argmax — equivalent to a hard decision.
* <math>T = 1</math>: Standard softmax.
* <math>T \to \infty</math>: The distribution approaches uniform — all classes become equally likely.

Temperature scaling is widely used in knowledge distillation (Hinton et al., 2015), where a "soft" distribution from a teacher model provides richer training signal than hard labels. It is also used to control randomness in text generation from language models.

== Numerical Stability ==

A naive implementation of softmax can overflow when logits are large (e.g., <math>e^{1000}</math> is infinite in floating point). The standard fix subtracts the maximum logit:

:<math>\sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j</math>

This is mathematically equivalent (the constant cancels) but ensures the largest exponent is <math>e^0 = 1</math>, preventing overflow. All major deep learning frameworks implement this stabilized version automatically.

== Relationship to Sigmoid ==

For the special case of <math>K = 2</math> classes, the softmax function reduces to the '''sigmoid''' (logistic) function. If we define <math>z = z_1 - z_2</math>, then:

:<math>\sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z)</math>

This is why binary classifiers typically use a single output neuron with a sigmoid activation rather than two neurons with softmax — they are mathematically equivalent.

== Gradient ==

The Jacobian of the softmax function with respect to its input is:

:<math>\frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j)</math>

where <math>\delta_{kj}</math> is the Kronecker delta. When combined with [[Cross-Entropy Loss]], the gradient simplifies to <math>\hat{y}_k - y_k</math>, which is computationally efficient and numerically stable.

== Use in Classification ==

In a typical classification pipeline:

# A neural network produces raw logits <math>\mathbf{z}</math> from its final linear layer.
# Softmax converts logits to probabilities: <math>\hat{\mathbf{y}} = \sigma(\mathbf{z})</math>.
# The predicted class is <math>\hat{c} = \arg\max_k \hat{y}_k</math>.
# Training uses [[Cross-Entropy Loss]] applied to the predicted distribution and the true labels.

In practice, the softmax and cross-entropy are computed jointly for numerical stability (the '''log-softmax''' formulation), and the argmax at inference time can be applied directly to the logits without computing softmax at all.

== Beyond Classification ==

Softmax appears in many contexts beyond the output layer:

* '''Attention mechanisms''': Softmax normalizes alignment scores into attention weights in the [[Attention Mechanisms|Transformer]] architecture.
* '''Reinforcement learning''': Softmax over action-value estimates produces a stochastic policy (Boltzmann exploration).
* '''Mixture models''': Softmax parameterizes mixing coefficients in mixture-of-experts architectures.

== See also ==

* [[Cross-Entropy Loss]]
* [[Loss Functions]]
* [[Logistic regression]]
* [[Attention Mechanisms]]
* [[Neural Networks]]

== References ==

* Bishop, C. M. (2006). ''Pattern Recognition and Machine Learning''. Springer, Section 4.3.4.
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning''. MIT Press, Section 6.2.2.3.
* Hinton, G., Vinyals, O. and Dean, J. (2015). "Distilling the Knowledge in a Neural Network". ''arXiv:1503.02531''.
* Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs". ''Neurocomputing''.

[[Category:Machine Learning]]
[[Category:Introductory]]

Softmax Function - Revision history

DeployBot: [deploy-bot] Deploy from CI (8c92aeb)

DeployBot: Pass 2 force re-parse

DeployBot: Force re-parse after Math source-mode rollout (v1.2.0)

DeployBot: [deploy-bot] Deploy from CI (775ba6e)