Softmax Function: Difference between revisions

    From Marovi AI
    ([deploy-bot] Deploy from CI (8c92aeb))
    Tags: ci-deploy Manual revert
    ([deploy-bot] Convert Softmax Function to Translate-extension page)
    Line 1: Line 1:
    <languages />
    {{LanguageBar | page = Softmax Function}}
    {{LanguageBar | page = Softmax Function}}
    {{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}
    {{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}
    {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}
    {{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}


    <translate>
    <!--T:1-->
    The '''softmax function''' (also called the '''normalized exponential function''') is a mathematical function that converts a vector of real numbers ('''logits''') into a probability distribution. It is the standard output activation for multi-class classification in neural networks and plays a central role in models ranging from logistic regression to large language models.
    The '''softmax function''' (also called the '''normalized exponential function''') is a mathematical function that converts a vector of real numbers ('''logits''') into a probability distribution. It is the standard output activation for multi-class classification in neural networks and plays a central role in models ranging from logistic regression to large language models.


    <!--T:2-->
    == Definition ==
    == Definition ==


    <!--T:3-->
    Given a vector of logits <math>\mathbf{z} = (z_1, z_2, \dots, z_K)</math> for <math>K</math> classes, the softmax function produces:
    Given a vector of logits <math>\mathbf{z} = (z_1, z_2, \dots, z_K)</math> for <math>K</math> classes, the softmax function produces:


    <!--T:4-->
    :<math>\sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K</math>
    :<math>\sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K</math>


    <!--T:5-->
    The output satisfies two properties that make it a valid probability distribution:
    The output satisfies two properties that make it a valid probability distribution:


    <!--T:6-->
    # <math>\sigma(\mathbf{z})_k > 0</math> for all <math>k</math> (since the exponential is always positive).
    # <math>\sigma(\mathbf{z})_k > 0</math> for all <math>k</math> (since the exponential is always positive).
    # <math>\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1</math> (by construction).
    # <math>\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1</math> (by construction).


    <!--T:7-->
    == Intuition ==
    == Intuition ==


    <!--T:8-->
    The softmax function amplifies differences between logits. A logit that is larger than its peers receives a disproportionately large share of the probability mass because the exponential function grows super-linearly. For example:
    The softmax function amplifies differences between logits. A logit that is larger than its peers receives a disproportionately large share of the probability mass because the exponential function grows super-linearly. For example:


    <!--T:9-->
    {| class="wikitable"
    {| class="wikitable"
    |-
    |-
    Line 29: Line 40:
    |}
    |}


    <!--T:10-->
    As the gap between the largest logit and the others increases, the output approaches a one-hot vector. This "winner-take-most" behavior makes softmax well-suited for classification where a single class should dominate.
    As the gap between the largest logit and the others increases, the output approaches a one-hot vector. This "winner-take-most" behavior makes softmax well-suited for classification where a single class should dominate.


    <!--T:11-->
    == Temperature Parameter ==
    == Temperature Parameter ==


    <!--T:12-->
    A '''temperature''' parameter <math>T > 0</math> controls the sharpness of the distribution:
    A '''temperature''' parameter <math>T > 0</math> controls the sharpness of the distribution:


    <!--T:13-->
    :<math>\sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}</math>
    :<math>\sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}}</math>


    <!--T:14-->
    * <math>T \to 0</math>: The distribution collapses to a one-hot vector selecting the argmax — equivalent to a hard decision.
    * <math>T \to 0</math>: The distribution collapses to a one-hot vector selecting the argmax — equivalent to a hard decision.
    * <math>T = 1</math>: Standard softmax.
    * <math>T = 1</math>: Standard softmax.
    * <math>T \to \infty</math>: The distribution approaches uniform — all classes become equally likely.
    * <math>T \to \infty</math>: The distribution approaches uniform — all classes become equally likely.


    <!--T:15-->
    Temperature scaling is widely used in knowledge distillation (Hinton et al., 2015), where a "soft" distribution from a teacher model provides richer training signal than hard labels. It is also used to control randomness in text generation from language models.
    Temperature scaling is widely used in knowledge distillation (Hinton et al., 2015), where a "soft" distribution from a teacher model provides richer training signal than hard labels. It is also used to control randomness in text generation from language models.


    <!--T:16-->
    == Numerical Stability ==
    == Numerical Stability ==


    <!--T:17-->
    A naive implementation of softmax can overflow when logits are large (e.g., <math>e^{1000}</math> is infinite in floating point). The standard fix subtracts the maximum logit:
    A naive implementation of softmax can overflow when logits are large (e.g., <math>e^{1000}</math> is infinite in floating point). The standard fix subtracts the maximum logit:


    <!--T:18-->
    :<math>\sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j</math>
    :<math>\sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j</math>


    <!--T:19-->
    This is mathematically equivalent (the constant cancels) but ensures the largest exponent is <math>e^0 = 1</math>, preventing overflow. All major deep learning frameworks implement this stabilized version automatically.
    This is mathematically equivalent (the constant cancels) but ensures the largest exponent is <math>e^0 = 1</math>, preventing overflow. All major deep learning frameworks implement this stabilized version automatically.


    <!--T:20-->
    == Relationship to Sigmoid ==
    == Relationship to Sigmoid ==


    <!--T:21-->
    For the special case of <math>K = 2</math> classes, the softmax function reduces to the '''sigmoid''' (logistic) function. If we define <math>z = z_1 - z_2</math>, then:
    For the special case of <math>K = 2</math> classes, the softmax function reduces to the '''sigmoid''' (logistic) function. If we define <math>z = z_1 - z_2</math>, then:


    <!--T:22-->
    :<math>\sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z)</math>
    :<math>\sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z)</math>


    <!--T:23-->
    This is why binary classifiers typically use a single output neuron with a sigmoid activation rather than two neurons with softmax — they are mathematically equivalent.
    This is why binary classifiers typically use a single output neuron with a sigmoid activation rather than two neurons with softmax — they are mathematically equivalent.


    <!--T:24-->
    == Gradient ==
    == Gradient ==


    <!--T:25-->
    The Jacobian of the softmax function with respect to its input is:
    The Jacobian of the softmax function with respect to its input is:


    <!--T:26-->
    :<math>\frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j)</math>
    :<math>\frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j)</math>


    <!--T:27-->
    where <math>\delta_{kj}</math> is the Kronecker delta. When combined with [[Cross-Entropy Loss]], the gradient simplifies to <math>\hat{y}_k - y_k</math>, which is computationally efficient and numerically stable.
    where <math>\delta_{kj}</math> is the Kronecker delta. When combined with [[Cross-Entropy Loss]], the gradient simplifies to <math>\hat{y}_k - y_k</math>, which is computationally efficient and numerically stable.


    <!--T:28-->
    == Use in Classification ==
    == Use in Classification ==


    <!--T:29-->
    In a typical classification pipeline:
    In a typical classification pipeline:


    <!--T:30-->
    # A neural network produces raw logits <math>\mathbf{z}</math> from its final linear layer.
    # A neural network produces raw logits <math>\mathbf{z}</math> from its final linear layer.
    # Softmax converts logits to probabilities: <math>\hat{\mathbf{y}} = \sigma(\mathbf{z})</math>.
    # Softmax converts logits to probabilities: <math>\hat{\mathbf{y}} = \sigma(\mathbf{z})</math>.
    Line 76: Line 108:
    # Training uses [[Cross-Entropy Loss]] applied to the predicted distribution and the true labels.
    # Training uses [[Cross-Entropy Loss]] applied to the predicted distribution and the true labels.


    <!--T:31-->
    In practice, the softmax and cross-entropy are computed jointly for numerical stability (the '''log-softmax''' formulation), and the argmax at inference time can be applied directly to the logits without computing softmax at all.
    In practice, the softmax and cross-entropy are computed jointly for numerical stability (the '''log-softmax''' formulation), and the argmax at inference time can be applied directly to the logits without computing softmax at all.


    <!--T:32-->
    == Beyond Classification ==
    == Beyond Classification ==


    <!--T:33-->
    Softmax appears in many contexts beyond the output layer:
    Softmax appears in many contexts beyond the output layer:


    <!--T:34-->
    * '''Attention mechanisms''': Softmax normalizes alignment scores into attention weights in the [[Attention Mechanisms|Transformer]] architecture.
    * '''Attention mechanisms''': Softmax normalizes alignment scores into attention weights in the [[Attention Mechanisms|Transformer]] architecture.
    * '''Reinforcement learning''': Softmax over action-value estimates produces a stochastic policy (Boltzmann exploration).
    * '''Reinforcement learning''': Softmax over action-value estimates produces a stochastic policy (Boltzmann exploration).
    * '''Mixture models''': Softmax parameterizes mixing coefficients in mixture-of-experts architectures.
    * '''Mixture models''': Softmax parameterizes mixing coefficients in mixture-of-experts architectures.


    <!--T:35-->
    == See also ==
    == See also ==


    <!--T:36-->
    * [[Cross-Entropy Loss]]
    * [[Cross-Entropy Loss]]
    * [[Loss Functions]]
    * [[Loss Functions]]
    Line 94: Line 132:
    * [[Neural Networks]]
    * [[Neural Networks]]


    <!--T:37-->
    == References ==
    == References ==


    <!--T:38-->
    * Bishop, C. M. (2006). ''Pattern Recognition and Machine Learning''. Springer, Section 4.3.4.
    * Bishop, C. M. (2006). ''Pattern Recognition and Machine Learning''. Springer, Section 4.3.4.
    * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning''. MIT Press, Section 6.2.2.3.
    * Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning''. MIT Press, Section 6.2.2.3.
    * Hinton, G., Vinyals, O. and Dean, J. (2015). "Distilling the Knowledge in a Neural Network". ''arXiv:1503.02531''.
    * Hinton, G., Vinyals, O. and Dean, J. (2015). "Distilling the Knowledge in a Neural Network". ''arXiv:1503.02531''.
    * Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs". ''Neurocomputing''.
    * Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs". ''Neurocomputing''.
    </translate>


    [[Category:Machine Learning]]
    [[Category:Machine Learning]]
    [[Category:Introductory]]
    [[Category:Introductory]]

    Revision as of 00:30, 27 April 2026

    Other languages:
    Languages: English | Español | 中文
    Article
    Topic area Machine Learning
    Difficulty Introductory

    The softmax function (also called the normalized exponential function) is a mathematical function that converts a vector of real numbers (logits) into a probability distribution. It is the standard output activation for multi-class classification in neural networks and plays a central role in models ranging from logistic regression to large language models.

    Definition

    Given a vector of logits $ \mathbf{z} = (z_1, z_2, \dots, z_K) $ for $ K $ classes, the softmax function produces:

    $ \sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad k = 1, \dots, K $

    The output satisfies two properties that make it a valid probability distribution:

    1. $ \sigma(\mathbf{z})_k > 0 $ for all $ k $ (since the exponential is always positive).
    2. $ \sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1 $ (by construction).

    Intuition

    The softmax function amplifies differences between logits. A logit that is larger than its peers receives a disproportionately large share of the probability mass because the exponential function grows super-linearly. For example:

    Logits Softmax output
    $ (2.0,\; 1.0,\; 0.1) $ $ (0.659,\; 0.242,\; 0.099) $
    $ (5.0,\; 1.0,\; 0.1) $ $ (0.993,\; 0.005,\; 0.002) $

    As the gap between the largest logit and the others increases, the output approaches a one-hot vector. This "winner-take-most" behavior makes softmax well-suited for classification where a single class should dominate.

    Temperature Parameter

    A temperature parameter $ T > 0 $ controls the sharpness of the distribution:

    $ \sigma(\mathbf{z}; T)_k = \frac{e^{z_k / T}}{\sum_{j=1}^{K} e^{z_j / T}} $
    • $ T \to 0 $: The distribution collapses to a one-hot vector selecting the argmax — equivalent to a hard decision.
    • $ T = 1 $: Standard softmax.
    • $ T \to \infty $: The distribution approaches uniform — all classes become equally likely.

    Temperature scaling is widely used in knowledge distillation (Hinton et al., 2015), where a "soft" distribution from a teacher model provides richer training signal than hard labels. It is also used to control randomness in text generation from language models.

    Numerical Stability

    A naive implementation of softmax can overflow when logits are large (e.g., $ e^{1000} $ is infinite in floating point). The standard fix subtracts the maximum logit:

    $ \sigma(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}, \qquad m = \max_j z_j $

    This is mathematically equivalent (the constant cancels) but ensures the largest exponent is $ e^0 = 1 $, preventing overflow. All major deep learning frameworks implement this stabilized version automatically.

    Relationship to Sigmoid

    For the special case of $ K = 2 $ classes, the softmax function reduces to the sigmoid (logistic) function. If we define $ z = z_1 - z_2 $, then:

    $ \sigma(\mathbf{z})_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z}} = \sigma_{\mathrm{sigmoid}}(z) $

    This is why binary classifiers typically use a single output neuron with a sigmoid activation rather than two neurons with softmax — they are mathematically equivalent.

    Gradient

    The Jacobian of the softmax function with respect to its input is:

    $ \frac{\partial \sigma_k}{\partial z_j} = \sigma_k (\delta_{kj} - \sigma_j) $

    where $ \delta_{kj} $ is the Kronecker delta. When combined with Cross-Entropy Loss, the gradient simplifies to $ \hat{y}_k - y_k $, which is computationally efficient and numerically stable.

    Use in Classification

    In a typical classification pipeline:

    1. A neural network produces raw logits $ \mathbf{z} $ from its final linear layer.
    2. Softmax converts logits to probabilities: $ \hat{\mathbf{y}} = \sigma(\mathbf{z}) $.
    3. The predicted class is $ \hat{c} = \arg\max_k \hat{y}_k $.
    4. Training uses Cross-Entropy Loss applied to the predicted distribution and the true labels.

    In practice, the softmax and cross-entropy are computed jointly for numerical stability (the log-softmax formulation), and the argmax at inference time can be applied directly to the logits without computing softmax at all.

    Beyond Classification

    Softmax appears in many contexts beyond the output layer:

    • Attention mechanisms: Softmax normalizes alignment scores into attention weights in the Transformer architecture.
    • Reinforcement learning: Softmax over action-value estimates produces a stochastic policy (Boltzmann exploration).
    • Mixture models: Softmax parameterizes mixing coefficients in mixture-of-experts architectures.

    See also

    References

    • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, Section 4.3.4.
    • Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press, Section 6.2.2.3.
    • Hinton, G., Vinyals, O. and Dean, J. (2015). "Distilling the Knowledge in a Neural Network". arXiv:1503.02531.
    • Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs". Neurocomputing.