Loss Functions/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:52:37Z

Updating to match new version of source page

← Older revision		Revision as of 23:52, 27 April 2026
Line 31:		Line 31:
	== Cross-entropy loss ==		== Cross-entropy loss ==

	'''Cross-entropy loss''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.		'''{{Term\|categorical cross-entropy\|Cross-entropy loss}}''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.

	=== Binary cross-entropy ===		=== Binary cross-entropy ===
Line 47:		Line 47:
	:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>		:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>

	When the true labels are one-hot encoded, only the term corresponding to the correct class survives.		When the true labels are {{Term\|one-hot encoding\|one-hot}} encoded, only the term corresponding to the correct class survives.

	== Hinge loss ==		== Hinge loss ==
Line 79:		Line 79:

	* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.		* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.
	* '''Binary classification''' — binary cross-entropy with sigmoid output.		* '''Binary classification''' — binary {{Term\|categorical cross-entropy\|cross-entropy}} with sigmoid output.
	* '''Multi-class classification''' — categorical cross-entropy with softmax output.		* '''Multi-class classification''' — {{Term\|categorical cross-entropy}} with {{Term\|softmax}} output.
	* '''Multi-label classification''' — binary cross-entropy applied independently per label.		* '''Multi-label classification''' — binary {{Term\|categorical cross-entropy\|cross-entropy}} applied independently per label.
	* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.		* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.

	An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. Cross-entropy is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.		An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. {{Term\|categorical cross-entropy\|Cross-entropy}} is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.

	== Regularisation terms ==		== Regularisation terms ==

	In practice, the total objective often includes a '''regularisation term''' that penalises model complexity:		In practice, the total objective often includes a '''{{Term\|regularization\|regularisation}} term''' that penalises model complexity:

	:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>		:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>

	where <math>\lambda</math> controls the strength of regularisation. Common choices include L2 regularisation (<math>R = \\|\theta\\|_2^2</math>) and L1 regularisation (<math>R = \\|\theta\\|_1</math>). See [[Overfitting and Regularization]] for more detail.		where <math>\lambda</math> controls the strength of {{Term\|regularization\|regularisation}}. Common choices include L2 {{Term\|regularization\|regularisation}} (<math>R = \\|\theta\\|_2^2</math>) and L1 {{Term\|regularization\|regularisation}} (<math>R = \\|\theta\\|_1</math>). See [[Overfitting and Regularization]] for more detail.

	== See also ==		== See also ==

FuzzyBot: Updating to match new version of source page

2026-04-27T22:01:02Z

Updating to match new version of source page

← Older revision		Revision as of 22:01, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''~~{{Term\|loss function\|~~Loss functions}}''' (also called '''~~{{Term\|loss function\|~~cost functions}}''' or '''~~{{Term\|loss function\|~~objective functions}}''') quantify how far a model's predictions are from the desired output. Minimising the ~~{{Term\|~~loss function}} is the central goal of the training process in machine learning: the optimisation algorithm adjusts the model's parameters to drive the loss as low as possible.		'''Loss functions''' (also called '''cost functions''' or '''objective functions''') quantify how far a model's predictions are from the desired output. Minimising the loss function is the central goal of the training process in machine learning: the optimisation algorithm adjusts the model's parameters to drive the loss as low as possible.

	== Purpose ==		== Purpose ==

	A ~~{{Term\|~~loss function}} maps the model's prediction <math>\hat{y}</math> and the true target <math>y</math> to a non-negative real number. Formally, for a single example:		A loss function maps the model's prediction <math>\hat{y}</math> and the true target <math>y</math> to a non-negative real number. Formally, for a single example:

	:<math>\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}</math>		:<math>\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}</math>
Line 15:		Line 15:
	:<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)</math>		:<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)</math>

	The choice of ~~{{Term\|~~loss function}} encodes the problem's structure — what kind of errors matter and how severely they should be penalised. A poorly chosen loss can lead to a model that optimises the wrong objective.		The choice of loss function encodes the problem's structure — what kind of errors matter and how severely they should be penalised. A poorly chosen loss can lead to a model that optimises the wrong objective.

	== Mean squared error ==		== Mean squared error ==
Line 31:		Line 31:
	== Cross-entropy loss ==		== Cross-entropy loss ==

	'''~~{{Term\|categorical cross-entropy\|~~Cross-entropy loss}}''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.		'''Cross-entropy loss''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.

	=== Binary cross-entropy ===		=== Binary cross-entropy ===
Line 47:		Line 47:
	:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>		:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>

	When the true labels are ~~{{Term\|~~one-hot ~~encoding\|one-hot}}~~ encoded, only the term corresponding to the correct class survives.		When the true labels are one-hot encoded, only the term corresponding to the correct class survives.

	== Hinge loss ==		== Hinge loss ==
Line 76:		Line 76:
	== Choosing the right loss ==		== Choosing the right loss ==

	The appropriate ~~{{Term\|~~loss function}} depends on the task:		The appropriate loss function depends on the task:

	* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.		* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.
	* '''Binary classification''' — binary ~~{{Term\|categorical cross-entropy\|~~cross-entropy}} with sigmoid output.		* '''Binary classification''' — binary cross-entropy with sigmoid output.
	* '''Multi-class classification''' — ~~{{Term\|~~categorical cross-entropy}} with ~~{{Term\|~~softmax}} output.		* '''Multi-class classification''' — categorical cross-entropy with softmax output.
	* '''Multi-label classification''' — binary ~~{{Term\|categorical cross-entropy\|~~cross-entropy}} applied independently per label.		* '''Multi-label classification''' — binary cross-entropy applied independently per label.
	* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.		* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.

	An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. ~~{{Term\|categorical cross-entropy\|~~Cross-entropy}} is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.		An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. Cross-entropy is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.

	== Regularisation terms ==		== Regularisation terms ==

	In practice, the total objective often includes a '''~~{{Term\|regularization\|~~regularisation}} term''' that penalises model complexity:		In practice, the total objective often includes a '''regularisation term''' that penalises model complexity:

	:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>		:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>

	where <math>\lambda</math> controls the strength of ~~{{Term\|regularization\|~~regularisation}}. Common choices include L2 ~~{{Term\|regularization\|~~regularisation}} (<math>R = \\|\theta\\|_2^2</math>) and L1 ~~{{Term\|regularization\|~~regularisation}} (<math>R = \\|\theta\\|_1</math>). See [[Overfitting and Regularization]] for more detail.		where <math>\lambda</math> controls the strength of regularisation. Common choices include L2 regularisation (<math>R = \\|\theta\\|_2^2</math>) and L1 regularisation (<math>R = \\|\theta\\|_1</math>). See [[Overfitting and Regularization]] for more detail.

	== See also ==		== See also ==

FuzzyBot: Updating to match new version of source page

2026-04-27T19:41:54Z

Updating to match new version of source page

← Older revision		Revision as of 19:41, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Loss functions''' (also called '''cost functions''' or '''objective functions''') quantify how far a model's predictions are from the desired output. Minimising the loss function is the central goal of the training process in machine learning: the optimisation algorithm adjusts the model's parameters to drive the loss as low as possible.		'''{{Term\|loss function\|Loss functions}}''' (also called '''{{Term\|loss function\|cost functions}}''' or '''{{Term\|loss function\|objective functions}}''') quantify how far a model's predictions are from the desired output. Minimising the {{Term\|loss function}} is the central goal of the training process in machine learning: the optimisation algorithm adjusts the model's parameters to drive the loss as low as possible.

	== Purpose ==		== Purpose ==

	A loss function maps the model's prediction <math>\hat{y}</math> and the true target <math>y</math> to a non-negative real number. Formally, for a single example:		A {{Term\|loss function}} maps the model's prediction <math>\hat{y}</math> and the true target <math>y</math> to a non-negative real number. Formally, for a single example:

	:<math>\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}</math>		:<math>\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}</math>
Line 15:		Line 15:
	:<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)</math>		:<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)</math>

	The choice of loss function encodes the problem's structure — what kind of errors matter and how severely they should be penalised. A poorly chosen loss can lead to a model that optimises the wrong objective.		The choice of {{Term\|loss function}} encodes the problem's structure — what kind of errors matter and how severely they should be penalised. A poorly chosen loss can lead to a model that optimises the wrong objective.

	== Mean squared error ==		== Mean squared error ==
Line 31:		Line 31:
	== Cross-entropy loss ==		== Cross-entropy loss ==

	'''Cross-entropy loss''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.		'''{{Term\|categorical cross-entropy\|Cross-entropy loss}}''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.

	=== Binary cross-entropy ===		=== Binary cross-entropy ===
Line 47:		Line 47:
	:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>		:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>

	When the true labels are one-hot encoded, only the term corresponding to the correct class survives.		When the true labels are {{Term\|one-hot encoding\|one-hot}} encoded, only the term corresponding to the correct class survives.

	== Hinge loss ==		== Hinge loss ==
Line 76:		Line 76:
	== Choosing the right loss ==		== Choosing the right loss ==

	The appropriate loss function depends on the task:		The appropriate {{Term\|loss function}} depends on the task:

	* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.		* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.
	* '''Binary classification''' — binary cross-entropy with sigmoid output.		* '''Binary classification''' — binary {{Term\|categorical cross-entropy\|cross-entropy}} with sigmoid output.
	* '''Multi-class classification''' — categorical cross-entropy with softmax output.		* '''Multi-class classification''' — {{Term\|categorical cross-entropy}} with {{Term\|softmax}} output.
	* '''Multi-label classification''' — binary cross-entropy applied independently per label.		* '''Multi-label classification''' — binary {{Term\|categorical cross-entropy\|cross-entropy}} applied independently per label.
	* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.		* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.

	An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. Cross-entropy is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.		An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. {{Term\|categorical cross-entropy\|Cross-entropy}} is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.

	== Regularisation terms ==		== Regularisation terms ==

	In practice, the total objective often includes a '''regularisation term''' that penalises model complexity:		In practice, the total objective often includes a '''{{Term\|regularization\|regularisation}} term''' that penalises model complexity:

	:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>		:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>

	where <math>\lambda</math> controls the strength of regularisation. Common choices include L2 regularisation (<math>R = \\|\theta\\|_2^2</math>) and L1 regularisation (<math>R = \\|\theta\\|_1</math>). See [[Overfitting and Regularization]] for more detail.		where <math>\lambda</math> controls the strength of {{Term\|regularization\|regularisation}}. Common choices include L2 {{Term\|regularization\|regularisation}} (<math>R = \\|\theta\\|_2^2</math>) and L1 {{Term\|regularization\|regularisation}} (<math>R = \\|\theta\\|_1</math>). See [[Overfitting and Regularization]] for more detail.

	== See also ==		== See also ==

FuzzyBot: Updating to match new version of source page

2026-04-27T02:37:48Z

Updating to match new version of source page

← Older revision		Revision as of 02:37, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Loss Functions}}~~
	{{ArticleInfobox \| topic_area = Machine Learning \| difficulty = Introductory \| prerequisites = }}		{{ArticleInfobox \| topic_area = Machine Learning \| difficulty = Introductory \| prerequisites = }}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:31:24Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Loss Functions}}
{{ArticleInfobox | topic_area = Machine Learning | difficulty = Introductory | prerequisites = }}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Loss functions''' (also called '''cost functions''' or '''objective functions''') quantify how far a model's predictions are from the desired output. Minimising the loss function is the central goal of the training process in machine learning: the optimisation algorithm adjusts the model's parameters to drive the loss as low as possible.

== Purpose ==

A loss function maps the model's prediction <math>\hat{y}</math> and the true target <math>y</math> to a non-negative real number. Formally, for a single example:

:<math>\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}</math>

Over a dataset of <math>N</math> examples, the total loss is typically the average:

:<math>L(\theta) = \frac{1}{N}\sum_{i=1}^{N}\ell\bigl(y_i,\, \hat{y}_i(\theta)\bigr)</math>

The choice of loss function encodes the problem's structure — what kind of errors matter and how severely they should be penalised. A poorly chosen loss can lead to a model that optimises the wrong objective.

== Mean squared error ==

'''Mean squared error''' (MSE) is the default loss for regression tasks:

:<math>L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2</math>

MSE penalises large errors quadratically, making it sensitive to outliers. Its gradient is straightforward:

:<math>\frac{\partial}{\partial \hat{y}_i} (y_i - \hat{y}_i)^2 = -2(y_i - \hat{y}_i)</math>

A closely related variant is '''mean absolute error''' (MAE), <math>\frac{1}{N}\sum|y_i - \hat{y}_i|</math>, which is more robust to outliers but has a non-smooth gradient at zero. The '''Huber loss''' combines both: it behaves like MSE for small errors and like MAE for large ones.

== Cross-entropy loss ==

'''Cross-entropy loss''' is the standard choice for classification tasks. It measures the dissimilarity between the predicted probability distribution and the true label distribution.

=== Binary cross-entropy ===

For binary classification with predicted probability <math>p</math> and true label <math>y \in \{0, 1\}</math>:

:<math>L_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log p_i + (1 - y_i)\log(1 - p_i)\bigr]</math>

This loss is minimised when the predicted probability matches the true label perfectly (<math>p = 1</math> when <math>y = 1</math> and <math>p = 0</math> when <math>y = 0</math>).

=== Categorical cross-entropy ===

For multi-class classification with <math>C</math> classes and predicted probability vector <math>\hat{\mathbf{y}}</math>:

:<math>L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c} \log \hat{y}_{i,c}</math>

When the true labels are one-hot encoded, only the term corresponding to the correct class survives.

== Hinge loss ==

'''Hinge loss''' is associated with support vector machines (SVMs) and maximum-margin classifiers. For a binary classification problem with labels <math>y \in \{-1, +1\}</math> and raw model output <math>s</math>:

:<math>L_{\text{hinge}} = \frac{1}{N}\sum_{i=1}^{N}\max(0,\; 1 - y_i \, s_i)</math>

The hinge loss is zero when the prediction has the correct sign with margin at least 1, and increases linearly otherwise. Because it is not differentiable at the hinge point, subgradient methods are used for optimisation.

== Other common loss functions ==

{| class="wikitable"
|-
! Loss !! Formula !! Typical use
|-
| '''Huber''' || <math>\begin{cases}\tfrac{1}{2}(y-\hat{y})^2 & |y-\hat{y}|\leq\delta \\ \delta(|y-\hat{y}|-\tfrac{\delta}{2}) & \text{otherwise}\end{cases}</math> || Robust regression
|-
| '''KL divergence''' || <math>\sum_c p_c \log\frac{p_c}{q_c}</math> || Distribution matching, VAEs
|-
| '''Focal loss''' || <math>-\alpha(1-p_t)^\gamma \log p_t</math> || Imbalanced classification
|-
| '''CTC loss''' || Dynamic programming over alignments || Speech recognition, OCR
|-
| '''Triplet loss''' || <math>\max(0,\; d(a,p) - d(a,n) + m)</math> || Metric learning, face verification
|}

== Choosing the right loss ==

The appropriate loss function depends on the task:

* '''Regression''' — MSE is the default; switch to MAE or Huber if outliers are a concern.
* '''Binary classification''' — binary cross-entropy with sigmoid output.
* '''Multi-class classification''' — categorical cross-entropy with softmax output.
* '''Multi-label classification''' — binary cross-entropy applied independently per label.
* '''Ranking or retrieval''' — contrastive loss, triplet loss, or listwise ranking losses.

An important consideration is whether the loss is '''calibrated''' — i.e., whether minimising it yields well-calibrated predicted probabilities. Cross-entropy is a proper scoring rule and produces calibrated probabilities, while hinge loss does not.

== Regularisation terms ==

In practice, the total objective often includes a '''regularisation term''' that penalises model complexity:

:<math>J(\theta) = L(\theta) + \lambda \, R(\theta)</math>

where <math>\lambda</math> controls the strength of regularisation. Common choices include L2 regularisation (<math>R = \|\theta\|_2^2</math>) and L1 regularisation (<math>R = \|\theta\|_1</math>). See [[Overfitting and Regularization]] for more detail.

== See also ==

* [[Gradient Descent]]
* [[Neural Networks]]
* [[Backpropagation]]
* [[Overfitting and Regularization]]
* [[Stochastic Gradient Descent]]

== References ==

* Bishop, C. M. (2006). ''Pattern Recognition and Machine Learning'', Chapter 1. Springer.
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapters 6 and 8. MIT Press.
* Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection". ''ICCV''.
* Murphy, K. P. (2022). ''Probabilistic Machine Learning: An Introduction''. MIT Press.

[[Category:Machine Learning]]
[[Category:Introductory]]