F1 Score/en
| Article | |
|---|---|
| Topic area | Machine Learning |
| Prerequisites | Precision, Recall |
Overview
The F1 score is a single-number summary of a binary classifier's quality, defined as the harmonic mean of precision and recall. Its value lies in the closed interval from 0 to 1, with 1 attained only when the classifier achieves both perfect precision and perfect recall on the evaluation set. The metric was popularised in information retrieval by van Rijsbergen, whose 1979 textbook formalised the broader F-beta family of which F1 is the symmetric special case, and it is now the default headline number for binary and multi-class classification tasks where the positive class is rare or where false positives and false negatives carry comparable cost.
F1 is preferred over accuracy whenever class imbalance is present, because accuracy can be made arbitrarily high by always predicting the majority class. Precision alone, the fraction of predicted positives that are correct, says nothing about how many true positives were missed; recall alone, the fraction of true positives that were retrieved, can be inflated by predicting the positive class indiscriminately. The harmonic mean combines them in a way that punishes the smaller of the two more heavily than an arithmetic mean would: a system with precision 1.0 and recall 0.1 has F1 of about 0.18, not 0.55. This sensitivity to the weaker component is what makes F1 a useful diagnostic in settings such as fraud detection, medical triage, and named-entity recognition, where neither false positives nor false negatives can be ignored.
Precision and recall
For a binary classifier on a labelled test set, every prediction falls into one of four cells of the confusion matrix: a true positive (TP) is a positive instance correctly classified as positive; a false positive (FP) is a negative instance incorrectly classified as positive; a false negative (FN) is a positive instance incorrectly classified as negative; and a true negative (TN) is a negative instance correctly classified as negative. Precision and recall are then
$ {\displaystyle P = \frac{TP}{TP + FP}, \qquad R = \frac{TP}{TP + FN}.} $
Precision answers the question: of the items the system flagged as positive, what fraction actually were? Recall answers the complementary question: of the items that actually were positive, what fraction did the system flag? The two are coupled through a tunable decision threshold on the classifier's score: lowering the threshold typically raises recall at the expense of precision, and raising the threshold does the opposite. Plotting one against the other as the threshold sweeps produces the precision-recall curve, and F1 picks out a single operating point on that curve.
Definition
The F1 score is the harmonic mean of precision and recall:
$ {\displaystyle F_1 = \frac{2 P R}{P + R} = \frac{2 \, TP}{2 \, TP + FP + FN}.} $
Two facts about the harmonic mean explain why it is the natural choice. First, the harmonic mean of two numbers is bounded above by both the geometric mean and the arithmetic mean, and approaches zero whenever either input approaches zero. A classifier that gets one of precision or recall right but the other catastrophically wrong receives a low F1, whereas the arithmetic mean of 1.0 and 0.0 would still be 0.5. Second, the harmonic mean has a clean information-theoretic interpretation as the inverse of the average of the inverses: minimising the average error rate across the two complementary views of the classifier is what F1 quantifies.
The compact form on the right of the equation is the version most often implemented in code. It avoids a separate computation of P and R and is numerically stable when one or both denominators of the original definitions could be zero: F1 is defined as 0 by convention when TP = 0, regardless of whether FP or FN is zero, since a classifier that fails to identify any true positive cannot be said to perform on the positive class at all.
F-beta generalization
F1 is the symmetric member of a one-parameter family of weighted harmonic means of precision and recall, the F-beta score:
$ {\displaystyle F_\beta = (1 + \beta^2) \cdot \frac{P R}{\beta^2 P + R}.} $
The parameter $ \beta > 0 $ controls how recall is weighted relative to precision. When $ \beta = 1 $ the formula collapses to F1, weighting the two equally. When $ \beta < 1 $, precision dominates: F0.5 is sometimes used in spam filtering, where false positives that block legitimate mail are more painful than missed spam. When $ \beta > 1 $, recall dominates: F2 is common in clinical screening and in safety-critical detection systems, where a missed positive carries far higher cost than a false alarm. The exact relationship is that $ F_\beta $ equals $ P $ when $ R $ equals $ \beta^2 P $, so beta encodes the indifference ratio between the two error types.
The choice of beta should be made before evaluation, ideally tied to a stated cost ratio between false positives and false negatives. In practice, F1 is the overwhelming default and is used even in settings where the symmetric weighting is not strictly justified, because reporting F1 makes results comparable across the literature and because formal cost analyses are uncommon in research benchmarks.
Multi-class extensions
The F1 formula is defined per class. Extending it to a multi-class problem requires aggregating per-class F1 values, and three averaging schemes are in standard use. The macro F1 averages the per-class F1 scores with equal weight, regardless of how many examples each class has: it gives a small minority class the same influence as a large majority class. The weighted F1 averages the per-class scores with weights proportional to each class's support, which produces a number more representative of overall performance on the test distribution but obscures poor performance on rare classes. The micro F1 pools the TP, FP, and FN counts across all classes before computing precision, recall, and F1 once at the global level; for a single-label multi-class problem in which every example has exactly one label, micro F1 equals accuracy, and the three numbers carry different information only in multi-label settings or when only a subset of classes is being evaluated.
The choice between macro and micro is a substantive one. Macro F1 is the right metric when the goal is to perform well on every class, including rare ones; benchmarks for rare-disease classification, low-resource named-entity recognition, and long-tail object detection routinely report macro F1 for this reason. Micro F1 is the right metric when the goal is to maximise the number of correct decisions, regardless of which classes those decisions land in; large-scale multi-label tagging tasks often prefer it. Reporting both, alongside per-class numbers, is the most informative practice and is now standard in the major NLP and computer-vision benchmarks.
Properties and interpretation
Several properties of F1 are worth keeping in mind when interpreting it. The metric is asymmetric in the two classes: swapping which class is called "positive" generally changes the score, because precision and recall are defined relative to a chosen positive class. This is rarely a problem in practice, since most binary tasks have a natural positive class such as "fraud" or "disease present", but it does mean that F1 is not a meaningful descriptor of a classifier's overall behaviour without specifying which class is being scored.
F1 ignores true negatives entirely. The compact form contains TP, FP, and FN but no TN, which is why it remains informative on highly imbalanced data: a classifier that always predicts negative on a 1-in-1000 positive task achieves accuracy of 0.999 but F1 of 0. Conversely, F1 is invariant to changes in the negative class size: doubling the number of true negatives in the test set leaves F1 unchanged. This invariance is desirable when the negative class is essentially a residual category that absorbs everything not of interest, but it can hide behaviour that matters: a fraud-detection system that doubles its false-alarm rate on legitimate transactions is paying a real operational cost that F1 does not see.
The F1 score has a range of 0 to 1 but the distribution of values that arise in practice is far from uniform. On easy benchmarks, F1 routinely sits above 0.9 and a difference of 0.005 may be statistically significant; on hard tasks, scores cluster around 0.5 and a difference of 0.02 is within noise. Calibrating intuition for what counts as a strong F1 requires knowing the task and the strength of recent baselines.
Limitations
F1's most criticised property is that it is a fixed weighted combination of two metrics that may have very different costs. The implicit assumption that precision and recall are equally important is rarely defensible from first principles, and in cost-sensitive settings the F-beta family or a directly calibrated cost-weighted error is more honest. A second criticism is that F1 collapses the precision-recall curve to a single threshold, hiding how the classifier behaves at other operating points; the area under the precision-recall curve, AUPRC, is a threshold-free alternative that summarises the entire curve.
A third concern is statistical: F1 is a non-linear function of TP, FP, and FN, so confidence intervals require care. Bootstrap resampling at the example level is the standard recipe; analytical approximations exist but are unreliable when any of the cells is small. Significance tests for the difference between two F1 scores are commonly performed via paired bootstrap or, for matched evaluations, McNemar's test on the underlying error counts.
Finally, F1 is sensitive to label noise and to changes in class balance between training and test. A test set that happens to over-represent the positive class will inflate recall and may move F1 substantially even when the classifier itself has not changed. In production deployments where the prevalence of the positive class drifts over time, F1 measured on a frozen test set can become misleading and should be supplemented with rolling on-distribution evaluation.
Comparisons with other metrics
Accuracy is the most familiar alternative. On balanced data it is informative and easy to communicate, but on imbalanced data it can be uninformative: a classifier that always predicts the majority class achieves high accuracy while being useless on the class of interest. F1 was introduced in part to fix this failure mode, and it has largely displaced accuracy as the default reporting metric on imbalanced binary tasks.
The Matthews correlation coefficient, MCC, is a single-number summary that uses all four cells of the confusion matrix and is symmetric in the two classes. It ranges from -1 to 1, with 0 corresponding to chance-level performance and 1 to perfect classification. MCC is often considered more informative than F1 on imbalanced data because it does not ignore true negatives and because it is symmetric, but it lacks F1's intuitive interpretation as a balance between two well-known quantities, and is less common in published benchmarks. Cohen's kappa similarly accounts for chance agreement and is preferred in settings where annotator-style agreement is the relevant frame.
ROC-AUC summarises classifier ranking quality across all thresholds and is invariant to class balance, which makes it complementary to F1 rather than a replacement. AUPRC, the area under the precision-recall curve, is the threshold-free analogue of F1 and is often preferred for highly imbalanced detection tasks where ROC-AUC can be optimistically high.
Practical considerations
Three details matter when reporting F1. First, specify which averaging scheme is used in multi-class settings, as macro, micro, and weighted F1 can differ substantially and a number reported without this qualifier is ambiguous. Second, specify the threshold at which F1 was computed: when the classifier outputs probabilities, the threshold is a free parameter, and reporting the F1 at the threshold that maximises F1 on the test set overstates expected performance because it leaks the test set into model selection. The standard remedy is to choose the threshold on a held-out validation set and report F1 at that frozen threshold on the test set. Third, report a confidence interval, ideally via bootstrap, particularly when test sets are small or when comparing closely matched systems.
When F1 is used as a training objective, the non-differentiability of TP, FP, and FN with respect to model parameters is a problem. Direct optimisation is usually replaced by a differentiable surrogate, such as the soft-F1 loss that uses predicted probabilities in place of binary decisions, or by training with cross-entropy and choosing the decision threshold post hoc to maximise validation F1. The latter recipe is more robust and is the dominant practice in modern systems.
References
- ↑ van Rijsbergen, C. J. Information Retrieval, 2nd edition. Butterworths, 1979.
- ↑ Manning, C. D., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. Cambridge University Press, 2008.
- ↑ Sasaki, Y. The Truth of the F-measure. School of Computer Science, University of Manchester, 2007.
- ↑ Powers, D. M. W. Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies, 2011.
- ↑ Chicco, D. and Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 2020.
- ↑ Lipton, Z. C., Elkan, C., and Naryanaswamy, B. Optimal Thresholding of Classifiers to Maximize F1 Measure. Template:Cite arxiv