BLEU Score/en
| Article | |
|---|---|
| Topic area | Natural Language Processing |
| Prerequisites | Tokenization, Machine Translation |
Overview
The BLEU score (Bilingual Evaluation Understudy) is an automatic evaluation metric for machine-translated text, introduced by Papineni and colleagues at IBM Research in 2002. It scores a candidate translation against one or more human reference translations on a scale from 0 to 1 (or 0 to 100 in the percentage form most commonly reported), with higher values indicating closer agreement with the references. Its design goals were to be cheap, language-independent, and to correlate with human judgement on average over a corpus, so that translation researchers could iterate on systems without commissioning a human evaluation for every change.
BLEU is built from two ideas. The first is modified n-gram precision: it counts how many of the n-grams in the candidate also appear in any reference, with each reference n-gram only available to be matched a bounded number of times so that repeating a single common phrase does not inflate the score. The second is a brevity penalty that multiplies the precision score down whenever the candidate is shorter than the reference, since precision alone has no defence against a short, dense translation that prints only a few high-confidence words. The geometric mean of the modified precisions for n = 1 through 4, multiplied by the brevity penalty, is the BLEU score. Despite well-documented weaknesses, this construction has remained the default headline number in machine-translation papers for two decades, in part because it is reproducible enough that two groups can agree on a number, and in part because no successor metric is simultaneously simple, free, and language-agnostic.
History and motivation
Before BLEU, the dominant evaluation method in machine translation was human rating along axes such as adequacy and fluency. Human evaluation is the gold standard but is slow and expensive: a typical campaign costs weeks of work and tens of thousands of dollars, which made it impractical to use during system development. The IBM team proposed BLEU as an "understudy" that researchers could query repeatedly during a development cycle, with human evaluation reserved for occasional calibration. The 2002 paper showed that BLEU correlated reasonably with human judgement at the corpus level across multiple systems and languages, and the metric was rapidly adopted by the WMT and NIST evaluation campaigns. Its introduction is widely credited with accelerating the statistical machine-translation era of the mid-2000s.
Modified n-gram precision
For a given order $ n $, the modified precision $ p_n $ compares the n-grams of the candidate against the references. Let $ C $ be the candidate and $ \{R_1, \ldots, R_m\} $ be the reference set. For each n-gram $ g $ appearing in $ C $, define the count $ \mathrm{count}(g, C) $ in the candidate and the maximum count $ \mathrm{max\_ref\_count}(g) = \max_i \mathrm{count}(g, R_i) $ over the references. The clipped count is
$ {\displaystyle \mathrm{count}_{\mathrm{clip}}(g) = \min\big(\mathrm{count}(g, C), \mathrm{max\_ref\_count}(g)\big),} $
and the modified precision is
$ {\displaystyle p_n = \frac{\sum_{g \in C} \mathrm{count}_{\mathrm{clip}}(g)}{\sum_{g \in C} \mathrm{count}(g, C)}.} $
The clipping step is what distinguishes modified precision from naive precision. The original paper motivates it with a worked example: a candidate that consists solely of the word "the" repeated seven times would obtain unigram precision 1 against any reference containing "the", because every word in the candidate is in some reference. Clipping bounds the contribution of "the" by the maximum number of times it appears in any single reference, restoring a sensible score. The same logic applies to longer n-grams; in practice clipping matters most for unigrams, where pathological repetition is most common.
In a corpus-level evaluation, the numerator and denominator are summed across all sentence pairs before the ratio is taken, rather than averaging sentence-level precisions. This corpus-level pooling is what gives BLEU much of its robustness: a single short sentence with no matching n-grams does not collapse the score because it contributes only a few terms to a much larger denominator.
Brevity penalty
Precision alone rewards short candidates, since it is easier to be precise when one says less. To prevent this, BLEU multiplies the precision component by a brevity penalty $ \mathrm{BP} $ defined as
$ {\displaystyle \mathrm{BP} = \begin{cases} 1 & \text{if } c > r, \\ \exp\!\left(1 - \tfrac{r}{c}\right) & \text{if } c \leq r, \end{cases}} $
where $ c $ is the total length of the candidate corpus and $ r $ is the effective reference length. When multiple references exist, $ r $ is the sum over sentences of the reference length closest to that sentence's candidate length. The penalty is exactly 1 when the candidate is at least as long as the reference, and decays smoothly toward 0 as the candidate becomes much shorter. There is deliberately no symmetric penalty for over-long candidates, because the modified precision already drops when extra words fail to match references.
The brevity penalty operates at the corpus level, not per sentence. This is a deliberate design choice: a short sentence may be a faithful translation of a short source, so penalising every short candidate would itself be unfair. Aggregating lengths across the corpus averages out this fluctuation.
The full BLEU formula
The conventional BLEU score combines modified precisions for n = 1 through 4 with the brevity penalty:
$ {\displaystyle \mathrm{BLEU} = \mathrm{BP} \cdot \exp\!\left(\sum_{n=1}^{4} w_n \log p_n\right),} $
with uniform weights $ w_n = 1/4 $. The exponential of the weighted log-sum is the geometric mean of the precisions, and the geometric mean is what makes BLEU drop to zero whenever any single $ p_n $ is zero. This is in keeping with the intent of the metric: a translation that fails to recover any 4-gram from the references is not a good translation, even if its unigram precision is high.
The choice of n-gram orders up to 4 and uniform weights is a convention rather than a mathematical necessity. The 2002 paper experimented with several configurations and found that the four-gram geometric mean correlated best with human judgement on their data; the convention has since been frozen in place, partly because changing it would make new results incomparable with the literature. BLEU-1, BLEU-2, and so on refer to BLEU computed with the geometric mean truncated at the corresponding order, and are sometimes reported separately to give a more granular picture.
Smoothing
Because the geometric mean is zero whenever any $ p_n $ is zero, sentence-level BLEU is highly unstable: a single sentence missing a 4-gram match scores zero, even with high lower-order precision. This is acceptable at corpus level, where zero modified precisions are rare once a sufficient number of sentences are pooled, but it is a serious problem when BLEU is used as a per-sentence training signal or to evaluate small test sets. A family of smoothing methods, codified in the SmoothBLEU work of Chen and Cherry, addresses this. The common smoothing strategies add a small constant to the numerator and denominator (additive smoothing), substitute a tiny fraction when zero is encountered, or use exponential smoothing that interpolates lower-order precisions into higher orders. SacreBLEU exposes several of these as named options.
Variants and standardisation
The metric as specified in the original paper underdetermines several practical decisions: how to tokenise, how to lowercase, how to handle punctuation, and how to count when there are multiple references. Different toolkits made different choices, and for years two papers reporting "BLEU 30" might have used incompatible procedures. SacreBLEU, introduced by Post in 2018, standardises the entire pipeline: it tokenises the references and the hypothesis itself using a fixed scheme, fixes the smoothing, and reports a version-tagged signature so that the result is reproducible. The community has converged on SacreBLEU as the de facto standard for published numbers; the older, tokenisation-dependent BLEU is now considered unreliable for cross-paper comparison.
Variants extend BLEU along several axes. BLEU-1 through BLEU-4 vary the maximum n-gram order, with BLEU-1 (unigram-only with brevity penalty) sometimes used as a coarse proxy for adequacy. NIST replaces uniform weights with information-weighted ones, so that rare matched n-grams contribute more than common ones. ChrF computes character-level F-scores instead of word-level precision and is more robust on morphologically rich languages. Self-BLEU and back-translation BLEU repurpose the metric for diversity and quality estimation in generation tasks beyond translation.
Strengths, limitations, and modern alternatives
BLEU's strengths are practical: it is cheap, deterministic, language-agnostic in the sense that it requires no language-specific resources beyond a tokeniser, and it has a long enough literature that researchers have a strong intuition for what a given number means in a given setting. It correlates with human judgement well enough at the corpus level, when comparing systems of similar architecture, that it served as the workhorse evaluation throughout the statistical and early neural machine-translation eras.
Its limitations are well-catalogued. It is a surface-form metric that does not understand synonyms: a perfectly fluent translation that happens to use different words from the reference can score badly. It is insensitive to word order beyond the n-gram window, so reordering errors that a human reader would notice are invisible to BLEU. It rewards vocabulary overlap with the reference rather than meaning preservation, which is exploitable: systems can be trained or tuned to maximise BLEU in ways that drift away from human-rated quality. At the sentence level it is noisy, and at the corpus level its correlation with human judgement is weakest precisely when comparing very strong systems, which is when modern translation research operates. The METEOR metric introduced explicit synonym matching, paraphrase tables, and a recall component to address some of these issues but is more expensive and language-dependent. TER (translation edit rate) measures the number of edits required to transform the candidate into a reference, providing a complementary view. Embedding-based metrics such as BERTScore use contextual representations to score semantic similarity, and learned metrics such as COMET and BLEURT are trained directly on human-judgement data and now substantially outperform BLEU in correlation with human ratings on modern strong systems. The contemporary practice in machine-translation evaluation is to report SacreBLEU for backward compatibility alongside one or more learned metrics.
Practical considerations
When reporting BLEU, three details matter. First, always specify the toolkit and version, ideally by quoting the SacreBLEU signature; this is the only way to make a number reproducible. Second, distinguish corpus BLEU from sentence BLEU and from the average of sentence BLEUs, as the three differ and only corpus BLEU matches the metric as originally defined. Third, note the number of references: BLEU's correlation with human judgement improves with more references, and a single-reference score is a noisier estimate than a four-reference score on the same system. When using BLEU as an optimisation target during training, smoothed sentence BLEU or one of its differentiable surrogates is appropriate; minimum-risk training and reinforcement-learning fine-tuning of Transformer-based translation systems both routinely use BLEU as the reward, and the resulting reward-hacking effects are part of the broader case for moving to learned metrics.
References
- ↑ Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002.
- ↑ Post, M. A Call for Clarity in Reporting BLEU Scores. Template:Cite arxiv
- ↑ Chen, B. and Cherry, C. A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. WMT 2014.
- ↑ Banerjee, S. and Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. ACL 2005.
- ↑ Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. AMTA 2006.
- ↑ Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. BERTScore: Evaluating Text Generation with BERT. Template:Cite arxiv
- ↑ Rei, R., Stewart, C., Farinha, A. C., and Lavie, A. COMET: A Neural Framework for MT Evaluation. Template:Cite arxiv