Machine Translation

Article
Topic area	Natural Language Processing
Prerequisites	Transformer, Cross-Entropy Loss

This page contains changes which are not marked for translation.

Other languages:

English
Español
中文

Overview

Machine translation is the task of automatically converting text or speech from one natural language into another while preserving meaning. It is one of the oldest applications of computational linguistics, with practical attempts dating to the 1950s, and remains one of the most economically important uses of natural-language technology, underpinning multilingual web search, real-time messaging, document localization, and accessibility tooling. Modern systems frame the problem as conditional sequence generation: given a source sentence, output a target-language sentence that maximizes a learned probability under a parametric model.

The field has gone through three broad paradigms. Rule-based systems of the 1970s and 1980s relied on hand-crafted dictionaries and transfer rules. Statistical machine translation, dominant from the late 1990s through the early 2010s, learned phrase-level translation tables and reordering models from parallel corpora. Neural machine translation, the current paradigm, models the source-to-target conditional distribution end-to-end with a single neural network, typically a Transformer-based encoder-decoder. Each transition was driven less by a single technical breakthrough than by the convergence of better data, more compute, and metrics such as the BLEU Score that made progress measurable.

Problem formulation

Given a source sequence $x = (x_1, \ldots, x_S)$ over a source vocabulary and a target sequence $y = (y_1, \ldots, y_T)$ over a target vocabulary, machine translation seeks the conditional distribution

$p(y \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x).$

The factorization is left-to-right autoregressive; each token is predicted given all source tokens and all previously generated target tokens. The training objective is typically the negative log-likelihood of reference translations under teacher forcing,

$\mathcal{L}(\theta) = -\sum_{(x, y) \in \mathcal{D}} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x),$

which is the Cross-Entropy Loss applied position-wise. Inference replaces the reference prefix $y_{<t}$ with the model's own previous outputs, creating an exposure-mismatch between training and decoding that has motivated a long line of fixes including scheduled sampling, minimum-risk training, and reinforcement-learning fine-tuning with translation-quality rewards.

Statistical machine translation

Statistical machine translation decomposes $p(y \mid x)$ via a noisy-channel model into a translation model $p(x \mid y)$ and a target-side language model $$ p(y) $$ , then chooses $\arg\max_y p(x \mid y)\, p(y)$ . Word-level alignment models such as the IBM Models 1 through 5 estimate which source words generate which target words from sentence-aligned corpora using expectation-maximization. Phrase-based statistical machine translation, the workhorse of the 2000s, generalizes alignment to contiguous phrase pairs and adds reordering and length penalties through a log-linear discriminative model. Hierarchical and syntax-based extensions tried to capture long-range reordering with synchronous grammars. These systems produced the first deployable web translation services and set the BLEU baselines that neural systems eventually surpassed.

Neural machine translation

Early neural systems, beginning with the encoder-decoder recurrent architectures of 2014, encoded the source into a fixed-length vector and decoded a target sequence with a Long Short-Term Memory network. The fixed-length bottleneck limited quality on long sentences and was relaxed by the additive attention mechanism, which lets the decoder attend to a weighted sum of encoder states at each step. The Transformer, introduced in 2017, replaced recurrence entirely with stacked self-attention and Cross-Attention layers and quickly became the standard architecture for translation and most other sequence tasks.

A modern translation Transformer is an encoder-decoder. The encoder applies multi-head self-attention and feed-forward sublayers to source token embeddings; the decoder applies masked self-attention over the target prefix, cross-attention over encoder outputs, and a feed-forward sublayer; both stacks use residual connections and layer normalization. Inputs are subword units produced by Byte-Pair Encoding or unigram language-model tokenization, which keep vocabularies small and sidestep the open-vocabulary problem on morphologically rich languages.

Training and inference

Training data consists of parallel corpora — sentence pairs in two languages — supplemented by monolingual data via back-translation, in which a target-to-source model produces synthetic source sentences from genuine target text. Standard tricks include label smoothing, learning-rate warmup followed by inverse-square-root decay, and mixed-precision optimization. Gradients flow through the entire encoder-decoder by Backpropagation through time across both stacks.

Inference uses approximate search because the exact argmax over target sequences is intractable. Beam search with a beam width of 4 to 8 and a length-normalization term is the default, although recent work suggests that ancestral sampling with low-temperature filters can produce translations that are more diverse and equally accurate by reference-free metrics. Decoding speed is often the bottleneck in production, leading to non-autoregressive variants that emit the whole target in parallel, Knowledge Distillation from large teacher models into smaller students, and quantization of the decoder weights.

Multilingual and zero-shot translation

A single model can be trained to translate among many language pairs by prepending a target-language tag to the source. Such multilingual systems amortize encoder capacity across languages and frequently improve translation quality for low-resource pairs through positive transfer from high-resource ones. Strikingly, they often produce reasonable translations between language pairs that never co-occurred in the training data, a phenomenon known as zero-shot translation. The cost is interference between languages with very different scripts or syntax, which manifests as systematic errors that monolingual baselines do not exhibit.

Evaluation

Translation quality is hard to measure because there are many acceptable translations of any given sentence. The BLEU Score computes modified n-gram precision against one or more reference translations and remains the most widely reported headline number despite well-documented weaknesses. Newer reference-based metrics such as chrF, COMET, and BLEURT correlate better with human judgement, especially for high-quality systems where small BLEU differences are unreliable. Reference-free metrics that compare a candidate translation against the source via a quality estimation model are increasingly used in production to flag low-confidence outputs. Human evaluation along axes of adequacy and fluency, or via direct assessment, is still the gold standard for system comparison and is required by the major shared tasks such as the Conference on Machine Translation.

Limitations and active research

Even the strongest current systems make systematic errors. Hallucinations — fluent target sentences unrelated to the source — appear under domain shift or unusual punctuation. Gender and other social biases are amplified when source pronouns are ambiguous. Document-level coherence is poor because most systems translate sentence by sentence and lose anaphoric and stylistic context. Low-resource and morphologically rich languages remain substantially behind English-French or English-German. Active research directions include document-level and discourse-aware translation, integration with retrieval to ground rare names and terminology, instruction-tuned large language models that perform translation as one capability among many, and translation of speech-to-speech and signed languages where the input modality itself is non-textual.

References

^[1]

^[2]

^[3]

^[4]

^[5]

^[6]

^[7]

↑ Template:Cite arxiv
↑ Template:Cite arxiv
↑ Koehn, P. Statistical Machine Translation. Cambridge University Press, 2010.
↑ Brown, P. F. et al. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 1993.
↑ Papineni, K. et al. BLEU: a Method for Automatic Evaluation of Machine Translation. ACL, 2002.
↑ Template:Cite arxiv
↑ Template:Cite arxiv

[1] Template:Cite arxiv

[2] Template:Cite arxiv

[3] Koehn, P. Statistical Machine Translation. Cambridge University Press, 2010.

[4] Brown, P. F. et al. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 1993.

[5] Papineni, K. et al. BLEU: a Method for Automatic Evaluation of Machine Translation. ACL, 2002.

[6] Template:Cite arxiv

[7] Template:Cite arxiv

[1]

[2]

[3]

[4]

[5]

[6]

[7]