Byte-Pair Encoding/en
| Article | |
|---|---|
| Topic area | Natural Language Processing |
| Prerequisites | Tokenization, Language Model |
Overview
Byte-Pair Encoding (BPE) is a subword tokenization algorithm that builds a vocabulary by iteratively merging the most frequent adjacent pairs of symbols in a training corpus. Originally introduced as a data compression technique by Philip Gage in 1994, it was adapted for neural machine translation by Sennrich, Haddow, and Birch in 2015 and has since become a foundational component of modern language models, including the GPT family, RoBERTa, and many open-source large language models.[1] BPE strikes a balance between two extremes of Tokenization: word-level vocabularies, which suffer from out-of-vocabulary words and large embedding tables, and character-level vocabularies, which produce excessively long sequences and weak inductive biases. By learning a fixed-size vocabulary of subword units that covers any input string without unknown tokens, BPE enables efficient open-vocabulary modeling across languages and domains.
Background and Motivation
Early neural sequence models relied on word-level vocabularies, typically capped at 30,000 to 100,000 entries. Words outside this set were replaced by an <UNK> token, discarding information. This was particularly damaging for morphologically rich languages, technical domains, and rare named entities. Character-level models avoid the unknown-token problem entirely but require longer sequences and force the model to relearn high-frequency word patterns from scratch.
Subword tokenization offers a middle ground. The intuition is that frequent words should remain whole tokens, while rare words decompose into meaningful subword pieces such as morphemes or stems. For example, "tokenization" might split into "token" and "ization", allowing the model to share representations with related forms like "tokenize" or "modernization". BPE provides a simple, data-driven procedure for finding such a vocabulary without linguistic supervision.
The Algorithm
BPE training begins with a base vocabulary consisting of all characters (or all bytes) appearing in the corpus. Each word is represented as a sequence of these base symbols, typically with an end-of-word marker such as </w> to distinguish word-internal pieces from word-final ones. The training procedure then repeatedly applies a single operation: count the frequency of every adjacent pair of symbols across the corpus, identify the most frequent pair, and merge it into a new symbol that is added to the vocabulary.
Formally, let $ V_0 $ denote the initial vocabulary of characters and let $ C $ be a multiset of pre-tokenized words, each represented as a sequence of symbols. At iteration $ t $, BPE selects the pair
$ {\displaystyle (a^{*}, b^{*}) = \arg\max_{(a, b)} \; \mathrm{count}(a, b \mid C, V_{t-1})} $
and forms the new symbol $ ab $. Every occurrence of the adjacent pair $ (a, b) $ in $ C $ is replaced by $ ab $, and the merge rule is appended to an ordered merge list. The procedure continues until the vocabulary reaches a target size, commonly between 8,000 and 100,000 tokens. The output of training is two artifacts: the final vocabulary and the ordered list of merge rules.
A small worked example illustrates the dynamics. Suppose the corpus contains the words "low", "lower", "newest", and "widest", with frequencies that make "es" the most common adjacent pair after the initialization step. BPE first merges "e" + "s" into "es", then perhaps "es" + "t" into "est", and continues. After several iterations, common suffixes such as "est" and "er" emerge as single tokens, while rarer forms remain segmented.
Tokenization at Inference
Once trained, BPE tokenizes a new string deterministically. The string is first split into base symbols, and the merge rules are applied in the same order they were learned. At each step, the algorithm finds the merge rule with the lowest learned index that still applies, executes it, and repeats until no rule applies. The resulting sequence of tokens is unique given the merge list.
This greedy, priority-ordered application makes BPE encoding fast and reproducible. Practical implementations use a priority queue keyed on merge rank to achieve roughly linear time in the input length. A subtle consequence is that BPE tokenization is not invariant to whitespace or capitalization unless the pre-tokenizer normalizes them. Most modern tokenizers, including those used by GPT-2 and later models, preserve case and represent leading whitespace as a special character (often Ġ) so that "the" and " the" map to distinct tokens.
Byte-Level BPE
A practical issue with character-level BPE is that the base vocabulary depends on the training corpus. A new language or unusual Unicode codepoint can produce unknown characters at inference. Byte-level BPE, introduced with GPT-2, sidesteps this by treating the input as a sequence of UTF-8 bytes rather than characters.[2] The base vocabulary is fixed at 256 byte values, guaranteeing that any input string has a representation. Byte-level BPE eliminates unknown tokens entirely and handles emoji, code, and multilingual text uniformly. Most contemporary large language models use this variant.
The trade-off is that non-Latin scripts encode less efficiently in bytes. A single Chinese character may consume three UTF-8 bytes and require multiple merges to consolidate, leading to longer token sequences and higher inference costs for such languages. SentencePiece with a Unigram model is sometimes preferred when fairness across languages is a goal.
Comparison with Other Subword Methods
WordPiece, used in BERT, is closely related to BPE but selects merges based on a likelihood criterion rather than raw frequency. At each step, WordPiece chooses the pair whose merger maximizes the likelihood of the training corpus under a unigram language model assumption. In practice, WordPiece and BPE produce similar vocabularies, and the choice is often a matter of toolchain preference.
The Unigram language model tokenizer, used in SentencePiece and the T5 family, takes a different approach. It starts with a large candidate vocabulary and iteratively prunes tokens to maximize corpus likelihood, producing a probabilistic segmentation that supports subword regularization through sampling. SentencePiece additionally treats whitespace as a regular character, allowing fully end-to-end tokenization without language-specific pre-tokenization rules.
Compared to character-level models, BPE produces shorter sequences and stronger word-level priors. Compared to word-level models, it eliminates unknown tokens and shares parameters across morphologically related forms. The resulting vocabularies generalize well across domains and have proven remarkably robust as the dominant choice for Transformer architectures.
Implementation Considerations
A naive BPE implementation recounts pair frequencies after every merge, which is quadratic in the vocabulary size. Production implementations such as those in Hugging Face Tokenizers and OpenAI tiktoken maintain incremental data structures that update only the affected pairs after each merge, reducing training time to near-linear in corpus size. For inference, modern libraries achieve millions of tokens per second per CPU core through careful Rust or C++ implementations.
Pre-tokenization choices matter. Most BPE pipelines split text on whitespace and punctuation before applying merges, preventing tokens from spanning word boundaries. The exact rule set, including how digits, contractions, and punctuation are handled, varies between models and is rarely documented in detail, which makes exact reproduction of a tokenizer non-trivial. Released models therefore distribute the merge list and pre-tokenizer regex together as a frozen artifact.
Limitations and Trade-offs
BPE is a greedy, frequency-based heuristic with no explicit linguistic objective. It can produce counterintuitive segmentations, especially for low-frequency morphological variants, and it is sensitive to the composition of the training corpus. A vocabulary trained on web text segments medical or legal text suboptimally, inflating sequence lengths and degrading downstream performance until adaptation or vocabulary extension is performed.
The deterministic nature of BPE encoding has also been identified as a source of implicit bias in language models. Because numbers are merged based on training-corpus frequency rather than digit position, the same number can tokenize differently depending on context, which complicates arithmetic reasoning. Recent models have begun adopting digit-level tokenization as a partial remedy.
Finally, BPE provides a single canonical segmentation per string, which limits the model's exposure to alternative analyses during training. Subword regularization techniques, such as BPE dropout, randomly skip merges during training to expose the model to multiple plausible segmentations and improve robustness.[3]
References
- ↑ Sennrich, R., Haddow, B., and Birch, A., "Neural Machine Translation of Rare Words with Subword Units," ACL 2016.
- ↑ Radford, A. et al., "Language Models are Unsupervised Multitask Learners," OpenAI 2019.
- ↑ Provilkov, I., Emelianenko, D., and Voita, E., "BPE-Dropout: Simple and Effective Subword Regularization," ACL 2020.