BERT Pre-training of Deep Bidirectional Transformers: Difference between revisions

Research Paper
Authors	Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova
Year	2019
Venue	NAACL
Topic area	NLP
Difficulty	Research
arXiv	1810.04805
PDF	Download PDF

Latest revision as of 02:36, 27 April 2026

Other languages:

English
Español
中文

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding is a 2019 paper by Devlin et al. from Google AI Language that introduced BERT (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.

Overview

Before BERT, pre-trained language representations were either unidirectional (like GPT, which reads left-to-right) or used shallow concatenation of independently trained left-to-right and right-to-left models (like ELMo). These approaches were suboptimal because standard language models are inherently unidirectional, preventing tokens from attending to context on both sides simultaneously.

BERT addressed this limitation by introducing a novel pre-training objective — masked language modeling (MLM) — that enables genuine bidirectional pre-training. Combined with a next sentence prediction (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.

Key Contributions

Masked language modeling (MLM): A pre-training objective that randomly masks input tokens and trains the model to predict them from bidirectional context, enabling true bidirectional representation learning.
Next sentence prediction (NSP): A binary classification pre-training task that teaches the model to understand relationships between sentence pairs.
A simple and effective fine-tuning paradigm: Adding a single output layer to the pre-trained model suffices for a wide range of NLP tasks, from classification to question answering.
Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.

Methods

BERT uses the encoder portion of the Transformer architecture. The model takes a sequence of tokens as input and produces a contextualized embedding for each token. Two model sizes were released: BERT-Base (12 layers, 768 hidden units, 12 attention heads, 110M parameters) and BERT-Large (24 layers, 1024 hidden units, 16 attention heads, 340M parameters).

The masked language modeling objective works by randomly masking 15% of the input tokens. Of these masked positions, 80% are replaced with the [MASK] token, 10% with a random token, and 10% are left unchanged. The model predicts the original token at each masked position using a cross-entropy loss:

$L_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\backslash \mathcal{M}})$

where $\mathcal{M}$ is the set of masked positions and $\mathbf{x}_{\backslash \mathcal{M}}$ represents the corrupted input.

For next sentence prediction, the model receives pairs of sentences (A and B) and predicts whether B is the actual next sentence following A in the corpus, or a randomly sampled sentence. A special [CLS] token at the beginning of the input captures the aggregate sequence representation used for this binary classification.

Input representation combines token embeddings, segment embeddings (indicating sentence A or B), and positional embeddings. BERT uses WordPiece tokenization with a 30,000-token vocabulary.

Pre-training used the BooksCorpus (800M words) and English Wikipedia (2,500M words), running for 1M steps with a batch size of 256 sequences. The total pre-training compute was substantial for its time, requiring four days on 4 to 16 Cloud TPUs (for Base and Large respectively).

Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.

Results

BERT achieved state-of-the-art results on eleven NLP benchmarks at the time of publication:

GLUE benchmark: BERT-Large achieved an average score of 80.5, a 7.7-point improvement over the previous state of the art.
SQuAD v1.1 (question answering): F1 score of 93.2, surpassing human performance (91.2 F1).
SQuAD v2.0: F1 score of 83.1, a 5.1-point improvement over prior systems.
SWAG (commonsense reasoning): 86.3% accuracy, outperforming human expert performance (85.0%).

Ablation studies demonstrated that both pre-training tasks were important, and that bidirectionality was the most significant factor — removing it caused large drops across all tasks. Increasing model size consistently improved results, even on small-scale tasks when fine-tuned appropriately.

The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.

Impact

BERT catalyzed a paradigm shift in NLP toward the "pre-train then fine-tune" methodology. It spawned an extensive family of derivative models, including RoBERTa (which improved pre-training), ALBERT (parameter-efficient variant), DistilBERT (knowledge distillation), and domain-specific variants like BioBERT and SciBERT. The approach also influenced multi-modal models and cross-lingual representations through models like mBERT and XLM.

BERT demonstrated that large-scale unsupervised pre-training could effectively transfer linguistic knowledge to downstream tasks, reducing the need for task-specific labeled data and engineering. This pre-train-then-fine-tune paradigm remains foundational to modern NLP practice.

The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. arXiv:1810.04805
Peters, M. E., Neumann, M., Iyyer, M., et al. (2018). Deep Contextualized Word Representations. NAACL 2018.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.

@@ Line 1: / Line 1: @@
 <languages />
-{{LanguageBar | page = BERT Pre-training of Deep Bidirectional Transformers}}
 <translate>