BERT Pre-training of Deep Bidirectional Transformers - Revision history

DeployBot: Marked this version for translation

2026-04-27T02:36:51Z

Marked this version for translation

← Older revision		Revision as of 02:36, 27 April 2026
Line 18:		Line 18:
	'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.		'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.

	<!--T:3-->		== Overview == <!--T:3-->
	~~== Overview ==~~

	<!--T:4-->		<!--T:4-->
Line 27:		Line 26:
	BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.		BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.

	<!--T:6-->		== Key Contributions == <!--T:6-->
	~~== Key Contributions ==~~

	<!--T:7-->		<!--T:7-->
Line 36:		Line 34:
	* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.		* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.

	<!--T:8-->		== Methods == <!--T:8-->
	~~== Methods ==~~

	<!--T:9-->		<!--T:9-->
Line 63:		Line 60:
	Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.		Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.

	<!--T:17-->		== Results == <!--T:17-->
	~~== Results ==~~

	<!--T:18-->		<!--T:18-->
Line 81:		Line 77:
	The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.		The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.

	<!--T:22-->		== Impact == <!--T:22-->
	~~== Impact ==~~

	<!--T:23-->		<!--T:23-->
Line 93:		Line 88:
	The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.		The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.

	<!--T:26-->		== See also == <!--T:26-->
	~~== See also ==~~

	<!--T:27-->		<!--T:27-->
Line 101:		Line 95:
	* [[Efficient Estimation of Word Representations]]		* [[Efficient Estimation of Word Representations]]

	<!--T:28-->		== References == <!--T:28-->
	~~== References ==~~

	<!--T:29-->		<!--T:29-->

DeployBot: [deploy-bot] Drop {{LanguageBar}} (v1.4.1)

2026-04-27T02:35:51Z

[deploy-bot] Drop {{LanguageBar}} (v1.4.1)

← Older revision		Revision as of 02:35, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = BERT Pre-training of Deep Bidirectional Transformers}}~~

	<translate>		<translate>
Line 19:		Line 18:
	'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.		'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.

	~~== Overview ==~~ <!--T:3-->		<!--T:3-->
			== Overview ==

	<!--T:4-->		<!--T:4-->
Line 27:		Line 27:
	BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.		BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.

	~~== Key Contributions ==~~ <!--T:6-->		<!--T:6-->
			== Key Contributions ==

	<!--T:7-->		<!--T:7-->
Line 35:		Line 36:
	* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.		* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.

	~~== Methods ==~~ <!--T:8-->		<!--T:8-->
			== Methods ==

	<!--T:9-->		<!--T:9-->
Line 61:		Line 63:
	Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.		Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.

	~~== Results ==~~ <!--T:17-->		<!--T:17-->
			== Results ==

	<!--T:18-->		<!--T:18-->
Line 78:		Line 81:
	The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.		The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.

	~~== Impact ==~~ <!--T:22-->		<!--T:22-->
			== Impact ==

	<!--T:23-->		<!--T:23-->
Line 89:		Line 93:
	The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.		The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.

	~~== See also ==~~ <!--T:26-->		<!--T:26-->
			== See also ==

	<!--T:27-->		<!--T:27-->
Line 96:		Line 101:
	* [[Efficient Estimation of Word Representations]]		* [[Efficient Estimation of Word Representations]]

	~~== References ==~~ <!--T:28-->		<!--T:28-->
			== References ==

	<!--T:29-->		<!--T:29-->

DeployBot: Marked this version for translation

2026-04-27T00:31:23Z

Marked this version for translation

← Older revision		Revision as of 00:31, 27 April 2026
Line 19:		Line 19:
	'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.		'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.

	<!--T:3-->		== Overview == <!--T:3-->
	~~== Overview ==~~

	<!--T:4-->		<!--T:4-->
Line 28:		Line 27:
	BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.		BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.

	<!--T:6-->		== Key Contributions == <!--T:6-->
	~~== Key Contributions ==~~

	<!--T:7-->		<!--T:7-->
Line 37:		Line 35:
	* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.		* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.

	<!--T:8-->		== Methods == <!--T:8-->
	~~== Methods ==~~

	<!--T:9-->		<!--T:9-->
Line 64:		Line 61:
	Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.		Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.

	<!--T:17-->		== Results == <!--T:17-->
	~~== Results ==~~

	<!--T:18-->		<!--T:18-->
Line 82:		Line 78:
	The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.		The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.

	<!--T:22-->		== Impact == <!--T:22-->
	~~== Impact ==~~

	<!--T:23-->		<!--T:23-->
Line 94:		Line 89:
	The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.		The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.

	<!--T:26-->		== See also == <!--T:26-->
	~~== See also ==~~

	<!--T:27-->		<!--T:27-->
Line 102:		Line 96:
	* [[Efficient Estimation of Word Representations]]		* [[Efficient Estimation of Word Representations]]

	<!--T:28-->		== References == <!--T:28-->
	~~== References ==~~

	<!--T:29-->		<!--T:29-->

DeployBot: [deploy-bot] Convert BERT Pre-training of Deep Bidirectional Transformers to Translate-extension page

2026-04-27T00:31:21Z

[deploy-bot] Convert BERT Pre-training of Deep Bidirectional Transformers to Translate-extension page

New page

<languages />
{{LanguageBar | page = BERT Pre-training of Deep Bidirectional Transformers}}

<translate>

{{PaperInfobox
| topic_area = NLP
| difficulty = Research
| authors = Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova
| year = 2019
| venue = NAACL
| arxiv_id = 1810.04805
| source_url = https://arxiv.org/abs/1810.04805
| pdf_url = https://arxiv.org/pdf/1810.04805
}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}


'''BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding''' is a 2019 paper by Devlin et al. from Google AI Language that introduced '''BERT''' (Bidirectional Encoder Representations from Transformers), a method for pre-training deep bidirectional language representations. BERT revolutionized NLP by demonstrating that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of downstream tasks with minimal task-specific architecture modifications.


== Overview ==


Before BERT, pre-trained language representations were either unidirectional (like GPT, which reads left-to-right) or used shallow concatenation of independently trained left-to-right and right-to-left models (like ELMo). These approaches were suboptimal because standard language models are inherently unidirectional, preventing tokens from attending to context on both sides simultaneously.


BERT addressed this limitation by introducing a novel pre-training objective — '''masked language modeling''' (MLM) — that enables genuine bidirectional pre-training. Combined with a '''next sentence prediction''' (NSP) task, BERT learned rich contextual representations that could be transferred to downstream tasks through simple fine-tuning, eliminating the need for task-specific architectures.


== Key Contributions ==


* '''Masked language modeling''' (MLM): A pre-training objective that randomly masks input tokens and trains the model to predict them from bidirectional context, enabling true bidirectional representation learning.
* '''Next sentence prediction''' (NSP): A binary classification pre-training task that teaches the model to understand relationships between sentence pairs.
* A simple and effective '''fine-tuning paradigm''': Adding a single output layer to the pre-trained model suffices for a wide range of NLP tasks, from classification to question answering.
* Demonstration that deep bidirectional pre-training is critically important for learning general-purpose language representations.


== Methods ==


BERT uses the encoder portion of the Transformer architecture. The model takes a sequence of tokens as input and produces a contextualized embedding for each token. Two model sizes were released: BERT-Base (12 layers, 768 hidden units, 12 attention heads, 110M parameters) and BERT-Large (24 layers, 1024 hidden units, 16 attention heads, 340M parameters).


The '''masked language modeling''' objective works by randomly masking 15% of the input tokens. Of these masked positions, 80% are replaced with the [MASK] token, 10% with a random token, and 10% are left unchanged. The model predicts the original token at each masked position using a cross-entropy loss:


<math>L_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\backslash \mathcal{M}})</math>


where <math>\mathcal{M}</math> is the set of masked positions and <math>\mathbf{x}_{\backslash \mathcal{M}}</math> represents the corrupted input.


For '''next sentence prediction''', the model receives pairs of sentences (A and B) and predicts whether B is the actual next sentence following A in the corpus, or a randomly sampled sentence. A special [CLS] token at the beginning of the input captures the aggregate sequence representation used for this binary classification.


Input representation combines token embeddings, segment embeddings (indicating sentence A or B), and positional embeddings. BERT uses WordPiece tokenization with a 30,000-token vocabulary.


Pre-training used the BooksCorpus (800M words) and English Wikipedia (2,500M words), running for 1M steps with a batch size of 256 sequences. The total pre-training compute was substantial for its time, requiring four days on 4 to 16 Cloud TPUs (for Base and Large respectively).


Fine-tuning is straightforward: for each downstream task, task-specific inputs and outputs are plugged into the pre-trained model, and all parameters are fine-tuned end-to-end. For token-level tasks like named entity recognition, each token's final hidden vector is fed into a classification layer. For sequence-level tasks like sentiment analysis, the [CLS] token's representation is used.


== Results ==


BERT achieved state-of-the-art results on eleven NLP benchmarks at the time of publication:


* '''GLUE benchmark''': BERT-Large achieved an average score of 80.5, a 7.7-point improvement over the previous state of the art.
* '''SQuAD v1.1''' (question answering): F1 score of 93.2, surpassing human performance (91.2 F1).
* '''SQuAD v2.0''': F1 score of 83.1, a 5.1-point improvement over prior systems.
* '''SWAG''' (commonsense reasoning): 86.3% accuracy, outperforming human expert performance (85.0%).


Ablation studies demonstrated that both pre-training tasks were important, and that bidirectionality was the most significant factor — removing it caused large drops across all tasks. Increasing model size consistently improved results, even on small-scale tasks when fine-tuned appropriately.


The paper also showed that BERT's representations could be used as fixed feature extractors (without fine-tuning) and still achieve strong results, though fine-tuning consistently outperformed the feature-based approach.


== Impact ==


BERT catalyzed a paradigm shift in NLP toward the "pre-train then fine-tune" methodology. It spawned an extensive family of derivative models, including RoBERTa (which improved pre-training), ALBERT (parameter-efficient variant), DistilBERT (knowledge distillation), and domain-specific variants like BioBERT and SciBERT. The approach also influenced multi-modal models and cross-lingual representations through models like mBERT and XLM.


BERT demonstrated that large-scale unsupervised pre-training could effectively transfer linguistic knowledge to downstream tasks, reducing the need for task-specific labeled data and engineering. This pre-train-then-fine-tune paradigm remains foundational to modern NLP practice.


The paper received over 100,000 citations within its first five years and is one of the most cited works in computer science. Google integrated BERT into its search engine in 2019, marking one of the largest deployments of a neural language model for information retrieval. The model's influence extends beyond academia into widespread industrial adoption, where BERT-based systems power search, content moderation, customer service, and many other applications.


== See also ==


* [[Attention Is All You Need]]
* [[Language Models are Few-Shot Learners]]
* [[Efficient Estimation of Word Representations]]


== References ==


* Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ''Proceedings of NAACL-HLT 2019''. [https://arxiv.org/abs/1810.04805 arXiv:1810.04805]
* Peters, M. E., Neumann, M., Iyyer, M., et al. (2018). Deep Contextualized Word Representations. ''NAACL 2018''.
* Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. ''OpenAI''.


[[Category:NLP]] [[Category:Research]] [[Category:Research Papers]]
</translate>