Translations:BERT Pre-training of Deep Bidirectional Transformers/23/en
BERT catalyzed a paradigm shift in NLP toward the "pre-train then fine-tune" methodology. It spawned an extensive family of derivative models, including RoBERTa (which improved pre-training), ALBERT (parameter-efficient variant), DistilBERT (knowledge distillation), and domain-specific variants like BioBERT and SciBERT. The approach also influenced multi-modal models and cross-lingual representations through models like mBERT and XLM.