Translations:BERT Pre-training of Deep Bidirectional Transformers/20/en

Ablation studies demonstrated that both pre-training tasks were important, and that bidirectionality was the most significant factor — removing it caused large drops across all tasks. Increasing model size consistently improved results, even on small-scale tasks when fine-tuned appropriately.