Translations:BERT Pre-training of Deep Bidirectional Transformers/15/en

pre-training used the BooksCorpus (800M words) and English Wikipedia (2,500M words), running for 1M steps with a batch size of 256 sequences. The total pre-training compute was substantial for its time, requiring four days on 4 to 16 Cloud TPUs (for Base and Large respectively).