Translations:Language Models are Few-Shot Learners/10/en
The model was trained on a filtered and deduplicated dataset of approximately 570 GB of text, drawn primarily from Common Crawl (filtered for quality using a classifier trained on high-quality reference corpora), supplemented with WebText2, Books1, Books2, and English Wikipedia. Training used a batch size ramping from 32K to 3.2M tokens and a learning rate schedule with warmup.