Translations:BERT Pre-training of Deep Bidirectional Transformers/10/en

The masked language modeling objective works by randomly masking 15% of the input tokens. Of these masked positions, 80% are replaced with the [MASK] token, 10% with a random token, and 10% are left unchanged. The model predicts the original token at each masked position using a cross-entropy loss: