Translations:Attention Is All You Need/23/en
Ablation studies showed that multi-head attention outperformed single-head attention, that the scaling factor was important for large key dimensions, and that learned positional embeddings performed comparably to the sinusoidal encodings.