Translations:Attention Is All You Need/7/en

Introduction of the transformer, the first sequence transduction model based entirely on attention without recurrence or convolution.
The scaled dot-product attention mechanism and multi-head attention, which allow the model to jointly attend to information from different representation subspaces at different positions.
Positional encodings using sinusoidal functions, providing the model with information about token order without recurrence.
Demonstration that attention-only models can achieve state-of-the-art results on machine translation while being more parallelizable and faster to train.