Attention Is All You Need

    From Marovi AI
    Other languages:
    Research Paper
    Authors Ashish Vaswani; Noam Shazeer; Niki Parmar; Jakob Uszkoreit; Llion Jones; Aidan N. Gomez; Lukasz Kaiser; Illia Polosukhin
    Year 2017
    Venue NeurIPS
    Topic area NLP
    Difficulty Research
    arXiv 1706.03762
    PDF Download PDF

    Attention Is All You Need is a landmark 2017 paper by Vaswani et al. that introduced the Transformer architecture, a novel neural network design based entirely on attention mechanisms. The paper demonstrated that recurrent and convolutional layers, previously considered essential for sequence-to-sequence tasks, could be replaced by self-attention, yielding superior performance and dramatically improved training efficiency.

    Overview

    Prior to the Transformer, dominant sequence transduction models relied on recurrent neural networks (RNNs), particularly LSTMs and GRUs, often enhanced with attention mechanisms. These architectures processed tokens sequentially, creating a fundamental bottleneck that prevented parallelization during training. The Transformer eliminated this constraint by relying solely on attention to draw global dependencies between input and output sequences, enabling far greater parallelism and reducing training times from days to hours on contemporary hardware.

    The model was evaluated on English-to-German and English-to-French translation benchmarks from the WMT 2014 shared task, where it achieved new state-of-the-art BLEU scores while requiring substantially less compute to train than competing models.

    Key Contributions

    • Introduction of the Transformer, the first sequence transduction model based entirely on attention without recurrence or convolution.
    • The scaled dot-product attention mechanism and multi-head attention, which allow the model to jointly attend to information from different representation subspaces at different positions.
    • Positional encodings using sinusoidal functions, providing the model with information about token order without recurrence.
    • Demonstration that attention-only models can achieve state-of-the-art results on machine translation while being more parallelizable and faster to train.

    Methods

    The Transformer follows an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, and the decoder generates an output sequence one element at a time in an autoregressive fashion.

    The core operation is scaled dot-product attention, defined as:

    $ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

    where $ Q $, $ K $, and $ V $ are matrices of queries, keys, and values respectively, and $ d_k $ is the dimensionality of the keys. The scaling factor $ \sqrt{d_k} $ prevents the dot products from growing large in magnitude, which would push the softmax into regions with extremely small gradients.

    Multi-head attention applies several attention functions in parallel, each with different learned linear projections:

    $ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $

    where each $ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $.

    The encoder consists of six identical layers, each containing a multi-head self-attention sublayer followed by a position-wise feed-forward network, with residual connections and layer normalization around each sublayer. The decoder adds a third sublayer that performs multi-head attention over the encoder output, and masks future positions in the self-attention to preserve the autoregressive property.

    Since the model contains no recurrence, positional encodings are added to the input embeddings using sinusoidal functions of different frequencies:

    $ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}}) $

    $ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}}) $

    Results

    On the WMT 2014 English-to-German translation task, the big Transformer model achieved a BLEU score of 28.4, surpassing the previous best results including ensembles by over 2 BLEU points. On the WMT 2014 English-to-French task, it achieved 41.0 BLEU, establishing a new single-model state-of-the-art at a fraction of the training cost of prior models.

    The base model was trained in approximately 12 hours on 8 NVIDIA P100 GPUs, while the big model required 3.5 days on the same hardware — still substantially less than competing RNN-based architectures required for comparable performance.

    Ablation studies showed that multi-head attention outperformed single-head attention, that the scaling factor was important for large key dimensions, and that learned positional embeddings performed comparably to the sinusoidal encodings.

    Impact

    The Transformer architecture fundamentally reshaped the landscape of deep learning and natural language processing. It became the foundation for virtually all subsequent large language models, including BERT, GPT, T5, and their successors. Beyond NLP, the architecture was adopted in computer vision (Vision Transformer), speech recognition, protein structure prediction (AlphaFold 2), and many other domains.

    The paper's title — "Attention Is All You Need" — became one of the most recognizable phrases in machine learning, and the architecture it introduced has been called one of the most influential contributions to artificial intelligence in the 2010s. As of 2026, the Transformer remains the dominant architecture for large-scale neural network models across modalities.

    The original paper has accumulated over 100,000 citations, making it one of the most cited works in computer science history. The eight co-authors went on to found or co-found multiple AI companies, reflecting the enormous commercial value that flowed from the Transformer's invention.

    See also

    References

    • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NeurIPS 2017). arXiv:1706.03762
    • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.
    • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016). Google's Neural Machine Translation System. arXiv:1609.08144.