Transformer
Transformers are a family of neural network architectures built around the self-attention mechanism, introduced by Vaswani et al. in the 2017 paper Lua error: Internal error: The interpreter exited with status 1. Is All You Need. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.
Overview
A transformer processes a sequence of tokens by repeatedly mixing information across positions using Lua error: Internal error: The interpreter exited with status 1. rather than recurrence or Lua error: Internal error: The interpreter exited with status 1.. Each layer applies multi-head self-Lua error: Internal error: The interpreter exited with status 1. followed by a position-wise feed-forward network, with Lua error: Internal error: The interpreter exited with status 1. and layer normalisation at each sublayer. Because Lua error: Internal error: The interpreter exited with status 1. is permutation-invariant, positional information is injected via Lua error: Internal error: The interpreter exited with status 1..
Unlike recurrent neural networks, transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with Lua error: Internal error: The interpreter exited with status 1.'s ability to model long-Lua error: Internal error: The interpreter exited with status 1. dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary large Lua error: Internal error: The interpreter exited with status 1. (LLMs).
Key Concepts
- Self-Lua error: Internal error: The interpreter exited with status 1. — every token in a sequence computes a weighted sum of all other tokens, where the weights are learned from content similarity rather than position.
- Queries, keys, and values — each token is projected into three Lua error: Internal error: The interpreter exited with status 1.: queries match against keys to produce Lua error: Internal error: The interpreter exited with status 1. weights, which are then used to combine values.
- Multi-head Lua error: Internal error: The interpreter exited with status 1. — several Lua error: Internal error: The interpreter exited with status 1. operations run in parallel with independent projections, allowing the model to attend to different aspects of the input simultaneously.
- Lua error: Internal error: The interpreter exited with status 1. — sinusoidal or learned Lua error: Internal error: The interpreter exited with status 1. added to token Lua error: Internal error: The interpreter exited with status 1. so the model can distinguish positions in an otherwise order-agnostic operation.
- Lua error: Internal error: The interpreter exited with status 1. and layer normalisation — wrap every sublayer to stabilise gradients and enable very deep stacks.
- Feed-forward networks — two-layer position-wise Lua error: Internal error: The interpreter exited with status 1. applied independently to each token, providing per-position nonlinear transformation.
- Causal masking — in Lua error: Internal error: The interpreter exited with status 1., future positions are masked out of the Lua error: Internal error: The interpreter exited with status 1. so the model cannot peek ahead during autoregressive generation.
History
Lua error: Internal error: The interpreter exited with status 1. as a soft alignment between Lua error: Internal error: The interpreter exited with status 1. and Lua error: Internal error: The interpreter exited with status 1. hidden states was introduced for Lua error: Internal error: The interpreter exited with status 1. by Bahdanau et al. (2015) and refined by Luong et al. (2015), but these models were still built on top of RNNs. In 2017, Vaswani et al. proposed the transformer, which removed recurrence entirely and relied solely on Lua error: Internal error: The interpreter exited with status 1.. The architecture won the WMT'14 English–German and English–French translation benchmarks while training in a fraction of the time.
In 2018, Devlin et al. released Lua error: Internal error: The interpreter exited with status 1., a deep bidirectional Lua error: Internal error: The interpreter exited with status 1. pre-trained with masked-language modelling, which set new state-of-the-art results on a wide Lua error: Internal error: The interpreter exited with status 1. of NLP benchmarks. The same year, Radford et al. introduced Lua error: Internal error: The interpreter exited with status 1., a Lua error: Internal error: The interpreter exited with status 1.-only autoregressive transformer trained with a standard language-modelling objective. Subsequent scaling — Lua error: Internal error: The interpreter exited with status 1.-2 (2019), Lua error: Internal error: The interpreter exited with status 1.-3 (2020), PaLM, Lua error: Internal error: The interpreter exited with status 1., and contemporary frontier models — established that transformer performance follows predictable scaling laws (Kaplan et al. 2020) as parameters, data, and compute grow.
Transformers also expanded beyond text: Lua error: Internal error: The interpreter exited with status 1. (Dosovitskiy et al. 2021) treat Lua error: Internal error: The interpreter exited with status 1. patches as tokens, Speech transformers (e.g. Whisper) operate on audio spectrograms, and multimodal models such as Lua error: Internal error: The interpreter exited with status 1. and Flamingo unify several modalities in a single architecture.
Key Approaches
The core building block is scaled Lua error: Internal error: The interpreter exited with status 1. Lua error: Internal error: The interpreter exited with status 1.. Given Lua error: Internal error: The interpreter exited with status 1. of queries $ Q \in \mathbb{R}^{n \times d_k} $, keys $ K \in \mathbb{R}^{n \times d_k} $, and values $ V \in \mathbb{R}^{n \times d_v} $:
- $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V $
The scaling factor $ \sqrt{d_k} $ prevents the Lua error: Internal error: The interpreter exited with status 1. from growing large and pushing the softmax into low-gradient regions. Multi-head Lua error: Internal error: The interpreter exited with status 1. runs $ h $ parallel heads with independent projections and concatenates their outputs:
- $ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O $
where $ \mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V) $.
Three architectural variants dominate practice:
- Lua error: Internal error: The interpreter exited with status 1.–Lua error: Internal error: The interpreter exited with status 1. transformers — the original design, used for Lua error: Internal error: The interpreter exited with status 1. tasks such as translation and summarisation. Lua error: Internal error: The interpreter exited with status 1. layers use bidirectional self-Lua error: Internal error: The interpreter exited with status 1.; Lua error: Internal error: The interpreter exited with status 1. layers use causal self-Lua error: Internal error: The interpreter exited with status 1. and cross-Lua error: Internal error: The interpreter exited with status 1. to Lua error: Internal error: The interpreter exited with status 1. outputs. Examples: Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1., the original Vaswani model.
- Lua error: Internal error: The interpreter exited with status 1.-only transformers — discard the Lua error: Internal error: The interpreter exited with status 1. and use bidirectional self-Lua error: Internal error: The interpreter exited with status 1. throughout. Trained with masked-language modelling for Lua error: Internal error: The interpreter exited with status 1.. Examples: Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1..
- Lua error: Internal error: The interpreter exited with status 1.-only transformers — discard the Lua error: Internal error: The interpreter exited with status 1. and use causal self-Lua error: Internal error: The interpreter exited with status 1. only, trained with next-token prediction. Now the dominant design for general-purpose Lua error: Internal error: The interpreter exited with status 1.. Examples: Lua error: Internal error: The interpreter exited with status 1. family, Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1..
The original transformer uses Lua error: Internal error: The interpreter exited with status 1.:
- $ \mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right) $
Subsequent work introduced learned absolute positions (Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1.-2), Lua error: Internal error: The interpreter exited with status 1. (Shaw et al. 2018, Lua error: Internal error: The interpreter exited with status 1.), rotary position Lua error: Internal error: The interpreter exited with status 1. (Lua error: Internal error: The interpreter exited with status 1., Su et al. 2021), and Lua error: Internal error: The interpreter exited with status 1. linear biases (Press et al. 2022), the latter two improving extrapolation to longer sequences than seen during training.
Self-Lua error: Internal error: The interpreter exited with status 1. has $ O(n^2 d) $ complexity, which is expensive for long sequences. Efficient transformer variants reduce this cost: Linformer and Lua error: Internal error: The interpreter exited with status 1. use low-Lua error: Internal error: The interpreter exited with status 1. or Lua error: Internal error: The interpreter exited with status 1. approximations; Longformer and BigBird mix local and global Lua error: Internal error: The interpreter exited with status 1. patterns; Lua error: Internal error: The interpreter exited with status 1. (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact Lua error: Internal error: The interpreter exited with status 1. with much higher hardware utilisation. Sparse Lua error: Internal error: The interpreter exited with status 1. (Lua error: Internal error: The interpreter exited with status 1.) routing replaces dense feed-forward sublayers with sparsely-activated experts (Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1.) to scale parameters without proportionally scaling compute.
Connections
Transformers are deeply connected to other ideas in modern machine learning. They build directly on Attention Mechanisms, generalising the soft alignment used in earlier Lua error: Internal error: The interpreter exited with status 1. models. Their inputs are typically dense word embeddings (or learned subword Lua error: Internal error: The interpreter exited with status 1. such as Lua error: Internal error: The interpreter exited with status 1. and Lua error: Internal error: The interpreter exited with status 1.), and their outputs over a vocabulary are produced by a softmax over linear Lua error: Internal error: The interpreter exited with status 1..
Training relies on the same machinery as other deep networks: backpropagation through the Lua error: Internal error: The interpreter exited with status 1. and feed-forward sublayers, optimisation by gradient descent variants such as Lua error: Internal error: The interpreter exited with status 1. and Lua error: Internal error: The interpreter exited with status 1., and cross-entropy loss for next-token prediction or masked-token recovery. Lua error: Internal error: The interpreter exited with status 1. techniques including dropout (on Lua error: Internal error: The interpreter exited with status 1. weights and feed-forward Lua error: Internal error: The interpreter exited with status 1.) and weight decay are standard.
Architecturally, transformers can be viewed as a Lua error: Internal error: The interpreter exited with status 1. of fully-connected layers with content-conditioned routing, or as a special case of Lua error: Internal error: The interpreter exited with status 1. on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via transfer learning — Lua error: Internal error: The interpreter exited with status 1., parameter-efficient methods such as Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1., or simply prompting in-context.
See also
- Attention Mechanisms
- Neural Networks
- Recurrent Neural Networks
- Word Embeddings
- Softmax Function
- Transfer Learning
- Backpropagation
- Cross-Entropy Loss
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). "Lua error: Internal error: The interpreter exited with status 1. Is All You Need". NeurIPS.
- Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). "Lua error: Internal error: The interpreter exited with status 1.: Lua error: Internal error: The interpreter exited with status 1. of Deep Bidirectional Transformers for Language Understanding". NAACL.
- Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI technical report.
- Brown, T. et al. (2020). "Language Models are Few-Shot Learners". NeurIPS.
- Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.
- Dosovitskiy, A. et al. (2021). "An Lua error: Internal error: The interpreter exited with status 1. is Worth 16x16 Words: Transformers for Lua error: Internal error: The interpreter exited with status 1. Recognition at Scale". ICLR.
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Lua error: Internal error: The interpreter exited with status 1.". arXiv:2104.09864.
- Dao, T., Fu, D. Y., Ermon, S., Rudra, A. and Ré, C. (2022). "Lua error: Internal error: The interpreter exited with status 1.: Fast and Memory-Efficient Exact Lua error: Internal error: The interpreter exited with status 1. with IO-Awareness". NeurIPS.