Transformer: Difference between revisions

Latest revision as of 23:37, 27 April 2026

Transformers are a family of neural network architectures built around the self-attention mechanism, introduced by Vaswani et al. in the 2017 paper attention Is All You Need. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.

Overview

A transformer processes a sequence of tokens by repeatedly mixing information across positions using attention rather than recurrence or convolution. Each layer applies multi-head self-attention followed by a position-wise feed-forward network, with residual connections and layer normalisation at each sublayer. Because attention is permutation-invariant, positional information is injected via positional encodings.

Unlike recurrent neural networks, transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with attention's ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary large language models (LLMs).

Key Concepts

Self-attention — every token in a sequence computes a weighted sum of all other tokens, where the weights are learned from content similarity rather than position.
Queries, keys, and values — each token is projected into three vectors: queries match against keys to produce attention weights, which are then used to combine values.
Multi-head attention — several attention operations run in parallel with independent projections, allowing the model to attend to different aspects of the input simultaneously.
Positional encoding — sinusoidal or learned vectors added to token embeddings so the model can distinguish positions in an otherwise order-agnostic operation.
Residual connections and layer normalisation — wrap every sublayer to stabilise gradients and enable very deep stacks.
Feed-forward networks — two-layer position-wise MLPs applied independently to each token, providing per-position nonlinear transformation.
Causal masking — in decoders, future positions are masked out of the attention so the model cannot peek ahead during autoregressive generation.

History

attention as a soft alignment between encoder and decoder hidden states was introduced for neural machine translation by Bahdanau et al. (2015) and refined by Luong et al. (2015), but these models were still built on top of RNNs. In 2017, Vaswani et al. proposed the transformer, which removed recurrence entirely and relied solely on attention. The architecture won the WMT'14 English–German and English–French translation benchmarks while training in a fraction of the time.

In 2018, Devlin et al. released BERT, a deep bidirectional encoder pre-trained with masked-language modelling, which set new state-of-the-art results on a wide range of NLP benchmarks. The same year, Radford et al. introduced GPT, a decoder-only autoregressive transformer trained with a standard language-modelling objective. Subsequent scaling — GPT-2 (2019), GPT-3 (2020), PaLM, LLaMA, and contemporary frontier models — established that transformer performance follows predictable scaling laws (Kaplan et al. 2020) as parameters, data, and compute grow.

Transformers also expanded beyond text: Vision Transformers (Dosovitskiy et al. 2021) treat image patches as tokens, Speech transformers (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.

Key Approaches

The core building block is scaled dot-product attention. Given matrices of queries $Q \in \mathbb{R}^{n \times d_k}$ , keys $K \in \mathbb{R}^{n \times d_k}$ , and values $V \in \mathbb{R}^{n \times d_v}$ :

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V

The scaling factor $\sqrt{d_k}$ prevents the dot products from growing large and pushing the softmax into low-gradient regions. Multi-head attention runs $$ h $$ parallel heads with independent projections and concatenates their outputs:

\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O

where $\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)$ .

Three architectural variants dominate practice:

Encoder–decoder transformers — the original design, used for sequence-to-sequence tasks such as translation and summarisation. Encoder layers use bidirectional self-attention; decoder layers use causal self-attention and cross-attention to encoder outputs. Examples: T5, BART, the original Vaswani model.
Encoder-only transformers — discard the decoder and use bidirectional self-attention throughout. Trained with masked-language modelling for representation learning. Examples: BERT, RoBERTa, DeBERTa.
Decoder-only transformers — discard the encoder and use causal self-attention only, trained with next-token prediction. Now the dominant design for general-purpose language models. Examples: GPT family, LLaMA, Mistral, Claude.

The original transformer uses sinusoidal positional encodings:

\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Subsequent work introduced learned absolute positions (BERT, GPT-2), relative positional encodings (Shaw et al. 2018, T5), rotary position embeddings (RoPE, Su et al. 2021), and ALiBi linear biases (Press et al. 2022), the latter two improving extrapolation to longer sequences than seen during training.

Self-attention has $$ O(n^2 d) $$ complexity, which is expensive for long sequences. Efficient transformer variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global attention patterns; FlashAttention (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact attention with much higher hardware utilisation. Sparse mixture-of-experts (MoE) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.

Connections

Transformers are deeply connected to other ideas in modern machine learning. They build directly on Attention Mechanisms, generalising the soft alignment used in earlier sequence-to-sequence models. Their inputs are typically dense word embeddings (or learned subword embeddings such as BPE and SentencePiece), and their outputs over a vocabulary are produced by a softmax over linear logits.

Training relies on the same machinery as other deep networks: backpropagation through the attention and feed-forward sublayers, optimisation by gradient descent variants such as adam and AdamW, and cross-entropy loss for next-token prediction or masked-token recovery. Regularisation techniques including dropout (on attention weights and feed-forward activations) and weight decay are standard.

Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of graph neural networks on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via transfer learning — fine-tuning, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). "attention Is All You Need". NeurIPS.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). "BERT: pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI technical report.
Brown, T. et al. (2020). "Language Models are Few-Shot Learners". NeurIPS.
Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.
Dosovitskiy, A. et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position embedding". arXiv:2104.09864.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. and Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact attention with IO-Awareness". NeurIPS.

@@ Line 1: / Line 1: @@
-<languages />
-{{ArticleInfobox
- | topic_area    = Machine Learning
- | difficulty    = Introductory
-}}
-{{ContentMeta
- | generated_by   = claude-code-direct
- | model_used     = claude-opus-4-7
- | generated_date = 2026-04-27
-}}
-<translate>
-<!--T:1-->
 '''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.
-<!--T:2-->
 == Overview ==
-<!--T:3-->
 A transformer processes a sequence of tokens by repeatedly mixing information across positions using {{Term|attention}} rather than recurrence or {{Term|convolution}}. Each layer applies '''multi-head self-{{Term|attention}}''' followed by a position-wise feed-forward network, with residual connections and layer normalisation at each sublayer. Because {{Term|attention}} is permutation-invariant, positional information is injected via '''positional encodings'''.
-<!--T:4-->
 Unlike [[Recurrent Neural Networks|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).
-<!--T:5-->
 == Key Concepts ==
-<!--T:6-->
 * '''Self-{{Term|attention}}''' — every token in a sequence computes a weighted sum of all other tokens, where the weights are learned from content similarity rather than position.
 * '''Queries, keys, and values''' — each token is projected into three vectors: queries match against keys to produce {{Term|attention}} weights, which are then used to combine values.
@@ Line 35: / Line 17: @@
 * '''Causal masking''' — in decoders, future positions are masked out of the {{Term|attention}} so the model cannot peek ahead during autoregressive generation.
-<!--T:7-->
 == History ==
-<!--T:8-->
 {{Term|attention}} as a soft alignment between encoder and decoder hidden states was introduced for neural machine translation by Bahdanau et al. (2015) and refined by Luong et al. (2015), but these models were still built on top of [[Recurrent Neural Networks|RNNs]]. In 2017, Vaswani et al. proposed the transformer, which removed recurrence entirely and relied solely on {{Term|attention}}. The architecture won the WMT'14 English–German and English–French translation benchmarks while training in a fraction of the time.
-<!--T:9-->
 In 2018, Devlin et al. released '''BERT''', a deep bidirectional encoder pre-trained with masked-language modelling, which set new state-of-the-art results on a wide range of NLP benchmarks. The same year, Radford et al. introduced '''GPT''', a decoder-only autoregressive transformer trained with a standard language-modelling objective. Subsequent scaling — GPT-2 (2019), GPT-3 (2020), PaLM, LLaMA, and contemporary frontier models — established that transformer performance follows predictable '''scaling laws''' (Kaplan et al. 2020) as parameters, data, and compute grow.
-<!--T:10-->
 Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.
-<!--T:11-->
 == Key Approaches ==
-<!--T:12-->
 The core building block is '''scaled dot-product {{Term|attention}}'''. Given matrices of queries <math>Q \in \mathbb{R}^{n \times d_k}</math>, keys <math>K \in \mathbb{R}^{n \times d_k}</math>, and values <math>V \in \mathbb{R}^{n \times d_v}</math>:
-<!--T:13-->
 :<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>
-<!--T:14-->
 The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large and pushing the [[Softmax Function|softmax]] into low-gradient regions. '''Multi-head {{Term|attention}}''' runs <math>h</math> parallel heads with independent projections and concatenates their outputs:
-<!--T:15-->
 :<math>\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O</math>
-<!--T:16-->
 where <math>\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)</math>.
-<!--T:17-->
 Three architectural variants dominate practice:
-<!--T:18-->
 * '''Encoder–decoder transformers''' — the original design, used for {{Term|sequence-to-sequence}} tasks such as translation and summarisation. Encoder layers use bidirectional self-{{Term|attention}}; decoder layers use causal self-{{Term|attention}} and cross-{{Term|attention}} to encoder outputs. Examples: T5, BART, the original Vaswani model.
 * '''Encoder-only transformers''' — discard the decoder and use bidirectional self-{{Term|attention}} throughout. Trained with masked-language modelling for representation learning. Examples: BERT, RoBERTa, DeBERTa.
 * '''Decoder-only transformers''' — discard the encoder and use causal self-{{Term|attention}} only, trained with next-token prediction. Now the dominant design for general-purpose language models. Examples: GPT family, LLaMA, Mistral, Claude.
-<!--T:19-->
 The original transformer uses '''sinusoidal positional encodings''':
-<!--T:20-->
 :<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>
-<!--T:21-->
 Subsequent work introduced learned absolute positions (BERT, GPT-2), '''relative positional encodings''' (Shaw et al. 2018, T5), '''rotary position {{Term|embedding|embeddings}}''' (RoPE, Su et al. 2021), and '''ALiBi''' linear biases (Press et al. 2022), the latter two improving extrapolation to longer sequences than seen during training.
-<!--T:22-->
 Self-{{Term|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term|attention}} with much higher hardware utilisation. '''Sparse {{Term|mixture of experts|mixture-of-experts}}''' ({{Term|mixture of experts|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.
-<!--T:23-->
 == Connections ==
-<!--T:24-->
 Transformers are deeply connected to other ideas in modern machine learning. They build directly on [[Attention Mechanisms]], generalising the soft alignment used in earlier {{Term|sequence-to-sequence}} models. Their inputs are typically dense [[Word Embeddings|word embeddings]] (or learned subword {{Term|embedding|embeddings}} such as BPE and SentencePiece), and their outputs over a vocabulary are produced by a [[Softmax Function|softmax]] over linear {{Term|logits}}.
-<!--T:25-->
 Training relies on the same machinery as other deep networks: [[Backpropagation|backpropagation]] through the {{Term|attention}} and feed-forward sublayers, optimisation by [[Gradient Descent|gradient descent]] variants such as {{Term|adam}} and AdamW, and [[Cross-Entropy Loss|cross-entropy loss]] for next-token prediction or masked-token recovery. {{Term|regularization|Regularisation}} techniques including [[Dropout|dropout]] (on {{Term|attention}} weights and feed-forward {{Term|activation function|activations}}) and [[Overfitting and Regularization|weight decay]] are standard.
-<!--T:26-->
 Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning|transfer learning]] — {{Term|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.
-<!--T:27-->
 == See also ==
-<!--T:28-->
 * [[Attention Mechanisms]]
 * [[Neural Networks]]
@@ Line 110: / Line 70: @@
 * [[Cross-Entropy Loss]]
-<!--T:29-->
 == References ==
-<!--T:30-->
 * Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). "{{Term|attention}} Is All You Need". ''NeurIPS''.
 * Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). "BERT: {{Term|pre-training}} of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.
@@ Line 122: / Line 80: @@
 * Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position {{Term|embedding}}". ''arXiv:2104.09864''.
 * Dao, T., Fu, D. Y., Ermon, S., Rudra, A. and Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact {{Term|attention}} with IO-Awareness". ''NeurIPS''.
-</translate>
-[[Category:Machine Learning]]
-[[Category:Introductory]]