Transformer

Article
Topic area	Machine Learning
Difficulty	Introductory

Other languages:

English
Español
中文

Transformers are a family of neural network architectures built around the self-attention mechanism, introduced by Vaswani et al. in the 2017 paper Lua error: Internal error: The interpreter exited with status 1. Is All You Need. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.

Overview

A transformer processes a sequence of tokens by repeatedly mixing information across positions using Lua error: Internal error: The interpreter exited with status 1. rather than recurrence or Lua error: Internal error: The interpreter exited with status 1.. Each layer applies multi-head self-Lua error: Internal error: The interpreter exited with status 1. followed by a position-wise feed-forward network, with residual connections and layer normalisation at each sublayer. Because Lua error: Internal error: The interpreter exited with status 1. is permutation-invariant, positional information is injected via positional encodings.

Unlike recurrent neural networks, transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with Lua error: Internal error: The interpreter exited with status 1.'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary large language models (LLMs).

Key Concepts

Self-Lua error: Internal error: The interpreter exited with status 1. — every token in a sequence computes a weighted sum of all other tokens, where the weights are learned from content similarity rather than position.
Queries, keys, and values — each token is projected into three vectors: queries match against keys to produce Lua error: Internal error: The interpreter exited with status 1. weights, which are then used to combine values.
Multi-head Lua error: Internal error: The interpreter exited with status 1. — several Lua error: Internal error: The interpreter exited with status 1. operations run in parallel with independent projections, allowing the model to attend to different aspects of the input simultaneously.
Positional encoding — sinusoidal or learned vectors added to token Lua error: Internal error: The interpreter exited with status 1. so the model can distinguish positions in an otherwise order-agnostic operation.
Residual connections and layer normalisation — wrap every sublayer to stabilise gradients and enable very deep stacks.
Feed-forward networks — two-layer position-wise MLPs applied independently to each token, providing per-position nonlinear transformation.
Causal masking — in decoders, future positions are masked out of the Lua error: Internal error: The interpreter exited with status 1. so the model cannot peek ahead during autoregressive generation.

History

Lua error: Internal error: The interpreter exited with status 1. as a soft alignment between encoder and decoder hidden states was introduced for neural machine translation by Bahdanau et al. (2015) and refined by Luong et al. (2015), but these models were still built on top of RNNs. In 2017, Vaswani et al. proposed the transformer, which removed recurrence entirely and relied solely on Lua error: Internal error: The interpreter exited with status 1.. The architecture won the WMT'14 English–German and English–French translation benchmarks while training in a fraction of the time.

In 2018, Devlin et al. released BERT, a deep bidirectional encoder pre-trained with masked-language modelling, which set new state-of-the-art results on a wide range of NLP benchmarks. The same year, Radford et al. introduced GPT, a decoder-only autoregressive transformer trained with a standard language-modelling objective. Subsequent scaling — GPT-2 (2019), GPT-3 (2020), PaLM, LLaMA, and contemporary frontier models — established that transformer performance follows predictable scaling laws (Kaplan et al. 2020) as parameters, data, and compute grow.

Transformers also expanded beyond text: Vision Transformers (Dosovitskiy et al. 2021) treat image patches as tokens, Speech transformers (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.

Key Approaches

The core building block is scaled dot-product Lua error: Internal error: The interpreter exited with status 1.. Given matrices of queries $Q \in \mathbb{R}^{n \times d_k}$ , keys $K \in \mathbb{R}^{n \times d_k}$ , and values $V \in \mathbb{R}^{n \times d_v}$ :

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V

The scaling factor $\sqrt{d_k}$ prevents the dot products from growing large and pushing the softmax into low-gradient regions. Multi-head Lua error: Internal error: The interpreter exited with status 1. runs $$ h $$ parallel heads with independent projections and concatenates their outputs:

\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O

where $\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)$ .

Three architectural variants dominate practice:

Encoder–decoder transformers — the original design, used for Lua error: Internal error: The interpreter exited with status 1. tasks such as translation and summarisation. Encoder layers use bidirectional self-Lua error: Internal error: The interpreter exited with status 1.; decoder layers use causal self-Lua error: Internal error: The interpreter exited with status 1. and cross-Lua error: Internal error: The interpreter exited with status 1. to encoder outputs. Examples: T5, BART, the original Vaswani model.
Encoder-only transformers — discard the decoder and use bidirectional self-Lua error: Internal error: The interpreter exited with status 1. throughout. Trained with masked-language modelling for representation learning. Examples: BERT, RoBERTa, DeBERTa.
Decoder-only transformers — discard the encoder and use causal self-Lua error: Internal error: The interpreter exited with status 1. only, trained with next-token prediction. Now the dominant design for general-purpose language models. Examples: GPT family, LLaMA, Mistral, Claude.

The original transformer uses sinusoidal positional encodings:

\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Subsequent work introduced learned absolute positions (BERT, GPT-2), relative positional encodings (Shaw et al. 2018, T5), rotary position Lua error: Internal error: The interpreter exited with status 1. (RoPE, Su et al. 2021), and ALiBi linear biases (Press et al. 2022), the latter two improving extrapolation to longer sequences than seen during training.

Self-Lua error: Internal error: The interpreter exited with status 1. has $$ O(n^2 d) $$ complexity, which is expensive for long sequences. Efficient transformer variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global Lua error: Internal error: The interpreter exited with status 1. patterns; FlashAttention (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact Lua error: Internal error: The interpreter exited with status 1. with much higher hardware utilisation. Sparse Lua error: Internal error: The interpreter exited with status 1. (Lua error: Internal error: The interpreter exited with status 1.) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.

Connections

Transformers are deeply connected to other ideas in modern machine learning. They build directly on Attention Mechanisms, generalising the soft alignment used in earlier Lua error: Internal error: The interpreter exited with status 1. models. Their inputs are typically dense word embeddings (or learned subword Lua error: Internal error: The interpreter exited with status 1. such as BPE and SentencePiece), and their outputs over a vocabulary are produced by a softmax over linear Lua error: Internal error: The interpreter exited with status 1..

Training relies on the same machinery as other deep networks: backpropagation through the Lua error: Internal error: The interpreter exited with status 1. and feed-forward sublayers, optimisation by gradient descent variants such as Lua error: Internal error: The interpreter exited with status 1. and AdamW, and cross-entropy loss for next-token prediction or masked-token recovery. Lua error: Internal error: The interpreter exited with status 1. techniques including dropout (on Lua error: Internal error: The interpreter exited with status 1. weights and feed-forward Lua error: Internal error: The interpreter exited with status 1.) and weight decay are standard.

Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of graph neural networks on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via transfer learning — Lua error: Internal error: The interpreter exited with status 1., parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). "Lua error: Internal error: The interpreter exited with status 1. Is All You Need". NeurIPS.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). "BERT: Lua error: Internal error: The interpreter exited with status 1. of Deep Bidirectional Transformers for Language Understanding". NAACL.
Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI technical report.
Brown, T. et al. (2020). "Language Models are Few-Shot Learners". NeurIPS.
Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361.
Dosovitskiy, A. et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Lua error: Internal error: The interpreter exited with status 1.". arXiv:2104.09864.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. and Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Lua error: Internal error: The interpreter exited with status 1. with IO-Awareness". NeurIPS.