Transformer - Revision history

DeployBot: [deploy-bot] Deploy from CI (e406999)

2026-04-27T23:37:09Z

[deploy-bot] Deploy from CI (e406999)

DeployBot: Marked this version for translation

2026-04-27T23:32:56Z

Marked this version for translation

← Older revision		Revision as of 23:32, 27 April 2026
Line 14:		Line 14:
	'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms\|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term\|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.		'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms\|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term\|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.

	<!--T:2-->		== Overview == <!--T:2-->
	~~== Overview ==~~

	<!--T:3-->		<!--T:3-->
Line 23:		Line 22:
	Unlike [[Recurrent Neural Networks\|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term\|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).		Unlike [[Recurrent Neural Networks\|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term\|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).

	<!--T:5-->		== Key Concepts == <!--T:5-->
	~~== Key Concepts ==~~

	<!--T:6-->		<!--T:6-->
Line 35:		Line 33:
	* '''Causal masking''' — in decoders, future positions are masked out of the {{Term\|attention}} so the model cannot peek ahead during autoregressive generation.		* '''Causal masking''' — in decoders, future positions are masked out of the {{Term\|attention}} so the model cannot peek ahead during autoregressive generation.

	<!--T:7-->		== History == <!--T:7-->
	~~== History ==~~

	<!--T:8-->		<!--T:8-->
Line 47:		Line 44:
	Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.		Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.

	<!--T:11-->		== Key Approaches == <!--T:11-->
	~~== Key Approaches ==~~

	<!--T:12-->		<!--T:12-->
Line 85:		Line 81:
	Self-{{Term\|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term\|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term\|attention}} with much higher hardware utilisation. '''Sparse {{Term\|mixture of experts\|mixture-of-experts}}''' ({{Term\|mixture of experts\|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.		Self-{{Term\|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term\|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term\|attention}} with much higher hardware utilisation. '''Sparse {{Term\|mixture of experts\|mixture-of-experts}}''' ({{Term\|mixture of experts\|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.

	<!--T:23-->		== Connections == <!--T:23-->
	~~== Connections ==~~

	<!--T:24-->		<!--T:24-->
Line 97:		Line 92:
	Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning\|transfer learning]] — {{Term\|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.		Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning\|transfer learning]] — {{Term\|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.

	<!--T:27-->		== See also == <!--T:27-->
	~~== See also ==~~

	<!--T:28-->		<!--T:28-->
Line 110:		Line 104:
	* [[Cross-Entropy Loss]]		* [[Cross-Entropy Loss]]

	<!--T:29-->		== References == <!--T:29-->
	~~== References ==~~

	<!--T:30-->		<!--T:30-->

DeployBot: [deploy-bot] Claude-authored article: Transformer

2026-04-27T23:32:56Z

[deploy-bot] Claude-authored article: Transformer

← Older revision		Revision as of 23:32, 27 April 2026
Line 14:		Line 14:
	'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms\|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term\|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.		'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms\|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term\|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.

	~~== Overview ==~~ <!--T:2-->		<!--T:2-->
			== Overview ==

	<!--T:3-->		<!--T:3-->
Line 22:		Line 23:
	Unlike [[Recurrent Neural Networks\|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term\|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).		Unlike [[Recurrent Neural Networks\|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term\|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).

	~~== Key Concepts ==~~ <!--T:5-->		<!--T:5-->
			== Key Concepts ==

	<!--T:6-->		<!--T:6-->
Line 33:		Line 35:
	* '''Causal masking''' — in decoders, future positions are masked out of the {{Term\|attention}} so the model cannot peek ahead during autoregressive generation.		* '''Causal masking''' — in decoders, future positions are masked out of the {{Term\|attention}} so the model cannot peek ahead during autoregressive generation.

	~~== History ==~~ <!--T:7-->		<!--T:7-->
			== History ==

	<!--T:8-->		<!--T:8-->
Line 44:		Line 47:
	Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.		Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.

	~~== Key Approaches ==~~ <!--T:11-->		<!--T:11-->
			== Key Approaches ==

	<!--T:12-->		<!--T:12-->
Line 81:		Line 85:
	Self-{{Term\|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term\|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term\|attention}} with much higher hardware utilisation. '''Sparse {{Term\|mixture of experts\|mixture-of-experts}}''' ({{Term\|mixture of experts\|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.		Self-{{Term\|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term\|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term\|attention}} with much higher hardware utilisation. '''Sparse {{Term\|mixture of experts\|mixture-of-experts}}''' ({{Term\|mixture of experts\|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.

	~~== Connections ==~~ <!--T:23-->		<!--T:23-->
			== Connections ==

	<!--T:24-->		<!--T:24-->
Line 92:		Line 97:
	Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning\|transfer learning]] — {{Term\|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.		Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning\|transfer learning]] — {{Term\|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.

	~~== See also ==~~ <!--T:27-->		<!--T:27-->
			== See also ==

	<!--T:28-->		<!--T:28-->
Line 104:		Line 110:
	* [[Cross-Entropy Loss]]		* [[Cross-Entropy Loss]]

	~~== References ==~~ <!--T:29-->		<!--T:29-->
			== References ==

	<!--T:30-->		<!--T:30-->

DeployBot: Marked this version for translation

2026-04-27T21:57:59Z

Marked this version for translation

← Older revision		Revision as of 21:57, 27 April 2026
Line 14:		Line 14:
	'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms\|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term\|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.		'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms\|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term\|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.

	<!--T:2-->		== Overview == <!--T:2-->
	~~== Overview ==~~

	<!--T:3-->		<!--T:3-->
Line 23:		Line 22:
	Unlike [[Recurrent Neural Networks\|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term\|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).		Unlike [[Recurrent Neural Networks\|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term\|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).

	<!--T:5-->		== Key Concepts == <!--T:5-->
	~~== Key Concepts ==~~

	<!--T:6-->		<!--T:6-->
Line 35:		Line 33:
	* '''Causal masking''' — in decoders, future positions are masked out of the {{Term\|attention}} so the model cannot peek ahead during autoregressive generation.		* '''Causal masking''' — in decoders, future positions are masked out of the {{Term\|attention}} so the model cannot peek ahead during autoregressive generation.

	<!--T:7-->		== History == <!--T:7-->
	~~== History ==~~

	<!--T:8-->		<!--T:8-->
Line 47:		Line 44:
	Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.		Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.

	<!--T:11-->		== Key Approaches == <!--T:11-->
	~~== Key Approaches ==~~

	<!--T:12-->		<!--T:12-->
Line 85:		Line 81:
	Self-{{Term\|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term\|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term\|attention}} with much higher hardware utilisation. '''Sparse {{Term\|mixture of experts\|mixture-of-experts}}''' ({{Term\|mixture of experts\|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.		Self-{{Term\|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term\|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term\|attention}} with much higher hardware utilisation. '''Sparse {{Term\|mixture of experts\|mixture-of-experts}}''' ({{Term\|mixture of experts\|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.

	<!--T:23-->		== Connections == <!--T:23-->
	~~== Connections ==~~

	<!--T:24-->		<!--T:24-->
Line 97:		Line 92:
	Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning\|transfer learning]] — {{Term\|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.		Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning\|transfer learning]] — {{Term\|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.

	<!--T:27-->		== See also == <!--T:27-->
	~~== See also ==~~

	<!--T:28-->		<!--T:28-->
Line 110:		Line 104:
	* [[Cross-Entropy Loss]]		* [[Cross-Entropy Loss]]

	<!--T:29-->		== References == <!--T:29-->
	~~== References ==~~

	<!--T:30-->		<!--T:30-->

DeployBot: [deploy-bot] Claude-authored article: Transformer

2026-04-27T21:57:58Z

[deploy-bot] Claude-authored article: Transformer

New page

<languages />
{{ArticleInfobox
| topic_area = Machine Learning
| difficulty = Introductory
}}
{{ContentMeta
| generated_by = claude-code-direct
| model_used = claude-opus-4-7
| generated_date = 2026-04-27
}}

<translate>

'''Transformers''' are a family of neural network architectures built around the [[Attention Mechanisms|self-attention]] mechanism, introduced by Vaswani et al. in the 2017 paper ''{{Term|attention}} Is All You Need''. They have become the dominant architecture for natural language processing and increasingly for computer vision, speech, and multimodal tasks.


== Overview ==


A transformer processes a sequence of tokens by repeatedly mixing information across positions using {{Term|attention}} rather than recurrence or {{Term|convolution}}. Each layer applies '''multi-head self-{{Term|attention}}''' followed by a position-wise feed-forward network, with residual connections and layer normalisation at each sublayer. Because {{Term|attention}} is permutation-invariant, positional information is injected via '''positional encodings'''.


Unlike [[Recurrent Neural Networks|recurrent neural networks]], transformers compute all positions of the sequence in parallel during training, which dramatically improves throughput on modern hardware. This parallelism, combined with {{Term|attention}}'s ability to model long-range dependencies in a constant number of layers, has enabled the training of very large models on very large corpora — the foundation of contemporary '''large language models''' (LLMs).


== Key Concepts ==


* '''Self-{{Term|attention}}''' — every token in a sequence computes a weighted sum of all other tokens, where the weights are learned from content similarity rather than position.
* '''Queries, keys, and values''' — each token is projected into three vectors: queries match against keys to produce {{Term|attention}} weights, which are then used to combine values.
* '''Multi-head {{Term|attention}}''' — several {{Term|attention}} operations run in parallel with independent projections, allowing the model to attend to different aspects of the input simultaneously.
* '''Positional encoding''' — sinusoidal or learned vectors added to token {{Term|embedding|embeddings}} so the model can distinguish positions in an otherwise order-agnostic operation.
* '''Residual connections and layer normalisation''' — wrap every sublayer to stabilise gradients and enable very deep stacks.
* '''Feed-forward networks''' — two-layer position-wise MLPs applied independently to each token, providing per-position nonlinear transformation.
* '''Causal masking''' — in decoders, future positions are masked out of the {{Term|attention}} so the model cannot peek ahead during autoregressive generation.


== History ==


{{Term|attention}} as a soft alignment between encoder and decoder hidden states was introduced for neural machine translation by Bahdanau et al. (2015) and refined by Luong et al. (2015), but these models were still built on top of [[Recurrent Neural Networks|RNNs]]. In 2017, Vaswani et al. proposed the transformer, which removed recurrence entirely and relied solely on {{Term|attention}}. The architecture won the WMT'14 English–German and English–French translation benchmarks while training in a fraction of the time.


In 2018, Devlin et al. released '''BERT''', a deep bidirectional encoder pre-trained with masked-language modelling, which set new state-of-the-art results on a wide range of NLP benchmarks. The same year, Radford et al. introduced '''GPT''', a decoder-only autoregressive transformer trained with a standard language-modelling objective. Subsequent scaling — GPT-2 (2019), GPT-3 (2020), PaLM, LLaMA, and contemporary frontier models — established that transformer performance follows predictable '''scaling laws''' (Kaplan et al. 2020) as parameters, data, and compute grow.


Transformers also expanded beyond text: '''Vision Transformers''' (Dosovitskiy et al. 2021) treat image patches as tokens, '''Speech transformers''' (e.g. Whisper) operate on audio spectrograms, and multimodal models such as CLIP and Flamingo unify several modalities in a single architecture.


== Key Approaches ==


The core building block is '''scaled dot-product {{Term|attention}}'''. Given matrices of queries <math>Q \in \mathbb{R}^{n \times d_k}</math>, keys <math>K \in \mathbb{R}^{n \times d_k}</math>, and values <math>V \in \mathbb{R}^{n \times d_v}</math>:


:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>


The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large and pushing the [[Softmax Function|softmax]] into low-gradient regions. '''Multi-head {{Term|attention}}''' runs <math>h</math> parallel heads with independent projections and concatenates their outputs:


:<math>\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O</math>


where <math>\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)</math>.


Three architectural variants dominate practice:


* '''Encoder–decoder transformers''' — the original design, used for {{Term|sequence-to-sequence}} tasks such as translation and summarisation. Encoder layers use bidirectional self-{{Term|attention}}; decoder layers use causal self-{{Term|attention}} and cross-{{Term|attention}} to encoder outputs. Examples: T5, BART, the original Vaswani model.
* '''Encoder-only transformers''' — discard the decoder and use bidirectional self-{{Term|attention}} throughout. Trained with masked-language modelling for representation learning. Examples: BERT, RoBERTa, DeBERTa.
* '''Decoder-only transformers''' — discard the encoder and use causal self-{{Term|attention}} only, trained with next-token prediction. Now the dominant design for general-purpose language models. Examples: GPT family, LLaMA, Mistral, Claude.


The original transformer uses '''sinusoidal positional encodings''':


:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>


Subsequent work introduced learned absolute positions (BERT, GPT-2), '''relative positional encodings''' (Shaw et al. 2018, T5), '''rotary position {{Term|embedding|embeddings}}''' (RoPE, Su et al. 2021), and '''ALiBi''' linear biases (Press et al. 2022), the latter two improving extrapolation to longer sequences than seen during training.


Self-{{Term|attention}} has <math>O(n^2 d)</math> complexity, which is expensive for long sequences. '''Efficient transformer''' variants reduce this cost: Linformer and Performer use low-rank or kernel approximations; Longformer and BigBird mix local and global {{Term|attention}} patterns; '''FlashAttention''' (Dao et al. 2022) reorders the computation to be IO-aware, achieving exact {{Term|attention}} with much higher hardware utilisation. '''Sparse {{Term|mixture of experts|mixture-of-experts}}''' ({{Term|mixture of experts|MoE}}) routing replaces dense feed-forward sublayers with sparsely-activated experts (Switch Transformer, Mixtral) to scale parameters without proportionally scaling compute.


== Connections ==


Transformers are deeply connected to other ideas in modern machine learning. They build directly on [[Attention Mechanisms]], generalising the soft alignment used in earlier {{Term|sequence-to-sequence}} models. Their inputs are typically dense [[Word Embeddings|word embeddings]] (or learned subword {{Term|embedding|embeddings}} such as BPE and SentencePiece), and their outputs over a vocabulary are produced by a [[Softmax Function|softmax]] over linear {{Term|logits}}.


Training relies on the same machinery as other deep networks: [[Backpropagation|backpropagation]] through the {{Term|attention}} and feed-forward sublayers, optimisation by [[Gradient Descent|gradient descent]] variants such as {{Term|adam}} and AdamW, and [[Cross-Entropy Loss|cross-entropy loss]] for next-token prediction or masked-token recovery. {{Term|regularization|Regularisation}} techniques including [[Dropout|dropout]] (on {{Term|attention}} weights and feed-forward {{Term|activation function|activations}}) and [[Overfitting and Regularization|weight decay]] are standard.


Architecturally, transformers can be viewed as a generalisation of fully-connected layers with content-conditioned routing, or as a special case of '''graph neural networks''' on fully-connected graphs. They are usually pre-trained on large unlabelled corpora and then adapted to downstream tasks via [[Transfer Learning|transfer learning]] — {{Term|fine-tuning}}, parameter-efficient methods such as LoRA, prefix-tuning, or simply prompting in-context.


== See also ==


* [[Attention Mechanisms]]
* [[Neural Networks]]
* [[Recurrent Neural Networks]]
* [[Word Embeddings]]
* [[Softmax Function]]
* [[Transfer Learning]]
* [[Backpropagation]]
* [[Cross-Entropy Loss]]


== References ==


* Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). "{{Term|attention}} Is All You Need". ''NeurIPS''.
* Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). "BERT: {{Term|pre-training}} of Deep Bidirectional Transformers for Language Understanding". ''NAACL''.
* Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training". OpenAI technical report.
* Brown, T. et al. (2020). "Language Models are Few-Shot Learners". ''NeurIPS''.
* Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models". ''arXiv:2001.08361''.
* Dosovitskiy, A. et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ''ICLR''.
* Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position {{Term|embedding}}". ''arXiv:2104.09864''.
* Dao, T., Fu, D. Y., Ermon, S., Rudra, A. and Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact {{Term|attention}} with IO-Awareness". ''NeurIPS''.
</translate>

[[Category:Machine Learning]]
[[Category:Introductory]]