Attention Mechanisms/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-28T00:00:07Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T21:57:48Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:37Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T02:38:32Z

Updating to match new version of source page

← Older revision		Revision as of 02:38, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Attention Mechanisms}}~~
	{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Advanced \| prerequisites = [[Neural Networks]], [[Recurrent Neural Networks]]}}		{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Advanced \| prerequisites = [[Neural Networks]], [[Recurrent Neural Networks]]}}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:30:23Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Attention Mechanisms}}
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Advanced | prerequisites = [[Neural Networks]], [[Recurrent Neural Networks]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the [[Transformer]].

== Motivation ==

Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.

== Bahdanau (Additive) Attention ==

Bahdanau et al. (2015) proposed the first widely adopted attention mechanism for machine translation. Given encoder hidden states <math>h_1, \dots, h_T</math> and the decoder state <math>s_{t-1}</math>, the alignment score is computed as:

:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>

where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying softmax:

:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>

The context vector is the weighted sum <math>c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i</math>, which is concatenated with <math>s_{t-1}</math> and fed into the decoder.

== Luong (Multiplicative) Attention ==

Luong et al. (2015) simplified the scoring function by replacing the additive network with a dot product or a bilinear form:

{| class="wikitable"
|-
! Variant !! Score function
|-
| Dot || <math>e_{t,i} = s_t^{\!\top} h_i</math>
|-
| General || <math>e_{t,i} = s_t^{\!\top} W_a\, h_i</math>
|-
| Concat || <math>e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i])</math>
|}

The dot variant requires encoder and decoder dimensions to match, while the general variant introduces a learnable weight matrix <math>W_a</math>.

== Scaled Dot-Product Attention ==

Vaswani et al. (2017) introduced the formulation used in the Transformer. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:

:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>

The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the softmax into regions of extremely small gradients.

== Self-Attention ==

In '''self-attention''', the queries, keys, and values all derive from the same sequence. Each position attends to every other position (including itself), enabling the model to capture long-range dependencies in a single layer. For an input matrix <math>X \in \mathbb{R}^{n \times d}</math>:

:<math>Q = X W^Q, \quad K = X W^K, \quad V = X W^V</math>

Self-attention has <math>O(n^2 d)</math> complexity, which can be expensive for very long sequences. Efficient variants such as sparse attention and linear attention reduce this cost.

== Multi-Head Attention ==

Rather than performing a single attention function, '''multi-head attention''' runs <math>h</math> parallel attention heads with independent projections:

:<math>\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O</math>

where <math>\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)</math>. Each head can learn to attend to different aspects of the input — for example, one head might capture syntactic relationships while another captures semantic ones. Typical configurations use 8 or 16 heads.

== Positional Encoding ==

Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original Transformer uses sinusoidal encodings:

:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>

Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.

== Cross-Attention ==

'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.

== Practical Considerations ==

* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before softmax) to preserve the causal structure.
* '''Attention dropout''': Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

== See also ==

* [[Transformer]]
* [[Recurrent Neural Networks]]
* [[Sequence-to-sequence models]]
* [[Self-supervised learning]]
* [[Softmax Function]]

== References ==

* Bahdanau, D., Cho, K. and Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate". ''ICLR''.
* Luong, M.-T., Pham, H. and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation". ''EMNLP''.
* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.
* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.
* Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". ''arXiv:2104.09864''.

[[Category:Deep Learning]]
[[Category:Advanced]]
[[Category:Neural Networks]]

← Older revision		Revision as of 00:00, 28 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the [[Transformer]].		'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in {{Term\|sequence-to-sequence}} models, attention has become the foundational building block of modern architectures such as the [[Transformer]].

	== Motivation ==		== Motivation ==

	Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks\|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.		Early {{Term\|sequence-to-sequence}} models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks\|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.

	== Bahdanau (Additive) Attention ==		== Bahdanau (Additive) Attention ==
Line 15:		Line 15:
	:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>		:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>

	where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying softmax:		where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying {{Term\|softmax}}:

	:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>		:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>
Line 40:		Line 40:
	== Scaled Dot-Product Attention ==		== Scaled Dot-Product Attention ==

	Vaswani et al. (2017) introduced the formulation used in the ~~Transformer~~. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:		Vaswani et al. (2017) introduced the formulation used in the {{Term\|transformer}}. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:

	:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>		:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>

	The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the softmax into regions of extremely small gradients.		The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the {{Term\|softmax}} into regions of extremely small gradients.

	== Self-Attention ==		== Self-Attention ==
Line 64:		Line 64:
	== Positional Encoding ==		== Positional Encoding ==

	Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original ~~Transformer~~ uses sinusoidal encodings:		Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original {{Term\|transformer}} uses sinusoidal encodings:

	:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>		:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>

	Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.		Learned positional {{Term\|embedding\|embeddings}} and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.

	== Cross-Attention ==		== Cross-Attention ==

	'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.		'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder {{Term\|transformer\|Transformers}}, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.

	== Practical Considerations ==		== Practical Considerations ==

	* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before softmax) to preserve the causal structure.		* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before {{Term\|softmax}}) to preserve the causal structure.
	* '''Attention dropout''': Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.		* '''Attention {{Term\|dropout}}''': Dropping attention weights randomly during training acts as a regulariser and reduces {{Term\|overfitting}} to specific alignment patterns.
	* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.		* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

Line 94:		Line 94:
	* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.		* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.
	* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.		* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.
	* Su, J. et al. (2021). "RoFormer: Enhanced ~~Transformer~~ with Rotary Position ~~Embedding~~". ''arXiv:2104.09864''.		* Su, J. et al. (2021). "RoFormer: Enhanced {{Term\|transformer}} with Rotary Position {{Term\|embedding}}". ''arXiv:2104.09864''.

	[[Category:Deep Learning]]		[[Category:Deep Learning]]
	[[Category:Advanced]]		[[Category:Advanced]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]

← Older revision		Revision as of 21:57, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in ~~{{Term\|~~sequence-to-sequence}} models, attention has become the foundational building block of modern architectures such as the [[Transformer]].		'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the [[Transformer]].

	== Motivation ==		== Motivation ==

	Early ~~{{Term\|~~sequence-to-sequence}} models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks\|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.		Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks\|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.

	== Bahdanau (Additive) Attention ==		== Bahdanau (Additive) Attention ==
Line 15:		Line 15:
	:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>		:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>

	where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying ~~{{Term\|~~softmax}}:		where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying softmax:

	:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>		:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>
Line 40:		Line 40:
	== Scaled Dot-Product Attention ==		== Scaled Dot-Product Attention ==

	Vaswani et al. (2017) introduced the formulation used in the ~~{{Term\|transformer}}~~. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:		Vaswani et al. (2017) introduced the formulation used in the Transformer. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:

	:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>		:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>

	The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the ~~{{Term\|~~softmax}} into regions of extremely small gradients.		The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the softmax into regions of extremely small gradients.

	== Self-Attention ==		== Self-Attention ==
Line 64:		Line 64:
	== Positional Encoding ==		== Positional Encoding ==

	Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original ~~{{Term\|transformer}}~~ uses sinusoidal encodings:		Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original Transformer uses sinusoidal encodings:

	:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>		:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>

	Learned positional ~~{{Term\|embedding\|~~embeddings}} and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.		Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.

	== Cross-Attention ==		== Cross-Attention ==

	'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder ~~{{Term\|transformer\|~~Transformers}}, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.		'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.

	== Practical Considerations ==		== Practical Considerations ==

	* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before ~~{{Term\|~~softmax}}) to preserve the causal structure.		* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before softmax) to preserve the causal structure.
	* '''Attention ~~{{Term\|~~dropout}}''': Dropping attention weights randomly during training acts as a regulariser and reduces ~~{{Term\|~~overfitting}} to specific alignment patterns.		* '''Attention dropout''': Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
	* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.		* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

Line 94:		Line 94:
	* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.		* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.
	* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.		* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.
	* Su, J. et al. (2021). "RoFormer: Enhanced ~~{{Term\|transformer}}~~ with Rotary Position ~~{{Term\|embedding}}~~". ''arXiv:2104.09864''.		* Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". ''arXiv:2104.09864''.

	[[Category:Deep Learning]]		[[Category:Deep Learning]]
	[[Category:Advanced]]		[[Category:Advanced]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]

← Older revision		Revision as of 19:42, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the [[Transformer]].		'''Attention mechanisms''' are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in {{Term\|sequence-to-sequence}} models, attention has become the foundational building block of modern architectures such as the [[Transformer]].

	== Motivation ==		== Motivation ==

	Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks\|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.		Early {{Term\|sequence-to-sequence}} models encoded an entire input sequence into a single fixed-dimensional vector using a [[Recurrent Neural Networks\|recurrent neural network]]. This ''bottleneck'' forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.

	== Bahdanau (Additive) Attention ==		== Bahdanau (Additive) Attention ==
Line 15:		Line 15:
	:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>		:<math>e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)</math>

	where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying softmax:		where <math>W_s</math>, <math>W_h</math>, and <math>v</math> are learned parameters. The attention weights are obtained by applying {{Term\|softmax}}:

	:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>		:<math>\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}</math>
Line 40:		Line 40:
	== Scaled Dot-Product Attention ==		== Scaled Dot-Product Attention ==

	Vaswani et al. (2017) introduced the formulation used in the ~~Transformer~~. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:		Vaswani et al. (2017) introduced the formulation used in the {{Term\|transformer}}. Given matrices of queries <math>Q</math>, keys <math>K</math>, and values <math>V</math>:

	:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>		:<math>\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V</math>

	The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the softmax into regions of extremely small gradients.		The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the {{Term\|softmax}} into regions of extremely small gradients.

	== Self-Attention ==		== Self-Attention ==
Line 64:		Line 64:
	== Positional Encoding ==		== Positional Encoding ==

	Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original ~~Transformer~~ uses sinusoidal encodings:		Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original {{Term\|transformer}} uses sinusoidal encodings:

	:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>		:<math>\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)</math>

	Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.		Learned positional {{Term\|embedding\|embeddings}} and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.

	== Cross-Attention ==		== Cross-Attention ==

	'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.		'''Cross-attention''' is used when queries come from one sequence and keys/values come from another. In encoder-decoder {{Term\|transformer\|Transformers}}, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.

	== Practical Considerations ==		== Practical Considerations ==

	* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before softmax) to preserve the causal structure.		* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before {{Term\|softmax}}) to preserve the causal structure.
	* '''Attention dropout''': Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.		* '''Attention {{Term\|dropout}}''': Dropping attention weights randomly during training acts as a regulariser and reduces {{Term\|overfitting}} to specific alignment patterns.
	* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.		* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

Line 94:		Line 94:
	* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.		* Vaswani, A. et al. (2017). "Attention Is All You Need". ''NeurIPS''.
	* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.		* Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". ''NAACL''.
	* Su, J. et al. (2021). "RoFormer: Enhanced ~~Transformer~~ with Rotary Position ~~Embedding~~". ''arXiv:2104.09864''.		* Su, J. et al. (2021). "RoFormer: Enhanced {{Term\|transformer}} with Rotary Position {{Term\|embedding}}". ''arXiv:2104.09864''.

	[[Category:Deep Learning]]		[[Category:Deep Learning]]
	[[Category:Advanced]]		[[Category:Advanced]]
	[[Category:Neural Networks]]		[[Category:Neural Networks]]