Attention Mechanisms/en

Article
Topic area	Deep Learning
Prerequisites	Neural Networks, Recurrent Neural Networks

Other languages:

Attention mechanisms are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in Lua error: Internal error: The interpreter exited with status 1. models, attention has become the foundational building block of modern architectures such as the Transformer.

Motivation

Early Lua error: Internal error: The interpreter exited with status 1. models encoded an entire input sequence into a single fixed-dimensional Lua error: Internal error: The interpreter exited with status 1. using a recurrent neural network. This bottleneck forced long-Lua error: Internal error: The interpreter exited with status 1. dependencies to be compressed into a Lua error: Internal error: The interpreter exited with status 1. of constant size, degrading performance on long sequences. Attention resolves this by letting the Lua error: Internal error: The interpreter exited with status 1. consult every Lua error: Internal error: The interpreter exited with status 1. hidden state at each generation Lua error: Internal error: The interpreter exited with status 1., weighting them by learned relevance Lua error: Internal error: The interpreter exited with status 1..

Bahdanau (Additive) Attention

Bahdanau et al. (2015) proposed the first widely adopted attention mechanism for Lua error: Internal error: The interpreter exited with status 1.. Given Lua error: Internal error: The interpreter exited with status 1. hidden states $h_1, \dots, h_T$ and the Lua error: Internal error: The interpreter exited with status 1. state $s_{t-1}$ , the alignment Lua error: Internal error: The interpreter exited with status 1. is computed as:

e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)

where $$ W_s $$ , $$ W_h $$ , and $$ v $$ are learned parameters. The attention weights are obtained by applying Lua error: Internal error: The interpreter exited with status 1.:

\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}

The context Lua error: Internal error: The interpreter exited with status 1. is the weighted sum $c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i$ , which is concatenated with $s_{t-1}$ and fed into the Lua error: Internal error: The interpreter exited with status 1..

Luong (Multiplicative) Attention

Luong et al. (2015) simplified the scoring function by replacing the additive network with a Lua error: Internal error: The interpreter exited with status 1. or a bilinear form:

Variant	Lua error: Internal error: The interpreter exited with status 1.
Dot	$e_{t,i} = s_t^{\!\top} h_i$
General	$e_{t,i} = s_t^{\!\top} W_a\, h_i$
Concat	$e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i])$

The dot variant requires Lua error: Internal error: The interpreter exited with status 1. and Lua error: Internal error: The interpreter exited with status 1. dimensions to match, while the general variant introduces a learnable weight Lua error: Internal error: The interpreter exited with status 1. $$ W_a $$ .

Scaled Dot-Product Attention

Vaswani et al. (2017) introduced the formulation used in the Lua error: Internal error: The interpreter exited with status 1.. Given Lua error: Internal error: The interpreter exited with status 1. of queries $$ Q $$ , keys $$ K $$ , and values $$ V $$ :

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V

The scaling factor $\sqrt{d_k}$ prevents the Lua error: Internal error: The interpreter exited with status 1. from growing large in magnitude as the key dimension $$ d_k $$ increases, which would push the Lua error: Internal error: The interpreter exited with status 1. into regions of extremely small gradients.

Self-Attention

In self-attention, the queries, keys, and values all derive from the same sequence. Each position attends to every other position (including itself), enabling the model to capture long-Lua error: Internal error: The interpreter exited with status 1. dependencies in a single layer. For an input Lua error: Internal error: The interpreter exited with status 1. $X \in \mathbb{R}^{n \times d}$ :

Q = X W^Q, \quad K = X W^K, \quad V = X W^V

Self-attention has $$ O(n^2 d) $$ complexity, which can be expensive for very long sequences. Efficient variants such as Lua error: Internal error: The interpreter exited with status 1. and Lua error: Internal error: The interpreter exited with status 1. reduce this cost.

Multi-Head Attention

Rather than performing a single attention function, multi-head attention runs $$ h $$ parallel attention heads with independent projections:

\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O

where $\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)$ . Each head can learn to attend to different aspects of the input — for example, one head might capture syntactic relationships while another captures semantic ones. Typical configurations use 8 or 16 heads.

Positional Encoding

Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original Lua error: Internal error: The interpreter exited with status 1. uses sinusoidal encodings:

\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Learned positional Lua error: Internal error: The interpreter exited with status 1. and Lua error: Internal error: The interpreter exited with status 1. (e.g., Lua error: Internal error: The interpreter exited with status 1., Lua error: Internal error: The interpreter exited with status 1.) are common alternatives that can generalise better to unseen sequence lengths.

Cross-Attention

Cross-attention is used when queries come from one sequence and keys/values come from another. In Lua error: Internal error: The interpreter exited with status 1. Lua error: Internal error: The interpreter exited with status 1., the Lua error: Internal error: The interpreter exited with status 1. attends to Lua error: Internal error: The interpreter exited with status 1. outputs via cross-attention, enabling the model to condition its generation on the full input context.

Practical Considerations

Masking: In autoregressive decoding, future positions are masked (set to $-\infty$ before Lua error: Internal error: The interpreter exited with status 1.) to preserve the causal structure.
Attention Lua error: Internal error: The interpreter exited with status 1.: Dropping attention weights randomly during training acts as a Lua error: Internal error: The interpreter exited with status 1. and reduces Lua error: Internal error: The interpreter exited with status 1. to specific alignment patterns.
Key-value caching: During inference, previously computed key and value Lua error: Internal error: The interpreter exited with status 1. are cached to avoid redundant computation, significantly speeding up autoregressive generation.

References

Bahdanau, D., Cho, K. and Bengio, Y. (2015). "Lua error: Internal error: The interpreter exited with status 1. by Jointly Learning to Lua error: Internal error: The interpreter exited with status 1. and Translate". ICLR.
Luong, M.-T., Pham, H. and Manning, C. D. (2015). "Effective Approaches to Attention-based Lua error: Internal error: The interpreter exited with status 1.". EMNLP.
Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS.
Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". NAACL.
Su, J. et al. (2021). "RoFormer: Enhanced Lua error: Internal error: The interpreter exited with status 1. with Rotary Position Lua error: Internal error: The interpreter exited with status 1.". arXiv:2104.09864.