Attention Mechanisms

Article
Topic area	Deep Learning
Difficulty	Advanced
Prerequisites	Neural Networks, Recurrent Neural Networks

Languages: English | Español | 中文

Attention mechanisms are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the Transformer.

Motivation

Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a recurrent neural network. This bottleneck forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.

Bahdanau (Additive) Attention

Bahdanau et al. (2015) proposed the first widely adopted attention mechanism for machine translation. Given encoder hidden states $h_1, \dots, h_T$ and the decoder state $s_{t-1}$ , the alignment score is computed as:

e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i)

where $$ W_s $$ , $$ W_h $$ , and $$ v $$ are learned parameters. The attention weights are obtained by applying softmax:

\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}

The context vector is the weighted sum $c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i$ , which is concatenated with $s_{t-1}$ and fed into the decoder.

Luong (Multiplicative) Attention

Luong et al. (2015) simplified the scoring function by replacing the additive network with a dot product or a bilinear form:

Variant	Score function
Dot	$e_{t,i} = s_t^{\!\top} h_i$
General	$e_{t,i} = s_t^{\!\top} W_a\, h_i$
Concat	$e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i])$

The dot variant requires encoder and decoder dimensions to match, while the general variant introduces a learnable weight matrix $$ W_a $$ .

Scaled Dot-Product Attention

Vaswani et al. (2017) introduced the formulation used in the Transformer. Given matrices of queries $$ Q $$ , keys $$ K $$ , and values $$ V $$ :

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V

The scaling factor $\sqrt{d_k}$ prevents the dot products from growing large in magnitude as the key dimension $$ d_k $$ increases, which would push the softmax into regions of extremely small gradients.

Self-Attention

In self-attention, the queries, keys, and values all derive from the same sequence. Each position attends to every other position (including itself), enabling the model to capture long-range dependencies in a single layer. For an input matrix $X \in \mathbb{R}^{n \times d}$ :

Q = X W^Q, \quad K = X W^K, \quad V = X W^V

Self-attention has $$ O(n^2 d) $$ complexity, which can be expensive for very long sequences. Efficient variants such as sparse attention and linear attention reduce this cost.

Multi-Head Attention

Rather than performing a single attention function, multi-head attention runs $$ h $$ parallel attention heads with independent projections:

\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O

where $\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V)$ . Each head can learn to attend to different aspects of the input — for example, one head might capture syntactic relationships while another captures semantic ones. Typical configurations use 8 or 16 heads.

Positional Encoding

Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original Transformer uses sinusoidal encodings:

\mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.

Cross-Attention

Cross-attention is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.

Practical Considerations

Masking: In autoregressive decoding, future positions are masked (set to $-\infty$ before softmax) to preserve the causal structure.
Attention dropout: Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
Key-value caching: During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

References

Bahdanau, D., Cho, K. and Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate". ICLR.
Luong, M.-T., Pham, H. and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation". EMNLP.
Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS.
Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". NAACL.
Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864.