Attention Mechanisms

    From Marovi AI
    Languages: English | Español | 中文
    Article
    Topic area Deep Learning
    Difficulty Advanced
    Prerequisites Neural Networks, Recurrent Neural Networks

    Attention mechanisms are a family of techniques that allow neural networks to focus selectively on relevant parts of their input when producing each element of the output. Originally introduced to overcome the limitations of fixed-length context vectors in sequence-to-sequence models, attention has become the foundational building block of modern architectures such as the Transformer.

    Motivation

    Early sequence-to-sequence models encoded an entire input sequence into a single fixed-dimensional vector using a recurrent neural network. This bottleneck forced long-range dependencies to be compressed into a vector of constant size, degrading performance on long sequences. Attention resolves this by letting the decoder consult every encoder hidden state at each generation step, weighting them by learned relevance scores.

    Bahdanau (Additive) Attention

    Bahdanau et al. (2015) proposed the first widely adopted attention mechanism for machine translation. Given encoder hidden states $ h_1, \dots, h_T $ and the decoder state $ s_{t-1} $, the alignment score is computed as:

    $ e_{t,i} = v^{\!\top} \tanh(W_s\, s_{t-1} + W_h\, h_i) $

    where $ W_s $, $ W_h $, and $ v $ are learned parameters. The attention weights are obtained by applying softmax:

    $ \alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})} $

    The context vector is the weighted sum $ c_t = \sum_{i=1}^{T} \alpha_{t,i}\, h_i $, which is concatenated with $ s_{t-1} $ and fed into the decoder.

    Luong (Multiplicative) Attention

    Luong et al. (2015) simplified the scoring function by replacing the additive network with a dot product or a bilinear form:

    Variant Score function
    Dot $ e_{t,i} = s_t^{\!\top} h_i $
    General $ e_{t,i} = s_t^{\!\top} W_a\, h_i $
    Concat $ e_{t,i} = v^{\!\top} \tanh(W_a [s_t;\, h_i]) $

    The dot variant requires encoder and decoder dimensions to match, while the general variant introduces a learnable weight matrix $ W_a $.

    Scaled Dot-Product Attention

    Vaswani et al. (2017) introduced the formulation used in the Transformer. Given matrices of queries $ Q $, keys $ K $, and values $ V $:

    $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\!\top}}{\sqrt{d_k}}\right) V $

    The scaling factor $ \sqrt{d_k} $ prevents the dot products from growing large in magnitude as the key dimension $ d_k $ increases, which would push the softmax into regions of extremely small gradients.

    Self-Attention

    In self-attention, the queries, keys, and values all derive from the same sequence. Each position attends to every other position (including itself), enabling the model to capture long-range dependencies in a single layer. For an input matrix $ X \in \mathbb{R}^{n \times d} $:

    $ Q = X W^Q, \quad K = X W^K, \quad V = X W^V $

    Self-attention has $ O(n^2 d) $ complexity, which can be expensive for very long sequences. Efficient variants such as sparse attention and linear attention reduce this cost.

    Multi-Head Attention

    Rather than performing a single attention function, multi-head attention runs $ h $ parallel attention heads with independent projections:

    $ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\, W^O $

    where $ \mathrm{head}_i = \mathrm{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V) $. Each head can learn to attend to different aspects of the input — for example, one head might capture syntactic relationships while another captures semantic ones. Typical configurations use 8 or 16 heads.

    Positional Encoding

    Because self-attention is permutation-invariant (it treats the input as an unordered set), positional information must be injected explicitly. The original Transformer uses sinusoidal encodings:

    $ \mathrm{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right) $

    Learned positional embeddings and relative positional encodings (e.g., RoPE, ALiBi) are common alternatives that can generalise better to unseen sequence lengths.

    Cross-Attention

    Cross-attention is used when queries come from one sequence and keys/values come from another. In encoder-decoder Transformers, the decoder attends to encoder outputs via cross-attention, enabling the model to condition its generation on the full input context.

    Practical Considerations

    • Masking: In autoregressive decoding, future positions are masked (set to $ -\infty $ before softmax) to preserve the causal structure.
    • Attention dropout: Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
    • Key-value caching: During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

    See also

    References

    • Bahdanau, D., Cho, K. and Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate". ICLR.
    • Luong, M.-T., Pham, H. and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation". EMNLP.
    • Vaswani, A. et al. (2017). "Attention Is All You Need". NeurIPS.
    • Shaw, P., Uszkoreit, J. and Vaswani, A. (2018). "Self-Attention with Relative Position Representations". NAACL.
    • Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864.