ALiBi Positional Bias/en
| Article | |
|---|---|
| Topic area | transformers |
| Prerequisites | Transformer, Self-Attention, Positional Encoding |
Overview
Attention with Linear Biases (ALiBi) is a method for injecting positional information into Transformer attention by adding a fixed, distance-dependent bias to attention scores rather than embedding positions into the token representations. Introduced by Press, Smith, and Lewis in 2021, ALiBi replaces sinusoidal and learned positional embeddings with a static per-head bias matrix whose entries grow linearly with the distance between query and key positions. Heads are penalized for attending to far-away tokens at different rates, controlled by a fixed per-head slope.
The defining property of ALiBi is its ability to extrapolate to sequence lengths far longer than those seen during training. Standard positional encodings deteriorate sharply when evaluated on inputs longer than the training context, but a model trained on length 1024 with ALiBi can be evaluated on length 2048, 4096, or longer with only modest perplexity degradation, without any retraining or interpolation. This property made ALiBi an early workhorse for long-context language modeling and the positional method of choice in BLOOM, MPT, and Replit Code.
Motivation
The original Transformer uses sinusoidal positional encodings that are added to the input token embeddings. Learned absolute positional embeddings (as in BERT and GPT-2) replace the fixed sinusoids with trainable vectors. Both approaches share a fundamental limitation: positions beyond the training range are either out of the embedding table (learned) or fall on parts of the sinusoid the model has never had to interpret (sinusoidal). Empirically, perplexity on out-of-range positions explodes.
Press, Smith, and Lewis observed that a Transformer's positional information ultimately enters the network through the attention scores. Instead of routing position through the input embeddings, why not encode it directly where it is consumed? ALiBi removes positional embeddings entirely and instead biases the attention logits by a function of the relative distance between query and key. Because the bias is a simple linear function of distance with no learned parameters, evaluating it at a previously unseen distance is well-defined and behaves smoothly.
Formulation
Standard scaled dot-product attention computes, for queries $ Q $, keys $ K $, and values $ V $,
$ {\displaystyle \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V.} $
ALiBi modifies this by adding a per-head bias matrix $ B^{(h)} $ inside the softmax, before the row-wise normalization,
$ {\displaystyle \operatorname{Attention}^{(h)}(Q, K, V) = \operatorname{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}} + B^{(h)}\right) V.} $
In the causal language modeling setting, the bias for head $ h $ at query position $ i $ and key position $ j $ with $ j \le i $ is
$ {\displaystyle B^{(h)}_{ij} = -m_h \cdot (i - j),} $
where $ m_h > 0 $ is a fixed (non-learned) per-head slope and $ (i - j) \ge 0 $ is the causal distance. Positions $ j > i $ are masked to $ -\infty $ as usual. For a bidirectional model, the bias is symmetric: $ B^{(h)}_{ij} = -m_h \cdot |i - j| $.
The key features are: no positional embedding is added to the inputs; the bias depends only on the distance $ i - j $, not on the absolute positions; and the slopes $ m_h $ are not learned. Each head therefore implements an exponentially decaying preference for nearby tokens, with the rate of decay controlled by its slope.
Slope Schedule
ALiBi assigns slopes geometrically across heads. For an attention layer with $ n $ heads, the slopes are
$ {\displaystyle m_h = 2^{-8 h / n}, \quad h = 1, 2, \ldots, n.} $
For $ n = 8 $ heads, this gives slopes $ 1/2, 1/4, 1/8, \ldots, 1/256 $. For $ n = 16 $, the slopes interpolate the same range geometrically, with each slope being the geometric mean of the slope above and below its neighbors in the 8-head schedule. Heads with larger slopes attend almost exclusively to recent tokens; heads with smaller slopes have a near-flat bias and can attend to distant context. The schedule was chosen empirically by the authors and held fixed across model sizes; despite its simplicity, it has not been substantially improved upon.
Crucially, the slopes are constants, not parameters. They are not updated by Backpropagation, do not appear in the optimizer state, and add zero parameters to the model. This makes ALiBi cheaper than learned positional embeddings or Rotary Position Embedding in terms of parameter count.
Length Extrapolation
The headline result of the original ALiBi paper is length extrapolation. A 1.3B-parameter Transformer trained on 1024-token contexts and evaluated on 3072-token contexts retains within a small margin of the perplexity it would have achieved had it been trained at 3072 tokens. Sinusoidal and learned positional embeddings, evaluated on the same out-of-distribution length, see perplexity blow up by an order of magnitude or more.
The mechanism is intuitive. The bias $ -m_h (i - j) $ is a linear function of distance and is well-defined for any non-negative $ i - j $, including values larger than anything seen in training. The model never has to interpret a positional input it has not encountered; it has only to apply softmax to a linear bias whose shape is identical to what it has always seen. The attention distribution naturally concentrates on nearby tokens at long range, in proportion to each head's slope, which the model learns to exploit during training.
In practice, the extrapolation is not free. Quality continues to degrade gradually with length, and very long evaluations (more than several times the training length) lose information about distant tokens because all heads' biases become large and negative. Still, the graceful degradation makes ALiBi a strong choice when training-time context length is constrained but inference-time context may grow.
Comparison with Other Position Methods
Sinusoidal positional encodings add a fixed function of position to the input embeddings. They are absolute (each position has a distinct encoding), parameter-free, and in principle defined for any position; in practice they fail to extrapolate because the trained network has not learned to interpret the high-frequency components beyond the training range.
Learned absolute positional embeddings allocate a trainable vector to each position. They cannot extrapolate at all; positions beyond the table do not exist.
Relative positional encodings (T5 bias, Shaw et al.) add a learned scalar bias to attention scores based on the bucketed relative distance between query and key. ALiBi is closely related but uses a fixed linear bias instead of a learned bucketed one, which both eliminates parameters and enables extrapolation to unseen distances.
Rotary positional embedding (RoPE) rotates pairs of dimensions in $ Q $ and $ K $ by an angle proportional to position. RoPE is now more popular than ALiBi in modern large language models (LLaMA, Mistral, Qwen, Gemma) because it tends to give slightly better in-distribution quality and supports interpolation-based context extension via NTK-aware or YaRN scaling. RoPE does not extrapolate as cleanly as ALiBi out of the box, but combined with position-interpolation tricks it has overtaken ALiBi for long-context fine-tuning.
Variants and Extensions
A number of refinements have been proposed.
Symmetric ALiBi applies $ B_{ij} = -m_h |i - j| $ for bidirectional encoders such as BERT-style models. The original paper focuses on causal models, but the symmetric variant has been used in encoder-only and encoder-decoder settings.
Learnable slopes make $ m_h $ trainable rather than fixed. This usually does not improve quality and partially destroys the extrapolation guarantee, since learned slopes can drift in directions that depend on the training distribution.
Sandwich and KERPLE replace the linear bias with other monotone functions of distance (logarithmic, kernelized) that retain the extrapolation property while sometimes giving better in-distribution perplexity. These methods are conceptually descendants of ALiBi and slot into the same architectural position.
Dynamic NTK ALiBi adjusts the slope schedule at inference time to extend the effective context further, analogous to NTK-aware RoPE scaling.
Implementation Notes
ALiBi requires no changes to the input pipeline, no positional embedding layer, and no modification of $ Q $ or $ K $. Implementations typically precompute the bias matrix $ B^{(h)} $ once per sequence length and add it to the attention logits. In FlashAttention and similar fused kernels, the bias is computed on the fly from the row and column indices, using a small per-head slope buffer of size $ n $.
For inference with key-value caching, the bias depends only on the absolute positions of the cached keys relative to the current query, which are known. Streaming inference with growing context length therefore needs no special handling beyond extending the bias to the new positions.
ALiBi is compatible with Layer Normalization, decoder and encoder architectures, and standard attention masking. Because no positional information is mixed into the residual stream, the input embedding table is also slightly smaller than in models with learned positional embeddings.
Limitations
ALiBi has several known weaknesses. First, although extrapolation is graceful, in-distribution quality on benchmarks where positional precision matters (e.g. exact-match span retrieval, character-level tasks, certain code-completion settings) is typically a touch below well-tuned RoPE. Second, the slope schedule is fixed and not adapted per task or per layer; some heads end up effectively unused because their slopes are too steep to attend beyond the immediate neighborhood. Third, the linear-in-distance bias is a strong inductive bias toward locality; tasks that require attending to a single distant token (e.g. retrieving a key from the start of a long context) become harder as context grows, since the bias actively suppresses far-away keys.
These trade-offs explain why ALiBi has been partially displaced by RoPE plus position-interpolation methods in recent large language models, even though it remains a clean, parameter-free baseline for length extrapolation and is still in active use in production systems trained before 2023.
References
- ↑ Press, O., Smith, N. A., and Lewis, M. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." International Conference on Learning Representations (ICLR), 2022. arXiv:2108.12409.
- ↑ Vaswani, A. et al. "Attention Is All You Need." Advances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1706.03762.
- ↑ Shaw, P., Uszkoreit, J., and Vaswani, A. "Self-Attention with Relative Position Representations." North American Chapter of the Association for Computational Linguistics (NAACL), 2018. arXiv:1803.02155.
- ↑ Su, J. et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing, 2024. arXiv:2104.09864.
- ↑ Chi, T.-C., Fan, T.-H., Ramadge, P. J., and Rudnicky, A. I. "KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation." Advances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2205.09921.
- ↑ Le Scao, T. et al. "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." 2022. arXiv:2211.05100.