Transformer Decoder

    From Marovi AI
    Other languages:
    Article
    Topic area Deep Learning
    Prerequisites Transformer, Attention Mechanism, Self-Attention


    Overview

    A transformer decoder is the autoregressive half of the Transformer architecture, designed to generate output sequences one token at a time while conditioning on previously generated tokens and, optionally, on an encoded input sequence. Introduced by Vaswani et al. in 2017,[1] it replaced the recurrent and convolutional layers used in earlier sequence-to-sequence models with stacked layers of Self-Attention, cross-attention, and position-wise feed-forward networks. Decoders form the backbone of modern generative language models such as GPT, LLaMA, and the decoding side of translation systems like the original Transformer and T5.

    The defining property of the decoder is causal (or masked) attention: at every position, the model is permitted to attend only to current and earlier positions in the target sequence. This constraint enforces the autoregressive factorization of the joint distribution and makes the decoder usable both for parallel training over full sequences and for token-by-token sampling at inference time.

    Architectural Position

    In the original encoder-decoder formulation, the decoder receives two inputs: the partial target sequence (shifted right by one position) and the contextual representations produced by the Transformer Encoder. Each decoder layer consequently contains three sublayers rather than the encoder's two:

    1. A masked multi-head self-attention sublayer over the target sequence so far.
    2. A multi-head cross-attention sublayer that queries encoder outputs.
    3. A position-wise feed-forward network (FFN).

    Each sublayer is wrapped in a residual connection and Layer Normalization. After the final decoder layer, a linear projection followed by a softmax produces a distribution over the vocabulary at every position.

    In decoder-only models such as GPT, the cross-attention sublayer is removed and the model attends only to its own past tokens. The architecture is otherwise identical, which is why decoder-only and encoder-decoder transformers share the same fundamental compute pattern.

    Layer Components

    Masked self-attention

    Given a target sequence of length $ T $ represented as a matrix $ X \in \mathbb{R}^{T \times d} $, masked self-attention computes queries, keys, and values via learned projections $ Q = X W^Q $, $ K = X W^K $, $ V = X W^V $, and applies scaled dot-product attention with a causal mask:

    $ {\displaystyle \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}} + M\right) V} $

    where $ M_{ij} = 0 $ if $ j \le i $ and $ M_{ij} = -\infty $ otherwise. The mask zeros out the contribution of future positions after the softmax, so position $ i $ sees only positions $ 1, \ldots, i $. Multiple attention heads are computed in parallel and concatenated.

    Cross-attention

    In encoder-decoder decoders, the second sublayer uses queries derived from the decoder's hidden states and keys/values derived from the encoder outputs $ Z $:

    $ {\displaystyle \mathrm{CrossAttn}(H, Z) = \mathrm{softmax}\!\left(\frac{(H W^Q)(Z W^K)^\top}{\sqrt{d_k}}\right)(Z W^V)} $

    No causal mask is applied here because the entire input sequence is available; the only mask used is for padding positions.

    Feed-forward network

    The third sublayer is a position-wise Multi-Layer Perceptron applied independently at each position:

    $ {\displaystyle \mathrm{FFN}(x) = \sigma(x W_1 + b_1) W_2 + b_2} $

    Common choices for $ \sigma $ are ReLU (original transformer), GELU (BERT, GPT-2), and gated variants such as SwiGLU (LLaMA, PaLM).

    Causal Masking and Autoregression

    The decoder factorizes the joint probability of an output sequence as

    $ {\displaystyle p(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x)} $

    Causal masking is what makes this factorization exact at training time despite processing the entire sequence in parallel. Each position predicts the next token, and the masked self-attention guarantees no information leaks from future positions. This is the same factorization used by Recurrent Neural Network language models, but the transformer evaluates all conditional probabilities simultaneously rather than recurrently, enabling efficient training on modern accelerators.

    Training: Teacher Forcing

    Decoders are trained with Teacher Forcing: the ground-truth target sequence is shifted right (prepending a <BOS> token) and passed as input, while the unshifted sequence serves as the prediction target. The loss is the average per-token Cross-Entropy:

    $ {\displaystyle \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)} $

    Because the entire target sequence is processed in one forward pass, training is fully parallel across sequence positions. This contrasts sharply with Recurrent Neural Network decoders, whose sequential computation prevents efficient parallelization. Teacher forcing also introduces exposure bias, a mismatch between training (where the model conditions on ground-truth prefixes) and inference (where it conditions on its own previous samples).

    Inference: Autoregressive Generation

    At inference time the decoder produces output tokens one at a time. Starting from <BOS> (and the encoder output, if applicable), the model:

    1. Computes the conditional distribution over the next token.
    2. Selects a token via Greedy Decoding, Beam Search, or sampling strategies such as Top-k Sampling, Nucleus Sampling, or Temperature Sampling.
    3. Appends the chosen token to the sequence and repeats.

    Generation stops at an <EOS> token or a length limit. Because each new token requires all earlier keys and values, naive inference is quadratic in sequence length. Production systems use a KV cache that stores the keys and values from previous steps, reducing per-step compute to linear in current length and making long-form generation tractable.

    Decoder-Only Models

    The most influential decoder variant in modern practice is the decoder-only transformer popularized by GPT.[2] By removing cross-attention, treating both inputs and outputs as a single token stream, and training on a generic next-token-prediction objective over web-scale text, decoder-only models unify many tasks under a single architecture. Few-shot prompting then conditions behavior at inference without any parameter updates.[3]

    This shift has reshaped the field: most large Language Models released since 2020, including GPT-3, GPT-4, LLaMA, Mistral, and DeepSeek, are decoder-only. Encoder-decoder architectures persist where input and output are clearly distinct (machine translation, summarization, T5-style tasks).

    Variants and Optimizations

    Practical decoders depart from the original 2017 specification in several ways:

    • Pre-LN — moving Layer Normalization before each sublayer rather than after dramatically improves training stability for deep stacks and is now standard.
    • RMSNorm — replaces LayerNorm with a simpler root-mean-square normalization (LLaMA, T5).
    • RoPE and ALiBi — relative position schemes that improve length extrapolation versus the original learned absolute embeddings.
    • Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) — share keys and values across query heads to shrink the KV cache during inference.
    • FlashAttention — an IO-aware exact attention implementation that reduces memory bandwidth pressure.
    • Mixture of Experts (MoE) — sparse FFN blocks that increase capacity at fixed inference compute (Switch, Mixtral, DeepSeek-V3).
    • Speculative Decoding — uses a small draft model to propose tokens that the larger decoder verifies in parallel, accelerating sampling without changing the output distribution.

    Comparison with Encoders

    Although encoder and decoder layers share the same primitives, three differences are crucial:

    1. Attention masking: encoders use bidirectional attention (every position sees every other), while decoders are restricted to past positions.
    2. Number of sublayers: encoder-decoder decoders include cross-attention; encoders and decoder-only models do not.
    3. Training objective: encoders are typically trained with Masked Language Modeling (BERT-style fill-in-the-blank), whereas decoders are trained with Causal Language Modeling (next-token prediction).

    These differences make encoders well suited to representation tasks such as classification and retrieval, and decoders well suited to open-ended generation. Encoder-decoder hybrids combine both for conditional generation tasks.

    Limitations

    • Quadratic attention cost: self-attention is $ O(T^2 d) $ in sequence length. Long-context models address this with sparse attention, linear attention, State Space Models, or sliding-window schemes.
    • Exposure bias: teacher forcing can produce models that drift on their own samples; mitigations include scheduled sampling and reinforcement learning from human feedback (RLHF).
    • KV cache memory: autoregressive inference requires storing keys and values for every layer and head across the full context, dominating memory at long lengths.
    • Sequential decoding: generation is fundamentally token-by-token, limiting throughput. Speculative decoding and parallel decoding methods chip away at this constraint but do not eliminate it.
    • Hallucination and miscalibration: because the decoder is trained purely to maximize likelihood of next tokens, it has no built-in mechanism for factual grounding.

    See also

    References

    1. Template:Cite arxiv
    2. Radford, A., et al., "Improving Language Understanding by Generative Pre-Training," 2018.
    3. Template:Cite arxiv