Transformer Decoder/zh
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Transformer, Attention Mechanism, Self-Attention |
概述
transformer 解码器是 Transformer 架构的自回归部分,旨在一次生成一个 token 的输出序列,同时基于先前生成的 token 进行条件建模,并可选地基于已编码的输入序列。由 Vaswani 等人于 2017 年提出,[1]它用 Self-Attention、cross-attention 和位置式前馈网络的堆叠层取代了早期 序列到序列模型中使用的循环层和卷积层。解码器构成了现代生成式语言模型的核心,例如 GPT、LLaMA,以及原始 Transformer 和 T5 等翻译系统的解码端。
解码器的核心特性是因果(或掩码)注意力:在每个位置,模型只能关注目标序列中当前及更早的位置。这一约束强制了联合分布的自回归分解,使解码器既可用于在完整序列上进行并行训练,也可用于推理时逐 token 的采样。
架构定位
在最初的编码器-解码器表述中,解码器接收两个输入:部分目标序列(向右平移一个位置)和由 Transformer Encoder 产生的上下文表示。因此,每个解码器层包含三个子层,而非编码器的两个:
- 一个对当前目标序列执行的掩码多头自注意力子层。
- 一个查询编码器输出的多头交叉注意力子层。
- 一个位置式前馈网络(FFN)。
每个子层都包裹在残差连接和 Layer Normalization 中。在最后一个解码器层之后,线性投影后接 softmax 在每个位置生成词表上的分布。
在诸如 GPT 的仅解码器模型中,交叉注意力子层被移除,模型仅关注其自身的过往 token。除此之外架构完全相同,这正是仅解码器和编码器-解码器 transformer 共享相同基本计算模式的原因。
层组件
掩码自注意力
给定长度为 $ T $ 的目标序列,表示为矩阵 $ X \in \mathbb{R}^{T \times d} $,掩码自注意力通过学习到的投影 $ Q = X W^Q $、$ K = X W^K $、$ V = X W^V $ 计算查询、键和值,并应用带有因果掩码的缩放点积注意力:
$ {\displaystyle \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}} + M\right) V} $
其中当 $ j \le i $ 时 $ M_{ij} = 0 $,否则 $ M_{ij} = -\infty $。掩码在 softmax 之后将未来位置的贡献置零,因此位置 $ i $ 只能看到位置 $ 1, \ldots, i $。多个注意力头并行计算并被拼接起来。
交叉注意力
在编码器-解码器解码器中,第二个子层使用源自解码器隐藏状态的查询,以及源自编码器输出 $ Z $ 的键/值:
$ {\displaystyle \mathrm{CrossAttn}(H, Z) = \mathrm{softmax}\!\left(\frac{(H W^Q)(Z W^K)^\top}{\sqrt{d_k}}\right)(Z W^V)} $
No causal mask is applied here because the entire input sequence is available; the only mask used is for padding positions.
Feed-forward network
The third sublayer is a position-wise Multi-Layer Perceptron applied independently at each position:
$ {\displaystyle \mathrm{FFN}(x) = \sigma(x W_1 + b_1) W_2 + b_2} $
Common choices for $ \sigma $ are ReLU (original transformer), GELU (BERT, GPT-2), and gated variants such as SwiGLU (LLaMA, PaLM).
Causal Masking and Autoregression
The decoder factorizes the joint probability of an output sequence as
$ {\displaystyle p(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x)} $
Causal masking is what makes this factorization exact at training time despite processing the entire sequence in parallel. Each position predicts the next token, and the masked self-attention guarantees no information leaks from future positions. This is the same factorization used by Recurrent Neural Network language models, but the transformer evaluates all conditional probabilities simultaneously rather than recurrently, enabling efficient training on modern accelerators.
Training: Teacher Forcing
Decoders are trained with Teacher Forcing: the ground-truth target sequence is shifted right (prepending a <BOS> token) and passed as input, while the unshifted sequence serves as the prediction target. The loss is the average per-token Cross-Entropy:
$ {\displaystyle \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)} $
Because the entire target sequence is processed in one forward pass, training is fully parallel across sequence positions. This contrasts sharply with Recurrent Neural Network decoders, whose sequential computation prevents efficient parallelization. Teacher forcing also introduces exposure bias, a mismatch between training (where the model conditions on ground-truth prefixes) and inference (where it conditions on its own previous samples).
Inference: Autoregressive Generation
At inference time the decoder produces output tokens one at a time. Starting from <BOS> (and the encoder output, if applicable), the model:
- Computes the conditional distribution over the next token.
- Selects a token via Greedy Decoding, Beam Search, or sampling strategies such as Top-k Sampling, Nucleus Sampling, or Temperature Sampling.
- Appends the chosen token to the sequence and repeats.
Generation stops at an <EOS> token or a length limit. Because each new token requires all earlier keys and values, naive inference is quadratic in sequence length. Production systems use a KV cache that stores the keys and values from previous steps, reducing per-step compute to linear in current length and making long-form generation tractable.
Decoder-Only Models
The most influential decoder variant in modern practice is the decoder-only transformer popularized by GPT.[2] By removing cross-attention, treating both inputs and outputs as a single token stream, and training on a generic next-token-prediction objective over web-scale text, decoder-only models unify many tasks under a single architecture. Few-shot prompting then conditions behavior at inference without any parameter updates.[3]
This shift has reshaped the field: most large Language Models released since 2020, including GPT-3, GPT-4, LLaMA, Mistral, and DeepSeek, are decoder-only. Encoder-decoder architectures persist where input and output are clearly distinct (machine translation, summarization, T5-style tasks).
Variants and Optimizations
Practical decoders depart from the original 2017 specification in several ways:
- Pre-LN — moving Layer Normalization before each sublayer rather than after dramatically improves training stability for deep stacks and is now standard.
- RMSNorm — replaces LayerNorm with a simpler root-mean-square normalization (LLaMA, T5).
- RoPE and ALiBi — relative position schemes that improve length extrapolation versus the original learned absolute embeddings.
- Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) — share keys and values across query heads to shrink the KV cache during inference.
- FlashAttention — an IO-aware exact attention implementation that reduces memory bandwidth pressure.
- Mixture of Experts (MoE) — sparse FFN blocks that increase capacity at fixed inference compute (Switch, Mixtral, DeepSeek-V3).
- Speculative Decoding — uses a small draft model to propose tokens that the larger decoder verifies in parallel, accelerating sampling without changing the output distribution.
Comparison with Encoders
Although encoder and decoder layers share the same primitives, three differences are crucial:
- Attention masking: encoders use bidirectional attention (every position sees every other), while decoders are restricted to past positions.
- Number of sublayers: encoder-decoder decoders include cross-attention; encoders and decoder-only models do not.
- Training objective: encoders are typically trained with Masked Language Modeling (BERT-style fill-in-the-blank), whereas decoders are trained with Causal Language Modeling (next-token prediction).
These differences make encoders well suited to representation tasks such as classification and retrieval, and decoders well suited to open-ended generation. Encoder-decoder hybrids combine both for conditional generation tasks.
Limitations
- Quadratic attention cost: self-attention is $ O(T^2 d) $ in sequence length. Long-context models address this with sparse attention, linear attention, State Space Models, or sliding-window schemes.
- Exposure bias: teacher forcing can produce models that drift on their own samples; mitigations include scheduled sampling and reinforcement learning from human feedback (RLHF).
- KV cache memory: autoregressive inference requires storing keys and values for every layer and head across the full context, dominating memory at long lengths.
- Sequential decoding: generation is fundamentally token-by-token, limiting throughput. Speculative decoding and parallel decoding methods chip away at this constraint but do not eliminate it.
- Hallucination and miscalibration: because the decoder is trained purely to maximize likelihood of next tokens, it has no built-in mechanism for factual grounding.
See also
References
- ↑ Template:Cite arxiv
- ↑ Radford, A., et al., "Improving Language Understanding by Generative Pre-Training," 2018.
- ↑ Template:Cite arxiv