Transformer Decoder/zh
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Transformer, Attention Mechanism, Self-Attention |
概述
transformer 解碼器是 Transformer 架構的自回歸部分,旨在一次生成一個 token 的輸出序列,同時基於先前生成的 token 進行條件建模,並可選地基於已編碼的輸入序列。由 Vaswani 等人於 2017 年提出,[1]它用 Self-Attention、cross-attention 和位置式前饋網絡的堆疊層取代了早期 序列到序列模型中使用的循環層和卷積層。解碼器構成了現代生成式語言模型的核心,例如 GPT、LLaMA,以及原始 Transformer 和 T5 等翻譯系統的解碼端。
解碼器的核心特性是因果(或掩碼)注意力:在每個位置,模型只能關注目標序列中當前及更早的位置。這一約束強制了聯合分布的自回歸分解,使解碼器既可用於在完整序列上進行並行訓練,也可用於推理時逐 token 的採樣。
架構定位
在最初的編碼器-解碼器表述中,解碼器接收兩個輸入:部分目標序列(向右平移一個位置)和由 Transformer Encoder 產生的上下文表示。因此,每個解碼器層包含三個子層,而非編碼器的兩個:
- 一個對當前目標序列執行的掩碼多頭自注意力子層。
- 一個查詢編碼器輸出的多頭交叉注意力子層。
- 一個位置式前饋網絡(FFN)。
每個子層都包裹在殘差連接和 Layer Normalization 中。在最後一個解碼器層之後,線性投影后接 softmax 在每個位置生成詞表上的分布。
在諸如 GPT 的僅解碼器模型中,交叉注意力子層被移除,模型僅關注其自身的過往 token。除此之外架構完全相同,這正是僅解碼器和編碼器-解碼器 transformer 共享相同基本計算模式的原因。
層組件
掩碼自注意力
給定長度為 $ T $ 的目標序列,表示為矩陣 $ X \in \mathbb{R}^{T \times d} $,掩碼自注意力通過學習到的投影 $ Q = X W^Q $、$ K = X W^K $、$ V = X W^V $ 計算查詢、鍵和值,並應用帶有因果掩碼的縮放點積注意力:
$ {\displaystyle \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}} + M\right) V} $
其中當 $ j \le i $ 時 $ M_{ij} = 0 $,否則 $ M_{ij} = -\infty $。掩碼在 softmax 之後將未來位置的貢獻置零,因此位置 $ i $ 只能看到位置 $ 1, \ldots, i $。多個注意力頭並行計算並被拼接起來。
交叉注意力
在編碼器-解碼器解碼器中,第二個子層使用源自解碼器隱藏狀態的查詢,以及源自編碼器輸出 $ Z $ 的鍵/值:
$ {\displaystyle \mathrm{CrossAttn}(H, Z) = \mathrm{softmax}\!\left(\frac{(H W^Q)(Z W^K)^\top}{\sqrt{d_k}}\right)(Z W^V)} $
No causal mask is applied here because the entire input sequence is available; the only mask used is for padding positions.
Feed-forward network
The third sublayer is a position-wise Multi-Layer Perceptron applied independently at each position:
$ {\displaystyle \mathrm{FFN}(x) = \sigma(x W_1 + b_1) W_2 + b_2} $
Common choices for $ \sigma $ are ReLU (original transformer), GELU (BERT, GPT-2), and gated variants such as SwiGLU (LLaMA, PaLM).
Causal Masking and Autoregression
The decoder factorizes the joint probability of an output sequence as
$ {\displaystyle p(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x)} $
Causal masking is what makes this factorization exact at training time despite processing the entire sequence in parallel. Each position predicts the next token, and the masked self-attention guarantees no information leaks from future positions. This is the same factorization used by Recurrent Neural Network language models, but the transformer evaluates all conditional probabilities simultaneously rather than recurrently, enabling efficient training on modern accelerators.
Training: Teacher Forcing
Decoders are trained with Teacher Forcing: the ground-truth target sequence is shifted right (prepending a <BOS> token) and passed as input, while the unshifted sequence serves as the prediction target. The loss is the average per-token Cross-Entropy:
$ {\displaystyle \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)} $
Because the entire target sequence is processed in one forward pass, training is fully parallel across sequence positions. This contrasts sharply with Recurrent Neural Network decoders, whose sequential computation prevents efficient parallelization. Teacher forcing also introduces exposure bias, a mismatch between training (where the model conditions on ground-truth prefixes) and inference (where it conditions on its own previous samples).
Inference: Autoregressive Generation
At inference time the decoder produces output tokens one at a time. Starting from <BOS> (and the encoder output, if applicable), the model:
- Computes the conditional distribution over the next token.
- Selects a token via Greedy Decoding, Beam Search, or sampling strategies such as Top-k Sampling, Nucleus Sampling, or Temperature Sampling.
- Appends the chosen token to the sequence and repeats.
Generation stops at an <EOS> token or a length limit. Because each new token requires all earlier keys and values, naive inference is quadratic in sequence length. Production systems use a KV cache that stores the keys and values from previous steps, reducing per-step compute to linear in current length and making long-form generation tractable.
Decoder-Only Models
The most influential decoder variant in modern practice is the decoder-only transformer popularized by GPT.[2] By removing cross-attention, treating both inputs and outputs as a single token stream, and training on a generic next-token-prediction objective over web-scale text, decoder-only models unify many tasks under a single architecture. Few-shot prompting then conditions behavior at inference without any parameter updates.[3]
This shift has reshaped the field: most large Language Models released since 2020, including GPT-3, GPT-4, LLaMA, Mistral, and DeepSeek, are decoder-only. Encoder-decoder architectures persist where input and output are clearly distinct (machine translation, summarization, T5-style tasks).
Variants and Optimizations
Practical decoders depart from the original 2017 specification in several ways:
- Pre-LN — moving Layer Normalization before each sublayer rather than after dramatically improves training stability for deep stacks and is now standard.
- RMSNorm — replaces LayerNorm with a simpler root-mean-square normalization (LLaMA, T5).
- RoPE and ALiBi — relative position schemes that improve length extrapolation versus the original learned absolute embeddings.
- Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) — share keys and values across query heads to shrink the KV cache during inference.
- FlashAttention — an IO-aware exact attention implementation that reduces memory bandwidth pressure.
- Mixture of Experts (MoE) — sparse FFN blocks that increase capacity at fixed inference compute (Switch, Mixtral, DeepSeek-V3).
- Speculative Decoding — uses a small draft model to propose tokens that the larger decoder verifies in parallel, accelerating sampling without changing the output distribution.
Comparison with Encoders
Although encoder and decoder layers share the same primitives, three differences are crucial:
- Attention masking: encoders use bidirectional attention (every position sees every other), while decoders are restricted to past positions.
- Number of sublayers: encoder-decoder decoders include cross-attention; encoders and decoder-only models do not.
- Training objective: encoders are typically trained with Masked Language Modeling (BERT-style fill-in-the-blank), whereas decoders are trained with Causal Language Modeling (next-token prediction).
These differences make encoders well suited to representation tasks such as classification and retrieval, and decoders well suited to open-ended generation. Encoder-decoder hybrids combine both for conditional generation tasks.
Limitations
- Quadratic attention cost: self-attention is $ O(T^2 d) $ in sequence length. Long-context models address this with sparse attention, linear attention, State Space Models, or sliding-window schemes.
- Exposure bias: teacher forcing can produce models that drift on their own samples; mitigations include scheduled sampling and reinforcement learning from human feedback (RLHF).
- KV cache memory: autoregressive inference requires storing keys and values for every layer and head across the full context, dominating memory at long lengths.
- Sequential decoding: generation is fundamentally token-by-token, limiting throughput. Speculative decoding and parallel decoding methods chip away at this constraint but do not eliminate it.
- Hallucination and miscalibration: because the decoder is trained purely to maximize likelihood of next tokens, it has no built-in mechanism for factual grounding.
See also
References
- ↑ Template:Cite arxiv
- ↑ Radford, A., et al., "Improving Language Understanding by Generative Pre-Training," 2018.
- ↑ Template:Cite arxiv