Transformer Decoder/es

Article
Topic area	Deep Learning
Prerequisites	Transformer, Attention Mechanism, Self-Attention

This page is a translated version of the page Transformer Decoder and the translation is 31% complete.

Other languages:

English
Español
中文

Resumen

Un decodificador transformer es la mitad autorregresiva de la arquitectura Transformer, diseñada para generar secuencias de salida un token a la vez, condicionando sobre los tokens generados previamente y, opcionalmente, sobre una secuencia de entrada codificada. Introducido por Vaswani et al. en 2017,^[1] reemplazó las capas convolucionales y recurrentes utilizadas en modelos secuencia a secuencia anteriores con capas apiladas de Self-Attention, cross-attention y redes de propagación hacia adelante posicionales. Los decodificadores constituyen la columna vertebral de los modernos modelos de lenguaje generativos, como GPT, LLaMA, y el lado de decodificación de los sistemas de traducción como el Transformer original y T5.

La propiedad definitoria del decodificador es la atención causal (o enmascarada): en cada posición, el modelo solo puede atender a posiciones actuales y anteriores en la secuencia objetivo. Esta restricción impone la factorización autorregresiva de la distribución conjunta y permite usar el decodificador tanto para entrenamiento paralelo sobre secuencias completas como para muestreo token a token en tiempo de inferencia.

Posición arquitectónica

En la formulación original codificador-decodificador, el decodificador recibe dos entradas: la secuencia objetivo parcial (desplazada una posición a la derecha) y las representaciones contextuales producidas por el Transformer Encoder. En consecuencia, cada capa del decodificador contiene tres subcapas en lugar de las dos del codificador:

Una subcapa de autoatención multi-cabeza enmascarada sobre la secuencia objetivo construida hasta ese momento.
Una subcapa de atención cruzada multi-cabeza que consulta las salidas del codificador.
Una red de propagación hacia adelante posicional (FFN).

Cada subcapa está envuelta en una conexión residual y Layer Normalization. Tras la última capa del decodificador, una proyección lineal seguida de un softmax produce una distribución sobre el vocabulario en cada posición.

En los modelos solo decodificador como GPT, la subcapa de atención cruzada se elimina y el modelo solo atiende a sus propios tokens pasados. La arquitectura es por lo demás idéntica, razón por la cual los transformers solo decodificador y codificador-decodificador comparten el mismo patrón fundamental de cómputo.

Componentes de capa

Autoatención enmascarada

Dada una secuencia objetivo de longitud $$ T $$ representada como una matriz $X \in \mathbb{R}^{T \times d}$ , la autoatención enmascarada calcula consultas, claves y valores mediante proyecciones aprendidas $$ Q = X W^Q $$ , $$ K = X W^K $$ , $$ V = X W^V $$ , y aplica atención de producto punto escalado con una máscara causal:

$\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}} + M\right) V$

donde $M_{ij} = 0$ si $j \le i$ y $M_{ij} = -\infty$ en caso contrario. La máscara anula la contribución de las posiciones futuras tras el softmax, de modo que la posición $$ i $$ solo ve las posiciones $1, \ldots, i$ . Múltiples cabezas de atención se calculan en paralelo y se concatenan.

Atención cruzada

En los decodificadores codificador-decodificador, la segunda subcapa utiliza consultas derivadas de los estados ocultos del decodificador y claves/valores derivados de las salidas del codificador $$ Z $$ :

$\mathrm{CrossAttn}(H, Z) = \mathrm{softmax}\!\left(\frac{(H W^Q)(Z W^K)^\top}{\sqrt{d_k}}\right)(Z W^V)$

No causal mask is applied here because the entire input sequence is available; the only mask used is for padding positions.

Feed-forward network

The third sublayer is a position-wise Multi-Layer Perceptron applied independently at each position:

$\mathrm{FFN}(x) = \sigma(x W_1 + b_1) W_2 + b_2$

Common choices for $\sigma$ are ReLU (original transformer), GELU (BERT, GPT-2), and gated variants such as SwiGLU (LLaMA, PaLM).

Causal Masking and Autoregression

The decoder factorizes the joint probability of an output sequence as

$p(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x)$

Causal masking is what makes this factorization exact at training time despite processing the entire sequence in parallel. Each position predicts the next token, and the masked self-attention guarantees no information leaks from future positions. This is the same factorization used by Recurrent Neural Network language models, but the transformer evaluates all conditional probabilities simultaneously rather than recurrently, enabling efficient training on modern accelerators.

Training: Teacher Forcing

Decoders are trained with Teacher Forcing: the ground-truth target sequence is shifted right (prepending a <BOS> token) and passed as input, while the unshifted sequence serves as the prediction target. The loss is the average per-token Cross-Entropy:

$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)$

Because the entire target sequence is processed in one forward pass, training is fully parallel across sequence positions. This contrasts sharply with Recurrent Neural Network decoders, whose sequential computation prevents efficient parallelization. Teacher forcing also introduces exposure bias, a mismatch between training (where the model conditions on ground-truth prefixes) and inference (where it conditions on its own previous samples).

Inference: Autoregressive Generation

At inference time the decoder produces output tokens one at a time. Starting from <BOS> (and the encoder output, if applicable), the model:

Computes the conditional distribution over the next token.
Selects a token via Greedy Decoding, Beam Search, or sampling strategies such as Top-k Sampling, Nucleus Sampling, or Temperature Sampling.
Appends the chosen token to the sequence and repeats.

Generation stops at an <EOS> token or a length limit. Because each new token requires all earlier keys and values, naive inference is quadratic in sequence length. Production systems use a KV cache that stores the keys and values from previous steps, reducing per-step compute to linear in current length and making long-form generation tractable.

Decoder-Only Models

The most influential decoder variant in modern practice is the decoder-only transformer popularized by GPT.^[2] By removing cross-attention, treating both inputs and outputs as a single token stream, and training on a generic next-token-prediction objective over web-scale text, decoder-only models unify many tasks under a single architecture. Few-shot prompting then conditions behavior at inference without any parameter updates.^[3]

This shift has reshaped the field: most large Language Models released since 2020, including GPT-3, GPT-4, LLaMA, Mistral, and DeepSeek, are decoder-only. Encoder-decoder architectures persist where input and output are clearly distinct (machine translation, summarization, T5-style tasks).

Variants and Optimizations

Practical decoders depart from the original 2017 specification in several ways:

Pre-LN — moving Layer Normalization before each sublayer rather than after dramatically improves training stability for deep stacks and is now standard.
RMSNorm — replaces LayerNorm with a simpler root-mean-square normalization (LLaMA, T5).
RoPE and ALiBi — relative position schemes that improve length extrapolation versus the original learned absolute embeddings.
Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) — share keys and values across query heads to shrink the KV cache during inference.
FlashAttention — an IO-aware exact attention implementation that reduces memory bandwidth pressure.
Mixture of Experts (MoE) — sparse FFN blocks that increase capacity at fixed inference compute (Switch, Mixtral, DeepSeek-V3).
Speculative Decoding — uses a small draft model to propose tokens that the larger decoder verifies in parallel, accelerating sampling without changing the output distribution.

Comparison with Encoders

Although encoder and decoder layers share the same primitives, three differences are crucial:

Attention masking: encoders use bidirectional attention (every position sees every other), while decoders are restricted to past positions.
Number of sublayers: encoder-decoder decoders include cross-attention; encoders and decoder-only models do not.
Training objective: encoders are typically trained with Masked Language Modeling (BERT-style fill-in-the-blank), whereas decoders are trained with Causal Language Modeling (next-token prediction).

These differences make encoders well suited to representation tasks such as classification and retrieval, and decoders well suited to open-ended generation. Encoder-decoder hybrids combine both for conditional generation tasks.

Limitations

Quadratic attention cost: self-attention is $$ O(T^2 d) $$ in sequence length. Long-context models address this with sparse attention, linear attention, State Space Models, or sliding-window schemes.
Exposure bias: teacher forcing can produce models that drift on their own samples; mitigations include scheduled sampling and reinforcement learning from human feedback (RLHF).
KV cache memory: autoregressive inference requires storing keys and values for every layer and head across the full context, dominating memory at long lengths.
Sequential decoding: generation is fundamentally token-by-token, limiting throughput. Speculative decoding and parallel decoding methods chip away at this constraint but do not eliminate it.
Hallucination and miscalibration: because the decoder is trained purely to maximize likelihood of next tokens, it has no built-in mechanism for factual grounding.

References

↑ Template:Cite arxiv
↑ Radford, A., et al., "Improving Language Understanding by Generative Pre-Training," 2018.
↑ Template:Cite arxiv

[1] Template:Cite arxiv

[2] Radford, A., et al., "Improving Language Understanding by Generative Pre-Training," 2018.

[3] Template:Cite arxiv

[1]

[2]

[3]