Gated Recurrent Unit/en

Article
Topic area	Deep Learning
Prerequisites	Recurrent Neural Networks, Long Short-Term Memory, Backpropagation

Translate this page

Other languages:

Overview

The Gated Recurrent Unit (GRU) is a recurrent neural network cell that uses gating mechanisms to control the flow of information across time steps. Introduced by Cho et al. in 2014 as part of an encoder-decoder model for statistical machine translation, the GRU was designed as a simpler alternative to the long short-term memory (LSTM) cell while preserving its ability to model long-range dependencies and mitigate the vanishing gradient problem.^[1]

Compared with the LSTM, the GRU merges the cell state and hidden state into a single vector and replaces the input, forget, and output gates with two gates: an update gate and a reset gate. The result is a cell with fewer parameters per unit, faster training, and competitive accuracy on a broad range of sequence-modeling tasks. While the rise of transformer architectures has displaced recurrent cells in many large-scale natural language workloads, GRUs remain widely used in low-latency inference settings, on-device speech models, time-series forecasting, and as components inside hybrid architectures.

Background and Motivation

Vanilla recurrent networks update a hidden state by applying a nonlinearity to a linear combination of the previous hidden state and the current input. When trained with backpropagation through time, the gradients of the loss with respect to early time steps involve repeated multiplications by the same Jacobian, which causes them to vanish or explode for long sequences. This makes it difficult for plain recurrent networks to learn dependencies that span more than a few dozen steps.

The LSTM, proposed by Hochreiter and Schmidhuber in 1997, addresses this by introducing a separate cell state with additive updates and gates that learn when to write, retain, or read information. The LSTM became the dominant recurrent architecture for over a decade, but its three-gate design carries a substantial parameter and computation cost. The GRU was motivated by the question of whether a simpler gating scheme could match this performance with fewer parameters, easier optimization, and a reduced memory footprint, especially attractive in the encoder-decoder setting where many recurrent steps run per training example.

Architecture

A GRU cell processes one input vector per time step and emits one hidden state. At time step $$ t $$ it receives the current input $x_t \in \mathbb{R}^{d_x}$ and the previous hidden state $h_{t-1} \in \mathbb{R}^{d_h}$ , and produces an updated hidden state $h_t \in \mathbb{R}^{d_h}$ using two gates and one candidate activation:

The reset gate $$ r_t $$ controls how much of the previous hidden state is mixed into the candidate. When the reset gate is close to zero, the candidate ignores past context and behaves as if the sequence were starting fresh.
The update gate $$ z_t $$ controls the convex blend between the previous hidden state and the new candidate. When the update gate is close to one, the cell effectively copies its previous state forward, which provides a near-identity path for gradients.
The candidate hidden state $\tilde{h}_t$ is a tanh-activated proposal that, gated by $$ r_t $$ , encodes new information from the current input together with a possibly down-weighted view of the past.

Unlike the LSTM, there is no separate cell state and no output gate; the hidden state itself is what is exposed to subsequent layers and what is recurrently fed back into the cell.

Mathematical Formulation

Let $\sigma$ denote the elementwise logistic sigmoid function and $\odot$ the elementwise product. The standard GRU update equations are:

$z_t = \sigma\bigl(W_z x_t + U_z h_{t-1} + b_z\bigr)$

$r_t = \sigma\bigl(W_r x_t + U_r h_{t-1} + b_r\bigr)$

$\tilde{h}_t = \tanh\bigl(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h\bigr)$

$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

Here $W_\bullet \in \mathbb{R}^{d_h \times d_x}$ map the input, $U_\bullet \in \mathbb{R}^{d_h \times d_h}$ map the previous hidden state, and $b_\bullet \in \mathbb{R}^{d_h}$ are biases. The total parameter count for a GRU cell is $3 \cdot d_h \cdot (d_x + d_h + 1)$ , roughly three quarters that of an LSTM of the same hidden width, which has four gates instead of three. Some implementations use the convention $h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t$ ; the two formulations are equivalent up to renaming of the gate.

The convex combination in the final equation is the key reason GRUs can carry information across many time steps. Whenever $$ z_t $$ is small, $h_t \approx h_{t-1}$ , so the gradient $\partial h_t / \partial h_{t-1}$ is close to the identity matrix and gradient signals propagate without geometric decay.

Training and Inference

GRUs are trained end-to-end with backpropagation through time, typically with Adam or a variant, and with gradient clipping to control the rare gradient explosions that survive the gating mechanism. Standard regularization options include Dropout applied to the inputs and outputs of the cell (and in some variants to the recurrent connections via variational dropout), and weight decay on the input-to-hidden and hidden-to-hidden matrices.

At inference time, a GRU consumes one token at a time and updates its hidden state in place, which makes per-token compute and memory independent of sequence length. Modern frameworks fuse the three gate computations into a single matrix multiplication with a stacked weight matrix of shape $3 d_h \times (d_x + d_h)$ per time step, and provide cuDNN or Metal kernels that further fuse across the time dimension during training. For deployment, the GRU's small parameter count and lack of an explicit cell state make it attractive for streaming applications and for embedded hardware where memory bandwidth is the bottleneck.

Variants

Several variants modify the basic formulation:

The minimal gated unit (MGU) drops the reset gate and uses a single forget-style gate, reducing the parameter count further while remaining competitive on character-level language modeling and music modeling benchmarks.^[2]
Coupled input-forget variants share weights between the reset and update gates, trading a small amount of expressivity for additional parameter savings.
The bidirectional GRU runs two independent GRU layers, one over the forward sequence and one over the reverse, and concatenates their hidden states. This is standard practice for tagging and classification tasks where the entire sequence is available at inference.
Stacked or deep GRUs compose several GRU layers vertically, with the output sequence of one layer serving as the input sequence of the next. Residual connections between layers help train deep recurrent stacks.
Convolutional GRUs replace the dense weight matrices with convolutions, producing a recurrent cell suitable for spatiotemporal data such as video and weather radar.

Comparison with LSTM

GRUs and LSTMs are often interchangeable in practice. Empirical comparisons by Chung et al. on polyphonic music and speech-signal modeling found no consistent winner, with the choice depending on task and computational budget.^[3] Greff et al. similarly concluded that no major architectural variation reliably beats the standard LSTM, and that the GRU is a competitive cheaper alternative.^[4]

The practical differences are predictable from the equations. Per-step compute and memory are about 25 percent lower for the GRU at the same hidden width, since it uses three gates rather than four and lacks a separate cell state. The LSTM's separate cell state and output gate provide a finer-grained mechanism for protecting long-term memory from being overwritten and for selectively exposing it to downstream layers, which can matter on tasks with very long dependencies. The GRU's coupled update -- where $$ z_t $$ simultaneously controls how much of the past is kept and how much of the candidate is written -- is a useful inductive bias for many tasks but cannot represent some patterns the LSTM can.

Applications

Before the widespread adoption of attention-based models, GRUs were a default building block for many sequence-to-sequence tasks: encoder-decoder models for machine translation, acoustic and language models for speech recognition, sequence labeling for named entity recognition, and recommender systems based on session-level click sequences. They remain common in:

On-device and streaming speech recognition and keyword spotting, where small footprints and per-token latency matter.
Time-series forecasting on tabular industrial data, where the inductive bias of recurrence and the modest parameter count are well matched to limited training data.
Reinforcement learning policies that need to integrate observations over time, particularly in partially observable environments.
Hybrid architectures that combine convolutional, recurrent, and attention components.

Limitations

The same properties that make GRUs efficient also bound their expressivity. The strictly sequential update rule means training cannot be parallelized over the time dimension, which is the central reason transformers have largely replaced GRUs at the frontier of natural language modeling. Long sequences still suffer from gradient attenuation in practice, even if the additive update path mitigates the worst cases; effective context lengths in standard GRUs are usually measured in hundreds rather than thousands of steps. The coupled update gate also makes it difficult to separately tune forgetting and writing dynamics, which is occasionally a real disadvantage relative to the LSTM. Finally, GRUs share with all recurrent models a sensitivity to initialization and to the scale of the recurrent weights, requiring careful choice of initialization (typically orthogonal or identity-scaled) for deep stacks to train reliably.

References

[1] Template:Cite arxiv

[2] Template:Cite arxiv

[3] Template:Cite arxiv

[4] Template:Cite arxiv

[1]

[2]

[3]

[4]