Recurrent Neural Networks: Difference between revisions

Article
Topic area	Deep Learning
Difficulty	Intermediate
Prerequisites	Neural Networks, Backpropagation

Revision as of 06:58, 24 April 2026

Languages: English | Español | 中文

Recurrent neural networks (RNNs) are a class of neural networks designed to process sequential data — data where the order of elements matters. Unlike feedforward networks, RNNs contain recurrent connections that allow information to persist across time steps, giving them a form of memory.

Sequence modelling

Many real-world problems involve sequences: text is a sequence of words, speech is a sequence of audio frames, stock prices form a time series, and DNA is a sequence of nucleotides. Standard feedforward networks require fixed-size inputs and treat each input independently, making them unsuitable for sequences of variable length where context matters.

RNNs address this by processing inputs one element at a time while maintaining a hidden state that summarises the information seen so far.

Vanilla RNN

At each time step $$ t $$ , a vanilla RNN computes:

\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h)

\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y

where $\mathbf{x}_t$ is the input at time $$ t $$ , $\mathbf{h}_t$ is the hidden state, $\mathbf{y}_t$ is the output, and $\mathbf{W}_{hh}, \mathbf{W}_{xh}, \mathbf{W}_{hy}$ are weight matrices shared across all time steps. The initial hidden state $\mathbf{h}_0$ is typically set to the zero vector.

The key insight is that the same parameters are applied at every time step — weight sharing in time — allowing the network to generalise across different positions in the sequence.

Backpropagation through time (BPTT)

Training an RNN requires computing gradients of the loss with respect to the shared weights. Backpropagation through time (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard backpropagation.

For a sequence of length $$ T $$ , the gradient of the loss with respect to $\mathbf{W}_{hh}$ involves a product of Jacobians:

\frac{\partial L}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\frac{\partial L_t}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t}\frac{\partial L_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}

The product of Jacobians $\prod \partial \mathbf{h}_j / \partial \mathbf{h}_{j-1}$ is the source of the vanishing and exploding gradient problems.

The vanishing gradient problem

When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the vanishing gradient problem. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.

Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the exploding gradient problem. Exploding gradients are typically handled by gradient clipping (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.

Long Short-Term Memory (LSTM)

The LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state $\mathbf{c}_t$ that flows through time with minimal interference, and three gates that control the flow of information:

\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)

(forget gate)

\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)

(input gate)

\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)

(candidate cell state)

\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

(cell state update)

\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)

(output gate)

\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)

The cell state acts as a conveyor belt: the forget gate decides what old information to discard, the input gate decides what new information to store, and the output gate controls what is exposed to the next layer. Because the cell state is updated through addition (not multiplication), gradients flow more easily across long sequences.

Gated Recurrent Unit (GRU)

The GRU (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state and using only two gates:

\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])

(update gate)

\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t])

(reset gate)

\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])

\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

The GRU has fewer parameters than the LSTM and often achieves comparable performance. In practice, the choice between LSTM and GRU is typically made empirically.

Bidirectional RNNs

A bidirectional RNN processes the sequence in both directions — forward (left to right) and backward (right to left) — and concatenates the hidden states:

\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\; \overleftarrow{\mathbf{h}}_t]

This allows the model to use both past and future context at every time step, which is beneficial for tasks like named entity recognition and machine translation where the meaning of a word depends on its surrounding context.

Applications

RNNs and their gated variants have been applied to a wide range of sequence tasks:

Language modelling — predicting the next word in a sequence.
Machine translation — encoder-decoder architectures for sequence-to-sequence translation (Sutskever et al., 2014).
Speech recognition — transcribing audio to text (often combined with CTC loss).
Sentiment analysis — classifying the sentiment of text.
Time-series forecasting — predicting future values of financial or sensor data.
Music generation — generating sequences of notes.

Note that for many NLP tasks, Transformers (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-attention.

References

Elman, J. L. (1990). "Finding Structure in Time". Cognitive Science, 14(2), 179–211.
Hochreiter, S. and Schmidhuber, J. (1997). "Long Short-Term Memory". Neural Computation, 9(8), 1735–1780.
Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". EMNLP.
Sutskever, I., Vinyals, O. and Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks". NeurIPS.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 10. MIT Press.

@@ Line 111: / Line 111: @@
 [[Category:Intermediate]]
 [[Category:Neural Networks]]
+<!--v1.2.0 cache-bust-->