Recurrent Neural Networks/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:37:42Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T22:01:10Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:39Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T02:39:27Z

Updating to match new version of source page

← Older revision		Revision as of 02:39, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Recurrent Neural Networks}}~~
	{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]], [[Backpropagation]]}}		{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]], [[Backpropagation]]}}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:30:56Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Recurrent Neural Networks}}
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Recurrent neural networks''' ('''RNNs''') are a class of [[Neural Networks|neural networks]] designed to process '''sequential data''' — data where the order of elements matters. Unlike feedforward networks, RNNs contain recurrent connections that allow information to persist across time steps, giving them a form of memory.

== Sequence modelling ==

Many real-world problems involve sequences: text is a sequence of words, speech is a sequence of audio frames, stock prices form a time series, and DNA is a sequence of nucleotides. Standard feedforward networks require fixed-size inputs and treat each input independently, making them unsuitable for sequences of variable length where context matters.

RNNs address this by processing inputs one element at a time while maintaining a '''hidden state''' that summarises the information seen so far.

== Vanilla RNN ==

At each time step <math>t</math>, a vanilla RNN computes:

:<math>\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\,\mathbf{h}_{t-1} + \mathbf{W}_{xh}\,\mathbf{x}_t + \mathbf{b}_h)</math>

:<math>\mathbf{y}_t = \mathbf{W}_{hy}\,\mathbf{h}_t + \mathbf{b}_y</math>

where <math>\mathbf{x}_t</math> is the input at time <math>t</math>, <math>\mathbf{h}_t</math> is the hidden state, <math>\mathbf{y}_t</math> is the output, and <math>\mathbf{W}_{hh}, \mathbf{W}_{xh}, \mathbf{W}_{hy}</math> are weight matrices shared across all time steps. The initial hidden state <math>\mathbf{h}_0</math> is typically set to the zero vector.

The key insight is that the same parameters are applied at every time step — '''weight sharing in time''' — allowing the network to generalise across different positions in the sequence.

== Backpropagation through time (BPTT) ==

Training an RNN requires computing gradients of the loss with respect to the shared weights. '''Backpropagation through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation|backpropagation]].

For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:

:<math>\frac{\partial L}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\frac{\partial L_t}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T}\sum_{k=1}^{t}\frac{\partial L_t}{\partial \mathbf{h}_t}\left(\prod_{j=k+1}^{t}\frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}\right)\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}</math>

The product of Jacobians <math>\prod \partial \mathbf{h}_j / \partial \mathbf{h}_{j-1}</math> is the source of the vanishing and exploding gradient problems.

== The vanishing gradient problem ==

When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.

Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''gradient clipping''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.

== Long Short-Term Memory (LSTM) ==

The '''LSTM''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:

:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')

:<math>\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)</math> ('''input gate''')

:<math>\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)</math> ('''candidate cell state''')

:<math>\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t</math> ('''cell state update''')

:<math>\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)</math> ('''output gate''')

:<math>\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)</math>

The cell state acts as a conveyor belt: the forget gate decides what old information to discard, the input gate decides what new information to store, and the output gate controls what is exposed to the next layer. Because the cell state is updated through addition (not multiplication), gradients flow more easily across long sequences.

== Gated Recurrent Unit (GRU) ==

The '''GRU''' (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state and using only two gates:

:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')

:<math>\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''reset gate''')

:<math>\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])</math>

:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>

The GRU has fewer parameters than the LSTM and often achieves comparable performance. In practice, the choice between LSTM and GRU is typically made empirically.

== Bidirectional RNNs ==

A '''bidirectional RNN''' processes the sequence in both directions — forward (left to right) and backward (right to left) — and concatenates the hidden states:

:<math>\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t;\; \overleftarrow{\mathbf{h}}_t]</math>

This allows the model to use both past and future context at every time step, which is beneficial for tasks like named entity recognition and machine translation where the meaning of a word depends on its surrounding context.

== Applications ==

RNNs and their gated variants have been applied to a wide range of sequence tasks:

* '''Language modelling''' — predicting the next word in a sequence.
* '''Machine translation''' — encoder-decoder architectures for sequence-to-sequence translation (Sutskever et al., 2014).
* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).
* '''Sentiment analysis''' — classifying the sentiment of text.
* '''Time-series forecasting''' — predicting future values of financial or sensor data.
* '''Music generation''' — generating sequences of notes.

Note that for many NLP tasks, '''Transformers''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-attention.

== See also ==

* [[Neural Networks]]
* [[Backpropagation]]
* [[Convolutional Neural Networks]]
* [[Word Embeddings]]
* [[Overfitting and Regularization]]

== References ==

* Elman, J. L. (1990). "Finding Structure in Time". ''Cognitive Science'', 14(2), 179–211.
* Hochreiter, S. and Schmidhuber, J. (1997). "Long Short-Term Memory". ''Neural Computation'', 9(8), 1735–1780.
* Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". ''EMNLP''.
* Sutskever, I., Vinyals, O. and Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks". ''NeurIPS''.
* Goodfellow, I., Bengio, Y. and Courville, A. (2016). ''Deep Learning'', Chapter 10. MIT Press.

[[Category:Deep Learning]]
[[Category:Intermediate]]
[[Category:Neural Networks]]

← Older revision		Revision as of 23:37, 27 April 2026
Line 25:		Line 25:
	== Backpropagation through time (BPTT) ==		== Backpropagation through time (BPTT) ==

	Training an RNN requires computing gradients of the loss with respect to the shared weights. '''~~Backpropagation~~ through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation\|backpropagation]].		Training an RNN requires computing gradients of the loss with respect to the shared weights. '''{{Term\|backpropagation}} through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation\|backpropagation]].

	For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:		For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:
Line 37:		Line 37:
	When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.		When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.

	Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''gradient clipping''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.		Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''{{Term\|gradient clipping}}''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.

	== Long Short-Term Memory (LSTM) ==		== Long Short-Term Memory (LSTM) ==

	The '''LSTM''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:		The '''{{Term\|long short-term memory\|LSTM}}''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:

	:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')		:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')
Line 59:		Line 59:
	== Gated Recurrent Unit (GRU) ==		== Gated Recurrent Unit (GRU) ==

	The '''GRU''' (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state and using only two gates:		The '''GRU''' (Cho et al., 2014) simplifies the {{Term\|long short-term memory\|LSTM}} by merging the cell state and hidden state and using only two gates:

	:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')		:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')
Line 69:		Line 69:
	:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>		:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>

	The GRU has fewer parameters than the LSTM and often achieves comparable performance. In practice, the choice between LSTM and GRU is typically made empirically.		The GRU has fewer parameters than the {{Term\|long short-term memory\|LSTM}} and often achieves comparable performance. In practice, the choice between {{Term\|long short-term memory\|LSTM}} and GRU is typically made empirically.

	== Bidirectional RNNs ==		== Bidirectional RNNs ==
Line 84:		Line 84:

	* '''Language modelling''' — predicting the next word in a sequence.		* '''Language modelling''' — predicting the next word in a sequence.
	* '''Machine translation''' — encoder-decoder architectures for sequence-to-sequence translation (Sutskever et al., 2014).		* '''Machine translation''' — encoder-decoder architectures for {{Term\|sequence-to-sequence}} translation (Sutskever et al., 2014).
	* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).		* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).
	* '''Sentiment analysis''' — classifying the sentiment of text.		* '''Sentiment analysis''' — classifying the sentiment of text.
Line 90:		Line 90:
	* '''Music generation''' — generating sequences of notes.		* '''Music generation''' — generating sequences of notes.

	Note that for many NLP tasks, '''Transformers''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-attention.		Note that for many NLP tasks, '''{{Term\|transformer\|Transformers}}''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-{{Term\|attention}}.

	== See also ==		== See also ==

← Older revision		Revision as of 22:01, 27 April 2026
Line 25:		Line 25:
	== Backpropagation through time (BPTT) ==		== Backpropagation through time (BPTT) ==

	Training an RNN requires computing gradients of the loss with respect to the shared weights. '''~~{{Term\|backpropagation}}~~ through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation\|backpropagation]].		Training an RNN requires computing gradients of the loss with respect to the shared weights. '''Backpropagation through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation\|backpropagation]].

	For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:		For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:
Line 37:		Line 37:
	When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.		When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.

	Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''~~{{Term\|~~gradient clipping}}''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.		Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''gradient clipping''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.

	== Long Short-Term Memory (LSTM) ==		== Long Short-Term Memory (LSTM) ==

	The '''~~{{Term\|long short-term memory\|~~LSTM}}''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:		The '''LSTM''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:

	:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')		:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')
Line 59:		Line 59:
	== Gated Recurrent Unit (GRU) ==		== Gated Recurrent Unit (GRU) ==

	The '''GRU''' (Cho et al., 2014) simplifies the ~~{{Term\|long short-term memory\|~~LSTM}} by merging the cell state and hidden state and using only two gates:		The '''GRU''' (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state and using only two gates:

	:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')		:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')
Line 69:		Line 69:
	:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>		:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>

	The GRU has fewer parameters than the ~~{{Term\|long short-term memory\|~~LSTM}} and often achieves comparable performance. In practice, the choice between ~~{{Term\|long short-term memory\|~~LSTM}} and GRU is typically made empirically.		The GRU has fewer parameters than the LSTM and often achieves comparable performance. In practice, the choice between LSTM and GRU is typically made empirically.

	== Bidirectional RNNs ==		== Bidirectional RNNs ==
Line 84:		Line 84:

	* '''Language modelling''' — predicting the next word in a sequence.		* '''Language modelling''' — predicting the next word in a sequence.
	* '''Machine translation''' — encoder-decoder architectures for ~~{{Term\|~~sequence-to-sequence}} translation (Sutskever et al., 2014).		* '''Machine translation''' — encoder-decoder architectures for sequence-to-sequence translation (Sutskever et al., 2014).
	* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).		* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).
	* '''Sentiment analysis''' — classifying the sentiment of text.		* '''Sentiment analysis''' — classifying the sentiment of text.
Line 90:		Line 90:
	* '''Music generation''' — generating sequences of notes.		* '''Music generation''' — generating sequences of notes.

	Note that for many NLP tasks, '''~~{{Term\|transformer\|~~Transformers}}''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-~~{{Term\|~~attention}}.		Note that for many NLP tasks, '''Transformers''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-attention.

	== See also ==		== See also ==

← Older revision		Revision as of 19:42, 27 April 2026
Line 25:		Line 25:
	== Backpropagation through time (BPTT) ==		== Backpropagation through time (BPTT) ==

	Training an RNN requires computing gradients of the loss with respect to the shared weights. '''~~Backpropagation~~ through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation\|backpropagation]].		Training an RNN requires computing gradients of the loss with respect to the shared weights. '''{{Term\|backpropagation}} through time''' (BPTT) "unrolls" the RNN across time steps, producing a deep feedforward network with shared weights, and then applies standard [[Backpropagation\|backpropagation]].

	For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:		For a sequence of length <math>T</math>, the gradient of the loss with respect to <math>\mathbf{W}_{hh}</math> involves a product of Jacobians:
Line 37:		Line 37:
	When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.		When the spectral radius of the recurrent Jacobian is less than 1, the gradient signal decays exponentially through time — the '''vanishing gradient problem'''. This makes it extremely difficult for vanilla RNNs to learn dependencies that span more than 10–20 time steps.

	Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''gradient clipping''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.		Conversely, when the spectral radius exceeds 1, gradients can grow exponentially — the '''exploding gradient problem'''. Exploding gradients are typically handled by '''{{Term\|gradient clipping}}''' (capping the gradient norm at a threshold), but vanishing gradients require architectural solutions.

	== Long Short-Term Memory (LSTM) ==		== Long Short-Term Memory (LSTM) ==

	The '''LSTM''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:		The '''{{Term\|long short-term memory\|LSTM}}''' (Hochreiter and Schmidhuber, 1997) introduces a '''cell state''' <math>\mathbf{c}_t</math> that flows through time with minimal interference, and three '''gates''' that control the flow of information:

	:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')		:<math>\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)</math> ('''forget gate''')
Line 59:		Line 59:
	== Gated Recurrent Unit (GRU) ==		== Gated Recurrent Unit (GRU) ==

	The '''GRU''' (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state and using only two gates:		The '''GRU''' (Cho et al., 2014) simplifies the {{Term\|long short-term memory\|LSTM}} by merging the cell state and hidden state and using only two gates:

	:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')		:<math>\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t])</math> ('''update gate''')
Line 69:		Line 69:
	:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>		:<math>\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t</math>

	The GRU has fewer parameters than the LSTM and often achieves comparable performance. In practice, the choice between LSTM and GRU is typically made empirically.		The GRU has fewer parameters than the {{Term\|long short-term memory\|LSTM}} and often achieves comparable performance. In practice, the choice between {{Term\|long short-term memory\|LSTM}} and GRU is typically made empirically.

	== Bidirectional RNNs ==		== Bidirectional RNNs ==
Line 84:		Line 84:

	* '''Language modelling''' — predicting the next word in a sequence.		* '''Language modelling''' — predicting the next word in a sequence.
	* '''Machine translation''' — encoder-decoder architectures for sequence-to-sequence translation (Sutskever et al., 2014).		* '''Machine translation''' — encoder-decoder architectures for {{Term\|sequence-to-sequence}} translation (Sutskever et al., 2014).
	* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).		* '''Speech recognition''' — transcribing audio to text (often combined with CTC loss).
	* '''Sentiment analysis''' — classifying the sentiment of text.		* '''Sentiment analysis''' — classifying the sentiment of text.
Line 90:		Line 90:
	* '''Music generation''' — generating sequences of notes.		* '''Music generation''' — generating sequences of notes.

	Note that for many NLP tasks, '''Transformers''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-attention.		Note that for many NLP tasks, '''{{Term\|transformer\|Transformers}}''' (Vaswani et al., 2017) have largely superseded RNNs due to their ability to process sequences in parallel and capture long-range dependencies more effectively through self-{{Term\|attention}}.

	== See also ==		== See also ==