Long Short-Term Memory
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Recurrent Neural Network, Backpropagation, Gradient Descent |
Overview
Long Short-Term Memory (LSTM) is a Recurrent Neural Network architecture designed to learn long-range dependencies in sequential data while avoiding the vanishing and exploding gradient problems that afflict standard recurrent networks. Introduced by Sepp Hochreiter and Juergen Schmidhuber in 1997,[1] the LSTM cell augments a recurrent unit with an internal cell state protected by multiplicative gates that learn to read, write, and erase information selectively. For roughly two decades the LSTM was the dominant neural architecture for sequential modeling, powering production systems for speech recognition, machine translation, handwriting recognition, and language modeling, before being largely supplanted by the Transformer in domains where parallel training dominates. It remains widely used in low-latency online inference, time-series forecasting, control, and any setting where strict sequential causality and bounded memory are advantages rather than constraints.
The defining idea of the LSTM is the constant error carousel: a linear self-loop on the cell state that lets gradients flow backward through arbitrarily many time steps without exponential decay. Gates learn when to let information into this loop, when to forget it, and when to expose it to the rest of the network.
Motivation: the vanishing gradient problem
A simple recurrent network maintains a hidden state $ h_t = \sigma(W_h h_{t-1} + W_x x_t + b) $ and is trained by backpropagation through time (BPTT). The gradient of a loss at time $ T $ with respect to a hidden state at time $ t $ involves a product of $ T - t $ Jacobians of the recurrent map. When the spectral radius of these Jacobians is below one the gradient shrinks exponentially with horizon and the network cannot learn dependencies more than a handful of steps apart; when above one the gradient explodes and training diverges.[2] Hochreiter's 1991 thesis identified this as the central obstacle to training deep recurrent models.
LSTMs address the problem at the architectural level: rather than relying on careful initialization or normalization, they introduce a path through time along which the gradient propagates with unit Jacobian by default, and learn gates that perturb that flow only when useful.
Architecture
A standard LSTM cell at time step $ t $ takes the previous hidden state $ h_{t-1} $, the previous cell state $ c_{t-1} $, and an input vector $ x_t $, and produces a new hidden state and cell state. Three sigmoid gates control the flow of information:
- The forget gate $ f_t $ decides which components of the previous cell state to erase.
- The input gate $ i_t $ decides which components of a candidate update $ \tilde{c}_t $ to write into the cell state.
- The output gate $ o_t $ decides which components of the cell state to expose as the hidden state.
Each gate is a learned linear projection of $ [h_{t-1}, x_t] $ followed by a sigmoid, so its outputs lie in $ (0,1) $ and act as soft binary masks under elementwise multiplication. The cell state is updated additively, which is the structural property that preserves gradient flow.
Formulation
Let $ \sigma $ denote the logistic sigmoid and $ \odot $ elementwise multiplication. With weights $ W_\bullet $, recurrent weights $ U_\bullet $, and biases $ b_\bullet $ for each gate, the LSTM update rules are
$ {\displaystyle f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)} $
$ {\displaystyle i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)} $
$ {\displaystyle o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)} $
$ {\displaystyle \tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)} $
$ {\displaystyle c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t} $
$ {\displaystyle h_t = o_t \odot \tanh(c_t)} $
The first four equations can be computed as a single matrix multiplication of width $ 4d $, where $ d $ is the hidden size, which is how efficient implementations are written. The cell state update $ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t $ is the constant error carousel: when $ f_t \approx 1 $, gradients with respect to $ c_{t-1} $ propagate backward through the additive path with no nonlinear squashing.
A widely used heuristic is to initialize the forget gate bias $ b_f $ to a large positive value (often 1 or 2), so that $ f_t \approx 1 $ at the start of training and the cell state defaults to remembering rather than forgetting.[3]
Training and inference
LSTMs are trained by truncated backpropagation through time: the loss is summed over a window of length $ T $ and gradients are propagated backward through the unrolled network, typically with Gradient Clipping to control the rare exploding-gradient events that the architecture does not eliminate. Standard optimizers such as Adam or SGD with momentum work well in practice. Regularization usually combines weight decay with Dropout applied to the non-recurrent connections, since naive dropout on the recurrent path damages the gradient highway; variants such as variational dropout apply the same mask across time to preserve it.[4]
Inference is inherently sequential: $ c_t $ depends on $ c_{t-1} $, so the time per token cannot be parallelized across the temporal axis the way a Transformer forward pass can. This sequential dependency is the chief reason LSTMs lost ground for large-scale pretraining, but the same property makes their per-token cost constant in sequence length and their memory footprint bounded, which is attractive for streaming inference and on-device deployment.
Variants
Many LSTM variants have been proposed; ablation studies suggest most yield small gains and that the forget gate and the output activation are the load-bearing components.[5]
- Peephole connections allow the gates to read the cell state directly, replacing $ U_\bullet h_{t-1} $ with terms that include $ c_{t-1} $ or $ c_t $.
- Coupled input and forget gates tie $ i_t = 1 - f_t $, halving the number of gate parameters.
- The Gated Recurrent Unit (GRU) merges the cell and hidden states and uses two gates instead of three, often matching LSTM performance with fewer parameters.
- Bidirectional LSTMs run two LSTMs in opposite directions and concatenate their hidden states, exposing both past and future context for non-causal tasks such as tagging.
- ConvLSTM replaces the matrix multiplications with convolutions, preserving spatial structure for video and spatiotemporal forecasting.
- Stacked LSTMs compose several LSTM layers vertically, with the hidden state of layer $ \ell $ serving as the input to layer $ \ell+1 $; this was the standard configuration for production neural machine translation systems prior to the Transformer.
Compared to a vanilla RNN, the LSTM trades roughly 4x the parameter count and compute per step for dramatically better gradient flow and, empirically, the ability to learn dependencies hundreds of steps apart. Compared to the GRU, LSTMs have a separate cell state and one extra gate; on most benchmarks the two are close, with GRUs slightly faster and LSTMs slightly more expressive on long sequences. Compared to the Transformer, LSTMs have constant per-token memory and compute and a strict causal inductive bias, but lack direct attention over arbitrary positions and cannot be trained with the same degree of parallelism, which limits how much they benefit from large-scale pretraining. Recent linear-attention and state-space models such as Mamba revisit ideas closely related to the LSTM cell state under a more parallelizable formulation.[6]
Applications
Through the 2010s LSTMs were the workhorse architecture for sequence modeling. Notable deployments include Google's neural machine translation system,[7] acoustic models for large-vocabulary speech recognition,[8] handwriting recognition, optical character recognition, and early language models. They remain common in time-series forecasting, financial modeling, anomaly detection, and reinforcement-learning policies that operate on partial observations, where the bounded recurrent state is itself useful as a learned summary of history.
Limitations
LSTMs do not eliminate exploding gradients, only mitigate vanishing ones; gradient clipping is still standard. Their sequential forward pass prevents the kind of teacher-forced parallel training that makes Transformers efficient on accelerators, which is why scaling to billions of parameters and trillions of tokens has been pursued almost exclusively with attention-based models. The cell state is a fixed-width vector, so an LSTM cannot in principle store an unbounded amount of past context; long-range retrieval tasks that a Transformer solves by attending to a distant token must instead be compressed into the recurrent state. Empirically, LSTMs also tend to underperform attention-based models on tasks requiring sharp, content-addressable lookup, while remaining competitive or superior on tasks dominated by short- to medium-range temporal structure.
References
- ↑ Hochreiter, S. and Schmidhuber, J., "Long Short-Term Memory", Neural Computation 9(8):1735-1780, 1997.
- ↑ Bengio, Y., Simard, P., and Frasconi, P., "Learning long-term dependencies with gradient descent is difficult", IEEE Transactions on Neural Networks 5(2):157-166, 1994.
- ↑ Jozefowicz, R., Zaremba, W., and Sutskever, I., "An empirical exploration of recurrent network architectures", ICML, 2015.
- ↑ Gal, Y. and Ghahramani, Z., "A theoretically grounded application of dropout in recurrent neural networks", NeurIPS, 2016.
- ↑ Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., and Schmidhuber, J., "LSTM: A search space odyssey", IEEE TNNLS 28(10):2222-2232, 2017.
- ↑ Gu, A. and Dao, T., "Mamba: Linear-time sequence modeling with selective state spaces", arXiv:2312.00752, 2023.
- ↑ Wu, Y. et al., "Google's neural machine translation system: Bridging the gap between human and machine translation", arXiv:1609.08144, 2016.
- ↑ Sak, H., Senior, A., and Beaufays, F., "Long short-term memory recurrent neural network architectures for large scale acoustic modeling", Interspeech, 2014.