Mamba State Space Models
| Article | |
|---|---|
| Topic area | Deep Learning |
| Prerequisites | Recurrent Neural Networks, Transformer, Linear Attention |
Overview
Mamba is a sequence modeling architecture based on selective state space models (selective SSMs) introduced by Albert Gu and Tri Dao in 2023. It was proposed as a competitive alternative to the Transformer for long-sequence tasks, combining the linear-time recurrence of state space models with an input-dependent selection mechanism that lets the model decide, at every time step, what to remember and what to forget. Mamba achieves Transformer-quality performance on language, audio, and genomics benchmarks while scaling linearly in sequence length and supporting context windows in the millions of tokens.
The architecture sits in a lineage that runs from classical control-theoretic state space models, through the structured state space sequence model (S4) and its variants, to modern hardware-aware implementations. Its central contribution is showing that the long-standing trade-off between expressive content-based reasoning (Transformer) and efficient long-context recurrence (linear Recurrent Neural Networks) can be partially resolved by making the SSM parameters depend on the input.
Background: State Space Models
A continuous-time linear state space model maps an input signal $ u(t) \in \mathbb{R} $ to an output $ y(t) \in \mathbb{R} $ through a hidden state $ h(t) \in \mathbb{R}^N $:
$ {\displaystyle h'(t) = A h(t) + B u(t), \quad y(t) = C h(t) + D u(t)} $
For sequence modeling, the system is discretized with a learned step size $ \Delta $ using a zero-order hold or bilinear transform, yielding discrete matrices $ \bar{A} $ and $ \bar{B} $:
$ {\displaystyle h_t = \bar{A} h_{t-1} + \bar{B} u_t, \quad y_t = C h_t} $
When $ A $, $ B $, $ C $, and $ \Delta $ are independent of the input, the recurrence is linear and time-invariant (LTI). The same map can then be expressed as a global convolution between $ u $ and a structured kernel $ \bar{K} = (C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, \dots) $. This dual recurrence/convolution view is the foundation of S4 and lets a model train in parallel like a CNN but run with constant memory at inference like an RNN.
S4 made this practical by parameterizing $ A $ with the HiPPO theory of online function approximation, ensuring the kernel has a tractable structure and stable long-range dynamics.
The Selection Mechanism
The S4 family is fundamentally LTI: every input token is processed by the same dynamics. Mamba's key insight is that LTI systems cannot perform content-based reasoning, because they cannot selectively propagate or ignore information depending on what each token actually contains. Tasks like the selective copying benchmark, where the model must reproduce a subsequence while ignoring filler tokens, require time-varying behavior.
Mamba addresses this by making $ B $, $ C $, and $ \Delta $ functions of the input $ u_t $:
$ {\displaystyle B_t = s_B(u_t), \quad C_t = s_C(u_t), \quad \Delta_t = \tau_\Delta(\mathrm{Parameter} + s_\Delta(u_t))} $
The matrix $ A $ is kept input-independent and parameterized in a structured form (real or complex diagonal). The discretization of $ \bar{A}_t = \exp(\Delta_t A) $ is then input-dependent through $ \Delta_t $. A small $ \Delta_t $ makes the state ignore the current input and persist; a large $ \Delta_t $ makes the state reset toward the current input. Selection therefore acts as a learned, content-conditioned gate over the entire hidden trajectory.
This change breaks the LTI property and removes the global convolution view. Mamba is a linear-time selective recurrence that can no longer be evaluated as a fixed convolution.
Hardware-Aware Parallel Scan
Without the convolutional shortcut, naive training would be sequential and slow. Mamba recovers parallelism via a hardware-aware parallel scan (associative scan) that exploits the associativity of the recurrence. The authors implement this scan in a fused CUDA kernel that:
- keeps the expanded state tensor in SRAM rather than materializing it in HBM,
- performs discretization, scan, and output projection in a single pass,
- recomputes intermediate states during the backward pass instead of storing them.
The technique mirrors the memory hierarchy reasoning behind FlashAttention and yields wall-clock speed that is competitive with optimized attention up to long contexts, while using only $ O(L) $ memory in sequence length $ L $. At inference, Mamba runs as a pure recurrence with constant per-token cost and a fixed-size hidden state, in contrast to the Transformer's growing key-value cache.
Architecture: The Mamba Block
A Mamba layer wraps a selective SSM in a gated block reminiscent of gated linear units. Given an input residual stream $ x $:
- Two parallel linear projections expand $ x $ from model dimension $ d $ to an inner dimension $ e \cdot d $ (typically $ e = 2 $), producing branches $ u $ and $ z $.
- Branch $ u $ passes through a short causal 1D convolution (kernel size 3 or 4), then a SiLU activation, then the selective SSM described above.
- Branch $ z $ passes through a SiLU and acts as a multiplicative gate on the SSM output.
- The gated result is projected back to dimension $ d $ and added to the residual stream.
Mamba networks stack these blocks with Layer Normalization (typically RMSNorm) and no attention layers and no feed-forward layers; the SSM and the gating absorb both roles. This gives a homogeneous architecture similar to a deep RNN but trained with the parallel scan.
Training and Inference
During training, the parallel scan computes outputs for an entire sequence in parallel on GPU, with Gradient Descent backpropagating through the recurrence via the recomputed states. Optimizer choices (AdamW, cosine schedules, weight decay) and initialization broadly follow Transformer practice; the only Mamba-specific care is the parameterization of $ A $ and the bias initialization of $ \Delta $, which controls how much the state persists at the start of training.
At inference, Mamba switches to recurrent mode. Each new token requires:
- one update of the hidden state $ h_t = \bar{A}_t h_{t-1} + \bar{B}_t u_t $,
- one read-out $ y_t = C_t h_t $.
Both the time and the memory per token are independent of the context length, in contrast to a Transformer whose attention scales as $ O(L) $ per token and whose KV cache grows linearly with $ L $. This makes Mamba attractive for streaming applications and for very long contexts.
Variants and Extensions
Several follow-up architectures build on the selective SSM idea:
- Mamba-2 reformulates the selective SSM as a structured state space duality that connects it to a restricted form of Linear Attention, enabling matrix-multiplication-based implementations and larger state dimensions with similar speed.
- Vision Mamba and VMamba apply selective SSMs to image patches with bidirectional or cross-scan orderings, replacing the causal mask of language Mamba with permutations suited to 2D structure.
- Hybrid models such as Jamba and Zamba interleave Mamba blocks with a small number of attention layers, keeping linear-time scaling for most of the network while preserving the in-context retrieval strength of attention.
- Domain-specific variants have been applied to genomics (long DNA contexts), audio, time-series forecasting, and graph data.
Comparisons
Compared with the Transformer, Mamba trades global content-addressed retrieval for input-dependent recurrence. Empirically:
- Throughput at long contexts is several times higher because there is no quadratic attention.
- Quality on standard language modeling benchmarks is broadly comparable at matched parameter counts up to a few billion parameters; pure-Mamba models tend to lag attention-based models on tasks that demand precise multi-hop retrieval over long contexts, which is part of the motivation for hybrid architectures.
- Inference is much cheaper for long generations due to constant-size state.
Compared with prior linear-time models such as LSTMs, linear Transformers, and S4, Mamba's selection mechanism is the main differentiator: it allows content-based forgetting and copying that LTI SSMs cannot express, while preserving linear scaling.
Limitations
Mamba is not a strict superset of attention. Known limitations include:
- Associative recall: retrieving an exact value associated with an exact key from earlier in the context is harder for a fixed-size state than for attention, which can index into the entire context. Hybrid models mitigate this.
- In-context learning on certain synthetic tasks (e.g. induction heads with rare tokens) is weaker than in attention-based models of similar size.
- Numerical stability of the selective recurrence requires careful parameterization of $ A $ and $ \Delta $; aggressive choices can lead to vanishing or exploding hidden states over very long sequences.
- Tooling for selective SSMs is less mature than for attention; many quantization, distillation, and serving stacks were originally optimized for Transformer KV caches.
Whether selective SSMs scale as gracefully as attention to frontier model sizes (tens to hundreds of billions of parameters) is still an active empirical question.
References
Cite error: <ref> tag with name "mamba" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "s4" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hippo" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mamba2" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "jamba" defined in <references> has group attribute "" which does not appear in prior text.