The Post-Transformer Era: State Space Models, Mamba, and What Comes After Attention

What if the most important architectural innovation since Transformers isn't trying to replace attention — but to escape its quadratic scaling problem entirely?
I've been watching State Space Models go from "interesting paper" to "IBM ships it in production" in about two years. Mamba showed up in December 2023 as a research curiosity. By late 2025, IBM built Granite 4.0 on it. AI21 shipped Jamba with 256K context on a single GPU. Mistral released Codestral Mamba and it beat CodeLlama 34B at code generation — with a pure SSM, no attention at all.
The field moved fast enough that most practitioners I talk to are still working off outdated assumptions. "Mamba can't do in-context learning." "SSMs are just fancy RNNs." "You need special hardware." None of that is true anymore, and the gap between what people think and what's actually shipping is getting wider.
Here's what's actually going on.
TL;DR
This post covers how selective state spaces work, why they scale linearly where Transformers scale quadratically, and which production models you should care about. The short version: Mamba achieves 5x higher throughput than Transformers with O(n) scaling. But pure SSMs still struggle with retrieval tasks. Hybrid architectures — a handful of attention layers mixed into a stack of Mamba layers — are winning in production. You'll walk away with a decision framework for when to use what.
The quadratic problem
Every Transformer layer computes this:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
That QK^T term is an n × n matrix, where n is your sequence length. Every token attends to every other token. The complexity is O(n² · d) per layer.
When Vaswani published "Attention Is All You Need" in 2017, sequences were 512 tokens long. The quadratic cost was a rounding error. Then context windows started growing.
| Sequence Length | Attention Pairs | KV Cache (7B model) |
|---|---|---|
| 2K (GPT-3 era) | 4 million | ~1 GB |
| 4K tokens | 16 million | ~4 GB |
| 32K tokens | 1 billion | ~32 GB |
| 128K tokens | 16.4 billion | 100+ GB |
| 1M tokens | 1 trillion | Impractical |
That 128K row is where things get ugly. A 7B parameter Transformer at 128K context can burn over 100 GB just on the KV cache. That's the memory cost of storing key and value tensors so each new token can attend to everything before it. The model weights themselves might only be 14 GB in half precision. The cache dwarfs the model.
# The scaling gap in one snippet def attention_flops(seq_len): return seq_len ** 2 # O(n²) def mamba_flops(seq_len): return seq_len # O(n) # At 128K tokens: # Attention: 16,384² = 268,435,456 (268M pairwise ops per head) # Mamba: 16,384 (16K state updates) # That's a 16,384x difference. Per layer.
This wasn't a problem when GPT-3 had a 2K context window. It became a problem when the field decided it needed models that could read entire codebases, process hour-long transcripts, and maintain conversations that span days. Claude runs at 200K context. Gemini hit 1M+. Reaching those numbers with pure attention requires staggering amounts of memory and compute.
The whole industry spent 2023-2024 trying to fix this with engineering patches. FlashAttention. KV cache quantization. Sliding window attention. Ring attention. All useful. None of them change the fundamental math. The complexity is still quadratic. You're just making each unit of quadratic work cheaper.
State Space Models take a different approach: change the math.
How selective state spaces work
The lineage goes HiPPO (2020) → S4 (2021) → Mamba (2023). Each step solved a specific limitation.
HiPPO (Albert Gu, 2020) figured out that you could represent a running history of a sequence as coefficients of orthogonal polynomials — Legendre, Laguerre — updated continuously. Think of it as a mathematical compression scheme: instead of storing every past token, you project the history onto a set of basis functions and keep just the coefficients. This gave SSMs a principled way to compress long-range context into a fixed-size state without the information just decaying to zero like it does in vanilla RNNs.
S4 (2021-2022) proved that properly structured SSMs, initialized with HiPPO matrices, could handle sequences of tens of thousands of steps, demolishing Transformers on the Long Range Arena benchmark. S4 exploited a key equivalence: a linear time-invariant SSM can be computed as a convolution, allowing parallel training on GPUs. This spawned a family of variants (S4D, S5, DSS) through 2022, each simplifying the parameterization.
But S4 had a fatal limitation: its parameters were fixed. The A, B, C matrices didn't change based on what the model was actually reading. Every token got processed identically. The model couldn't decide "this token matters, pay attention" versus "this is noise, forget it." In the paper's language, S4 lacked content-based reasoning.
Mamba (Albert Gu & Tri Dao, December 2023) fixed exactly that problem. The core idea: make the SSM parameters functions of the input.
The underlying system is deceptively simple. You have a continuous-time state equation:
h'(t) = A · h(t) + B · x(t)
y(t) = C · h(t)
State h gets updated based on input x, modulated by matrices A, B, C. Output y reads from the state through C. Discretize it (zero-order hold) and you get a recurrence:
h_t = Ā · h_{t-1} + B̄ · x_t
y_t = C · h_t
Each step is O(N) where N is the state dimension, constant with respect to sequence length. The state h is a fixed-size vector regardless of whether you've processed 100 tokens or 100,000.
What makes Mamba different from every SSM before it is selectivity. Instead of fixed parameters:
Δ: input-dependent ← softplus(Parameter + s_Δ(x))
B: input-dependent ← Linear(x)
C: input-dependent ← Linear(x)
A: fixed ← remains static
The step size Δ controls how much the model focuses on the current input versus preserving previous state. Large Δ means "gate open, let this in." Small Δ means "gate closed, keep what I have." B and C also adapt to the input, allowing content-dependent reading and writing of state.
This is formally a generalization of RNN gating. The Mamba paper proves it (Theorem 1): when N=1, A=−1, B=1, the selective SSM reduces to h_t = (1 − g_t) · h_{t-1} + g_t · x_t, which is exactly the classical gated recurrence. But with N=16 (the default), you get a state that's 16x richer than any gated RNN ever had.
Loading diagram...
Here's the catch: making parameters input-dependent breaks the convolution equivalence that S4 relied on for fast parallel training. You can't precompute a fixed convolution kernel when the kernel changes at every step. Mamba sidesteps this with a hardware-aware parallel scan algorithm.
Instead of materializing the full expanded state (shape B×L×D×N) in GPU HBM (slow memory), Mamba loads parameters into SRAM (fast memory), performs discretization and the recurrence in SRAM, and writes only the output (shape B×L×D) back to HBM. This gets 20-40x speedup over a naive implementation, up to 3x over naive recurrence on A100s. During training, intermediate states are recomputed during backprop instead of stored, trading compute for memory.
The architecture stacks these blocks with expansion factor 2, SiLU activation, and LayerNorm. No positional encoding needed. The recurrence inherently provides position information. Two Mamba blocks per layer match the parameter count (12D²) of a standard Transformer's MHA + MLP.
The result: Mamba-3B matches Transformer-6B quality on language modeling. Mamba-2.8B hits 63.3% zero-shot accuracy versus Pythia-2.8B's 59.1%. 5x higher generation throughput. Linear scaling to million-length sequences. On DNA modeling at 1M sequence length, Mamba's quality improves with context while HyenaDNA degrades.
Mamba-2 and Mamba-3
Mamba-1 proved the concept. The follow-ups refined it.
Mamba-2 (Tri Dao & Albert Gu, May 2024) introduced the State Space Duality (SSD) framework, a mathematical proof that SSMs and attention are dual representations of the same underlying computation on structured matrices. The paper title says it plainly: "Transformers are SSMs."
The key insight is that a selective SSM can be written as a lower-triangular matrix multiplication y = M · x, where M encodes both the causal mask (like attention) and the state decay (like a recurrence). When the decay factors are all 1, this reduces exactly to causal linear attention. The SSM view computes it in O(n) via recurrence. The attention view computes the same thing in O(n²) via matrix multiplication. Same function, two algorithms.
Practically, Mamba-2 is 2-8x faster than Mamba-1 on training. It replaces the scan-based computation with chunkwise matrix multiplications that GPUs are optimized for. The implementation is about 30 lines of PyTorch. Larger state sizes (up to 16x bigger than Mamba-1) substantially improve retrieval tasks.
Mamba-3 (2025) attacked three specific weaknesses:
-
Trapezoidal discretization: Mamba-1/2 used Euler's method (zero-order hold) to discretize the continuous system. Mamba-3 upgrades to the trapezoidal rule. Higher-order, more accurate, better quality at the same state size.
-
Complex-valued states: Mamba-2's real-valued states provably cannot solve certain state-tracking tasks. Mamba-3 switches to complex-valued state spaces. Look at the numbers:
| Task | Mamba-2 | Mamba-3 |
|---|---|---|
| Parity | ~0.9% (near random) | 100% |
| Modular Arithmetic | Fails | Solves |
0.9% to 100%. That's not an improvement, that's a different model. Complex-valued SSMs turn out to be connected to Data-Dependent Rotary Position Embeddings (RoPE), which bridges SSM theory with a technique Transformer practitioners already use.
- MIMO formulation: Multi-Input Multi-Output increases arithmetic intensity, trading compute for lower perplexity without increasing memory. You get better hardware utilization without paying for it in VRAM.
Production hybrid architectures
The theory is interesting. What matters is what ships. Six production models tell the story.
AI21 Jamba
The first production-scale Mamba deployment. Jamba interleaves Mamba layers with attention layers at a 1:8 ratio — one attention layer for every eight total layers — plus Mixture-of-Experts routing.
| Spec | Value |
|---|---|
| Active parameters | 12B (52B total MoE) |
| Context length | 256K tokens |
| Attention cache at 256K | 4 GB |
| Equivalent Transformer cache | 128 GB (Llama-2-70B) |
Read those last two rows again. 4 GB versus 128 GB. That's the difference between "runs on one 80GB GPU" and "needs a multi-node cluster." Jamba fits 140K tokens of context on a single A100.
Benchmarks: 87.1% HellaSwag, 67.4% MMLU, 59.9% GSM8K (chain-of-thought). 3x faster token generation than Mixtral on long-context tasks.
A surprising design choice: Jamba uses Mamba-1, not Mamba-2. AI21 found that in a hybrid setup, Mamba-1 + Attention outperformed Mamba-2 + Attention. The engineering reality doesn't always follow the paper chronology.
IBM Bamba-9B
A hybrid with 29 SSM layers and 3 attention layers, built on Mamba-2. Trained on 2.2T tokens (v1) and 2.5T tokens (v2).
The inference numbers: 2.5x throughput improvement over standard Transformers in vLLM, 2x latency reduction. Quantized from 18 GB to 9 GB with minimal quality loss. Bamba-9B v2 outperforms Llama 3.1 8B on standard leaderboards — despite Llama training on 7x more data. That's architectural efficiency winning over brute-force scaling.
The v2 training process was unusual: IBM trained two separate models to 3T tokens with different learning rate schedules, merged them using MergeKit weighted averaging, then annealed on 100B high-quality tokens. Training recipes matter as much as architecture choices.
NVIDIA Hymba-1.5B
Hymba does something different: parallel hybrid heads. Instead of interleaving Mamba and attention in separate layers (like Jamba), Hymba runs both in the same layer simultaneously. Attention and Mamba process the same input in parallel, then their outputs combine.
Other interesting choices: 128 learnable meta tokens prepended to every sequence (they absorb global information and reduce attention overhead), cross-layer KV cache sharing between consecutive attention layers, and full attention in only 3 of its layers. First, middle, last. That's it.
At 1.5B parameters, Hymba outperforms Llama-3.2-1B and uses 10x less KV cache memory on A100.
IBM Granite 4.0
IBM went aggressive with the Mamba ratio: 9 Mamba-2 blocks per 1 Transformer block in a 7B MoE model. The results justify the bet — 82.41% on HumanEval, 70%+ lower memory requirements than comparable Transformers, 2x faster inference. Apache 2.0 license, 12-language support.
IBM isn't shipping this as a research preview. It's a production model with SLAs. That tells you where enterprise AI thinks this is going.
Mistral Codestral Mamba
This one surprised me. Codestral Mamba is pure Mamba-2, no attention layers at all, with 7.28B parameters.
| Benchmark | Codestral Mamba | CodeGemma 7B | CodeLlama 34B |
|---|---|---|---|
| HumanEval | 75.0% | 61.0% | 31.1% |
| HumanEval C++ | 59.8% | 49.1% | — |
| HumanEval JS | 61.5% | 52.2% | — |
A 7B pure SSM beating a 34B Transformer at code generation. Code has enough structure and locality that the selective state mechanism captures what matters without global attention. If you're building a code-focused product, pure Mamba is a real option.
NVIDIA Nemotron-H
Replaces 92% of attention layers with Mamba-2 blocks. Up to 3x throughput over LLaMA-3.1 and Qwen-2.5 at comparable sizes. Across all six of these models, the same pattern: the ratio of attention to Mamba keeps shrinking, and quality holds.
When to use what
After staring at benchmarks and ablation studies for weeks, here's the decision framework I'd use:
| Scenario | Architecture | Why |
|---|---|---|
| Long context (>32K) | Hybrid | Linear memory + attention quality |
| Code generation | Pure Mamba | Structured tasks don't need global attention |
| Streaming / real-time | Pure Mamba | Constant memory per step |
| Complex reasoning | Transformer or Hybrid | Attention excels at in-context learning |
| Memory-constrained deployment | Mamba or Hybrid | Linear scaling wins |
| Retrieval-heavy RAG | Hybrid (mandatory) | Attention is required for retrieval |
| Edge deployment (<2B params) | Hymba-style parallel | Best efficiency at small scale |
The retrieval row deserves emphasis. A 2025 ablation study on hybrid models (RecurrentGemma, Jamba) found that removing attention layers causes retrieval accuracy to drop to 0%. Not "gets worse." Zero. The Mamba layers contribute nothing to retrieval. Hybrid architectures are really specialized module systems: Mamba handles the bulk of sequence processing, attention handles the precision recall.
| Architecture | Best At | Worst At |
|---|---|---|
| Pure Transformer | In-context learning, retrieval, reasoning | Quadratic scaling, long context memory |
| Pure Mamba | Throughput, long sequences, structured tasks | Associative recall, retrieval |
| Hybrid (Interleaved) | Balance of quality and efficiency | Slightly more complex to train |
| Hybrid (Parallel heads) | Maximum efficiency per parameter | Newest approach, less battle-tested |
One thing from 2025 research that doesn't get enough attention: learning rate choice plays an outsized role in recurrent model performance. Some of the negative SSM results in the literature may reflect suboptimal hyperparameter tuning rather than architectural limitations. If you're benchmarking Mamba against Transformers internally, make sure you're actually tuning both fairly.
The emerging consensus: start hybrid. Use a small ratio of attention layers (1-in-8 or 1-in-10). Only go pure Mamba if you've validated that your workload doesn't need retrieval. Only go pure Transformer if context length is permanently short and you need maximum in-context learning.
Common misconceptions
"Mamba can't do in-context learning." This was plausible in early 2024. It's not true in 2026. Jamba hits 67.4% MMLU and 59.9% GSM8K. Granite 4.0 scores 82.41% HumanEval. Hybrids addressed the early limitations, and even pure Mamba models keep improving through better state representations (Mamba-3's complex-valued states).
"SSMs are just RNNs with better marketing." No. The selective mechanism is a different thing from fixed gating. Mamba's parameters change with the input. The model decides per-token how much state to preserve or overwrite. The state dimension (N=16 by default) gives it far more representational capacity than scalar RNN gates. And Mamba-3 solves tasks (Parity at 100%) that no RNN and no real-valued SSM can solve. Call that marketing if you want, but the math disagrees.
"Mamba will fix hallucinations." It won't. OpenAI's 2025 hallucination framework (Kalai et al.) proves mathematically that hallucination is architecture-agnostic. The core theorem: err >= 2 · err_iiv. Under binary evaluation (right/wrong), models are incentivized to guess rather than say "I don't know." This holds whether you use attention, SSMs, or anything else. Hallucination lives in the training objective, not the architecture.
"You need equal parts attention and Mamba in hybrids." Production models disagree. Jamba uses a 1-in-8 ratio. Granite 4.0 uses 1-in-10. Nemotron-H replaces 92% of attention layers. Sometimes just 3 attention layers total is enough for retrieval capability while Mamba handles everything else.
Practical implementation
If you want to start using hybrid models today, Jamba is the most accessible entry point. Here's an 8-bit quantized setup:
import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_skip_modules=["mamba"] # Preserve Mamba layer precision ) model = AutoModelForCausalLM.from_pretrained( "ai21labs/Jamba-v0.1", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", quantization_config=quantization_config, )
Note the llm_int8_skip_modules=["mamba"]. Mamba layers are more sensitive to quantization than attention layers. Skipping them during 8-bit conversion preserves quality where it matters.
Dependencies:
pip install mamba-ssm causal-conv1d>=1.2.0 pip install transformers>=4.40.0 bitsandbytes
Deployment checklist before you ship anything:
- Verify CUDA 11.8+ compatibility (mamba-ssm requires it)
- Benchmark with representative workloads at your target context length
- Monitor memory usage — it should scale linearly with sequence length, not quadratically. If it doesn't, something is wrong
- Compare against a Transformer baseline at the same parameter count. The throughput gain should be 2-5x depending on context length
- Test retrieval-dependent features specifically. If your application relies on finding specific information in long contexts, a hybrid is mandatory
For production inference, vLLM and llama.cpp both support Mamba-based models. Standard NVIDIA GPUs work fine.
What this means
I keep coming back to the bigger picture here. Transformers solved the long-range dependency problem that killed RNNs. Selective state spaces are solving the scaling problem that's slowly strangling attention. The Transformer's core assumption, that every token must attend to every other token, turned out to be sufficient but not necessary.
The same pattern plays out across deep learning. CNNs weren't the final word in computer vision. RNNs weren't the final word in sequence modeling. Transformers almost certainly aren't either. The question was never "will something better come along?" It was "what will it look like?"
Now we have an answer: it looks like a fixed-size state that learns what to remember and what to forget, processed in linear time, optionally augmented with a few attention layers for the tasks that genuinely need global token interaction.
The next time you're designing a system with long contexts, ask yourself: does every token need to attend to every other token? Or is selective state propagation enough?
For most workloads, the answer is shifting.
Sources
- Vaswani, A., et al. "Attention is all you need." NeurIPS 2017.
- Gu, A., & Dao, T. "Mamba: Linear-time sequence modeling with selective state spaces." 2023.
- Dao, T., & Gu, A. "Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality." ICML 2024.
- Gu, A., et al. "Mamba-3: Improved sequence modeling using state space principles." 2025.
- Lieber, O., et al. "Jamba: A hybrid transformer-Mamba language model." ICLR 2025.
- Kalai, A.T., et al. "Why language models hallucinate." OpenAI, 2025.
- Gu, A. "Efficiently modeling long sequences with structured state spaces." ICLR 2022.
- NVIDIA. "Hymba: A hybrid-head architecture for small language models." 2024.
- IBM. "Granite 4.0: Hyper-efficient, high performance hybrid models." 2025.
- Mistral AI. "Codestral Mamba." 2024.
- IBM Research. "Meet Bamba." 2025.
- Bick, A., et al. "Understanding the skill gap in recurrent language models." ICML 2025.
- Grootendorst, M. "A visual guide to Mamba and state space models."
- IBM. "What is Mamba?"