The Post-Transformer Era: State Space Models, Mamba, and What Comes After Attention

machine-learningmambastate-space-modelstransformersai-architecture
SR
Serendeep Rudraraju
February 10, 202615 min read
The Post-Transformer Era: State Space Models, Mamba, and What Comes After Attention

What if the most important architectural innovation since Transformers isn't trying to replace attention — but to escape its quadratic scaling problem entirely?

I've been watching State Space Models go from "interesting paper" to "IBM ships it in production" in about two years. Mamba showed up in December 2023 as a research curiosity. By late 2025, IBM built Granite 4.0 on it. AI21 shipped Jamba with 256K context on a single GPU. Mistral released Codestral Mamba and it beat CodeLlama 34B at code generation — with a pure SSM, no attention at all.

The field moved fast enough that most practitioners I talk to are still working off outdated assumptions. "Mamba can't do in-context learning." "SSMs are just fancy RNNs." "You need special hardware." None of that is true anymore, and the gap between what people think and what's actually shipping is getting wider.

Here's what's actually going on.

TL;DR

This post covers how selective state spaces work, why they scale linearly where Transformers scale quadratically, and which production models you should care about. The short version: Mamba achieves 5x higher throughput than Transformers with O(n) scaling. But pure SSMs still struggle with retrieval tasks. Hybrid architectures — a handful of attention layers mixed into a stack of Mamba layers — are winning in production. You'll walk away with a decision framework for when to use what.

The quadratic problem

Every Transformer layer computes this:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

That QK^T term is an n × n matrix, where n is your sequence length. Every token attends to every other token. The complexity is O(n² · d) per layer.

When Vaswani published "Attention Is All You Need" in 2017, sequences were 512 tokens long. The quadratic cost was a rounding error. Then context windows started growing.

Sequence LengthAttention PairsKV Cache (7B model)
2K (GPT-3 era)4 million~1 GB
4K tokens16 million~4 GB
32K tokens1 billion~32 GB
128K tokens16.4 billion100+ GB
1M tokens1 trillionImpractical

That 128K row is where things get ugly. A 7B parameter Transformer at 128K context can burn over 100 GB just on the KV cache. That's the memory cost of storing key and value tensors so each new token can attend to everything before it. The model weights themselves might only be 14 GB in half precision. The cache dwarfs the model.

# The scaling gap in one snippet
def attention_flops(seq_len):
    return seq_len ** 2  # O(n²)

def mamba_flops(seq_len):
    return seq_len       # O(n)

# At 128K tokens:
# Attention: 16,384² = 268,435,456 (268M pairwise ops per head)
# Mamba:     16,384   (16K state updates)
# That's a 16,384x difference. Per layer.

This wasn't a problem when GPT-3 had a 2K context window. It became a problem when the field decided it needed models that could read entire codebases, process hour-long transcripts, and maintain conversations that span days. Claude runs at 200K context. Gemini hit 1M+. Reaching those numbers with pure attention requires staggering amounts of memory and compute.

The whole industry spent 2023-2024 trying to fix this with engineering patches. FlashAttention. KV cache quantization. Sliding window attention. Ring attention. All useful. None of them change the fundamental math. The complexity is still quadratic. You're just making each unit of quadratic work cheaper.

State Space Models take a different approach: change the math.

How selective state spaces work

The lineage goes HiPPO (2020) → S4 (2021) → Mamba (2023). Each step solved a specific limitation.

HiPPO (Albert Gu, 2020) figured out that you could represent a running history of a sequence as coefficients of orthogonal polynomials — Legendre, Laguerre — updated continuously. Think of it as a mathematical compression scheme: instead of storing every past token, you project the history onto a set of basis functions and keep just the coefficients. This gave SSMs a principled way to compress long-range context into a fixed-size state without the information just decaying to zero like it does in vanilla RNNs.

S4 (2021-2022) proved that properly structured SSMs, initialized with HiPPO matrices, could handle sequences of tens of thousands of steps, demolishing Transformers on the Long Range Arena benchmark. S4 exploited a key equivalence: a linear time-invariant SSM can be computed as a convolution, allowing parallel training on GPUs. This spawned a family of variants (S4D, S5, DSS) through 2022, each simplifying the parameterization.

But S4 had a fatal limitation: its parameters were fixed. The A, B, C matrices didn't change based on what the model was actually reading. Every token got processed identically. The model couldn't decide "this token matters, pay attention" versus "this is noise, forget it." In the paper's language, S4 lacked content-based reasoning.

Mamba (Albert Gu & Tri Dao, December 2023) fixed exactly that problem. The core idea: make the SSM parameters functions of the input.

The underlying system is deceptively simple. You have a continuous-time state equation:

h'(t) = A · h(t) + B · x(t)
y(t)  = C · h(t)

State h gets updated based on input x, modulated by matrices A, B, C. Output y reads from the state through C. Discretize it (zero-order hold) and you get a recurrence:

h_t = Ā · h_{t-1} + B̄ · x_t
y_t = C · h_t

Each step is O(N) where N is the state dimension, constant with respect to sequence length. The state h is a fixed-size vector regardless of whether you've processed 100 tokens or 100,000.

What makes Mamba different from every SSM before it is selectivity. Instead of fixed parameters:

Δ: input-dependent  ← softplus(Parameter + s_Δ(x))
B: input-dependent  ← Linear(x)
C: input-dependent  ← Linear(x)
A: fixed            ← remains static

The step size Δ controls how much the model focuses on the current input versus preserving previous state. Large Δ means "gate open, let this in." Small Δ means "gate closed, keep what I have." B and C also adapt to the input, allowing content-dependent reading and writing of state.

This is formally a generalization of RNN gating. The Mamba paper proves it (Theorem 1): when N=1, A=−1, B=1, the selective SSM reduces to h_t = (1 − g_t) · h_{t-1} + g_t · x_t, which is exactly the classical gated recurrence. But with N=16 (the default), you get a state that's 16x richer than any gated RNN ever had.

Loading diagram...

Here's the catch: making parameters input-dependent breaks the convolution equivalence that S4 relied on for fast parallel training. You can't precompute a fixed convolution kernel when the kernel changes at every step. Mamba sidesteps this with a hardware-aware parallel scan algorithm.

Instead of materializing the full expanded state (shape B×L×D×N) in GPU HBM (slow memory), Mamba loads parameters into SRAM (fast memory), performs discretization and the recurrence in SRAM, and writes only the output (shape B×L×D) back to HBM. This gets 20-40x speedup over a naive implementation, up to 3x over naive recurrence on A100s. During training, intermediate states are recomputed during backprop instead of stored, trading compute for memory.

The architecture stacks these blocks with expansion factor 2, SiLU activation, and LayerNorm. No positional encoding needed. The recurrence inherently provides position information. Two Mamba blocks per layer match the parameter count (12D²) of a standard Transformer's MHA + MLP.

The result: Mamba-3B matches Transformer-6B quality on language modeling. Mamba-2.8B hits 63.3% zero-shot accuracy versus Pythia-2.8B's 59.1%. 5x higher generation throughput. Linear scaling to million-length sequences. On DNA modeling at 1M sequence length, Mamba's quality improves with context while HyenaDNA degrades.

Mamba-2 and Mamba-3

Mamba-1 proved the concept. The follow-ups refined it.

Mamba-2 (Tri Dao & Albert Gu, May 2024) introduced the State Space Duality (SSD) framework, a mathematical proof that SSMs and attention are dual representations of the same underlying computation on structured matrices. The paper title says it plainly: "Transformers are SSMs."

The key insight is that a selective SSM can be written as a lower-triangular matrix multiplication y = M · x, where M encodes both the causal mask (like attention) and the state decay (like a recurrence). When the decay factors are all 1, this reduces exactly to causal linear attention. The SSM view computes it in O(n) via recurrence. The attention view computes the same thing in O(n²) via matrix multiplication. Same function, two algorithms.

Practically, Mamba-2 is 2-8x faster than Mamba-1 on training. It replaces the scan-based computation with chunkwise matrix multiplications that GPUs are optimized for. The implementation is about 30 lines of PyTorch. Larger state sizes (up to 16x bigger than Mamba-1) substantially improve retrieval tasks.

Mamba-3 (2025) attacked three specific weaknesses:

  1. Trapezoidal discretization: Mamba-1/2 used Euler's method (zero-order hold) to discretize the continuous system. Mamba-3 upgrades to the trapezoidal rule. Higher-order, more accurate, better quality at the same state size.

  2. Complex-valued states: Mamba-2's real-valued states provably cannot solve certain state-tracking tasks. Mamba-3 switches to complex-valued state spaces. Look at the numbers:

TaskMamba-2Mamba-3
Parity~0.9% (near random)100%
Modular ArithmeticFailsSolves

0.9% to 100%. That's not an improvement, that's a different model. Complex-valued SSMs turn out to be connected to Data-Dependent Rotary Position Embeddings (RoPE), which bridges SSM theory with a technique Transformer practitioners already use.

  1. MIMO formulation: Multi-Input Multi-Output increases arithmetic intensity, trading compute for lower perplexity without increasing memory. You get better hardware utilization without paying for it in VRAM.

Production hybrid architectures

The theory is interesting. What matters is what ships. Six production models tell the story.

AI21 Jamba

The first production-scale Mamba deployment. Jamba interleaves Mamba layers with attention layers at a 1:8 ratio — one attention layer for every eight total layers — plus Mixture-of-Experts routing.

SpecValue
Active parameters12B (52B total MoE)
Context length256K tokens
Attention cache at 256K4 GB
Equivalent Transformer cache128 GB (Llama-2-70B)

Read those last two rows again. 4 GB versus 128 GB. That's the difference between "runs on one 80GB GPU" and "needs a multi-node cluster." Jamba fits 140K tokens of context on a single A100.

Benchmarks: 87.1% HellaSwag, 67.4% MMLU, 59.9% GSM8K (chain-of-thought). 3x faster token generation than Mixtral on long-context tasks.

A surprising design choice: Jamba uses Mamba-1, not Mamba-2. AI21 found that in a hybrid setup, Mamba-1 + Attention outperformed Mamba-2 + Attention. The engineering reality doesn't always follow the paper chronology.

IBM Bamba-9B

A hybrid with 29 SSM layers and 3 attention layers, built on Mamba-2. Trained on 2.2T tokens (v1) and 2.5T tokens (v2).

The inference numbers: 2.5x throughput improvement over standard Transformers in vLLM, 2x latency reduction. Quantized from 18 GB to 9 GB with minimal quality loss. Bamba-9B v2 outperforms Llama 3.1 8B on standard leaderboards — despite Llama training on 7x more data. That's architectural efficiency winning over brute-force scaling.

The v2 training process was unusual: IBM trained two separate models to 3T tokens with different learning rate schedules, merged them using MergeKit weighted averaging, then annealed on 100B high-quality tokens. Training recipes matter as much as architecture choices.

NVIDIA Hymba-1.5B

Hymba does something different: parallel hybrid heads. Instead of interleaving Mamba and attention in separate layers (like Jamba), Hymba runs both in the same layer simultaneously. Attention and Mamba process the same input in parallel, then their outputs combine.

Other interesting choices: 128 learnable meta tokens prepended to every sequence (they absorb global information and reduce attention overhead), cross-layer KV cache sharing between consecutive attention layers, and full attention in only 3 of its layers. First, middle, last. That's it.

At 1.5B parameters, Hymba outperforms Llama-3.2-1B and uses 10x less KV cache memory on A100.

IBM Granite 4.0

IBM went aggressive with the Mamba ratio: 9 Mamba-2 blocks per 1 Transformer block in a 7B MoE model. The results justify the bet — 82.41% on HumanEval, 70%+ lower memory requirements than comparable Transformers, 2x faster inference. Apache 2.0 license, 12-language support.

IBM isn't shipping this as a research preview. It's a production model with SLAs. That tells you where enterprise AI thinks this is going.

Mistral Codestral Mamba

This one surprised me. Codestral Mamba is pure Mamba-2, no attention layers at all, with 7.28B parameters.

BenchmarkCodestral MambaCodeGemma 7BCodeLlama 34B
HumanEval75.0%61.0%31.1%
HumanEval C++59.8%49.1%
HumanEval JS61.5%52.2%

A 7B pure SSM beating a 34B Transformer at code generation. Code has enough structure and locality that the selective state mechanism captures what matters without global attention. If you're building a code-focused product, pure Mamba is a real option.

NVIDIA Nemotron-H

Replaces 92% of attention layers with Mamba-2 blocks. Up to 3x throughput over LLaMA-3.1 and Qwen-2.5 at comparable sizes. Across all six of these models, the same pattern: the ratio of attention to Mamba keeps shrinking, and quality holds.

When to use what

After staring at benchmarks and ablation studies for weeks, here's the decision framework I'd use:

ScenarioArchitectureWhy
Long context (>32K)HybridLinear memory + attention quality
Code generationPure MambaStructured tasks don't need global attention
Streaming / real-timePure MambaConstant memory per step
Complex reasoningTransformer or HybridAttention excels at in-context learning
Memory-constrained deploymentMamba or HybridLinear scaling wins
Retrieval-heavy RAGHybrid (mandatory)Attention is required for retrieval
Edge deployment (<2B params)Hymba-style parallelBest efficiency at small scale

The retrieval row deserves emphasis. A 2025 ablation study on hybrid models (RecurrentGemma, Jamba) found that removing attention layers causes retrieval accuracy to drop to 0%. Not "gets worse." Zero. The Mamba layers contribute nothing to retrieval. Hybrid architectures are really specialized module systems: Mamba handles the bulk of sequence processing, attention handles the precision recall.

ArchitectureBest AtWorst At
Pure TransformerIn-context learning, retrieval, reasoningQuadratic scaling, long context memory
Pure MambaThroughput, long sequences, structured tasksAssociative recall, retrieval
Hybrid (Interleaved)Balance of quality and efficiencySlightly more complex to train
Hybrid (Parallel heads)Maximum efficiency per parameterNewest approach, less battle-tested

One thing from 2025 research that doesn't get enough attention: learning rate choice plays an outsized role in recurrent model performance. Some of the negative SSM results in the literature may reflect suboptimal hyperparameter tuning rather than architectural limitations. If you're benchmarking Mamba against Transformers internally, make sure you're actually tuning both fairly.

The emerging consensus: start hybrid. Use a small ratio of attention layers (1-in-8 or 1-in-10). Only go pure Mamba if you've validated that your workload doesn't need retrieval. Only go pure Transformer if context length is permanently short and you need maximum in-context learning.

Common misconceptions

"Mamba can't do in-context learning." This was plausible in early 2024. It's not true in 2026. Jamba hits 67.4% MMLU and 59.9% GSM8K. Granite 4.0 scores 82.41% HumanEval. Hybrids addressed the early limitations, and even pure Mamba models keep improving through better state representations (Mamba-3's complex-valued states).

"SSMs are just RNNs with better marketing." No. The selective mechanism is a different thing from fixed gating. Mamba's parameters change with the input. The model decides per-token how much state to preserve or overwrite. The state dimension (N=16 by default) gives it far more representational capacity than scalar RNN gates. And Mamba-3 solves tasks (Parity at 100%) that no RNN and no real-valued SSM can solve. Call that marketing if you want, but the math disagrees.

"Mamba will fix hallucinations." It won't. OpenAI's 2025 hallucination framework (Kalai et al.) proves mathematically that hallucination is architecture-agnostic. The core theorem: err >= 2 · err_iiv. Under binary evaluation (right/wrong), models are incentivized to guess rather than say "I don't know." This holds whether you use attention, SSMs, or anything else. Hallucination lives in the training objective, not the architecture.

"You need equal parts attention and Mamba in hybrids." Production models disagree. Jamba uses a 1-in-8 ratio. Granite 4.0 uses 1-in-10. Nemotron-H replaces 92% of attention layers. Sometimes just 3 attention layers total is enough for retrieval capability while Mamba handles everything else.

Practical implementation

If you want to start using hybrid models today, Jamba is the most accessible entry point. Here's an 8-bit quantized setup:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=["mamba"]  # Preserve Mamba layer precision
)

model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    quantization_config=quantization_config,
)

Note the llm_int8_skip_modules=["mamba"]. Mamba layers are more sensitive to quantization than attention layers. Skipping them during 8-bit conversion preserves quality where it matters.

Dependencies:

pip install mamba-ssm causal-conv1d>=1.2.0
pip install transformers>=4.40.0 bitsandbytes

Deployment checklist before you ship anything:

  1. Verify CUDA 11.8+ compatibility (mamba-ssm requires it)
  2. Benchmark with representative workloads at your target context length
  3. Monitor memory usage — it should scale linearly with sequence length, not quadratically. If it doesn't, something is wrong
  4. Compare against a Transformer baseline at the same parameter count. The throughput gain should be 2-5x depending on context length
  5. Test retrieval-dependent features specifically. If your application relies on finding specific information in long contexts, a hybrid is mandatory

For production inference, vLLM and llama.cpp both support Mamba-based models. Standard NVIDIA GPUs work fine.

What this means

I keep coming back to the bigger picture here. Transformers solved the long-range dependency problem that killed RNNs. Selective state spaces are solving the scaling problem that's slowly strangling attention. The Transformer's core assumption, that every token must attend to every other token, turned out to be sufficient but not necessary.

The same pattern plays out across deep learning. CNNs weren't the final word in computer vision. RNNs weren't the final word in sequence modeling. Transformers almost certainly aren't either. The question was never "will something better come along?" It was "what will it look like?"

Now we have an answer: it looks like a fixed-size state that learns what to remember and what to forget, processed in linear time, optionally augmented with a few attention layers for the tasks that genuinely need global token interaction.

The next time you're designing a system with long contexts, ask yourself: does every token need to attend to every other token? Or is selective state propagation enough?

For most workloads, the answer is shifting.


Sources