Back
machine-learningai-architectureenergy-based-modelstransformersdeep-learningscaling-laws

Energy-Based Transformers: The 1982 Architecture Finally Got Compatible Training Tricks

SR

Serendeep Rudraraju

May 07, 2026·16 min read
Energy-Based Transformers: The 1982 Architecture Finally Got Compatible Training Tricks

In July 2025, Alexi Gladstone and his collaborators put a paper on arXiv claiming that a neural-network idea first written down in 1982 scales 35% faster than the modern Transformer. Ten months later, no independent lab has published a replication. Both of these things are true. Both of them matter.

The Transformer scaling story has been monolithic since 2020. Bigger pretraining, more data, Chinchilla-optimal mixtures. Energy-Based Models, the framework John Hopfield introduced in 1982 and that won the 2024 Nobel in Physics, were left for dead by ~2012. Then there's a paper. An ICLR 2026 oral. $1.03B raised by Yann LeCun's AMI Labs in March 2026 for the EBM-flavored cousin. And a replication gap nobody is talking about.

TL;DR

Energy-Based Transformers replace softmax-over-logits with a scalar energy and an iterative inference loop, sidestepping the partition function that historically broke EBMs. They scale 35% faster than Transformer++ (under 800M params), match the System 2 thesis Yann LeCun has been making since 2022, and have triggered a 2026 ecosystem of follow-up work. EBTs are the first EBM to cross the threshold without collapsing. They have not yet been independently replicated. Both halves of that sentence are load-bearing.

What an Energy-Based Transformer Actually Computes

Strip away the framing and an EBT is a Transformer that outputs a single scalar instead of a distribution, and treats prediction as gradient descent on that scalar. That's it. The novelty is in how you train it.

Mechanically: an EBT maps an input x and a candidate prediction ŷ to one scalar E_θ(x, ŷ) ∈ ℝ. Lower energy means more compatible. The unnormalized joint is p_θ(x, ŷ) ∝ exp(−E_θ(x, ŷ)) — the same Boltzmann form Hopfield wrote down 44 years ago. LeCun's 2006 Tutorial on Energy-Based Learning puts it cleanly in the abstract: "Energy-Based Models capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy."

The break from a normal Transformer happens at inference. A standard decoder hands you the answer in one forward pass: softmax over the logits, argmax or sample, done. An EBT initializes a random guess ŷ_0 ~ N(0, I) and runs gradient descent on it:

python
def ebt_inference(x, model, n_steps=8, alpha=0.1):    y = torch.randn_like(target_shape)             # random init    for _ in range(n_steps):        energy = model(x, y)                       # scalar        grad = torch.autograd.grad(energy, y)[0]   # ∇_y E        y = y - alpha * grad                       # descend    return y                                       # converged prediction

Note carefully: gradients are with respect to ŷ, not the weights. The weights are frozen at inference; what's being optimized is the prediction itself, treated as a free variable on the energy landscape that the model has learned. The architecture compares to three things at once:

Loading diagram...

Two thinking modes flow from this structure. Increase N and the model "thinks longer" — more gradient steps, deeper basin in the energy landscape. Or sample M random initializations, run each to convergence, and pick argmin_j E_θ(x, ŷ_{N,j}) — the model verifies its own attempts and ships the best one. Both buy quality with FLOPs, at inference, with no architectural change.

Why This Didn't Work for 40 Years

Three failures stacked. The 1982 framework had real, structural reasons not to scale. The 2025 paper didn't fix the framework — it routed around it.

Failure one: the partition function. EBMs need Z_θ = ∫ exp(−E_θ(x, y')) dy' to produce a real probability. That integral is usually intractable. Maximum-likelihood training has a gradient that depends on Z, so every update needs samples from the model itself. Goodfellow, Bengio, and Courville devote an entire chapter — Ch. 18, "Confronting the Partition Function" — to the problem. The textbook framing: the integral is "intractable for many interesting models," so the field built models that "do not involve computing p(x) at all." Softmax classifiers. Autoregressive language models. Transformers. Every dominant deep-learning architecture is structured to dodge the EBM tax.

LeCun himself, in the 2006 tutorial, conceded the cost in one of the dryer lines in machine-learning literature:

"Hence probabilistic modeling comes with a high price, and should be avoided when the application does not require it."

— Yann LeCun et al., A Tutorial on Energy-Based Learning, 2006

Even the framework's leading advocate said the math wasn't worth the cost most of the time.

Failure two: contrastive divergence is broken. The standard workaround was contrastive divergence with short-run MCMC, due to Hinton in 2002. Du and Mordatch's 2020 paper is blunt about what was happening: CD has "a gradient term neglected in the popular contrastive divergence formulation" that "is important in avoiding training instabilities that previously limited applicability and scalability of energy-based models." The 2010s ML establishment didn't ignore EBMs out of fashion. They had a documented instability problem, and nobody could confidently train an EBM past the size where it stopped fitting on a single GPU.

Failure three: nobody made one work at scale. From RBMs in 2009 through Du and Mordatch's 2019 ImageNet result, no EBM crossed a billion parameters with stable training. The EBT paper itself, in §3.4, puts a number on it: "zero publicly known Foundation EBMs" prior to its publication. From 2009 to 2025, while feed-forward Transformers crossed trillion parameters, the EBM camp had nothing at the scale anyone in industry would notice.

The Royal Swedish Academy gave Hopfield and Hinton the 2024 Nobel in Physics for "foundational discoveries and inventions that enable machine learning with artificial neural networks." Hopfield's network is "described in a manner equivalent to the energy in the spin system found in physics." This is the end-of-an-era citation. The framework is recognized as foundational at exactly the moment the field decides it's also salvageable.

Then July 2025 happened.

What Gladstone et al. Changed

EBTs aren't a new kind of EBM. They're a new training procedure for the same old framework, that happens to dodge every classical failure mode by accident.

The training trick is the headline. No contrastive divergence. No MCMC. No partition-function approximation. The training loss is the standard supervised loss between the converged prediction ŷ_N and the ground-truth y (cross-entropy for tokens, MSE for image patches), backpropagated through the entire N-step inference trajectory. Side by side:

python
# Classical EBM training (the historical approach that didn't scale)def classical_ebm_step(x, y, model, optimizer):    pos_energy = model(x, y)                       # data sample    y_neg = mcmc_sample(model, x, n_chain_steps=K) # sample from p_θ    neg_energy = model(x, y_neg)    loss = pos_energy - neg_energy + log_Z_approx  # CD-style, biased    loss.backward()                                 # unstable in practice    optimizer.step()# EBT training (Gladstone et al. 2025)def ebt_step(x, y, model, optimizer):    y_pred = ebt_inference(x, model, n_steps=N)    # full inference loop    loss = supervised_loss(y_pred, y)              # cross-entropy / MSE    loss.backward()                                # backprop *through* the loop    optimizer.step()                               # Hessian-vector products

The training signal becomes: teach the energy landscape such that gradient descent on ŷ from a random start lands at the right answer. The verifier and the generator in one model. The partition function never appears.

Three stability tricks earn their keep (§3.3 of the paper). A replay buffer recycles previously-optimized ŷ trajectories so the energy landscape is well-defined far from initialization. Langevin noise in the inference update (ŷ_{i+1} = ŷ_i − α∇E + η_i) lets the model escape spurious local minima rather than collapse onto one mode. Randomized step size and step count keep the model from overfitting to a specific optimization schedule. None of these is novel on its own. The combination is what hadn't been tried at this scale.

The lead author concedes the obvious on his blog: "There is a long way to go in scaling these models up (I'm mainly looking at you, potential stability issues)." Stable enough for an 800M-parameter paper. Not yet stable enough to bet a frontier model on.

The System 2 connection is structural, not rhetorical. LeCun's 2022 paper explicitly proposed reasoning as energy minimization in an actor module — same equation form the EBT inference procedure uses. The structural lineage is real: descend the energy landscape until convergence, output the basin you land in. Unlike o1 or DeepSeek-R1, where System 2 emerges from RL on tasks with verifiable rewards (math, code), EBTs claim System 2 emerges from pretraining alone, on any modality. That's a stronger claim. Whether it survives at frontier scale is the open question.

My conjecture, label as such: the deeper unlock isn't any single trick. It's that compute is now cheap enough to backprop through 8–32 inference steps during training. Hessian-vector products were prohibitive at the scale 2019 EBMs were trying. Today they're a constant-factor overhead on top of a Transformer that costs $10M to train anyway.

Lead with the Win, Concede the Caveats

The headline numbers are real and peer-reviewed (ICLR 2026 oral). Every one comes with a caveat that a senior engineer will find on the second read of the table.

The wins. EBTs achieve "an up to 35% higher scaling rate" than Transformer++ across data, batch size, parameters, FLOPs, and depth — a slope improvement on the fitted scaling curves, not absolute speed at a fixed point. On image denoising, EBTs land higher PSNR (27.25 vs 26.58) and lower MSE (122.55 vs 142.98) than DiT at σ=0.1 noise. With 99% fewer forward passes. Given more inference compute, EBTs improve "29% more than the Transformer++". A delta-of-deltas, but a non-trivial one. The architecture sibling EBT-Policy (Davies et al., October 2025) beats Diffusion Policy on simulated and real robotic manipulation, converges in 2 inference steps versus 100 (~50× reduction), and recovers zero-shot from failed action sequences without retry training. That last result, in robotics, is the cleanest production-shape win EBTs have so far.

The architecture comparison reads like this:

DimensionStandard TransformerDiffusion Transformer (DiT)Energy-Based Transformer (EBT)
OutputSoftmax over vocab logitsPredicted noise / velocitySingle scalar energy E_θ(x, ŷ)
TrainingCross-entropy on next tokenDenoising score-matchingSupervised loss on ŷ_N, backprop through N-step inference
Inference1 forward passN denoising steps (default 250)K gradient steps × (forward + backward through energy head)
Test-time compute leverBeam search, CoTNumber of denoising stepsN (steps) and M (random restarts)
Scaling claimChinchilla-optimalMonotonic FID gains to 675M"Up to 35% higher rate" vs Transformer++
Production deploymentsUniversalStable Diffusion, Flux, Sora-classNone known. 800M research artifact.
Math sidestepsNone — softmax is closedScore parameterizationNever computes Z; backprops through inference loop

The caveats — and every one is in the paper. The scale ceiling is 800M parameters. Every claim is extrapolated from sub-1B scaling curves. Frontier transformers are 100B–10T. Whether the 35% slope holds, accelerates, or collapses past 1B is unknown. EBT loses to Transformer++ on GSM8K (43.3 vs 49.6 with thinking, per The Batch's reading of Table 3) — the strongest reasoning benchmark in the table is one EBT doesn't win. Pretraining perplexity is worse (33.43 vs 31.36). The "EBT generalizes better than its perplexity suggests" framing is real but selectively true.

The bidirectional EBT for masked text collapses. The paper's own admission: predicts "the" for all masked tokens. Classical EBM mode collapse, not fully solved — just routed around in the autoregressive variant. Training compute overhead is real: The Decoder reports 3.3×–6.6× more FLOPs to train, The Batch reports ~10× to reach matched perplexity. Both numbers measure different things. Both are caveats.

And the "29% more System 2 improvement"? Measured on perplexity. Not AIME, not MMLU-Pro, not HumanEval. The paper does not benchmark against o1 or DeepSeek-R1.

Lead with the win. But say the rest.

The 2026 Ecosystem

Ten months in, the field is treating this seriously. There's a paper trail, a workshop, a billion dollars, and a critique. There is no replication.

The theory side has caught up. Mathieu Blondel and collaborators at Google DeepMind (arXiv 2512.15605, three revisions in 2026) prove a function-space bijection: autoregressive language models are energy-based models. Not metaphorically. Bijectively. Nobody has yet retrofitted Llama or Mistral with EBT-style inference, but the math now says you could.

The practitioners are running the experiment. Ying Nian Wu's group at UCLA (arXiv 2602.06584, February 2026) get a 0.2B model with 30 "rethinking" iterations to beat baselines 10–15× larger on math reasoning. Same energy-as-optimization framing, different head. And the small-model-beats-big-model result is exactly the "test-time compute beats parameter count" thesis the field has been arguing about since DeepSeek-R1.

The robotics side is shipping. EBT-Policy converges in 2 steps versus Diffusion Policy's 100. Recovery from failed action sequences without retraining. Robotics has fewer ideological tribes than language modeling. The architecture wins on the metrics or it doesn't, and EBT-Policy is winning.

The EBM workshop is back at ICLR. The NFAM workshop at ICLR 2026 (April 26, 2026, Rio) is the first dedicated associative-memory and EBM workshop at a top-tier venue in years. Speakers include Jay McClelland, Paul Liang, Ben Hoover. The fact that the workshop exists is the field signaling it's worth a workshop again.

The money is real, even if it's not branded EBT. AMI Labs raised $1.03B at $3.5B pre-money in March 2026 to build JEPA-based world models. Logical Intelligence launched in 2026 as "the World's First Energy-Based Model for Critical Systems," with LeCun as Founding Chair of the Technical Research Board. JEPA is EBM-flavored. World models are EBM-flavored. The 2026 industry narrative is energy-based, even if the EBT brand isn't the carrier.

The critique exists, and it's pointed. NRGPT v3 by Dehmamy, Hoover, Saha, Kozachkov, Slotine, and Krotov (arXiv 2512.16762, most recent revision May 1, 2026) calls out EBT for "implementation challenges, primarily due to the potential for information leakage in naïve implementations." It's the closest thing to a 2026 EBT skeptic paper, and it comes from a group that includes Dmitry Krotov, a long-time Hopfield-network theorist with no axe to grind against EBMs in general.

The gap nobody is closing: as of May 2026, no independent lab has published a replication of Gladstone's 35% scaling-rate claim. The supporting theory is real. The empirical confirmation is missing. Ten months in, that itself is the story.

The Strongest Skeptic Case

The case against EBTs is strong enough to engage seriously, and most of the argument comes from inside the paper.

The steelman, paraphrased: EBTs are an iterative-refinement Transformer with an "energy" framing. DiT already does iterative refinement at inference (250 denoising steps standard). PonderNet, Universal Transformers. The lineage of "think longer at inference" architectures predates EBTs by years. If you took an EBT, dropped the Boltzmann interpretation, and called the scalar output "a learned step-size signal," the contribution becomes "stability tricks for training iterative-refinement Transformers at scale." Real. But more incremental than "40-year-old idea beats Transformers."

The benchmark wins concentrate on out-of-distribution and structured-reasoning tasks. The in-distribution losses (GSM8K, pretraining perplexity) are also real. Selecting only for the wins is selection bias, and a senior reviewer would catch it.

Inference cost is obscured. A single EBT forward pass requires forward + backward through the energy head: roughly 2× the FLOPs of a vanilla Transformer pass. With N=4 gradient steps, that's 8× a standard Transformer's per-token inference cost, before you've gotten any "thinking" benefit relative to a standard Transformer that also gets 8× via beam search or longer chain-of-thought.

The "EBM rebrand" version of the case, named: Gladstone et al. trained an iterative-refinement transformer with end-to-end backprop through the inference loop. The energy interpretation is mathematically clean but functionally close to a learned step-size schedule. The contribution is "stability tricks for iterative-refinement transformers at scale." That's a real contribution. It's not "the framework Hopfield invented in 1982 is back to beat Transformers."

My honest take, label opinion: the strongest skeptic case is a hybrid. The empirical wins at this scale are mixed. The architectural lineage from existing iterative-refinement work is closer than the paper's framing implies. Until someone trains a 70B-parameter EBT and beats a 70B Llama on reasoning, "EBMs vindicated" is a thesis with promising data, not a settled result.

The right framing is narrower: the first time an EBM crossed 100M parameters and didn't collapse, with intriguing scaling that we can't verify at frontier scale yet.

That's calibration, not dismissal.

Where This Leaves You

Don't bet production on EBTs. Track them. Know what would change your mind.

  1. Track the GitHub repo. github.com/alexiglad/EBT is Apache-2.0, ~627 stars at time of writing, and includes custom flash-attention with second-derivative support. If a third-party fork crosses 5B parameters with the scaling rate maintained, that's the signal.
  2. Read NRGPT v3. arXiv 2512.16762 is the most rigorous 2026 alternative framing. The "information leakage in naïve implementations" critique is specific enough to read before you commit engineering time to a fork.
  3. Watch JEPA and AMI Labs more than the EBT brand. $1.03B is going into the EBM-flavored cousin, not the EBT label. If the next big architectural deployment is energy-based, it's likely JEPA-shaped, not EBT-shaped — and the deployment will tell you which version of the framework actually shipped.
  4. Don't migrate inference budgets yet. EBT inference is roughly 2N× standard Transformer per token. Without a frontier-scale reasoning win, the FLOP economics don't pencil for production serving.
  5. Update your priors when one of three things happens. A 10B+ EBT is published. Someone independently reproduces the 35% scaling claim. A major lab announces an EBT-based deployment. None of these has happened. Two of them might in 2026.

The frame for senior engineers: EBTs are the post-Transformer architecture worth paying attention to because they could be wrong. The framework is old. The training trick is new. The scaling claim is unreplicated. The field is treating it seriously enough to fund the EBM-flavored adjacent. That's the configuration where unexpected results land.

Closing

The 1982 framework was right about the math. Wrong about the training. The 2025 paper didn't change the math; it ducked it.

Hopfield wrote down the energy function 44 years ago. LeCun wrote the tutorial 19 years ago. Gladstone wrote the training loop last summer. The hard part is what it always was: showing it scales when nobody is paying you to ignore the caveats.


Sources

Enjoyed this post? Consider supporting the blog.

Buy me a coffee