Back
ai-agentscontext-engineeringcoding-agentsllmdeveloper-tools

Your Coding Agent Has Amnesia: It's the Harness, Not the Model

SR

Serendeep Rudraraju

June 09, 2026·18 min read
Your Coding Agent Has Amnesia: It's the Harness, Not the Model

The same model that fixes a bug in twelve seconds will, three hours into a refactor, confidently reintroduce the bug it fixed in hour one. It hasn't gotten dumber. It has forgotten.

I spend most of my working day inside coding agents, and the failure I actually hit is almost never a reasoning failure. The model can reason. What it can't do is hold the project in its head long enough to finish the job. It loses the thread across files, across sessions, across the compaction it just performed on itself. We talk about these tools as if the model is the whole story. "Opus is better than GPT-5 at code," as though that settled anything. It doesn't, because by the time you're four files deep, the thing deciding whether the work survives isn't the weights. It's the machinery around them that decides what the model sees and what it forgets.

There's now a benchmark and a peer-reviewed paper that say this out loud.

TL;DR

The bottleneck in agentic coding in 2026 is not model intelligence. It's the memory architecture wrapped around the model. The same model swings double-digit accuracy points depending on which harness runs it, agents collapse from ~73% to 25% the moment a task spans many files, and "more context" turns out to make recall worse, not better. This is an audit of the three ways agents try to remember, why the vendor benchmarks contradict each other, and how to structure your agent's memory so it stops forgetting.

The failure nobody put on the leaderboard

Look at the benchmark numbers that made the headlines and you'd think this problem was solved. SWE-bench Verified — the one every vendor quotes — sits around 72.8% for a frontier model. That sounds like a tool that mostly works.

But SWE-bench Verified is a single-issue benchmark. Fix one bug, in a known spot, in a repo you were handed. That is the easy shape of software work, and it's the shape agents got good at. Real engineering is the other shape: interpret a vague requirement, coordinate changes across a dozen files, and keep the thing working across many iterations.

When you benchmark that, the floor gives out. SWE-EVO, published in December 2025, builds its tasks from the release notes of seven mature open-source Python projects. Each task averages 21 files changed and is graded against 874 tests. The best agent in the paper, GPT-5.4 driving OpenHands, scores 25%. Same class of model that clears 72.8% on the easy benchmark loses two-thirds of its success rate the moment the work gets long.

SWE-bench VerifiedSWE-EVO
Task shapeOne isolated issueMulti-file software evolution
Files touchedUsually one~21 on average
Tests per taskA handful~874
Best agent score~72.8%~25%
What it measuresCan the model code?Can the agent remember while it codes?

A separate benchmark, SlopCodeBench, adds the dimension that makes this worse: it measures how agents degrade over long-horizon iterative tasks, and the answer is that each turn tends to make the next one worse. The agent doesn't hold a quality line. It drifts.

None of this is the model failing to reason. It's the model failing to remember what it already decided. And memory isn't a property of the weights. It's a property of the harness.

The binding constraint

For a long time "the harness matters" was folklore, something you believed because you'd watched the same model behave differently in two tools. As of 2026 it's measured.

Terminal-Bench, a collaboration out of Stanford and the Laude Institute, is built for exactly this. Its 2.0 release has 89 tasks, and the leaderboard pairs an agent harness with a model. That pairing is the whole point: the same model shows up multiple times under different harnesses, so the contribution of the scaffolding is visible in the data instead of inferred from vibes.

The deltas are not small. Take a single model, GPT-5.3-Codex, and read it across the harnesses that wrap it on the Terminal-Bench 2.0 board.

Harness (all running GPT-5.3-Codex)Built byTerminal-Bench 2.0
SageAgentOpenSage78.4% ± 2.2
DroidFactory77.3% ± 2.2
CodeBrain-1.5Feeling AI75.8% ± 2.0
Simple CodexOpenAI75.1% ± 2.4
MuxCoder74.6% ± 2.5
Terminus 2Terminal-Bench64.7% ± 2.7

Same weights, same 89 tasks. The score swings from 64.7% to 78.4% depending on nothing but the scaffold around the model. That's almost fourteen points, an order of magnitude past the error bars. And look at where OpenAI's own harness lands: Simple Codex, written by the lab that trained the model, sits mid-pack at 75.1%, beaten by Factory, OpenSage, and others driving the exact same weights. The bottom of the table, at 64.7%, is Terminus 2 — Terminal-Bench's own reference harness. The model isn't the thing that's varying here. The harness is.

In May 2026, Zhang and colleagues gave this a name. Their paper has the bluntest title in the genre, Stop Comparing LLM Agents Without Disclosing the Harness, and a thesis to match.

The agent execution harness is often a stronger determinant of agent performance than the model it wraps.

— Yunbei Zhang et al., Stop Comparing LLM Agents Without Disclosing the Harness

They call it the Binding Constraint Thesis, and the sharpest finding is that harness-induced variance can be large enough to cause model ranking reversal. The same board shows it happening. Put Claude Opus 4.6 in Stanford's Meta-Harness and it scores 76.4%; put GPT-5.3-Codex in the Terminus 2 reference harness and it scores 64.7%. On that pairing you'd conclude Claude is the better coding model by nearly twelve points. Now swap the scaffolds. GPT-5.3-Codex in SageAgent reaches 78.4%, while Opus 4.6 in Anthropic's own Claude Code submission manages 58.0%, and GPT wins by twenty. Same two models, opposite verdicts, decided entirely by what you wrapped them in. So when someone asks "is Opus better than GPT-5 at code?", the honest answer is another question. Better inside which harness?

There's a quiet embarrassment in those numbers, too: neither lab builds the best harness for its own model. OpenAI's Simple Codex sits mid-pack on GPT-5.3-Codex, behind Factory's Droid and OpenSage's SageAgent. Anthropic's Claude Code submission lands near the bottom of the board on Opus 4.6, while Stanford's Meta-Harness pulls eighteen more points out of the identical weights. And nobody is actually finishing the thing — the top scaffold on the board, a multi-model entry called LemonHarness, tops out at 84.5%. Whatever the model knows, someone else's scaffolding is usually better at getting it out, and even the best of it leaves one task in six on the floor.

The cleanest demonstration is a controlled before-and-after, and it isn't on the leaderboard at all. In early 2026 LangChain held the model fixed at GPT-5.2-Codex and changed nothing but the scaffold: the system prompt, the tools, and middleware hooks, including a "reasoning sandwich" that dials reasoning high to plan, low to build, then high again to verify. That alone moved the same weights from outside the top 30 on Terminal-Bench 2.0 to rank 5, 52.8% to 66.5%. Thirteen-plus points, no new model, just a better harness around the old one.

I want to be careful not to oversell this. A great harness cannot conjure capability that the model doesn't have; wrap a weak model in the best scaffold on earth and it still loses to a strong one. The claim is narrower and more useful than "the model doesn't matter." It's that at the frontier, where the top models are bunched within a few points of each other, the harness is the variable that decides the outcome. And the harness is the part you can actually change.

The effect isn't universal or monotonic, either, and the honest version says so. A fancier harness isn't automatically a better one. METR, measuring how long a task an agent can sustain, found that Claude Code and Codex did not outperform its own plain ReAct and Triframe scaffolds - the lab-built, heavily-optimized harnesses tied or lost to generic ones. And Scale's SWE-Atlas, which deliberately reruns models on a common minimal scaffold to strip the harness back out, finds the top models clustered within a few points. Set beside Terminal-Bench, the picture is that harness effects are real but regime-dependent: double digits in some setups, lost in the noise in others. That isn't a reason to ignore the harness. It's a reason to measure it on your own workload instead of trusting a leaderboard.

The corollary stings a little: a leaderboard score with no disclosed harness is half a number. It tells you how that model did inside one particular memory system, on one particular day, and almost nothing about the model in isolation.

Why "just use a bigger context window" backfires

The obvious fix, if the agent keeps forgetting, is to stop making it forget. Give it a million-token window. Dump the whole repo and the entire session history in there and let attention sort it out.

This is the intuition that makes the problem worse, and there's a clean study showing why.

Start with the distinction the field finally made explicit: long context and long-term memory are different problems. The benchmarks split along that line now. RULER and needle-in-a-haystack test whether a model can find a fact inside a big input. LongMemEval and BEAM test whether an agent can maintain continuity across many sessions. Treating a big context window as if it were memory is the core category error, and the 2026 literature treats them as separate fields with separate evaluations.

Then there's what actually happens when you fill the window. Chroma's "context rot" study ran 18 models (the Claude 4 family, the GPT-4.1 family, o3, Gemini 2.5, Qwen3) through controlled long-input tasks. Performance degraded as input grew, non-uniformly, even on trivial retrieval. Every model family scored materially higher on a focused prompt of a few hundred tokens than on the same needle buried in ~113K tokens of haystack. My favorite finding, because it's so counterintuitive, is this one: models do worse when the haystack has logical flow than when it's shuffled into nonsense. Give the model a coherent, realistic context and its recall drops relative to randomized filler.

As the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases.

— Anthropic, Effective context engineering for AI agents

The practical version of this bites people who think a memory file is free. An OpenCode user filed an issue after a 331KB AGENTS.md, about 83K tokens, got injected into the system prompt on every loop, consumed 81% of a 128K window before the agent did a single useful thing, and triggered an endless compaction loop trying to dig out. Their "memory" had become a denial-of-service attack on their own context budget.

So the design constraint flips. If filling the window degrades the model, the job of a memory system isn't accumulation. It's curation under a budget. Decide what the model sees, keep it small, keep it relevant. Which is exactly the problem every shipping agent is now trying to solve, and they don't agree on how.

The audit: three ways agents try to remember

The lineage here is short but real. In October 2023, Packer and colleagues published MemGPT: Towards LLMs as Operating Systems, which borrowed the oldest trick in systems design, virtual memory, and applied it to context windows. Tiers of memory, with the model paging information in and out of its limited context the way an OS pages RAM to disk. That framing, memory as a managed hierarchy rather than one flat window, is the ancestor of everything shipping today.

What shipped split into three families. And it keeps shipping: scroll any developer forum this month and you'll find a steady run of homemade fixes, each opening with the same confession that the agent forgets everything between sessions. People are bolting on git-backed worklogs, graph memory layers, MCP memory servers, even a Kanban board the agent updates itself - a new memory band-aid roughly every week. The shipped approaches disagree, by design, about what "relevant" means.

Whole-file instruction injection is the simplest. The agent re-reads a set of markdown files into its prompt at the start of each session or loop: CLAUDE.md, AGENTS.md, copilot-instructions.md, Cline's six-file Memory Bank. It's transparent, version-controllable, and you can read exactly what the agent is being told. Cline is the purest expression of it. The docs state plainly that it "resets completely between sessions" and rebuilds its entire understanding from those files. The failure mode is the 331KB story: injection isn't retrieval, it's a dump, and the dump competes with everything else for the budget. Anthropic's own guidance is to keep these files under 200 lines, because past that they consume context and reduce adherence.

Embedding and vector retrieval is what most "understands your codebase" marketing refers to. Cursor chunks your repo into functions and classes, embeds them into a vector database, and retrieves by semantic similarity, layered with grep. Windsurf does something similar with local embeddings. This scales to large repositories and retrieves by meaning rather than literal match. Its failure modes are subtler: retrieval is not utilization (a 2026 paper specifically diagnoses agents that retrieve the right snippet and then ignore it), indexes go stale, and similarity is a blunt instrument that misses structural dependencies a call graph would catch.

Graph and symbol retrieval is the contrarian branch. Aider builds a "repo map" with no embeddings at all. It parses the codebase with tree-sitter, builds a dependency graph of symbols, ranks them with a PageRank-style algorithm, and emits a token-budgeted map of the most-referenced definitions. Zep takes the same instinct into a temporal knowledge graph, storing time-aware facts and traversing them. These retrieve by structure or by chronology rather than by semantic vibe. The cost is building and maintaining the graph, and the ceiling is the quality of the parse.

Sitting on top of all three is the layer Anthropic has been formalizing as context engineering: compaction (summarize the conversation when it nears the limit and reinitialize), tool-result clearing (drop stale tool outputs), and sub-agents (spin off a worker that burns tens of thousands of tokens exploring and returns a 1,000–2,000 token summary). The newest move is to keep the session log outside the context window entirely: an append-only event store the model reads positional slices of on demand, so "the brain" never has to hold the whole history at once.

Here's how the major agents actually line up.

Agent / frameworkPrimary mechanismRetrieval stylePersistent store
Claude CodeCLAUDE.md + auto-memory + compaction + sub-agentsOn-demand file reads, agent-driven search~/.claude/projects/<proj>/memory/
CursorCodebase index + rules + MemoriesVector similarity + grep + Explore sub-agentEncrypted vector DB, .cursor/rules/
GitHub CopilotTool-driven context discovery + instructionsActive tool calls, no passive injection.github/copilot-instructions.md
ClineMemory Bank (six markdown files)Full re-read at session startmemory-bank/*.md
OpenCodeAGENTS.md + auto-compactionInstruction injection + summarizationMarkdown instruction files
WindsurfAuto-Memories + rules + local RAGLocal embeddings + assembly pipeline~/.codeium/windsurf/memories/
AiderRepository maptree-sitter symbol graph + PageRankNone — deterministic per run
ZedRules + @-mentions + searchExplicit context + codebase search.rules / CLAUDE.md / AGENTS.md
mem0Extract → consolidate → retrieveVector + optional graphExternal memory service
Letta (MemGPT)OS-style tiered memorySelf-editing memory via tool callsCore in-context + external DB
ZepTemporal knowledge graphGraph traversal over time-aware edgesTemporal KG (Graphiti)

The shape underneath the table is the same loop in every case: pick a retrieval strategy, assemble a curated window, work, and when the window fills, compact or offload and go again.

Loading diagram...

The counter-argument: benchmark theater

This is the part of the post where I'm supposed to tell you which memory framework wins. I can't, and the reason is instructive.

Every vendor in this space benchmarks itself to first place, and their numbers contradict each other. They use overlapping datasets (LoCoMo, LongMemEval, DMR), and each reports a configuration where it leads.

FrameworkHeadline claimOn benchmarkWho it beats
mem0+26% accuracy, 91% lower p95 latency, ~90% fewer tokensLoCoMoOpenAI memory
Letta74.0% with a simple agentLoCoMomem0's 68.5% graph variant
Zep94.8% vs 93.4%DMRMemGPT
Supermemory#1 simultaneouslyLoCoMo + LongMemEval + ConvoMemeveryone

Read that table closely. mem0's paper reports its graph variant around 68.5% on LoCoMo. Letta's rebuttal puts a Letta agent at 74.0% on the same benchmark, on gpt-4o-mini, by simply storing conversation histories in files — beating mem0's most elaborate graph configuration with the least sophisticated method on the board, and saying so directly. They can't both be the state of the art on the identical dataset. And the dataset itself is contested. LoCoMo has documented quality problems, which is part of why Letta and Supermemory keep pushing alternative benchmarks. The benchmark I win on is, naturally, the real one.

I don't think these vendors are lying. The honest reading is that all of these are real engineering wins over the naive baseline of stuffing the full history into context. mem0's ~90% token reduction is a genuine, useful result. And their head-to-head rankings against each other are marketing until somebody neutral reproduces them. Both things are true at once.

Which is exactly why Terminal-Bench matters more than any of these leaderboards. It holds the model fixed and is run by a third party with no framework to sell. The vendor memory leaderboards hold nothing fixed: different models, different prompts, different datasets, scored by the people shipping the product. One of these is evidence. The others are advertising with a methodology section.

What this actually means for how you run agents

If the harness is the binding constraint, the good news is that you have more leverage than the model-of-the-month discourse suggests. Most of it lives in how you structure memory, not in which weights you rent. Here's what I'd actually do.

Stop optimizing for context-window size. Picking an agent because it advertises a million-token window is optimizing the wrong variable. Context rot means a curated few thousand tokens beats a stuffed million. Budget your context like it's expensive, because in accuracy terms it is.

Keep instruction files small and scoped. Target Anthropic's under-200-lines guidance, and path-scope your rules so they load only when you're in the relevant part of the tree. The 331KB AGENTS.md is the cautionary tale: a memory file big enough to starve the context it was meant to enrich. Boring and load-bearing beats big.

Use compaction and sub-agents on purpose, not by accident. Offload exploration to sub-agents that return summaries instead of letting the main loop read everything itself. Let compaction reclaim the window, but know what survives it. Project-root instruction files get re-injected after a compaction; the clever thing you said in conversation three hours ago may not.

Match the memory family to the job. A large, unfamiliar repo wants embedding or graph retrieval, like Cursor or Aider. A stable project with strong conventions is well served by file-based memory you write and read yourself. Don't bolt a vector database onto a problem that grep and a good CLAUDE.md would solve.

Distrust any agent benchmark that doesn't disclose its harness. When you evaluate for yourself, hold the model fixed and vary the harness. That's the comparison that tells you what to change, and per the Binding Constraint Thesis, it's the one that moves the result.

Where this leaves you

The discourse is stuck on the model. Which lab shipped the smartest weights this month, which benchmark moved a point. But the weights are converging, the top models sit within a few points of each other, and the thing that decides whether your agent finishes the refactor or re-breaks it in hour three is the unglamorous machinery around the model: what it remembers, what it forgets, and who decided which.

I'll be honest about the gap, because the field has one. We don't yet have a neutral, coding-specific memory benchmark. Terminal-Bench measures harness quality, SWE-EVO measures long-horizon coding, and neither isolates memory the way LongMemEval isolates it for conversation. So we're building memory systems faster than we can measure them, and the vendor numbers are filling the vacuum. That's worth saying plainly rather than pretending the leaderboards have it covered.

Your agent's intelligence is rented from a lab. Its memory is the part you actually own. So it's the part worth engineering.


Sources

Enjoyed this post? Consider supporting the blog.

Buy me a coffee