Codified Context: Three-Tier Memory for AI Agents

Aristidis Vasilopoulos is a chemist, not a software engineer. But he just published a research paper describing how he built a 108,000-line C# distributed system using Claude Code as his sole code-generation tool, directed by human prompting and a small army of 19 specialized AI agents. The catch: he needed roughly 26,000 lines of structured documentation to keep those agents from losing their minds between sessions.

That ratio, one line of documentation for every four lines of code, is the most interesting number in the paper. It tells you something about where we actually are with AI-assisted development, versus where the marketing suggests we are.

The memory problem nobody's solved

Anyone who's spent a week with Claude Code or Cursor on a nontrivial project knows the frustration. Session one, the agent understands your architecture. Session two, it's reintroducing bugs you fixed yesterday. Session twelve, it's wiring damage calculations through a deprecated path because nobody told it about the migration you did in session eight. (That specific example comes from the paper, and I suspect it's familiar to a lot of people reading this.)

The current solutions are underwhelming. Single-file manifests like .cursorrules, CLAUDE.md, and AGENTS.md work fine for small projects. A 1,000-line prototype can be fully described in a single prompt. But Vasilopoulos's paper makes a straightforward argument: a 100,000-line system cannot, and trying to cram everything into one file is a recipe for exactly the kind of context loss that makes agents unreliable at scale.

According to a separate study on AGENTS.md files, the format has been adopted by over 60,000 repositories. And a survey of 466 open-source repos found that only about 5% had adopted any context file format at all. So we're still early. The developers who are doing it are seeing results, though: AGENTS.md files were associated with a 29% reduction in median runtime and 17% fewer output tokens.

Three tiers of memory

Vasilopoulos's solution is a three-tier system, and the architecture is more practical than academic. Tier 1 is a "constitution" (his term) that loads into every session automatically: project conventions, naming standards, build commands, and a trigger table that routes tasks to the right specialist agent. Think of it as hot memory, always present.

Tier 2 consists of 19 specialized agents, each a domain expert with focused prompts and embedded project knowledge. A code reviewer. A networking protocol specialist. A debug profiler. When the constitution detects a relevant task, it invokes the right agent. Over half of each agent's specification content is domain knowledge (codebase facts, formulas, failure modes) rather than behavioral instructions. The networking agent alone runs to 915 lines, roughly 65% of which is domain context.

Tier 3 is cold memory: 34 on-demand specification documents served through a Model Context Protocol server with search tools. Agents pull what they need, when they need it, instead of loading everything up front.

The idea of tiered context isn't entirely new. Birgitta Böckeler's overview of context engineering for coding agents at ThoughtWorks describes a similar hot/cold distinction emerging across the tooling ecosystem. Google's Conductor for the Gemini CLI addresses a related problem with persistent Markdown. But Vasilopoulos developed his framework independently and, more to the point, actually used it to build something substantial.

The numbers

Here's where it gets interesting, and where I start wanting more data than the paper provides. Across 283 development sessions, the infrastructure amplified 2,801 human prompts into 1,197 agent invocations, which generated 16,522 agent turns. That's a roughly 6x amplification from prompt to agent activity, which sounds impressive but is hard to evaluate without knowing what fraction of those turns were useful versus the agent spinning its wheels.

The paper doesn't tell us. It relies on observational case studies rather than controlled experiments, and Vasilopoulos acknowledges this. Four case studies illustrate how codified context prevented specific failure modes: a persistence system that maintained consistency across 74 sessions, captured debugging experience that prevented repeated trial-and-error in 10+ subsequent sessions, and so on. Compelling stories, but n=1 on a single project built by the person proposing the framework.

What the case studies do show

The save-system example is concrete enough to be useful. A two-tier save architecture (disk for permanent data, memory for temporary state) caused subtle data corruption when agents wrote to the wrong tier. Temporary buffs persisting permanently. Gold rewards vanishing on restart. A 283-line specification document eliminated the problem across subsequent sessions. The cost: writing 283 lines once. The alternative: diagnosing the same class of bug every few sessions, forever.

And there's the stale-context failure, which is the honest part of the paper. On at least two occasions, outdated specifications caused agents to generate code that conflicted with recent refactors. The agent's output looked syntactically correct; the errors only surfaced during testing. Documentation as infrastructure means documentation failures are infrastructure failures. That's the tradeoff.

Who is this actually for?

The author's background in chemistry rather than software engineering is either the most interesting or most concerning detail, depending on your perspective. The paper positions this as a test case for domain experts building software beyond their primary expertise. A chemist using AI to build a distributed game system. The companion repository includes factory agents that generate tailored context artifacts for new projects: ask three questions, get a constitution.

I'm genuinely uncertain what to make of this. On one hand, it's a real system built by a real person who shipped actual code. The GitHub repo lists Claude itself as a contributor, which is either charming or alarming. On the other hand, 108,000 lines of C# written by AI agents guided by a non-engineer, verified primarily by the same non-engineer, with no external code review mentioned in the paper. How much of that code is good? The paper doesn't say, and that's a gap.

But the framework itself, the three-tier context architecture, doesn't depend on the quality of the underlying project. The pattern would work the same way for a team of experienced engineers. It is essentially applying the principle that Sean Grove articulated at AI Engineer 2025: specs are becoming the real source code. The actual Python and C# files are compilation artifacts.

What's next

Vasilopoulos says he's applying the framework to a drug discovery project as an initial test of cross-domain transferability. The companion repo is MIT-licensed with quickstart tooling. As of this writing, it has zero stars on GitHub, which tells you something about how early this particular contribution is.

The broader trajectory is clear. Context engineering is becoming a discipline with its own taxonomy, drawn from over 1,400 papers according to one recent survey. The question isn't whether AI coding agents need structured external memory. They do. The question is whether the answer looks like 26,000 lines of handwritten specifications, or whether the agents themselves will eventually learn to build and maintain their own context infrastructure. Vasilopoulos's paper is evidence for the first approach. We're still waiting on evidence for the second.