Recursive Language Models Turn Context Into Code

A team from MIT dropped a research paper on New Year's Eve with a simple premise: what if language models could treat their input as a variable instead of reading it directly?

The idea, called Recursive Language Models (RLMs), has been floating around since Alex Zhang posted a blog version in October. But the full paper, co-authored with Tim Kraska and Omar Khattab, runs 33 pages with appendix and includes benchmark results that are genuinely surprising.

The setup

Here's what actually happens. You give an RLM a query and some context. Instead of stuffing everything into the model's context window, the context gets stored as a Python variable in an external REPL. The model receives only the query, plus awareness that this variable exists.

Then it writes code.

It can peek at slices of the string. Run regex searches. Break things into chunks. And critically, it can spawn recursive calls to smaller models, feeding them just the relevant snippets and aggregating the results.

From the outside, it looks like a normal API call. Text in, text out. But internally the model is orchestrating its own decomposition strategy.

Why this matters

Context rot is one of those problems everyone knows about but nobody wants to benchmark directly. Your Claude Code session gets bloated, your ChatGPT conversation goes on too long, and suddenly the model seems... dumber. The attention mechanism doesn't fail catastrophically. It degrades.

The standard fix is either bigger context windows (expensive, still degrades) or RAG (lossy, requires upfront indexing). RLMs take a different approach: never let the model see the full context in the first place. Let it navigate programmatically.

The benchmarks

The paper focuses on OOLONG, a long-context reasoning benchmark the authors got early access to, and a modified version of BrowseComp-Plus for document retrieval at scale.

On OOLONG with 132k token contexts, RLM using GPT-5-mini outperformed base GPT-5 by over 34 points, roughly doubling the correct answers, while maintaining comparable cost per query.

That's GPT-5-mini beating GPT-5. The smaller model, with the recursive scaffolding, outperforms the larger model reading the context directly.

At 263k tokens (near the context limit), the gap narrows but persists. RLM using GPT-5-mini outperformed GPT-5 by around 15 points, a 49% increase, and was cheaper per query on average.

The BrowseComp-Plus results are more dramatic. When scaling to 1000 documents (10M+ tokens of context), only RLM with GPT-5 achieved and maintained perfect performance at that scale. Base GPT-5 approaches showed clear performance dropoff as document count increased.

What the model actually does

The GitHub repo includes a visualizer for tracing RLM trajectories. The patterns that emerge are pretty intuitive once you see them.

Peeking: The model starts by grabbing the first few thousand characters to understand the structure. Like a programmer would.

Grepping: Instead of semantic retrieval, it runs regex to narrow down relevant lines. No index required.

Partition and map: For semantic tasks, it chunks the context and farms out classification to recursive sub-calls, then aggregates. This is where the real scaling happens.

Long input, long output: For tasks like tracking git diffs, the model just writes code to process the sequence programmatically. One-shots what would otherwise require the model to track state across thousands of lines.

Prime Intellect goes deeper

This is where things get interesting. Prime Intellect published their own analysis claiming RLMs will be "the paradigm of 2026." They've already integrated it into their training framework.

Their version adds some modifications: sub-LLMs can be parallelized, tools are only available to sub-LLMs (keeping the main model's context clean), and answers must be returned via an environment variable that allows iterative refinement.

They believe teaching models to manage their own context end-to-end through reinforcement learning will be the next major breakthrough, enabling agents to solve long-horizon tasks spanning weeks to months.

Their ablations show mixed results across different environments. RLMs help significantly for long-context retrieval and tool-heavy tasks. They hurt for math, probably because the scaffolding overhead outweighs any benefit when the task doesn't need context management. Training directly on the RLM scaffold should fix this, they argue.

The caveats

Speed is a problem. The paper doesn't optimize for latency. Each recursive call blocks, there's no prefix caching, and complex queries can take minutes. The authors acknowledge this explicitly.

The benchmarks are also narrow. OOLONG and BrowseComp-Plus test specific capabilities. Whether this transfers to, say, extended coding sessions or long document drafting is an open question.

And there's the training gap. Current results use prompting alone. Prime Intellect observed that RLM scaffolding doesn't necessarily improve baselines on all benchmarks, and hypothesizes the true potential will be unleashed after RL training. Models need to learn this paradigm, not just be dropped into it.

The bet

Omar Khattab (yes, the DSPy guy) put it directly on X: "Most people misunderstand RLMs to be about LLMs invoking themselves. The deeper insight is LLMs interacting with their own prompts as objects."

That framing matters. This isn't just another agent scaffold. It's a proposal for how models should relate to information itself. Context becomes data to manipulate, not just input to consume.

Whether that pays off depends on training. If models can learn good decomposition strategies through RL, the ceiling could be high. If the overhead consistently outweighs the benefits for real tasks, this joins the pile of clever ideas that didn't scale.

Prime Intellect is betting on the former. They're planning to train models specifically for RLM environments, test arbitrary recursion depths, and push toward multi-week agent trajectories.

The code is up. The benchmarks are reproducible. We'll know soon enough whether this is 2026's paradigm or 2025's footnote.