Microsoft DELEGATE-52: Top LLMs Corrupt 25% of Documents

Microsoft Research dropped a benchmark in April that should make anyone running an "agentic" document workflow nervous. The preprint, from Philippe Laban, Tobias Schnabel, and Jennifer Neville, tests 19 frontier and mid-tier models across 52 professional domains and finds that even the best of them quietly mangle about a quarter of a document over a long editing session.

It's called DELEGATE-52. The premise is simple to the point of being almost embarrassing for the industry.

How the test works

Researchers built 310 work environments spanning 52 domains, from Python source code and database schemas to crystallography files, sheet music, family trees, and accounting ledgers. Each document runs roughly 15,000 tokens. A model receives an editing instruction (add a row, transpose a measure, change a config block), then a second instruction that should undo the first. A faithful model returns the document close to its original state. An unfaithful one drifts.

Scoring is straightforward: a domain-specific similarity check between the seed file and the file you get back after k forward-and-back round trips. Twenty interactions is the headline number.

Code lives on GitHub, and the redistributable subset of 234 environments is on Hugging Face.

The damage

After 20 interactions, frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) preserve roughly 71 to 81 percent of the document on average. Call it a 25% loss. Across all 19 models the average lands closer to 50 percent. Half. Of a document. Gone or wrong.

The losses aren't smooth, which is the more uncomfortable finding. Most edits are clean. Then every few turns a model has a bad day and a single round trip strips out 10 to 30 percentage points. Those sparse blowups account for the bulk of total degradation, which means the average numbers above hide a distribution that is much worse to live with.

Researchers define "ready to delegate" as 98 percent reconstruction or higher. Out of 52 domains, the best performer, Gemini 3.1 Pro, clears that bar in 11. Python is the standout: 17 of 19 models hit lossless manipulation there.

Where it falls apart

Domains where models hold up are predictable in hindsight. Programmatic. Structured. Repetitive. Python, database schemas, molecule files, chess notation. The wrecking zones are the human-shaped stuff: recipes, fiction, sheet music, earnings statements. Anything with rich vocabulary that does not repeat itself.

Which is a problem, because the pitch for these models has spent the past year insisting they're ready to handle exactly that kind of work.

Agents make it worse

Microsoft's team also tried the obvious counter: hand the models proper tools. File read, file write, code execution, the agentic harness everyone is shipping. Across GPT-5.4, 5.2, 5.1, and 4.1, giving the model real tools added an average of six percentage points of degradation. Not improvement. Degradation.

This is the result that's hardest to square with the current product cycle. Every major lab is pushing agentic workflows as the next paradigm. Microsoft's own paper, on Microsoft's own benchmark, says the tooling makes the documents worse.

Failure modes split cleanly by tier. Weaker models delete chunks of the document outright. Frontier models, the ones whose output looks fine on review, distort what they keep. That's the silent corruption in the title. A human skimming the result might catch a missing section. They are less likely to catch that a row in an accounting ledger has different numbers than it did three turns ago.

What's missing

Some gaps in the work. It doesn't test cross-session corruption, which is probably where most real delegation happens. It doesn't break down failures by edit type in much detail (sorting versus inserting versus rewriting), though there's signal in the appendix. And it doesn't try the more elaborate agent harnesses that companies actually deploy in production, just a baseline one.

Still, the headline finding is sturdy: long workflows compound errors, agents don't help, and Python being the only "ready" domain isn't the comfort it sounds like.

DELEGATE-52 was released April 17. The dataset and evaluation code are public, so the next round of model releases will get measured against it. The interesting question is whether whatever ships next moves the 11-of-52 number meaningfully, or whether long-horizon document fidelity is a problem the current architecture just can't fix.