SWE-CI: 75% of AI Coding Agents Break Their Own Code

Researchers from Sun Yat-sen University and Alibaba Group published a paper on March 4 that asks a question the AI coding hype cycle has been ducking: sure, your agent can fix a bug, but can it maintain a codebase for eight months without setting it on fire?

The answer, for most models, is no.

SWE-CI is a new benchmark that tested 18 models from 8 providers across 100 real Python repositories. Each task spans an average of 233 days and 71 consecutive commits of actual development history. The agent doesn't patch one issue and walk away. It resolves dozens of rounds of changes, each building on the last, each capable of breaking what came before. The results: most models had a zero-regression rate below 0.25. That means in three out of four maintenance cycles, the AI agent introduced regressions into previously working code.

The snapshot problem

Here's the thing about benchmarks like HumanEval and SWE-bench: they're snapshot tests. Hand the model a bug, let it write a patch, run the tests, done. An agent that hard-codes a brittle fix and one that writes clean, extensible code both pass the same test suite. Their difference in maintainability is invisible until the codebase needs to evolve.

And codebases always need to evolve. Classic software engineering literature puts maintenance at 60% to 80% of total lifecycle costs. That's the work AI coding agents are supposedly going to automate. SWE-CI is the first benchmark that actually tries to measure whether they can.

The benchmark draws from 68 distinct open-source repositories, filtered for at least three years of active maintenance, 500+ stars, and permissive licenses. Each base-to-target transition involves at least 500 lines of modified source code, excluding test files. These aren't trivial patches.

Who broke what

Only two models in the Claude Opus series (4.5 and 4.6) exceeded a 50% zero-regression rate. Every other model tested fell below 25%. Let that gap sink in for a moment. The paper found that within the same provider family, newer models always score higher, and models released after 2026 show larger gains than their predecessors. But even the improving trend line doesn't close the gap. GLM-5 gets a mention as a strong performer, and the paper notes that some providers (MiniMax, DeepSeek, GPT) do better when long-term stability is weighted more heavily in the EvoScore metric. Others, like Kimi and GLM, perform better under short-term weighting. Claude and Qwen are stable across both.

The 8 providers tested include Claude, GPT, DeepSeek, Qwen, MiniMax, Kimi, GLM-5, and Doubao. The whole experiment consumed over 10 billion tokens.

EvoScore and how agents game benchmarks

The researchers introduce a metric called EvoScore that's designed to catch exactly the failure mode you'd expect from models trained on snapshot benchmarks. The formula weights later iterations more heavily than earlier ones. An agent that performs well on commits 1 through 20 but degrades on commits 50 through 71 gets punished. One that maintains steady performance throughout gets rewarded.

This matters because the failure pattern is predictable. A fix that patches the immediate test failure but introduces a subtle dependency conflict won't show up until three commits later. By then the agent has moved on, and the regression compounds. SWE-CI's dual-agent protocol (an Architect agent that identifies requirements, a Programmer agent that implements them) mirrors how professional teams actually work in CI loops. The Architect can only issue five requirements per iteration, forcing incremental development rather than one-shot rewrites.

What it doesn't tell you

Python only. That's the biggest limitation, and the researchers acknowledge it. Maintenance patterns in TypeScript, Go, or Rust are different enough that these results might not transfer cleanly. The benchmark is also testing raw model capabilities through a specific agent framework (iFlow CLI), not the agentic scaffolding that tools like Cursor or Claude Code layer on top. A Hacker News commenter pointed out that newer GPT versions are only available through OpenAI's proprietary Codex CLI, which means a true apples-to-apples model comparison gets murky fast.

And I'm a bit skeptical of how neatly the results map onto real development workflows. The Architect-Programmer split is a reasonable abstraction of CI, but actual software maintenance involves reading Slack threads, deciphering ambiguous Jira tickets, and making judgment calls about whether to refactor now or later. No benchmark captures that.

Still

SWE-CI is filling a real gap. The disconnect between benchmark performance and real-world utility is something practitioners have complained about for a while now. A model scoring 70%+ on SWE-bench can still produce code that becomes a maintenance burden within weeks. The dataset is public under CC BY 4.0, and the researchers are building a broader evaluation framework called Otter around it.

The narrative around AI coding has been "can it write code?" for the last two years. SWE-CI reframes the question as "can it keep code working?" And for most models, the honest answer is: not yet. The benchmark is available now, the data is open, and the FTC hasn't shut down AI coding tools (yet). But if your team is planning to hand ongoing codebase maintenance to an AI agent, these numbers suggest you'll be spending plenty of time cleaning up after it.