Coding Agent Index: Cursor Tops First Cross-Stack Benchmark

Artificial Analysis published its first coding agent benchmark ranking complete model-and-harness combinations rather than the underlying language models alone. Cursor CLI paired with Claude Opus 4.7 topped the leaderboard at 61, one point ahead of OpenAI's Codex with GPT-5.5 and Anthropic's own Claude Code running the same Opus model. Gemini CLI with Gemini 3.1 Pro came in dead last at 43.

A composite, sort of

The index averages pass@1 scores across three benchmarks: SWE-Bench-Pro-Hard-AA (150 implementation tasks pulled from Scale AI's harder SWE-Bench Pro), Terminal-Bench v2 (84 agentic shell tasks from the Laude Institute, with five excluded for environment compatibility), and SWE-Atlas-QnA (124 repository Q&A questions). Three runs, averaged. Simple arithmetic mean. No weighting.

That last detail matters. A composite that treats a single-file Q&A as equivalent to a multi-step terminal workflow is going to flatten some real differences, and Artificial Analysis admits as much in its methodology page: two agents at the same index value can still be pulling their points from very different places.

What does a one-point lead actually mean?

Cursor at 61, Codex and Claude Code at 60. Is that meaningful?

Probably not on its own. With 358 total tasks averaged across three runs, a single point could easily come down to a handful of coin flips on edge cases. What's more interesting is what the chart shows when you hold the model constant at Opus 4.7: Cursor's harness outperforms Anthropic's own. The site published a separate comparison for exactly this reason.

If that holds up across more runs, it's awkward for Anthropic. The company makes the model. Cursor wraps it in a third-party shell and gets more out of it.

Composer 2 at seven cents

The real news might be at the bottom of the cost chart. Cursor CLI running its in-house Composer 2 model came in at $0.07 per task on average. Other combinations ran up to $0.76, a roughly 10x spread for benchmarks where Composer 2 lands below the frontier setups but well clear of dead-last Gemini. (I'd want the exact Composer 2 index score in print before getting too definitive about value-per-dollar, but the cost gap is the cost gap.)

Composer 2, per Cursor's own technical report, started as a continued-pretraining run on Kimi K2.5 followed by large-scale RL on agentic coding traces. It isn't a frontier model. It isn't trying to be. In an agentic loop where the model gets called dozens of times per task, frontier pricing compounds fast, and seven cents versus a dollar-plus changes the calculus entirely.

Whether 60% of the quality at a tenth of the cost is the right tradeoff depends on what you're shipping. The benchmark won't tell you that.

Google in the basement

Gemini CLI with Gemini 3.1 Pro scored 43. The model itself is competitive on most general-purpose benchmarks, which makes the gap look like a harness problem rather than a weights problem.

Andreessen Horowitz's Martin Casado put it bluntly on X: "Google, fix the tooling." Hard to argue.

What it's not measuring

The pay-per-token cost figures here reflect API pricing, not what most developers actually pay. Anthropic, OpenAI, and Cursor all sell subscription plans that bundle agent usage at flat rates. The cost column on this leaderboard is useful for comparing efficiency across systems, less useful as a guide to what a given setup will run on your monthly bill.

And the suite is small. Three benchmarks. A few hundred tasks total. Artificial Analysis says coverage will grow over time, but for now this is a snapshot of a fast-moving target with a one-point margin at the top.

Where this goes

The interesting question isn't really who wins by a point. It's whether the harness gap closes when Google, or whoever else lands behind the leaders, decides to take its CLI tooling seriously. Artificial Analysis hasn't published a release schedule for index updates, so the next data point comes whenever they add benchmarks or evaluate new agents.