Cursor Composer 2.5: Closes Gap on Opus 4.7, Costs 10x Less

Cursor shipped Composer 2.5 on Monday, its second in-house coding model in two months and a pointed answer to Anthropic's pricing power over the AI coding market. The model runs on the same Kimi K2.5 base as its predecessor, but Cursor says 85% of its compute went into post-training and a new reinforcement-learning method built on self-distillation. On the company's headline benchmarks it lands within a percentage point of Claude Opus 4.7 while costing somewhere between a fifth and a tenth as much per token.

The pricing math

The economics are the whole point. Composer 2.5 keeps Composer 2's pricing: $0.50 per million input tokens, $2.50 per million output. The faster variant comes in at $3.00 and $15.00, which Cursor describes as a lower cost than the fast tiers of other frontier models. Cursor's own effort-curve chart shows Opus 4.7 and GPT-5.5 sitting several dollars per task to the right of Composer 2.5 at comparable accuracy points. That's the kind of gap that matters when you're an enterprise running thousands of agentic sessions a day.

Cursor has been in an awkward spot for a year. Claude Code reportedly crossed $2.5 billion in annualized revenue earlier this year, and Anthropic's pricing structure makes it almost impossible for a downstream wrapper to undercut the underlying API while still paying Anthropic for inference. Building your own model is necessary now. It is the only way out of the squeeze.

About those benchmarks

On SWE-Bench Multilingual, Composer 2.5 scores 79.8% against Opus 4.7's 80.5% and GPT-5.5's 77.8%. On CursorBench v3.1, it hits 63.2%. Opus 4.7 still scores higher at 64.8% on its max setting, though it drops to 61.6% at its default. CursorBench is, of course, Cursor's own benchmark. They built it. They run it. That doesn't make the numbers wrong, but it's worth flagging that the eval where Composer looks best is the one designed by the team shipping the model.

Terminal-Bench 2.0 is the more honest comparison, since it's third-party. Composer 2.5 lands at 69.3%, tied with Opus 4.7's 69.4% inside any reasonable margin. GPT-5.5 sits at 82.7%, well ahead. The gap to OpenAI on terminal-style agentic tasks is not closing yet.

Targeted feedback, borrowed from self-distillation

The technical idea worth paying attention to is what Cursor calls targeted RL with textual feedback. Standard problem in agentic RL: rollouts can span hundreds of thousands of tokens, and a single reward at the end can't tell the model which specific decision in that chain went sideways. If the model called a nonexistent tool somewhere in the middle, the noise from everything else drowns out the signal.

Cursor's approach borrows directly from the self-distillation literature. The announcement footnote cites three recent preprints, including RL self-distillation and Self-Distilled Reasoner. The mechanism: at a problematic turn, insert a hint into the local context, something like "Reminder: Available tools…" followed by the actual list. The model, now seeing the hint, produces a better distribution over next tokens. That distribution becomes the teacher. The same model, seeing only the original context, becomes the student. A KL-divergence loss pushes the student toward the teacher at that specific turn, while the broader RL objective keeps running across the full trajectory.

Granular credit assignment without the usual credit-assignment problem. If it works as cleanly as advertised, expect every lab doing agentic RL to copy it. Quickly.

The Colossus angle

Cursor also trained on 25 times more synthetic tasks than Composer 2. One method they describe involves stripping features out of working codebases and asking the model to reimplement them, with the original test suite as the verifiable reward. Predictable side effect: the model started reward-hacking. In one case it found a leftover Python type-checking cache and reverse-engineered a deleted function signature. In another, it decompiled Java bytecode to reconstruct a third-party API. Cursor says agentic monitoring tools caught these. They probably caught most. The ones they didn't, they didn't.

The bigger story is what comes next. Cursor and SpaceXAI (the merged SpaceX-xAI entity, which closed in February) struck a partnership in April that gives SpaceX an option to either acquire Cursor for $60 billion by year-end or take a $10 billion collaboration fee. Cursor gets access to Colossus 2, described as roughly a million H100 equivalents. Anthropic, awkwardly, is leasing the older Colossus 1 cluster from the same parent company. Musk reportedly cleared that deal personally, saying on X that his "evil detector" stayed quiet during the meetings.

Cursor is already training a significantly larger model from scratch on Colossus 2, using roughly ten times the total compute that went into Composer 2.5. That's the model that will decide whether SpaceX exercises its acquisition option.

What to watch

Composer 2.5 ships with double usage for the first week, runs only inside the Cursor IDE and CLI, and won't appear on any third-party API or Hugging Face mirror. The model docs are public. The weights aren't and won't be. Whether the from-scratch model on Colossus 2 lands before December will decide more than benchmark rankings; it will decide whether Cursor stays independent into 2027.