Cursor dropped a blog post about how they've been improving Bugbot, their automated code review agent. The headline numbers: resolution rate up from 52% to 70%, bugs flagged per PR up from 0.4 to 0.7. Do the math and that's roughly 2.5x more resolved bugs per pull request.
The chart in their post is the interesting part. It's messy. Eleven versions plotted on two axes (resolution rate vs bugs per run), and the progression isn't a clean line. Version 5 regressed. Version 8 barely moved. Whatever they did between versions 9 and 11 was the real breakthrough.
The early approach was brute force
Before they had metrics to optimize against, they were doing what everyone does: vibes-based development. Run the thing, ask engineers if the output seemed better, ship if consensus was "sure, I guess."
Their initial architecture ran eight parallel passes over each diff, with the file ordering randomized each time. The idea being that if multiple passes flag the same issue, it's probably real. Majority voting filtered the noise. This worked well enough to launch publicly in summer 2025.
But they couldn't tell if they were actually getting better.
They built a metric by making the LLM grade itself
The resolution rate metric is clever. At PR merge time, they have an AI check whether each flagged bug was actually fixed in the final code. They spot-checked against the PR authors. The LLM got it right nearly every time.
This gave them something to hill-climb on. Finally.
Most experiments failed
They ran 40 major experiments across models, prompts, validators, context management. A lot of the changes regressed their metrics. The blog mentions this almost in passing, but it's worth sitting with: their initial qualitative instincts were often correct. The changes that felt promising often weren't.
The agentic pivot
The biggest jump came when they switched to a fully agentic design last fall. Instead of running a fixed sequence of passes, the agent reasons over the diff, calls tools, decides where to dig deeper.
Here's the counterintuitive part. With the older system, they had to prompt the model to be cautious. Too many false positives otherwise. With the agentic approach, they had the opposite problem: the model was too cautious. They shifted to aggressive prompts that tell the agent to investigate every suspicious pattern.
Why would aggressive prompts reduce false positives? My guess: the agentic system has more context to work with before making a decision. It can investigate hunches and self-correct. The old parallel-pass system made snap judgments with limited information.
What's coming
They're already running Bugbot on over two million PRs monthly. Customers include Discord, Airtable, Rippling, Sierra AI. Cursor uses it on all their own code.
Next steps: Bugbot Autofix is in beta, which spawns a cloud agent to actually fix the bugs it finds. They're working on letting Bugbot run code to verify its own reports, and experimenting with an always-on version that scans codebases continuously rather than waiting for PRs.
The always-on version is interesting. Static analysis that runs continuously is table stakes. But LLM-powered review that runs continuously? The inference costs would've made that prohibitive a year ago.




