OpenAI Thrive Tax AI: Self-Improving Codex Agent

OpenAI and Thrive Holdings spent six months building a tax-prep agent that drafts returns and then rewrites its own code when accountants correct it. The system, called Tax AI, ran this season across Crete Professional Alliance, a network of more than 30 accounting firms, and handled roughly 7,000 returns. OpenAI laid out how it works in a blog post this week.

The part that isn't the demo

Reading a W-2 is easy. Any decent extraction model does it. The hard returns are the ones with K-1s, rental schedules, scribbled client notes, last year's filing, and figures that have to reconcile across five documents that don't agree with each other.

So forget the accuracy number for a second. The loop is the story. Every return leaves a full trace: the source file, the field Tax AI pulled, the citation it pulled it from, how that mapped into the tax engine, whatever the accountant changed, and the value that actually got filed. When the same field gets corrected over and over, that pattern turns into a test case. Codex then gets handed a narrow job, with the trace, the evals, the repo, and production data, and proposes a fix inside a sandbox that can read production context but can't write to it. Anything ambiguous kicks back to human engineers.

That last detail matters more than the marketing. No agent is quietly editing live tax software at 2 a.m. here. What OpenAI describes is a constrained code-suggestion pipeline with people sitting on the gate.

About that 97 percent

The headline number, up to 97 percent accuracy, is the kind of phrasing that should make you go find the footnote. OpenAI's actual metric is field completion: what share of returns reach 75%, 90%, or 100% correct fields before a human touches them. At launch, a quarter of returns cleared the 75% bar. Six weeks later, 86% did.

That's a real jump, and the 75% threshold is the easy one to clear. The company says the 90% and 100% bars climbed faster, though it never puts a single clean number on where 100% actually landed. Up to 97% is a ceiling, not an average, and the writeup doesn't say how often returns hit it.

The efficiency figures are softer still: roughly a third less prep time, throughput up about 50%. Both are company-reported, measured against Crete's own earlier workflow, with no outside audit. Plausible for document-heavy grunt work. Just not independently checked.

Rental schedules were the grind

One concrete example from the writeup: rental properties. Getting that category to 90% precision and recall took about six weeks and, in OpenAI's own words, "substantial engineering oversight." Which is a quiet way of admitting the self-improving part still leaned hard on people for the genuinely messy stuff. The payoff, the company argues, is reuse, with the patterns built for rentals carrying into Schedule C and Schedule A.

Who owns it

Thrive owns the firms doing the filing. Per coverage of the deal, Thrive Holdings also keeps the resulting IP and products, with OpenAI said to have taken an equity stake in the company late last year. OpenAI president Greg Brockman, who posted on X that the product "meaningfully self-improved as accountants used it," is tied to Thrive too. A tidy arrangement, that: build the thing inside firms you control, keep the IP, hand your AI partner equity instead of a license fee.

OpenAI frames rentals, Schedule C, and Schedule A as the starting set, not the finish. There's no date attached to what comes next, and tax season is over, so the real test is whether that 86% figure survives when the document mix shifts next year.

OpenAI and Thrive Build Self-Improving Tax AI for Accountants

The part that isn't the demo

About that 97 percent

Rental schedules were the grind

Who owns it

Andrés Martínez

Related Articles

Anthropic and OpenAI Shift Enterprise Coding Tools to API Pricing

OpenAI publishes internal playbook on how its engineers use Codex

Google Rebuilt Colab Around an AI Agent

Stay Ahead of the AI Curve