OpenAI Cuts Inference Costs in Half, Report Says

OpenAI engineers found a system optimization earlier this month that cut inference costs in half for the models it touched, according to The Information. The most striking detail: after applying it to logged-out ChatGPT traffic, the company reportedly ran that traffic on a couple hundred Nvidia GPUs.

What they actually did (nobody's saying)

The method hasn't been disclosed. No paper, no code, no benchmark. So take the headline number with the usual caution you'd apply to any leak that arrives the same week a company is trying to look good to investors.

The plausible candidates are the same ones every lab is poking at right now. Quantization, which trims the numerical precision of model weights to lighten the compute load. KV caching. Batching more requests together. Routing the easy queries to cheaper models. Probably some blend of those rather than one clever trick.

One reply going around on X put it bluntly, asking whether OpenAI was quietly relabeling some form of quantization as an "optimization." Fair question. Halving cost by doubling throughput is the oldest framing in the book.

The part that matters is the GPU count

Forget the "in half" framing for a second. The number worth sitting with is that logged-out ChatGPT, the version strangers hit without an account, reportedly ran on a few hundred GPUs after the change. That's a small footprint for one of the most-visited products on the internet.

The catch is right there in the reporting: this only hit logged-out traffic so far. Whether it generalizes to logged-in users, to the Thinking and Deep Research modes that chew through far more compute, or to the API is the open question. Frontier reasoning models have been getting more expensive to run, not less, as token consumption per task climbs. An optimization that shines on lightweight anonymous queries may do much less for the workloads that actually hurt.

Why OpenAI needs this badly

OpenAI closed Q1 2026 with a 39% gross margin, up from 33% a year earlier, per financials reported by The Information and broken down by TradingKey. That improvement still left roughly $2.2 billion in gross profit against an $8.6 billion R&D bill in the quarter. The cost of revenue, mostly inference, ran about $3.5 billion.

The company has floated a longer-term inference margin target in the low-to-mid 50s and above, a figure analysts have openly questioned given how competitive the model race is. Cheaper inference buys room to move on three fronts: fatten the margin, raise ChatGPT usage limits, or cut API prices to keep developers from drifting to Anthropic.

That last one is the real pressure. Sacra notes OpenAI has been weighing token price cuts to fend off Anthropic, the kind of move that compresses margins right when inference spend is projected to climb from $8.4 billion in 2025 to $14.1 billion in 2026. So a structural cost win, if it holds at scale, lands at a useful moment.

OpenAI is also attacking the problem in hardware. The company unveiled a custom inference chip called Jalapeño with Broadcom in late June, claiming roughly 50% lower cost per token than current Nvidia GPUs. Two different 50% claims in two weeks is a lot of 50%. Neither has independent production numbers behind it yet.

No timeline has been given for rolling the software optimization beyond logged-out traffic. Watch OpenAI's next financial disclosure for whether the gross margin actually moves.

OpenAI Cuts Inference Costs in Half With New System Optimization

What they actually did (nobody's saying)

The part that matters is the GPU count

Why OpenAI needs this badly

Oliver Senti

Related Articles

EU Joins US-Led Pax Silica Alliance as Membership Nears 24 Countries

Google Reorganizes AI Coding Strike Team as Researchers Exit for Anthropic

Stripe, Anthropic and OpenAI Fund $500M Nonprofit to End Colds and Flu

Stay Ahead of the AI Curve