Review: GPT-5.2. Fast Benchmarks, Slow Responses

OpenAI released GPT-5.2 on Thursday, less than two weeks after CEO Sam Altman declared an internal "code red" in response to Google's Gemini 3 eating into ChatGPT's market share. The model arrives in three tiers (Instant, Thinking, and Pro) and claims benchmark leads in coding and reasoning tasks. But early testers report a familiar problem: the model that actually delivers on those benchmarks is too slow to use for everyday work.

The speed tax

Matt Shumer, CEO of HyperWriteAI and one of the more reliable independent model testers, had access to GPT-5.2 for two weeks before release. His verdict lands somewhere between impressed and exasperated. "GPT-5.2 Pro is insanely better for deep reasoning," he wrote, "but it's slow, and every so often it will think forever and still fail."

That slowness isn't a minor complaint. Shumer reports barely using the Thinking mode at all, defaulting instead to Claude Opus 4.5 for quick questions and only reaching for GPT-5.2 Pro when he needs something genuinely thought through. The standard Thinking tier, meant to be the workhorse model, sits in what he calls "an awkward middle ground: slower than Opus but without the full reasoning benefits of Pro."

OpenAI's product lead Max Schwarzer claims GPT-5.2 Thinking responses contain 38% fewer errors than GPT-5.1. That would be meaningful if users could tolerate the wait. Allie K. Miller, former AWS executive and AI entrepreneur, flagged a different annoyance: the model's "extreme" default behavior around formatting. A simple question prompted 58 bullets and numbered points in her testing. OpenAI's models have always loved bullet points. This one apparently loves them more.

What the benchmarks actually say

OpenAI's benchmark claims deserve the usual scrutiny. The company reports GPT-5.2 Thinking hits 55.6% on SWE-Bench Pro, a newer coding evaluation that tests four programming languages and aims for better contamination resistance than older benchmarks. That's a real improvement over GPT-5.1's 50.7%, and it beats Gemini 3 Pro's 43.2% on the same test.

The catch: on SWE-Bench Verified, the more established benchmark, GPT-5.2 scores 80.0%. Anthropic's Claude Opus 4.5 still leads with 80.9%. The gap is small, but OpenAI didn't close it.

On reasoning tasks, the numbers look better. GPT-5.2 Pro hits 93.2% on GPQA Diamond, a graduate-level science benchmark. That essentially ties Gemini 3 Deep Think's 93.8%. And GPT-5.2 achieves a perfect 100% on AIME 2025 without tools, matching what Gemini 3 Pro requires code execution to accomplish.

OpenAI is also pushing its own GDPval benchmark, which measures performance on "well-specified knowledge work tasks" across 44 occupations. GPT-5.2 Thinking reportedly beats or ties industry professionals 70.9% of the time. But GDPval is OpenAI's evaluation, run on their infrastructure, with their methodology. Independent validation would be more persuasive than a press release.

The code generation question

Coding is where GPT-5.2 shows the most obvious improvement over its predecessors. The model writes more code without stopping mid-task, gathers context before implementing (rather than making assumptions and hitting walls), and handles larger codebases without losing track. In Shumer's testing with Codex CLI, he found GPT-5.2 gets things right on the first shot more often than anything else he's tried.

But there's still spatial reasoning weirdness. Shumer asked it to build a Three.js baseball field scene. The textures and lighting came out well. The object placement was wrong. This pattern shows up across visual generation tasks: the model handles some aspects of a problem excellently while failing at others that seem related. It's better at understanding images than at generating spatial arrangements.

For developers considering the switch from GPT-5.1, the 40% API price increase ($1.75 per million input tokens vs. $1.25) will factor into the decision. OpenAI argues greater token efficiency makes up for the higher cost, that GPT-5.2 solves tasks in fewer turns. Whether that math works out depends on your specific use case and how much you value the improved context window (400,000 tokens, roughly 5x GPT-4's capacity).

The competitive context

GPT-5.2 arrives at an uncomfortable moment for OpenAI. Gemini 3 currently tops LMArena's leaderboard across most categories. Anthropic's Claude Opus 4.5 holds the coding crown on SWE-Bench Verified. The Information reported that Altman's "code red" memo came amid ChatGPT traffic decline and employee concerns about losing consumer market share.

OpenAI executives pushed back on the rushed-release narrative during their Thursday briefing. "This has been in the works for many, many months," said Fidji Simo, OpenAI's CEO of applications. That may be true. But the timing, less than a month after GPT-5.1 and right after Gemini 3's strong showing, suggests competitive pressure accelerated whatever timeline existed.

The model appears designed primarily for enterprise and developer use rather than casual chat. GDPval focuses on professional knowledge work. The benchmark emphasis on coding, math, and multi-step reasoning targets power users. Simo herself positioned GPT-5.2 as a tool to "unlock economic value for people" through spreadsheets, presentations, and code.

For users who just want a conversational AI that responds quickly, this may not be the update they wanted.

When it makes sense

The clearest use case for GPT-5.2 Pro is deep research and complex reasoning where getting it right matters more than getting it fast. Shumer describes Pro as having a different relationship with difficult problems: "It will spend far longer than previous Pro models working through a problem." He tested it on meal planning and found it understood not just his explicit constraint (no time to cook) but the implied constraints (simple shopping, minimal prep). That kind of contextual understanding stands out.

For everyday questions, Claude Opus 4.5 remains faster and more direct. For frontend UI work, Gemini 3 Pro produces better-looking results, even if it requires engineering cleanup afterward. GPT-5.2 doesn't dominate any single category; it incrementally improves on multiple categories while introducing latency that undercuts practical use.

Katie Parrott, writing for Every, noted that while GPT-5.2 excels at instruction following, it's "less resourceful" than Claude Opus 4.5 in certain contexts, like deducing user location from email data. The model does what you tell it more reliably. It doesn't necessarily figure out what you need when you haven't told it.

OpenAI says there are no current plans to deprecate GPT-5.1 in the API. That's probably wise. Users who need reliability and cost efficiency over peak benchmark performance have reason to stick with the previous generation.

The FTC hasn't announced any review of OpenAI's market position, and with GPT-5.2's mixed reception, it's unlikely to be the model that draws regulatory attention. The AI race continues. Whether OpenAI is winning it depends on which metrics you care about.