Anthropic and OpenAI Ship Flagship Coding Models 20 Minutes Apart

Anthropic released Claude Opus 4.6 on Thursday, February 5, 2026, its most capable model to date, with a 1-million-token context window and new multi-agent coding features. OpenAI followed roughly 20 minutes later with GPT-5.3-Codex, a coding-focused model it says helped debug its own training runs. Both companies timed their launches for the same morning window, the kind of coincidence that never is one.

The numbers, such as they are

The benchmark picture is a mess, and not accidentally so. On Terminal-Bench 2.0, which measures command-line coding capability, GPT-5.3-Codex posts 77.3% while Opus 4.6 scores 65.4%. That's a substantial gap. But on GDPval-AA, a benchmark for real-world knowledge work in finance and legal, Anthropic says Opus 4.6 leads GPT-5.2 by 144 Elo points. Note the comparison target: GPT-5.2, not the just-released 5.3-Codex. Anthropic was already benchmarking against the old model when OpenAI dropped a new one mid-announcement.

On SWE-Bench Pro, which tests real software engineering across four languages, the gap barely registers: 56.8% for GPT-5.3-Codex versus roughly comparable scores for Opus. Both companies cherry-picked which benchmarks to lead with, and neither published a full head-to-head on the same day's models. Standard practice, but worth noting.

The OSWorld computer-use benchmark is where things get interesting. GPT-5.3-Codex hits 64.7%, up from 38.2% for its predecessor, a jump that's harder to dismiss as incremental. Human performance on that benchmark sits around 72%. Meanwhile, Opus 4.6 scores 72.7% on the same eval according to Anthropic's figures, which would put it at near-human level. The two companies appear to be reporting different configurations of the same benchmark, which tells you everything about how benchmark comparisons work in practice.

What each model actually does differently

Opus 4.6's headline feature is agent teams in Claude Code: multiple AI agents splitting a project into parallel tasks, one handling frontend, another the API, a third running migrations. "Instead of one agent working through tasks sequentially, you can split the work across multiple agents," Anthropic says. No other major model offers this natively yet.

GPT-5.3-Codex leans into something weirder. OpenAI claims early versions of the model helped build the final version, debugging its own training and managing deployment. "Our team was blown away by how much Codex was able to accelerate its own development," the company wrote. That's a sentence worth sitting with regardless of whether you find it exciting or unsettling. OpenAI also flagged GPT-5.3-Codex as the first model it classifies "high" for cybersecurity capability under its Preparedness Framework, meaning it could meaningfully enable real-world cyber harm if misused. The company is delaying full API access and gating certain features accordingly.

The cybersecurity split

Anthropic's cybersecurity story goes in the opposite direction, and the contrast is striking enough to deserve its own section. Before launch, the company's red team gave Opus 4.6 access to standard vulnerability tools and let it hunt for bugs in open-source code. According to Axios, the model found over 500 previously unknown zero-day vulnerabilities, each validated by a human researcher or a member of Anthropic's team. Flaws in GhostScript, OpenSC, and CGIF were among them.

"It's a race between defenders and attackers, and we want to put the tools in the hands of defenders as fast as possible," said Logan Graham, head of Anthropic's frontier red team, which sounds reasonable until you remember OpenAI is saying the quiet part out loud about the same capability: this stuff is dangerous. Anthropic acknowledged as much too, warning that new real-time detection tools "will create friction for legitimate research and some defensive work." Both companies arrived at the same conclusion from different angles. One led with the promise, the other with the risk.

So who wins?

Dan Shipper at Every, who tested both models for over a week before launch, published a comparison that cuts through the benchmark noise. Opus 4.6, he wrote, has "a higher ceiling as a model, but it also has higher variance." He described watching it build an iOS feature his team had struggled with for two months. It just built it. But Opus also sometimes falsely reports success.

GPT-5.3-Codex, by contrast, is "an excellent model, and its output is more reliable." Shipper's conclusion, and the only honest one: the models are converging. Opus picked up the precision that made Codex popular. Codex picked up the creative willingness that made Opus popular. Both labs are building toward the same thing.

Anthropic's Scott White told CNBC he thinks "we are now transitioning almost into vibe working," extending the vibe coding concept beyond just developers. Whether that framing survives contact with actual enterprise procurement teams is another question.

GPT-5.3-Codex is available now to paid ChatGPT users across the Codex app, CLI, IDE extensions, and web, with API access in the coming weeks. Opus 4.6 is live on claude.ai and the API at $5/$25 per million tokens, unchanged from its predecessor. Both companies are running Super Bowl ads on Sunday.

Anthropic and OpenAI Ship Flagship Coding Models 20 Minutes Apart

The numbers, such as they are

What each model actually does differently

The cybersecurity split

So who wins?

Liza Chan

Related Articles

Anthropic Launches Agent Teams for Claude Code, Then Used 16 of Them to Build a C Compiler

Karpathy: Programming Now 80% AI Agents, 20% Human Edits

Investor Bet Spawns AI Farming Experiment in Iowa

Stay Ahead of the AI Curve