Inception Labs, the Palo Alto startup built around Stanford professor Stefano Ermon's diffusion research, launched Mercury 2 on Monday. The company calls it the fastest reasoning LLM available, and the numbers are hard to ignore: 1,009 tokens per second on NVIDIA Blackwell GPUs, according to benchmarks consistent with Artificial Analysis methodology. For comparison, Claude 4.5 Haiku manages roughly 89 tokens per second. GPT-5 Mini gets about 71.
Those throughput figures come with a catch, though. Artificial Analysis flagged Mercury 2 as "very verbose," generating 69 million tokens during its evaluation where the average model produces 16 million. A model that talks four times as much will naturally look faster on raw throughput charts. Whether that verbosity helps or hurts actual applications is a question Inception hasn't really addressed.
How diffusion changes generation
Every major LLM today generates text the same way: one token at a time, left to right, each word waiting for the one before it. Mercury 2 does something different. It starts with noise and iteratively refines the entire output in parallel, the same core approach behind image generators like Midjourney and Sora, now applied to language. Each pass through the neural network modifies multiple tokens at once.
The company's blog post frames this as less typewriter, more editor revising a full draft. That analogy is useful. It also explains the latency gap: Inception claims 1.7 seconds end-to-end, compared to 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5 with reasoning enabled, according to The Decoder's reporting.
So is it any good?
Speed is the headline. Quality is the fine print. Mercury 2 scored 91.1 on AIME 2025, 73.6 on GPQA, and 67.3 on LiveCodeBench, according to the press release. Inception positions these as competitive with Claude 4.5 Haiku and GPT 5.2 Mini. That's a specific comparison: speed-optimized models, not frontier reasoning systems like GPT-5 or Claude Opus. The 91.1 AIME score sounds impressive, but frontier models now routinely crack 93-95% on that benchmark.
"Most teams treat inference as an optimization exercise around the autoregressive stack, but Inception started from a more fundamental place: diffusion for language," said Tim Tully, partner at Menlo Ventures, which is fair enough as investor framing goes, though "more fundamental" is doing some heavy lifting when the quality benchmarks land squarely in the mid-tier.
The price angle
Where things get more interesting: $0.25 per million input tokens and $0.75 per million output tokens. That undercuts Gemini 3 Flash on input by half and roughly quadruples the savings on output. Against Claude Haiku 4.5 pricing, the gap widens further. For applications where speed and cost matter more than peak intelligence (agent loops, voice interfaces, autocomplete), this is a compelling pitch.
The model supports a 128K context window, tool use, and schema-aligned JSON output. It's also OpenAI API compatible, so teams can swap it into existing stacks without rewriting anything. Skyvern CTO Suchintan Singh put it bluntly in his testimonial: Mercury 2 is "at least twice as fast as GPT-5.2."
Inception isn't alone in this bet
Google DeepMind showed off Gemini Diffusion back in May 2025, performing on par with Gemini 2.0 Flash Lite. Then went quiet. The original Mercury model shipped with the company's $50 million seed round in November 2025, backed by Menlo Ventures, Mayfield, NVentures, M12, Snowflake Ventures, and Databricks, with angel checks from Andrew Ng and Andrej Karpathy. The technical paper for Mercury Coder arrived last summer.
Mercury 2 is the first diffusion model to add reasoning capabilities, which is the actual news here. Whether diffusion can scale to frontier-level quality remains the open question. Ermon and his co-founders (Aditya Grover from UCLA, Volodymyr Kuleshov from Cornell) are betting their company on yes. The model is available now via API and free to try at chat.inceptionlabs.ai, where a "Diffusion Effect" toggle lets you watch the denoising process in real time.




