Alibaba's Qwen team has released Qwen3-Max-Thinking, a reasoning-focused variant of its flagship trillion-parameter model. The model achieved 100% accuracy on both AIME 2025 and HMMT, matching performance levels previously seen only from top-tier closed models. Built on Qwen3-Max's architecture, it uses a code interpreter and parallel test-time compute to tackle complex mathematical reasoning.
The model represents Alibaba's push into deliberate reasoning systems. Qwen3-Max-Thinking exposes its step-by-step thought process before delivering answers, letting users see how it works through problems. On GPQA, a graduate-level science benchmark, the model scores 85.4, trailing only GPT-5's 89.4. The technical report details the four-stage training pipeline behind the Qwen3 series.
Qwen3-Max itself ranks third on the LMArena text leaderboard and scores 74.8 on Tau2-Bench for agent tool calling. The thinking variant adds what Alibaba calls adaptive tool use, meaning the model decides when to invoke search, memory, or code execution without manual configuration. Pricing sits at $1.20/$6.00 per million tokens for input/output, roughly half the cost of comparable frontier models. Access is available through Qwen Chat and the Alibaba Cloud API, with open-source Qwen3 variants on GitHub.
The Bottom Line: Qwen3-Max-Thinking matches frontier reasoning benchmarks at lower cost, though independent verification of the company-reported scores remains pending.
QUICK FACTS
- 100% accuracy on AIME 2025 and HMMT (company-reported)
- 85.4 on GPQA (vs. GPT-5 at 89.4)
- Over 1 trillion parameters, trained on 36 trillion tokens
- 262,144 token context window
- $1.20/$6.00 per million input/output tokens
- Supports 100+ languages




