Alibaba's Qwen team released Qwen-AgentWorld, a language world model that predicts what an environment returns when an agent acts instead of acting on a live system itself. Think flight simulator, but for AI agents. The official blog frames it as a simulation layer for seven domains: MCP, Search, Terminal, software engineering, Android, Web, and OS.
The flagship 397B variant scored 58.71 on AgentWorldBench, the team's own benchmark, edging GPT-5.4 at 58.25 and beating Claude Opus 4.8. Worth noting: these are simulation-fidelity scores, judged on whether a predicted environment observation matches real execution, not head-to-head agent task completion. And the benchmark comes from Qwen, scoring its own model. Independent confirmation isn't in yet.
The benchmark itself is built from real trajectories of frontier models, including Claude Opus 4.6, on established suites like Terminal-Bench and OSWorld-Verified. Each predicted observation is scored against a ground-truth result on five dimensions. The technical paper reports a separate result too: agents trained against the simulator hit 50.3 percent F1 on WideSearch versus 45.6 percent for training in the real environment.
So far only the 35B-A3B weights are open-sourced, under Apache 2.0, alongside the benchmark dataset. Both are live on Hugging Face, with code on GitHub. The blog doesn't say when, or whether, the 397B weights ship.
Bottom Line
Qwen-AgentWorld-397B scored 58.71 on Qwen's own AgentWorldBench, ahead of GPT-5.4's 58.25, but only the smaller 35B weights are public so far.
Quick Facts
- Score: 397B variant hit 58.71 on AgentWorldBench (company-reported)
- GPT-5.4 scored 58.25; Claude Opus 4.8 ranked lower
- Two model sizes: 35B-A3B and 397B-A17B
- Only 35B-A3B weights open-sourced, Apache 2.0
- Seven domains: MCP, Search, Terminal, SWE, Android, Web, OS
- Trained on 10M+ real interaction trajectories




