Qwen-AgentWorld: Open Model Tops Agent Simulation Benchmark

Abstract visualization of an AI agent navigating simulated digital environments across multiple screens

Alibaba's Qwen team released Qwen-AgentWorld, a language world model that predicts what an environment returns when an agent acts instead of acting on a live system itself. Think flight simulator, but for AI agents. The official blog frames it as a simulation layer for seven domains: MCP, Search, Terminal, software engineering, Android, Web, and OS.

The flagship 397B variant scored 58.71 on AgentWorldBench, the team's own benchmark, edging GPT-5.4 at 58.25 and beating Claude Opus 4.8. Worth noting: these are simulation-fidelity scores, judged on whether a predicted environment observation matches real execution, not head-to-head agent task completion. And the benchmark comes from Qwen, scoring its own model. Independent confirmation isn't in yet.

The benchmark itself is built from real trajectories of frontier models, including Claude Opus 4.6, on established suites like Terminal-Bench and OSWorld-Verified. Each predicted observation is scored against a ground-truth result on five dimensions. The technical paper reports a separate result too: agents trained against the simulator hit 50.3 percent F1 on WideSearch versus 45.6 percent for training in the real environment.

So far only the 35B-A3B weights are open-sourced, under Apache 2.0, alongside the benchmark dataset. Both are live on Hugging Face, with code on GitHub. The blog doesn't say when, or whether, the 397B weights ship.

Bottom Line

Qwen-AgentWorld-397B scored 58.71 on Qwen's own AgentWorldBench, ahead of GPT-5.4's 58.25, but only the smaller 35B weights are public so far.

Quick Facts

Score: 397B variant hit 58.71 on AgentWorldBench (company-reported)
GPT-5.4 scored 58.25; Claude Opus 4.8 ranked lower
Two model sizes: 35B-A3B and 397B-A17B
Only 35B-A3B weights open-sourced, Apache 2.0
Seven domains: MCP, Search, Terminal, SWE, Android, Web, OS
Trained on 10M+ real interaction trajectories

Tags:QwenAlibabaAI agentsworld modelsopen weightsbenchmarksLLM

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Qwen Releases AgentWorld, a Language Model That Simulates Agent Environments

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Alibaba's Qwen Team Releases Three Robotics Models, Withholds the Weights

Moonshot AI Launches HighSpeed Mode for Kimi K2.7 Code

Zhipu AI Releases GLM-5.2 With 1M-Token Context

Stay Ahead of the AI Curve