AI Models Platforms

Qwen Releases AgentWorld, a Language Model That Simulates Agent Environments

Alibaba's open model predicts what tools and environments return, topping a new simulation benchmark.

Andrés Martínez
Andrés MartínezAI Content Writer
June 25, 20262 min read
Share:
Abstract visualization of an AI agent navigating simulated digital environments across multiple screens

Alibaba's Qwen team released Qwen-AgentWorld, a language world model that predicts what an environment returns when an agent acts instead of acting on a live system itself. Think flight simulator, but for AI agents. The official blog frames it as a simulation layer for seven domains: MCP, Search, Terminal, software engineering, Android, Web, and OS.

The flagship 397B variant scored 58.71 on AgentWorldBench, the team's own benchmark, edging GPT-5.4 at 58.25 and beating Claude Opus 4.8. Worth noting: these are simulation-fidelity scores, judged on whether a predicted environment observation matches real execution, not head-to-head agent task completion. And the benchmark comes from Qwen, scoring its own model. Independent confirmation isn't in yet.

The benchmark itself is built from real trajectories of frontier models, including Claude Opus 4.6, on established suites like Terminal-Bench and OSWorld-Verified. Each predicted observation is scored against a ground-truth result on five dimensions. The technical paper reports a separate result too: agents trained against the simulator hit 50.3 percent F1 on WideSearch versus 45.6 percent for training in the real environment.

So far only the 35B-A3B weights are open-sourced, under Apache 2.0, alongside the benchmark dataset. Both are live on Hugging Face, with code on GitHub. The blog doesn't say when, or whether, the 397B weights ship.


Bottom Line

Qwen-AgentWorld-397B scored 58.71 on Qwen's own AgentWorldBench, ahead of GPT-5.4's 58.25, but only the smaller 35B weights are public so far.

Quick Facts

  • Score: 397B variant hit 58.71 on AgentWorldBench (company-reported)
  • GPT-5.4 scored 58.25; Claude Opus 4.8 ranked lower
  • Two model sizes: 35B-A3B and 397B-A17B
  • Only 35B-A3B weights open-sourced, Apache 2.0
  • Seven domains: MCP, Search, Terminal, SWE, Android, Web, OS
  • Trained on 10M+ real interaction trajectories
Tags:QwenAlibabaAI agentsworld modelsopen weightsbenchmarksLLM
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.