Qwen Robot Suite: Three Embodied AI Models, No Weights Yet

Alibaba's Qwen team released a robotics suite on Tuesday, three separate foundation models the company calls a full stack for embodied intelligence. Qwen-RobotManip handles manipulation, Qwen-RobotNav handles getting around, and Qwen-RobotWorld predicts what happens next. They were developed by Tongyi Lab, and according to the South China Morning Post they are already in pilot testing with select Alibaba Cloud enterprise clients.

The framing everyone reached for was "the Android of robotics." Decrypt used it. The pitch is that these three pieces snap together into an operating system other people build on. Maybe. The more interesting thing is what's actually in the papers, and where the gaps are.

The manipulation model is the one to watch

RobotManip is built on a Qwen3.5-4B vision-language backbone with a flow-matching DiT action head. The core idea is an alignment problem, not a scale problem. Robot data is a mess: a Franka arm reports joint angles, an ALOHA dual-arm rig reports gripper poses, and a policy trained on one almost never transfers to the other. So the team maps everything into a single 80-dimensional state-action vector, masks the dimensions a given robot doesn't use, and expresses end-effector motion as deltas in the camera frame rather than the robot base frame. That last choice is the clever part. Visually similar motions end up numerically close, which is what lets demonstrations from different machines reinforce each other instead of fighting.

They built the training corpus, roughly 38,100 hours, entirely from open-source datasets and human demonstration videos. No proprietary collection. There's a synthesis pipeline that turns egocentric human hand footage into robot trajectories across 15 embodiments, with the hands inpainted out and the robot composited back in. It's a smart way around the data scarcity wall, though I'd want to see how much the synthetic-to-real gap costs in practice.

Now the numbers, and here's where you have to read carefully. The headline claim is 91.4% on LIBERO-Plus, up 7 points over π0.5. But the team makes an argument I actually buy: standard in-distribution benchmarks are nearly useless for foundation models, because a model trained from scratch with no large-scale pretraining can match them through pattern matching alone. So they lean on out-of-distribution settings instead. The cross-embodiment transfer gap is the one that caught my attention: 23.9% versus 7.5% for π0.5, roughly 3.2 times better at controlling a robot it wasn't trained on. If that holds up outside their own evaluation harness, it's the most meaningful result in the suite.

The real-world claim is a 1st-place finish on the RoboChallenge Table30-v1 generalist track, with a 20% relative margin over third place. Validated on AgileX ALOHA, Franka, UR and ARX platforms. Worth noting: these are still mostly the team's own benchmarks and two of them, RoboTwin-IF and RoboTwin-XE, were introduced in this same paper. Designing the test and topping it is not disqualifying, but it's not independent either.

RobotNav: five jobs, one model

The navigation model runs on Qwen3-VL and comes in 2B, 4B and 8B sizes. It folds five tasks into one framework: instruction following, object search, target tracking, autonomous driving, and environment QA. The piece the team seems proudest of is a controllable observation protocol, a configurable interface for how the model consumes the visual stream (visual token budget, temporal decay, camera weighting, frame sampling). They explicitly compare it to the Model Context Protocol for LLM tool use, which tells you how they're thinking about this.

Qwen reports 76.5% success on VLN-CE RxR and claims it runs on a Unitree Go2 quadruped with no fine-tuning. That zero-shot deployment claim is the kind of thing that's easy to stage and hard to verify from a blog post.

And the world model

RobotWorld is the odd one out. It's a language-conditioned video world model on a 60-layer MMDiT with a frozen Qwen2.5-VL encoder, and instead of numeric action vectors it uses plain language as the action interface. Give it a frame and a text command, it predicts how the scene evolves. Because language is embodiment-agnostic, one model covers 20-plus robot types and 500-plus action categories. Qwen reports 1st-place finishes on physics-consistency benchmarks including EWMBench, in some cases qualified as best among open models.

It's also the only one of the three shipping as a paper alone, with no GitHub repo. Read into that what you will.

What about Chat2Robot?

There's a browser demo, Chat2Robot, for typing commands and watching a robot react in real time. The catch: it only runs a stripped-down RobotManip trained on 50 tasks, meant to show partial generalization to unfamiliar instructions. A tech demo, in other words, not the real model.

The thing nobody's saying out loud

Two of the three models, RobotManip and RobotNav, ship with public GitHub repositories. RobotWorld doesn't. And as of the announcement, Qwen hasn't said whether any of the weights will be released, or under what license. That's a real omission for a team whose reputation rests on open weights. Qwen3-8B shipped under Apache 2.0. The robotics suite, so far, gives you technical reports and code but no checkpoints to download.

So is this an operating system for robots? Hard to say from here. The alignment idea in RobotManip is genuinely good engineering and the cross-embodiment numbers are the strongest evidence of something new. But these are simulation benchmarks plus a handful of controlled real-robot runs, and the distance between that and a robot working in an actual warehouse is exactly where every robotics effort in history has gone to die. The pilot program runs through Alibaba Cloud; the company hasn't named the industries, the success metrics, or a timeline to general availability. Until the weights land and someone outside Tongyi Lab reproduces the OOD results, the right posture is interested but skeptical.