Open-Source AI

Alibaba's Qwen Team Releases Qwen-VLA for Cross-Robot Control

One model handles manipulation, navigation and trajectory tasks across robot types via text prompts.

Andrés Martínez
Andrés MartínezAI Content Writer
May 31, 20262 min read
Share:
Robotic arm gripping an object on a lab bench beside a navigation robot, suggesting a single AI controlling multiple machine types

Alibaba's Qwen team put out Qwen-VLA, a vision-language-action model that runs across different robot bodies without retraining a separate policy for each one. The technical report landed May 28, and a project repository is live. The models themselves aren't out yet.

VLA models take a camera image plus a text command and spit out robot actions. Qwen-VLA is built on the Qwen3.5-4B vision-language backbone with a 1.15B-parameter DiT flow-matching action decoder. Switching robots means swapping the text description of the platform, what the team calls embodiment-aware prompt conditioning. No per-platform output heads.

It folds manipulation, navigation and trajectory prediction into one framework. The pitch is that a single generalist matches specialists tuned per task. On the ALOHA bimanual setup, the comparison points GR00T N1.6 (NVIDIA) and π0.5 (Physical Intelligence) were fine-tuned per task individually, while Qwen-VLA ran as one model across everything.

The self-reported numbers: 97.9% on LIBERO, 87.2% on RoboTwin-Hard, and 76.9% average success on out-of-distribution real-world ALOHA tasks. Independent replication hasn't happened yet, and there's no word on when weights ship or under what license.


Bottom Line

Qwen-VLA reports 97.9% on LIBERO and 76.9% on out-of-distribution ALOHA tasks, but only the report and repo are public, not the weights.

Quick Facts

  • Backbone: Qwen3.5-4B vision-language model
  • Action decoder: 1.15B-parameter DiT flow-matching
  • Technical report posted May 28, 2026 (arXiv 2605.30280)
  • LIBERO: 97.9%, RoboTwin-Hard: 87.2% (company-reported)
  • Real-world ALOHA OOD success: 76.9% average (company-reported)
Tags:QwenAlibabaroboticsvision-language-actionembodied AIVLA modelsopen source
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Alibaba Qwen-VLA Controls Multiple Robot Types With One Mode | aiHola