Alibaba's Qwen team put out Qwen-VLA, a vision-language-action model that runs across different robot bodies without retraining a separate policy for each one. The technical report landed May 28, and a project repository is live. The models themselves aren't out yet.
VLA models take a camera image plus a text command and spit out robot actions. Qwen-VLA is built on the Qwen3.5-4B vision-language backbone with a 1.15B-parameter DiT flow-matching action decoder. Switching robots means swapping the text description of the platform, what the team calls embodiment-aware prompt conditioning. No per-platform output heads.
It folds manipulation, navigation and trajectory prediction into one framework. The pitch is that a single generalist matches specialists tuned per task. On the ALOHA bimanual setup, the comparison points GR00T N1.6 (NVIDIA) and π0.5 (Physical Intelligence) were fine-tuned per task individually, while Qwen-VLA ran as one model across everything.
The self-reported numbers: 97.9% on LIBERO, 87.2% on RoboTwin-Hard, and 76.9% average success on out-of-distribution real-world ALOHA tasks. Independent replication hasn't happened yet, and there's no word on when weights ship or under what license.
Bottom Line
Qwen-VLA reports 97.9% on LIBERO and 76.9% on out-of-distribution ALOHA tasks, but only the report and repo are public, not the weights.
Quick Facts
- Backbone: Qwen3.5-4B vision-language model
- Action decoder: 1.15B-parameter DiT flow-matching
- Technical report posted May 28, 2026 (arXiv 2605.30280)
- LIBERO: 97.9%, RoboTwin-Hard: 87.2% (company-reported)
- Real-world ALOHA OOD success: 76.9% average (company-reported)



