Alibaba's Tongyi Lab has released MAI-UI, a set of vision-language models designed to control smartphone interfaces through natural language commands. The models, built on the Qwen 3 VL backbone, come in four sizes: 2B, 8B, 32B, and a sparse 235B variant with 22B active parameters.
The headline numbers are strong. MAI-UI hit 76.7% on AndroidWorld, a benchmark that tests agents across 116 tasks in 20 Android apps running in a live emulator. That beats UI-Tars-2, Gemini-2.5-Pro, and ByteDance's Seed1.8. On ScreenSpot-Pro, which evaluates grounding in high-resolution professional interfaces, the largest model reached 73.5% (with a zoom-in technique), surpassing both Gemini-3-Pro and Seed1.8 according to Alibaba's self-reported results.
The release addresses practical deployment challenges that have hampered GUI agents. MAI-UI includes native support for MCP tool calls (enabling external API access during task execution), a device-cloud collaboration system that routes computation based on task complexity and data sensitivity, and an online reinforcement learning framework. The team reports that scaling parallel RL environments from 32 to 512 yielded a 5.2-point improvement, though independent verification isn't available yet.
The 2B and 8B models are available now on Hugging Face under Apache 2.0. The 32B and 235B variants aren't publicly released.
The Bottom Line: Local GUI agents capable of useful smartphone automation are getting closer, with the 8B model small enough to run on consumer hardware.
QUICK FACTS
- AndroidWorld score: 76.7% (company-reported)
- ScreenSpot-Pro score: 73.5% with zoom-in (company-reported)
- Model sizes: 2B, 8B, 32B, 235B-A22B (sparse)
- Released: December 29, 2025
- License: Apache 2.0 (2B and 8B models only)
- Base architecture: Qwen 3 VL




