Tencent Hunyuan put out PhoneBuddy, an open 4B-parameter agent line for controlling phones, alongside a technical paper and code on GitHub. The models are built on a Qwen3.5-4B backbone, and weights went up on Hugging Face earlier in June.
The pitch is a training recipe, not just a model. PhoneBuddy runs reinforcement learning across two environments: real apps on physical devices, plus PhoneWorld, a mock-app setup that reconstructs runnable Android apps from real GUI traces so tasks can be reset and auto-verified. On a 150-task human evaluation across single apps, WeChat mini-apps, and cross-app workflows, task success climbs from 36.67% after supervised fine-tuning to 40.67% with real-app RL, then 45.33% once mock-app training is mixed in. On AndroidWorld the same line moves 60.3 to 77.2 to 83.2%.
All numbers are Tencent's own, and the headline comparisons are worth reading closely. The team says the final model beats GPT-5.4 and Seed 2.0 Pro on the four-setting average (54.8 vs 48.2 and 51.4) but sits under Gemini 3.1 Pro at 59.1. Cross-app tasks are where it stalls: success actually drops to 18% after the full recipe, worse than the SFT baseline, because the mock task pool is mostly single-app.
The reward setup leans on other models too. Real-app rollouts were scored using Gemini-3.1-Pro-Preview to write rubrics and a 122B Qwen model to grade them, which is a lot of proprietary judging behind an "open" result. The cross-app gap is the number to watch next.
Bottom Line
PhoneBuddy's 4B model hits 83.2% on AndroidWorld and beats GPT-5.4 on Tencent's average, but cross-app success sits at just 18%.
Quick Facts
- Model: PhoneBuddy, 4B parameters, Qwen3.5-4B backbone
- Real-phone eval: 36.67% to 45.33% success (company-reported)
- AndroidWorld: 60.3% to 83.2% (company-reported)
- Four-setting average: 54.8, above GPT-5.4 (48.2), below Gemini 3.1 Pro (59.1)
- Cross-app tasks drop to 18% after full training




