Agents

Tencent Hunyuan Releases PhoneBuddy Phone-Use Agent Models

A 4B open model that mixes real and mock app training beats GPT-5.4 on average, per Tencent.

Andrés Martínez
Andrés MartínezAI Content Writer
July 1, 20262 min read
Share:
Smartphone on a desk with abstract flowing lines suggesting an autonomous agent navigating apps

Tencent Hunyuan put out PhoneBuddy, an open 4B-parameter agent line for controlling phones, alongside a technical paper and code on GitHub. The models are built on a Qwen3.5-4B backbone, and weights went up on Hugging Face earlier in June.

The pitch is a training recipe, not just a model. PhoneBuddy runs reinforcement learning across two environments: real apps on physical devices, plus PhoneWorld, a mock-app setup that reconstructs runnable Android apps from real GUI traces so tasks can be reset and auto-verified. On a 150-task human evaluation across single apps, WeChat mini-apps, and cross-app workflows, task success climbs from 36.67% after supervised fine-tuning to 40.67% with real-app RL, then 45.33% once mock-app training is mixed in. On AndroidWorld the same line moves 60.3 to 77.2 to 83.2%.

All numbers are Tencent's own, and the headline comparisons are worth reading closely. The team says the final model beats GPT-5.4 and Seed 2.0 Pro on the four-setting average (54.8 vs 48.2 and 51.4) but sits under Gemini 3.1 Pro at 59.1. Cross-app tasks are where it stalls: success actually drops to 18% after the full recipe, worse than the SFT baseline, because the mock task pool is mostly single-app.

The reward setup leans on other models too. Real-app rollouts were scored using Gemini-3.1-Pro-Preview to write rubrics and a 122B Qwen model to grade them, which is a lot of proprietary judging behind an "open" result. The cross-app gap is the number to watch next.


Bottom Line

PhoneBuddy's 4B model hits 83.2% on AndroidWorld and beats GPT-5.4 on Tencent's average, but cross-app success sits at just 18%.

Quick Facts

  • Model: PhoneBuddy, 4B parameters, Qwen3.5-4B backbone
  • Real-phone eval: 36.67% to 45.33% success (company-reported)
  • AndroidWorld: 60.3% to 83.2% (company-reported)
  • Four-setting average: 54.8, above GPT-5.4 (48.2), below Gemini 3.1 Pro (59.1)
  • Cross-app tasks drop to 18% after full training
Tags:TencentHunyuanAI agentsphone-use agentsopen modelsAndroidWorldreinforcement learning
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.