Tencent's Robotics X and Hunyuan Vision teams dropped HY-Embodied-0.5 this week, a pair of vision-language models built specifically for agents that operate in the physical world. The 2B-parameter variant is open-sourced on Hugging Face with inference code. A larger 32B model handles complex reasoning but isn't publicly available yet.
The pitch: general-purpose VLMs aren't great at the stuff embodied agents actually need, like spatial perception, depth understanding, and action planning. HY-Embodied tries to close that gap. The smaller model uses a Mixture-of-Transformers architecture with 4B total parameters but only 2.2B activated during inference, so it runs at dense-2B speeds while pulling from a bigger parameter pool. Latent tokens compress visual information into tighter representations for finer perception.
Tencent claims the MoT-2B outperforms similarly sized models across 16 benchmarks, though these are company-reported numbers. The 32B variant reportedly matches Gemini 3.0 Pro on embodied tasks. Both claims lack independent verification so far. A technical paper details the training pipeline, which includes a self-evolving post-training loop and on-policy distillation from the large model to the small one.
The real test is downstream. Tencent says HY-Embodied already works as a foundation for Vision-Language-Action pipelines, with results in physical robot control experiments. No pricing or API access announced. The 2B weights are available now.
Bottom Line
Tencent's open-sourced 2B embodied model activates only 2.2B of its 4B parameters during inference, targeting edge robotics deployment.
Quick Facts
- MoT-2B: 4B total parameters, 2.2B activated
- 32B variant: 407B total parameters, 32B activated (not yet open-sourced)
- Outperforms peers on 16 benchmarks (company-reported)
- 32B performance comparable to Gemini 3.0 Pro (company-reported)
- Released April 9, 2026




