Xiaomi's Embodied Intelligence team released OneVL this week, open-sourcing a 4B vision-language-action model for autonomous driving trajectory prediction. The team posted the project page alongside code on GitHub.
OneVL is built on Qwen3-VL-4B-Instruct. It squeezes chain-of-thought reasoning into 55 latent tokens (35 visual, 20 language) and uses dual auxiliary decoders during training: one for language CoT, one as a visual world model that predicts future frames. At inference the decoders get dropped, and the latent tokens prefill in a single parallel pass.
Xiaomi reports 88.84 PDM-score on NAVSIM, ahead of 8B baselines AdaThinkDrive (86.20) and LaST-VLA (87.30). The team calls OneVL the first latent CoT method to surpass explicit CoT on driving benchmarks. Those numbers are self-reported and haven't been independently replicated.
The latency claim deserves scrutiny. OneVL's prefill runs at 4.46 seconds on the test setup, roughly matching an answer-only baseline. The 0.24-second figure that has circulated belongs to a separate MLP variant, which trades accuracy (86.83 PDM-score) for real-time speed.
The technical paper is up on arXiv. Xiaomi says it plans to fully open-source weights and codebase for outside researchers to build on.
Bottom Line
OneVL claims 88.84 PDM-score on NAVSIM with a 4B model, but the headline 0.24s latency is from a stripped-down MLP variant, not the main system.
Quick Facts
- Model size: 4B parameters, built on Qwen3-VL-4B-Instruct
- Latent tokens: 55 total (35 visual, 20 language)
- NAVSIM PDM-score: 88.84 (company-reported)
- Main model latency: 4.46s prefill
- MLP variant latency: 0.24s at 86.83 PDM-score



