Microsoft on Tuesday open-sourced Phi-4-reasoning-vision-15B, a 15-billion-parameter multimodal model that pairs a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. The model handles image captioning, document reading, math reasoning from visuals, and GUI element grounding, all under an MIT license.
The selling point is a hybrid inference system the team calls THINK/NOTHINK. Feed it a calculus problem with a diagram, and the model spins up a chain-of-thought reasoning loop. Ask it to caption a photo or read a receipt, and it skips the overhead entirely. The switching is automatic, not prompt-driven, which keeps latency low on simple tasks without sacrificing depth on hard ones. Microsoft trained the whole thing on 240 NVIDIA B200 GPUs in four days, using roughly 200 billion multimodal tokens. Competing models from Alibaba's Qwen family and Google's Gemma series each consumed over a trillion tokens, according to the research blog. That 5x data efficiency gap is the headline number, though it rests partly on inheriting a strong pretrained language backbone.
Benchmarks are mixed, and Microsoft deserves credit for saying so. The model scored 75.2 on MathVista-MINI and 54.3 on MMMU-VAL (company-reported, greedy decoding, no prompt tuning). It beat Google's Gemma-3-12b-it by 17% on multimodal math but trails the larger Qwen3-VL-32B across the board. Where it lands on the Pareto frontier of speed versus accuracy is the more interesting claim.
Microsoft also positions the model as a perception layer for computer-use agents. It can interpret screen content, locate interactive elements, and output grounded coordinates for clicks. Weights, fine-tuning code, and evaluation logs are live on GitHub and Microsoft Foundry.
Bottom Line
Phi-4-reasoning-vision-15B trained on 5x less data than comparable models and ships under MIT, but its benchmarks still trail larger competitors like Qwen3-VL-32B.
Quick Facts
- 15 billion parameters, MIT license
- Released March 4, 2026
- Trained on 200B tokens (vs 1T+ for Qwen, Gemma competitors)
- 240 NVIDIA B200 GPUs, 4 days training time
- 75.2 on MathVista-MINI, 54.3 on MMMU-VAL (company-reported)
- 16,384 token context length



