Microsoft Open-Sources Phi-4-Reasoning-Vision-15B Multimodal

Abstract visualization of a compact neural network processing both text and image data streams through parallel reasoning pathways

Microsoft on Tuesday open-sourced Phi-4-reasoning-vision-15B, a 15-billion-parameter multimodal model that pairs a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. The model handles image captioning, document reading, math reasoning from visuals, and GUI element grounding, all under an MIT license.

The selling point is a hybrid inference system the team calls THINK/NOTHINK. Feed it a calculus problem with a diagram, and the model spins up a chain-of-thought reasoning loop. Ask it to caption a photo or read a receipt, and it skips the overhead entirely. The switching is automatic, not prompt-driven, which keeps latency low on simple tasks without sacrificing depth on hard ones. Microsoft trained the whole thing on 240 NVIDIA B200 GPUs in four days, using roughly 200 billion multimodal tokens. Competing models from Alibaba's Qwen family and Google's Gemma series each consumed over a trillion tokens, according to the research blog. That 5x data efficiency gap is the headline number, though it rests partly on inheriting a strong pretrained language backbone.

Benchmarks are mixed, and Microsoft deserves credit for saying so. The model scored 75.2 on MathVista-MINI and 54.3 on MMMU-VAL (company-reported, greedy decoding, no prompt tuning). It beat Google's Gemma-3-12b-it by 17% on multimodal math but trails the larger Qwen3-VL-32B across the board. Where it lands on the Pareto frontier of speed versus accuracy is the more interesting claim.

Microsoft also positions the model as a perception layer for computer-use agents. It can interpret screen content, locate interactive elements, and output grounded coordinates for clicks. Weights, fine-tuning code, and evaluation logs are live on GitHub and Microsoft Foundry.

Bottom Line

Phi-4-reasoning-vision-15B trained on 5x less data than comparable models and ships under MIT, but its benchmarks still trail larger competitors like Qwen3-VL-32B.

Quick Facts

15 billion parameters, MIT license
Released March 4, 2026
Trained on 200B tokens (vs 1T+ for Qwen, Gemma competitors)
240 NVIDIA B200 GPUs, 4 days training time
75.2 on MathVista-MINI, 54.3 on MMMU-VAL (company-reported)
16,384 token context length

Tags:MicrosoftPhi-4multimodal AIopen-weight modelscomputer visionreasoning modelsSLM

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Microsoft Releases Multimodal Phi-4 That Decides When to Think

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Ai2 Opens Molmo 2 Video Model to Developers, Training Code Still Pending

Alibaba Releases Qwen 3.5 Small Models Down to 0.8B Parameters

OpenAI Ships GPT-5.3 Instant to Fix ChatGPT's Tone Problem

Stay Ahead of the AI Curve