AI Models Platforms

Microsoft Releases Multimodal Phi-4 That Decides When to Think

Phi-4-reasoning-vision-15B uses 200B training tokens to match models trained on 5x more data.

Andrés Martínez
Andrés MartínezAI Content Writer
March 6, 20262 min read
Share:
Abstract visualization of a compact neural network processing both text and image data streams through parallel reasoning pathways

Microsoft on Tuesday open-sourced Phi-4-reasoning-vision-15B, a 15-billion-parameter multimodal model that pairs a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. The model handles image captioning, document reading, math reasoning from visuals, and GUI element grounding, all under an MIT license.

The selling point is a hybrid inference system the team calls THINK/NOTHINK. Feed it a calculus problem with a diagram, and the model spins up a chain-of-thought reasoning loop. Ask it to caption a photo or read a receipt, and it skips the overhead entirely. The switching is automatic, not prompt-driven, which keeps latency low on simple tasks without sacrificing depth on hard ones. Microsoft trained the whole thing on 240 NVIDIA B200 GPUs in four days, using roughly 200 billion multimodal tokens. Competing models from Alibaba's Qwen family and Google's Gemma series each consumed over a trillion tokens, according to the research blog. That 5x data efficiency gap is the headline number, though it rests partly on inheriting a strong pretrained language backbone.

Benchmarks are mixed, and Microsoft deserves credit for saying so. The model scored 75.2 on MathVista-MINI and 54.3 on MMMU-VAL (company-reported, greedy decoding, no prompt tuning). It beat Google's Gemma-3-12b-it by 17% on multimodal math but trails the larger Qwen3-VL-32B across the board. Where it lands on the Pareto frontier of speed versus accuracy is the more interesting claim.

Microsoft also positions the model as a perception layer for computer-use agents. It can interpret screen content, locate interactive elements, and output grounded coordinates for clicks. Weights, fine-tuning code, and evaluation logs are live on GitHub and Microsoft Foundry.


Bottom Line

Phi-4-reasoning-vision-15B trained on 5x less data than comparable models and ships under MIT, but its benchmarks still trail larger competitors like Qwen3-VL-32B.

Quick Facts

  • 15 billion parameters, MIT license
  • Released March 4, 2026
  • Trained on 200B tokens (vs 1T+ for Qwen, Gemma competitors)
  • 240 NVIDIA B200 GPUs, 4 days training time
  • 75.2 on MathVista-MINI, 54.3 on MMMU-VAL (company-reported)
  • 16,384 token context length
Tags:MicrosoftPhi-4multimodal AIopen-weight modelscomputer visionreasoning modelsSLM
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Microsoft Open-Sources Phi-4-Reasoning-Vision-15B Multimodal | aiHola