Microsoft rolled out MAI-Transcribe-1.5 at Build 2026, the second version of its in-house speech-to-text model. The model page puts accuracy at 2.4% word error rate on the Artificial Analysis benchmark and 4.9% averaged across 43 languages on FLEURS. Both figures are self-reported.
The big addition this round is contextual biasing, what Microsoft calls domain-aware transcription. It nudges the model toward proper names, medical terms, and other industry jargon that generic models tend to mangle. The first version, shipped in April, didn't have it.
Language coverage jumped from 25 to 43, adding Bulgarian, Greek, Ukrainian, and a batch of Indian languages including Bengali, Tamil, and Telugu. Pricing stays at $0.36 per hour of audio.
On speed, Microsoft's spec sheet lists a 5.7X latency figure rather than the throughput numbers in the original report. Independent testing tells a more mixed story. Artificial Analysis ranked the previous MAI-Transcribe-1 fourth overall on its accuracy leaderboard, where ElevenLabs Scribe v2 (2.2%) and Alibaba's Fun-Realtime-ASR-preview (1.8%) lead the field. The 1.5 release hasn't been independently scored there yet.
Microsoft says streaming support is coming soon, though no date. The model is live now in Microsoft Foundry via the Azure Speech API.
Bottom Line
MAI-Transcribe-1.5 now covers 43 languages and adds term biasing, at $0.36 per hour of audio.
Quick Facts
- 43 languages supported, up from 25 in version 1
- 2.4% WER on Artificial Analysis benchmark (company-reported)
- 4.9% average WER across 43 languages on FLEURS (company-reported)
- Pricing: $0.36 per hour of audio
- Launched at Microsoft Build 2026; streaming support pending




