Microsoft dropped three proprietary AI models today, covering transcription, voice generation, and image creation. All three are available through Microsoft Foundry and a new MAI Playground. It is the company's first major model release since Satya Nadella reorganized the AI division under Mustafa Suleyman earlier this year.
The headline model is MAI-Transcribe-1, a speech-to-text system supporting 25 languages. Microsoft says it posts the lowest word error rate (3.8%) on the FLEURS benchmark, beating Whisper-large-v3, GPT-Transcribe, and Gemini Flash on most tested languages. Batch transcription runs 2.5x faster than Azure's existing Fast tier. Pricing sits at $0.36 per hour of audio, which Microsoft calls the best price-performance among large clouds. Suleyman called it "not just the most accurate but also lightning fast," though the benchmarks are self-reported and independent testing hasn't confirmed them yet.
MAI-Voice-1 handles text-to-speech: 60 seconds of audio generated in one second, with custom voice cloning from a few seconds of sample audio. That is at $22 per million characters. MAI-Image-2, meanwhile, ranks top three on the Arena.ai leaderboard and runs 2x faster than its predecessor. It is rolling out inside Bing and PowerPoint, priced at $5 per million input tokens and $33 per million image output tokens. WPP, the advertising giant, is already using it for production creative work.
The timing is pointed. Until October 2025, Microsoft was contractually barred from independently pursuing AGI-level AI. Now the company is building its own multimodal stack, competing directly with OpenAI, Google, and ElevenLabs. The models ship with enterprise safety guardrails, and MAI Playground is currently US-only. Integration with Copilot and Teams is being tested for Transcribe-1.
Bottom Line
Microsoft now sells its own speech-to-text, voice cloning, and image generation models through Foundry, directly competing with partners like OpenAI on price and speed.
Quick Facts
- MAI-Transcribe-1: 25 languages, 3.8% word error rate (company-reported), $0.36/hour
- MAI-Voice-1: 60 seconds of audio in 1 second, $22 per 1M characters
- MAI-Image-2: Top 3 on Arena.ai, $5/1M input tokens + $33/1M image tokens
- Batch transcription 2.5x faster than Azure Fast tier
- WPP is first enterprise partner using MAI-Image-2 for production ads




