Microsoft Launches 3 MAI Models for Foundry Developers

Abstract composition of soundwave, microphone, and image generation icons on a neutral background representing Microsoft's new MAI multimodal AI models

Microsoft dropped three proprietary AI models today, covering transcription, voice generation, and image creation. All three are available through Microsoft Foundry and a new MAI Playground. It is the company's first major model release since Satya Nadella reorganized the AI division under Mustafa Suleyman earlier this year.

The headline model is MAI-Transcribe-1, a speech-to-text system supporting 25 languages. Microsoft says it posts the lowest word error rate (3.8%) on the FLEURS benchmark, beating Whisper-large-v3, GPT-Transcribe, and Gemini Flash on most tested languages. Batch transcription runs 2.5x faster than Azure's existing Fast tier. Pricing sits at $0.36 per hour of audio, which Microsoft calls the best price-performance among large clouds. Suleyman called it "not just the most accurate but also lightning fast," though the benchmarks are self-reported and independent testing hasn't confirmed them yet.

MAI-Voice-1 handles text-to-speech: 60 seconds of audio generated in one second, with custom voice cloning from a few seconds of sample audio. That is at $22 per million characters. MAI-Image-2, meanwhile, ranks top three on the Arena.ai leaderboard and runs 2x faster than its predecessor. It is rolling out inside Bing and PowerPoint, priced at $5 per million input tokens and $33 per million image output tokens. WPP, the advertising giant, is already using it for production creative work.

The timing is pointed. Until October 2025, Microsoft was contractually barred from independently pursuing AGI-level AI. Now the company is building its own multimodal stack, competing directly with OpenAI, Google, and ElevenLabs. The models ship with enterprise safety guardrails, and MAI Playground is currently US-only. Integration with Copilot and Teams is being tested for Transcribe-1.

Bottom Line

Microsoft now sells its own speech-to-text, voice cloning, and image generation models through Foundry, directly competing with partners like OpenAI on price and speed.

Quick Facts

MAI-Transcribe-1: 25 languages, 3.8% word error rate (company-reported), $0.36/hour
MAI-Voice-1: 60 seconds of audio in 1 second, $22 per 1M characters
MAI-Image-2: Top 3 on Arena.ai, $5/1M input tokens + $33/1M image tokens
Batch transcription 2.5x faster than Azure Fast tier
WPP is first enterprise partner using MAI-Image-2 for production ads

Tags:MicrosoftMAIspeech-to-texttext-to-speechimage generationMicrosoft FoundryMustafa Suleyman

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Microsoft Launches Three In-House AI Models on Foundry

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

OpenAI Beats Anthropic on Q1 Revenue, Trails on Growth Rate

Trump Scraps Executive Order on Early AI Model Reviews

OpenAI Model Disproves 80-Year-Old Erdős Geometry Conjecture

Stay Ahead of the AI Curve