Image Generation

Microsoft Launches Three In-House AI Models on Foundry

MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 go live for commercial use.

Andrés Martínez
Andrés MartínezAI Content Writer
April 2, 20262 min read
Share:
Abstract composition of soundwave, microphone, and image generation icons on a neutral background representing Microsoft's new MAI multimodal AI models

Microsoft dropped three proprietary AI models today, covering transcription, voice generation, and image creation. All three are available through Microsoft Foundry and a new MAI Playground. It is the company's first major model release since Satya Nadella reorganized the AI division under Mustafa Suleyman earlier this year.

The headline model is MAI-Transcribe-1, a speech-to-text system supporting 25 languages. Microsoft says it posts the lowest word error rate (3.8%) on the FLEURS benchmark, beating Whisper-large-v3, GPT-Transcribe, and Gemini Flash on most tested languages. Batch transcription runs 2.5x faster than Azure's existing Fast tier. Pricing sits at $0.36 per hour of audio, which Microsoft calls the best price-performance among large clouds. Suleyman called it "not just the most accurate but also lightning fast," though the benchmarks are self-reported and independent testing hasn't confirmed them yet.

MAI-Voice-1 handles text-to-speech: 60 seconds of audio generated in one second, with custom voice cloning from a few seconds of sample audio. That is at $22 per million characters. MAI-Image-2, meanwhile, ranks top three on the Arena.ai leaderboard and runs 2x faster than its predecessor. It is rolling out inside Bing and PowerPoint, priced at $5 per million input tokens and $33 per million image output tokens. WPP, the advertising giant, is already using it for production creative work.

The timing is pointed. Until October 2025, Microsoft was contractually barred from independently pursuing AGI-level AI. Now the company is building its own multimodal stack, competing directly with OpenAI, Google, and ElevenLabs. The models ship with enterprise safety guardrails, and MAI Playground is currently US-only. Integration with Copilot and Teams is being tested for Transcribe-1.


Bottom Line

Microsoft now sells its own speech-to-text, voice cloning, and image generation models through Foundry, directly competing with partners like OpenAI on price and speed.

Quick Facts

  • MAI-Transcribe-1: 25 languages, 3.8% word error rate (company-reported), $0.36/hour
  • MAI-Voice-1: 60 seconds of audio in 1 second, $22 per 1M characters
  • MAI-Image-2: Top 3 on Arena.ai, $5/1M input tokens + $33/1M image tokens
  • Batch transcription 2.5x faster than Azure Fast tier
  • WPP is first enterprise partner using MAI-Image-2 for production ads
Tags:MicrosoftMAIspeech-to-texttext-to-speechimage generationMicrosoft FoundryMustafa Suleyman
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Microsoft Launches 3 MAI Models for Foundry Developers | aiHola