Agents

OpenAI Adds Three Voice Models to Realtime API

GPT-Realtime-2 brings GPT-5-class reasoning, 128K context, and tiered reasoning effort to voice.

Andrés Martínez
Andrés MartínezAI Content Writer
May 8, 20262 min read
Share:
Abstract visualization of audio waveforms streaming through a network of nodes, suggesting real-time voice processing.

OpenAI shipped three voice models to its Realtime API on Thursday, all built for live conversation rather than batch transcription. The headline release is GPT-Realtime-2, which the company's announcement calls its first voice model with GPT-5-class reasoning. Context jumps from 32K to 128K tokens. Reasoning effort is adjustable across five tiers from minimal to xhigh, and the model can fire parallel tool calls while saying things like "let me check that" so the line doesn't go dead while it works.

The other two are narrower. GPT-Realtime-Translate handles 70+ input languages into 13 output languages live. GPT-Realtime-Whisper is a streaming speech-to-text model meant for captions and meeting notes as the speaker talks, not post-recorded audio.

Pricing splits by model. Realtime-2 stays on per-token billing at $32 per million input audio tokens and $64 output, with cached input at $0.40. Translate runs $0.034 per minute, Whisper $0.017 per minute, easier numbers to forecast against than the token model.

OpenAI's self-reported benchmarks show Realtime-2 (high) scoring 15.2% above Realtime-1.5 on Big Bench Audio, and the xhigh variant scoring 13.8% higher on Audio MultiChallenge for instruction-following. Zillow, an early tester, reports a 26-point jump in call success rate (95% vs. 69%) on what it calls its hardest adversarial benchmark, after prompt tuning. Both figures come from interested parties.

All three models are live in the API now, documented in the Realtime developer guide. EU data residency is supported.


Bottom Line

GPT-Realtime-2 quadruples the context window to 128K tokens and stays at $32/$64 per million audio input/output tokens.

Quick Facts

  • Three models: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper
  • Context window: 32K to 128K tokens
  • GPT-Realtime-2 pricing: $32/1M input audio tokens, $64/1M output, $0.40 cached input
  • Translate: $0.034/min; Whisper: $0.017/min
  • Translate supports 70+ input languages, 13 output languages (company-reported)
  • Zillow reports 95% vs. 69% call success rate on its own benchmark (unverified)
Tags:openaivoice-airealtime-apigpt-realtime-2speech-modelsvoice-agents
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

OpenAI Launches GPT-Realtime-2 Voice Model in API | aiHola