NVIDIA Audio Flamingo Next: Open Model Beats Gemini on Audio

Abstract visualization of audio waveforms being analyzed by a neural network, with temporal markers along a spectrogram

NVIDIA and the University of Maryland dropped Audio Flamingo Next (AF-Next), the latest in their open audio-language model series, trained on roughly 108 million samples across 1 million hours of audio. The technical paper appeared on arXiv April 13. Three model variants are available on Hugging Face: Instruct for general audio Q&A and transcription, Think for multi-step reasoning, and Captioner for detailed audio descriptions.

The headline number: on LongAudioBench, AF-Next-Instruct scores 73.9 compared to Gemini 2.5 Pro's 60.4, per the team's own benchmarks. That gap widens to 81.2 vs. 66.2 when speech tasks are included. On MMAU-Pro, the Think variant edges out Gemini at 58.7 to 57.4. For speech recognition, AF-Next posts a 1.54 word error rate on LibriSpeech test-clean, the lowest reported among open large audio-language models. All figures are self-reported and have not been independently replicated.

Architecturally, AF-Next stacks a custom Whisper-based encoder (AF-Whisper) with a two-layer MLP adapter feeding into Qwen-2.5-7B, extended to a 128K token context window. The key design choice is Rotary Time Embeddings, which tie each token's positional encoding to its actual timestamp in the audio rather than its sequence position. This lets the model reason about temporal relationships across recordings up to 30 minutes long, triple the 10-minute cap of its predecessor Audio Flamingo 3.

The reasoning variant introduces what the team calls Temporal Audio Chain-of-Thought: each intermediate reasoning step gets anchored to a specific timestamp before producing a final answer. Useful for pulling facts scattered across long meetings or podcast episodes. Code is on GitHub, and a captioner demo is live. Licensing is noncommercial only under NVIDIA's OneWay license.

Bottom Line

AF-Next processes audio up to 30 minutes and reports the lowest word error rate (1.54) among open large audio-language models on LibriSpeech.

Quick Facts

Training data: ~108 million samples, ~1 million hours of audio
Max audio input length: 30 minutes (up from 10 minutes in AF3)
LongAudioBench score: 73.9 vs. Gemini 2.5 Pro's 60.4 (self-reported)
LibriSpeech test-clean WER: 1.54 (self-reported, lowest among open LALMs)
License: NVIDIA OneWay Noncommercial

Tags:NVIDIAaudio AIlarge language modelsspeech recognitionmultimodal AIopen source AIAudio Flamingo

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

NVIDIA Releases Audio Flamingo Next, an Open Audio-Language Model

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

ChatGPT Market Share Slides to 68% as Gemini and Claude Chip Away at OpenAI

Google Launches Gemini 3.1 Flash TTS With 200+ Voice Control Tags

ByteDance Shrinks Image Diffusion to Run on Phones

Stay Ahead of the AI Curve