LLMs & Foundation Models

NVIDIA Releases Audio Flamingo Next, an Open Audio-Language Model

AF-Next handles 30-minute audio inputs and beats Gemini 2.5 Pro on long-audio benchmarks.

Andrés Martínez
Andrés MartínezAI Content Writer
April 16, 20262 min read
Share:
Abstract visualization of audio waveforms being analyzed by a neural network, with temporal markers along a spectrogram

NVIDIA and the University of Maryland dropped Audio Flamingo Next (AF-Next), the latest in their open audio-language model series, trained on roughly 108 million samples across 1 million hours of audio. The technical paper appeared on arXiv April 13. Three model variants are available on Hugging Face: Instruct for general audio Q&A and transcription, Think for multi-step reasoning, and Captioner for detailed audio descriptions.

The headline number: on LongAudioBench, AF-Next-Instruct scores 73.9 compared to Gemini 2.5 Pro's 60.4, per the team's own benchmarks. That gap widens to 81.2 vs. 66.2 when speech tasks are included. On MMAU-Pro, the Think variant edges out Gemini at 58.7 to 57.4. For speech recognition, AF-Next posts a 1.54 word error rate on LibriSpeech test-clean, the lowest reported among open large audio-language models. All figures are self-reported and have not been independently replicated.

Architecturally, AF-Next stacks a custom Whisper-based encoder (AF-Whisper) with a two-layer MLP adapter feeding into Qwen-2.5-7B, extended to a 128K token context window. The key design choice is Rotary Time Embeddings, which tie each token's positional encoding to its actual timestamp in the audio rather than its sequence position. This lets the model reason about temporal relationships across recordings up to 30 minutes long, triple the 10-minute cap of its predecessor Audio Flamingo 3.

The reasoning variant introduces what the team calls Temporal Audio Chain-of-Thought: each intermediate reasoning step gets anchored to a specific timestamp before producing a final answer. Useful for pulling facts scattered across long meetings or podcast episodes. Code is on GitHub, and a captioner demo is live. Licensing is noncommercial only under NVIDIA's OneWay license.


Bottom Line

AF-Next processes audio up to 30 minutes and reports the lowest word error rate (1.54) among open large audio-language models on LibriSpeech.

Quick Facts

  • Training data: ~108 million samples, ~1 million hours of audio
  • Max audio input length: 30 minutes (up from 10 minutes in AF3)
  • LongAudioBench score: 73.9 vs. Gemini 2.5 Pro's 60.4 (self-reported)
  • LibriSpeech test-clean WER: 1.54 (self-reported, lowest among open LALMs)
  • License: NVIDIA OneWay Noncommercial
Tags:NVIDIAaudio AIlarge language modelsspeech recognitionmultimodal AIopen source AIAudio Flamingo
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

NVIDIA Audio Flamingo Next: Open Model Beats Gemini on Audio | aiHola