NVIDIA and the University of Maryland dropped Audio Flamingo Next (AF-Next), the latest in their open audio-language model series, trained on roughly 108 million samples across 1 million hours of audio. The technical paper appeared on arXiv April 13. Three model variants are available on Hugging Face: Instruct for general audio Q&A and transcription, Think for multi-step reasoning, and Captioner for detailed audio descriptions.
The headline number: on LongAudioBench, AF-Next-Instruct scores 73.9 compared to Gemini 2.5 Pro's 60.4, per the team's own benchmarks. That gap widens to 81.2 vs. 66.2 when speech tasks are included. On MMAU-Pro, the Think variant edges out Gemini at 58.7 to 57.4. For speech recognition, AF-Next posts a 1.54 word error rate on LibriSpeech test-clean, the lowest reported among open large audio-language models. All figures are self-reported and have not been independently replicated.
Architecturally, AF-Next stacks a custom Whisper-based encoder (AF-Whisper) with a two-layer MLP adapter feeding into Qwen-2.5-7B, extended to a 128K token context window. The key design choice is Rotary Time Embeddings, which tie each token's positional encoding to its actual timestamp in the audio rather than its sequence position. This lets the model reason about temporal relationships across recordings up to 30 minutes long, triple the 10-minute cap of its predecessor Audio Flamingo 3.
The reasoning variant introduces what the team calls Temporal Audio Chain-of-Thought: each intermediate reasoning step gets anchored to a specific timestamp before producing a final answer. Useful for pulling facts scattered across long meetings or podcast episodes. Code is on GitHub, and a captioner demo is live. Licensing is noncommercial only under NVIDIA's OneWay license.
Bottom Line
AF-Next processes audio up to 30 minutes and reports the lowest word error rate (1.54) among open large audio-language models on LibriSpeech.
Quick Facts
- Training data: ~108 million samples, ~1 million hours of audio
- Max audio input length: 30 minutes (up from 10 minutes in AF3)
- LongAudioBench score: 73.9 vs. Gemini 2.5 Pro's 60.4 (self-reported)
- LibriSpeech test-clean WER: 1.54 (self-reported, lowest among open LALMs)
- License: NVIDIA OneWay Noncommercial




