Machine Learning

Alibaba's PrismAudio Generates Stereo Sound From Video in Under a Second

ICLR 2026 paper splits audio reasoning into four specialized modules for better sync and spatial accuracy.

Andrés Martínez
Andrés MartínezAI Content Writer
April 6, 20262 min read
Share:
Abstract visualization of sound waves being generated from video frames, with four colored streams representing semantic, temporal, aesthetic, and spatial audio dimensions

Alibaba's Tongyi Lab has open-sourced PrismAudio, a video-to-audio framework that generates synchronized stereo sound from video. The ICLR 2026 paper tackles a specific weakness in prior V2A systems: they try to handle everything at once and end up doing nothing well.

PrismAudio's fix is structural. It splits chain-of-thought reasoning into four independent modules, each handling one perceptual dimension: what sounds belong in the scene, when they occur, how natural they sound, and where they sit in the stereo field. Each module gets its own reward function during reinforcement learning, so improving spatial accuracy doesn't come at the cost of temporal sync. The predecessor, ThinkSound (accepted at NeurIPS 2025), used a single monolithic reasoning block for all four tasks. On the team's project page, PrismAudio shows clear gains: CLAP score of 0.47 vs. 0.43, and spatial positioning error (CRW) cut roughly in half, from 13.47 to 7.72. These are self-reported numbers on VGGSound.

The training trick is worth noting. Fast-GRPO, a modified RL algorithm, applies stochastic sampling only within a small random window of diffusion steps while running the rest deterministically. The authors report it converges in 200 steps instead of 600. The whole model is 518 million parameters, and inference takes 0.63 seconds for 9 seconds of audio.

There's a catch. Users report feature extraction for a 10-second clip eats around 43 GB of VRAM, which limits who can actually run this locally. Code lives on a GitHub branch of the ThinkSound repo.


Bottom Line

PrismAudio generates 9 seconds of spatial stereo audio in 0.63 seconds, but feature extraction requires roughly 43 GB of VRAM.

Quick Facts

  • 518 million parameters
  • 0.63 seconds inference for 9-second audio clip
  • ~43 GB VRAM for feature extraction (user-reported)
  • CLAP score: 0.47 vs. 0.43 (ThinkSound), self-reported on VGGSound
  • Fast-GRPO converges in 200 steps vs. 600 (company-reported)
Tags:PrismAudiovideo-to-audioAlibabaICLR 2026reinforcement learningaudio generation
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

PrismAudio: Alibaba's V2A Model Generates Stereo Sound From | aiHola