Alibaba's Tongyi Lab has open-sourced PrismAudio, a video-to-audio framework that generates synchronized stereo sound from video. The ICLR 2026 paper tackles a specific weakness in prior V2A systems: they try to handle everything at once and end up doing nothing well.
PrismAudio's fix is structural. It splits chain-of-thought reasoning into four independent modules, each handling one perceptual dimension: what sounds belong in the scene, when they occur, how natural they sound, and where they sit in the stereo field. Each module gets its own reward function during reinforcement learning, so improving spatial accuracy doesn't come at the cost of temporal sync. The predecessor, ThinkSound (accepted at NeurIPS 2025), used a single monolithic reasoning block for all four tasks. On the team's project page, PrismAudio shows clear gains: CLAP score of 0.47 vs. 0.43, and spatial positioning error (CRW) cut roughly in half, from 13.47 to 7.72. These are self-reported numbers on VGGSound.
The training trick is worth noting. Fast-GRPO, a modified RL algorithm, applies stochastic sampling only within a small random window of diffusion steps while running the rest deterministically. The authors report it converges in 200 steps instead of 600. The whole model is 518 million parameters, and inference takes 0.63 seconds for 9 seconds of audio.
There's a catch. Users report feature extraction for a 10-second clip eats around 43 GB of VRAM, which limits who can actually run this locally. Code lives on a GitHub branch of the ThinkSound repo.
Bottom Line
PrismAudio generates 9 seconds of spatial stereo audio in 0.63 seconds, but feature extraction requires roughly 43 GB of VRAM.
Quick Facts
- 518 million parameters
- 0.63 seconds inference for 9-second audio clip
- ~43 GB VRAM for feature extraction (user-reported)
- CLAP score: 0.47 vs. 0.43 (ThinkSound), self-reported on VGGSound
- Fast-GRPO converges in 200 steps vs. 600 (company-reported)




