PrismAudio: Alibaba's V2A Model Generates Stereo Sound From

Abstract visualization of sound waves being generated from video frames, with four colored streams representing semantic, temporal, aesthetic, and spatial audio dimensions

Alibaba's Tongyi Lab has open-sourced PrismAudio, a video-to-audio framework that generates synchronized stereo sound from video. The ICLR 2026 paper tackles a specific weakness in prior V2A systems: they try to handle everything at once and end up doing nothing well.

PrismAudio's fix is structural. It splits chain-of-thought reasoning into four independent modules, each handling one perceptual dimension: what sounds belong in the scene, when they occur, how natural they sound, and where they sit in the stereo field. Each module gets its own reward function during reinforcement learning, so improving spatial accuracy doesn't come at the cost of temporal sync. The predecessor, ThinkSound (accepted at NeurIPS 2025), used a single monolithic reasoning block for all four tasks. On the team's project page, PrismAudio shows clear gains: CLAP score of 0.47 vs. 0.43, and spatial positioning error (CRW) cut roughly in half, from 13.47 to 7.72. These are self-reported numbers on VGGSound.

The training trick is worth noting. Fast-GRPO, a modified RL algorithm, applies stochastic sampling only within a small random window of diffusion steps while running the rest deterministically. The authors report it converges in 200 steps instead of 600. The whole model is 518 million parameters, and inference takes 0.63 seconds for 9 seconds of audio.

There's a catch. Users report feature extraction for a 10-second clip eats around 43 GB of VRAM, which limits who can actually run this locally. Code lives on a GitHub branch of the ThinkSound repo.

Bottom Line

PrismAudio generates 9 seconds of spatial stereo audio in 0.63 seconds, but feature extraction requires roughly 43 GB of VRAM.

Quick Facts

518 million parameters
0.63 seconds inference for 9-second audio clip
~43 GB VRAM for feature extraction (user-reported)
CLAP score: 0.47 vs. 0.43 (ThinkSound), self-reported on VGGSound
Fast-GRPO converges in 200 steps vs. 600 (company-reported)

Tags:PrismAudiovideo-to-audioAlibabaICLR 2026reinforcement learningaudio generation

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Alibaba's PrismAudio Generates Stereo Sound From Video in Under a Second

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

CapCut Is Bringing Its Editing Tools Into Google Gemini

Tencent Open-Sources Hy-MT2 Translation Models, 1.8B Fits in 440MB

Adaption's AutoScientist claims to fine-tune AI models better than its own researchers

Stay Ahead of the AI Curve