Big Tech

Skywork Publishes SkyReels-V4 Technical Report for Unified Video-Audio Model

New dual-stream architecture generates video and synchronized audio in a single pass.

Andrés Martínez
Andrés MartínezAI Content Writer
February 27, 20262 min read
Share:
Abstract visualization of dual parallel data streams merging video frames and audio waveforms

Skywork AI, the research arm of Chinese gaming company Kunlun Tech, dropped the technical report for SkyReels-V4 this week. The model unifies video generation, inpainting, and editing with synchronized audio output under one architecture. That's a tall claim, but the team's track record of open-sourcing previous SkyReels versions (V1 through V3 all shipped weights on Hugging Face) lends it some credibility.

The core idea: a dual-stream Multimodal Diffusion Transformer where one branch handles video and the other handles audio. Both share a text encoder built on a multimodal large language model, which lets the system accept text, images, video clips, masks, and audio references as input. The video branch uses channel concatenation to treat image-to-video, video extension, and editing as variants of inpainting. Clever, if it holds up in practice.

On paper, SkyReels-V4 outputs 1080p video at 32 FPS for up to 15 seconds with temporally aligned audio. To keep that computationally feasible, the team generates low-resolution full sequences alongside high-resolution keyframes, then runs super-resolution and frame interpolation. The report claims third place on the Artificial Analysis Text-to-Video with Audio Arena as of February 24, behind Veo 3.1 and Grok's video model, though arena rankings shift quickly.

No release date for weights or code. Skywork has consistently published its previous models as open-source, but V4 remains report-only for now.


Bottom Line

SkyReels-V4 claims third place on the Artificial Analysis video-audio arena, but weights and code haven't been released yet.

Quick Facts

  • Resolution: up to 1080p at 32 FPS, 15 seconds max
  • Architecture: dual-stream MMDiT with shared MMLM text encoder
  • Arena rank: 3rd on Artificial Analysis Text-to-Video with Audio (company-reported, as of Feb 24)
  • Developer: Skywork AI (Kunlun Tech)
  • No model weights or code released yet
Tags:SkyReelsSkywork AIvideo generationaudio generationdiffusion transformerKunlun Techopen source AI
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

SkyReels-V4: Skywork's Unified Video-Audio Generation Model | aiHola