Alibaba's Live Avatar Hits 20 FPS Real-Time Streaming

Researchers from Alibaba Group and the University of Science and Technology of China published Live Avatar on December 4, 2025, a system that generates audio-synchronized talking avatars at 20 frames per second using a 14-billion-parameter diffusion model distributed across five NVIDIA H800 GPUs. The team reports video generation sustained for over 10,000 seconds without identity drift or color artifacts.

The Core Technical Problem

Diffusion models have dominated recent video generation research, but their sequential denoising process creates a bottleneck for real-time applications. Each frame requires multiple passes through the model, and autoregressive generation (where each new frame depends on previous outputs) compounds latency while accumulating errors over time.

Audio-driven avatar systems face an additional challenge: the generated face must stay consistent across thousands of frames while lip movements match incoming audio with minimal delay. Prior systems from NVIDIA, Ant Group, and academic labs have achieved quality results but with inference times measured in minutes per few seconds of video. EchoMimic from Ant Group, for instance, generates 240 frames in roughly 50 seconds on an NVIDIA V100 after acceleration, according to its December 2024 release notes.

Live Avatar claims an 84x improvement in frames-per-second over baseline diffusion inference without quantization. The researchers attribute this to three technical innovations.

Timestep-Forcing Pipeline Parallelism

Standard diffusion models process each denoising step sequentially on a single GPU. Live Avatar's Timestep-forcing Pipeline Parallelism (TPP) distributes these steps across multiple GPUs, treating the denoising sequence as a pipeline where each device handles a different timestep.

The approach achieves linear speedup proportional to the number of devices. With four-step sampling and five H800 GPUs, the team reports end-to-end generation at 20 FPS, which meets the threshold for interactive applications. The researchers used Self-Forcing Distribution Matching Distillation to reduce sampling from the typical 20-50 steps in standard diffusion models down to four steps while maintaining visual quality.

Solving Long-Form Collapse

Autoregressive video generators typically degrade over extended sequences. The researchers identify three failure modes: RoPE (Rotary Position Embedding) drift where positional encodings shift relative to reference frames, attention sink degradation where early cached tokens lose influence, and error accumulation in the key-value cache.

Live Avatar addresses these with the Rolling Sink Frame Mechanism (RSFM). The system maintains a cached reference image and dynamically recalibrates appearance at intervals, preventing the gradual identity drift that plagues other long-form generators. The History Corrupt technique deliberately injects noise into the KV-cache during training, forcing the model to extract motion information from history while relying on the sink frame for stable appearance details.

The team demonstrates continuous generation for over 10,000 seconds (roughly 2.8 hours) without visible degradation in their project demonstrations.

Hardware Requirements and Practical Deployment

Live Avatar requires five NVIDIA H800 GPUs running in parallel. At current cloud rates (approximately $3.15-$3.42 per H800 per hour from providers like CR8DL), running the system would cost roughly $16-17 per hour of compute time.

The hardware demand places Live Avatar in enterprise or studio territory rather than consumer applications. The researchers acknowledge this in their roadmap, listing planned optimizations for RTX 4090 and A100 deployments through additional distillation (to 3 steps), SVD quantization, and SageAttention integration.

The model builds on Alibaba's Wan2.2-S2V-14B as its base architecture. Checkpoints are available on Hugging Face under the Quark-Vision organization, with inference code scheduled for open-source release.

Integration with Conversational AI

Project demonstrations show Live Avatar combined with Qwen3-Omni, Alibaba's multimodal language model, creating autonomous dialogue agents. One published demo features two AI-generated avatars conversing for over 18 minutes, with each avatar generating responses and corresponding lip-synced video in real time.

This integration points toward applications in virtual assistants, live broadcast automation, and interactive NPCs for gaming and virtual production. Alibaba's Quark division (which developed Live Avatar) serves over 200 million users in China through its search and productivity platform, providing a potential distribution channel for avatar-based features.

The research team includes members from the University of Science and Technology of China, Beijing University of Posts and Telecommunications, and Zhejiang University. Steven Hoi, a well-known figure in multimodal AI research, leads the Alibaba contingent. Jiaming Liu is credited as project leader, with Sirui Zhao and Enhong Chen listed as corresponding authors.

The code repository lists inference code, Gradio demo, and real-time streaming implementation as pending items expected in early December 2025. The Apache 2.0 license covers both Live Avatar and the underlying Wan model.