San Francisco-based Tavus shipped Phoenix-4 on February 19, a real-time rendering engine that generates every pixel of a digital human's face and head during live conversation. The model runs at 40 frames per second in 1080p, built on a hybrid Gaussian-diffusion architecture trained on thousands of hours of real conversational data. Tavus detailed the system on its research blog, and a live demo is already available.
The pitch here is emotional intelligence, not just visual fidelity. Phoenix-4 supports over ten explicit emotional states (joy, sadness, anger, surprise, among others) that developers can trigger through an API or let the model select contextually. Tavus calls it "the first real-time model to generate and control emotional states, active listening behavior, and continuous facial motion as a single, unified system," per its press release. Whether it truly solves the uncanny valley is something independent reviewers haven't confirmed yet.
The technical stack goes beyond just Phoenix-4. Tavus pairs it with Raven-1 for multimodal perception (reading the user's expressions and tone) and Sparrow-1 for conversational timing and turn-taking. End-to-end latency sits under 600 milliseconds, according to the company. Full-duplex support means the avatar generates listening behavior (nods, microexpressions) while the user is still speaking, rather than looping canned idle animations.
Phoenix-4 is available now through Tavus's platform and APIs. Target use cases: healthcare patient interactions, sales, education, and customer support. No pricing was disclosed.
Bottom Line
Phoenix-4 renders full-head video at 40fps in 1080p with explicit emotion control via API, available now through Tavus's platform.
Quick Facts
- 40 frames per second at 1080p resolution
- Sub-600ms end-to-end conversational latency (company-reported)
- 10+ emotional states controllable via API
- Trained on thousands of hours of conversational data
- Available now through Tavus platform and APIs; pricing not disclosed




