Voice Cloning/Synthesis

NVIDIA Open-Sources PersonaPlex-7B Full-Duplex Voice Model

A 7B-parameter model that listens and talks simultaneously, with customizable voices and personas.

Andrés Martínez
Andrés MartínezAI Content Writer
February 16, 20262 min read
Share:
Abstract visualization of two overlapping audio waveforms representing simultaneous listening and speaking in full-duplex voice AI

NVIDIA released PersonaPlex-7B-v1, a full-duplex speech-to-speech model that can listen and generate speech at the same time. No waiting for turns. The model, available on Hugging Face with over 330,000 downloads already, replaces the traditional ASR-to-LLM-to-TTS pipeline with a single Transformer that handles everything in one pass.

The pitch is straightforward: most voice AI still works like a walkie-talkie. You talk, it waits, it thinks, it responds. PersonaPlex processes incoming audio while simultaneously generating its own speech, supporting interruptions, overlapping talk, and backchannels (the "uh-huh" and "right" that make conversations feel human). Built on Kyutai's Moshi architecture with a Helium language backbone, the 7B-parameter model also lets developers customize both voice and persona through audio and text prompts.

NVIDIA's own benchmarks, measured on FullDuplexBench, show smooth turn-taking latency at 0.170 seconds and interruption handling at 0.240 seconds. On dialog naturalness scores, PersonaPlex hit 2.95 MOS compared to 2.80 for Gemini and 2.81 for Qwen-2.5-Omni. Those numbers come from NVIDIA's research paper, so take them with appropriate caution: all benchmarks are company-reported, and the evaluator pool ranged from 152 to 202 people depending on the test category.

Code ships under MIT, weights under the NVIDIA Open Model License, both cleared for commercial use. You'll need serious hardware though: NVIDIA recommends at least 24 GB of VRAM. English only for now, with other languages on the roadmap.

Bottom Line

PersonaPlex-7B is commercially licensed, open-weight, and already has 330,000+ Hugging Face downloads, but requires a 24 GB VRAM GPU and only supports English so far.

Quick Facts

  • 7 billion parameters, built on Moshi architecture
  • Smooth turn-taking latency: 0.170 seconds (company-reported)
  • User interruption latency: 0.240 seconds (company-reported)
  • Dialog naturalness MOS: 2.95 vs. Gemini's 2.80 (company-reported)
  • 330,000+ downloads on Hugging Face
  • Requires 24 GB+ VRAM (A10G, A40, RTX 3090/4090)
Tags:NVIDIAPersonaPlexfull-duplexvoice AIopen sourcespeech-to-speechconversational AI
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

NVIDIA Open-Sources PersonaPlex-7B Full-Duplex Voice Model | aiHola