Resemble AI has released DramaBox, an open-source text-to-speech model that takes screenplay-style prompts and turns them into voiced performances. Dialogue goes inside quotes. Stage directions, like sighs, whispers, or a cracking voice, sit outside the quotes, and the model treats them as performance cues rather than words to speak. The model card is live on Hugging Face.
Voice cloning is optional. A 10-second reference clip locks the timbre while the prompt still controls emotion and pacing. Skip the reference and DramaBox invents a voice from the speaker description. Output is 48kHz stereo, and every clip gets a Resemble Perth watermark that the company says survives MP3 compression and routine edits at near-100% detection accuracy.
Under the hood it is an IC-LoRA fine-tune of Lightricks' LTX-2.3, the 3.3B-parameter audio branch of the LTX-2 video model, conditioned on Gemma 3 12B text embeddings at 4-bit quantization. Resemble's launch post pegs generation at around 2.5 seconds on a warm H100 with roughly 24 GB peak VRAM. The release is English-only, which Resemble calls deliberate, trading language coverage for quality on directable speech.
Code is on GitHub under the LTX-2 Community License, and the trainer supports stacking your own LoRA on top of the DramaBox checkpoint. Performance and watermark robustness claims are self-reported; no third-party evaluations have surfaced yet.
Bottom Line
DramaBox runs in roughly 2.5 seconds on a 24 GB H100 and watermarks every output by default unless explicitly disabled.
Quick Facts
- Base model: LTX-2.3 audio branch, 3.3B parameters
- Text encoder: Gemma 3 12B at 4-bit quantization
- Voice cloning reference length: 10 seconds
- Output: 48 kHz stereo with Resemble Perth watermark
- Peak VRAM: approximately 24 GB on H100 (company-reported)
- Language support: English only at launch



