Tencent's Hunyuan team has open-sourced UniRL, a reinforcement learning framework that runs a single post-training loop across very different model types. The GitHub repo describes the loop the usual way: generate samples, score them, compute advantages, update the policy, then sync weights back to rollout workers.
The pitch is breadth. Most RL stacks lock to one modality. UniRL applies that loop to text-to-image, text and image-to-video, vision-language, plain LLMs, prompt enhancers, and unified autoregressive-plus-diffusion architectures. Under the hood it leans on Ray, FSDP, a Transfer Queue, and LoRA or full-weight sync, with SGLang and vLLM-Omni as rollout engines.
Two of the team's own algorithms anchor the release. Flow-DPPO targets flow matching and diffusion models, swapping the usual probability-ratio clip for an exact divergence-based trust-region mask. The second, DRPO, is token-level LLM RL with an advantage-weighted quadratic regularizer; the team says it holds up in FP8 where some baselines wobble. Those comparisons are self-reported, and there's no independent benchmarking yet.
Supported models span Stable Diffusion 3 and 3.5, Qwen-Image, FLUX.2-Klein, WAN 2.1 and 2.2, HunyuanVideo 1.0 and 1.5, Qwen-VL, Qwen3, HunyuanImage3, and Bagel. The roadmap promises wider algorithm coverage for newer families and more reward backends. The repo lists no formal release version yet.
Bottom Line
UniRL runs one RL loop across 11 model families, from Stable Diffusion 3.5 to Qwen3, under Apache 2.0.
Quick Facts
- Two team algorithms: Flow-DPPO and DRPO
- DRPO paper: arXiv 2606.09821, released May 2026
- Flow-DPPO manuscript dated June 8, 2026
- 11 supported model families listed
- Apache 2.0 license; built on Ray, FSDP, SGLang, vLLM-Omni
- Algorithm comparisons are company-reported




