Tencent Hunyuan Open-Sources UniRL RL Framework

Abstract visualization of a unified reinforcement learning loop connecting image, video, and language model nodes

Tencent's Hunyuan team has open-sourced UniRL, a reinforcement learning framework that runs a single post-training loop across very different model types. The GitHub repo describes the loop the usual way: generate samples, score them, compute advantages, update the policy, then sync weights back to rollout workers.

The pitch is breadth. Most RL stacks lock to one modality. UniRL applies that loop to text-to-image, text and image-to-video, vision-language, plain LLMs, prompt enhancers, and unified autoregressive-plus-diffusion architectures. Under the hood it leans on Ray, FSDP, a Transfer Queue, and LoRA or full-weight sync, with SGLang and vLLM-Omni as rollout engines.

Two of the team's own algorithms anchor the release. Flow-DPPO targets flow matching and diffusion models, swapping the usual probability-ratio clip for an exact divergence-based trust-region mask. The second, DRPO, is token-level LLM RL with an advantage-weighted quadratic regularizer; the team says it holds up in FP8 where some baselines wobble. Those comparisons are self-reported, and there's no independent benchmarking yet.

Supported models span Stable Diffusion 3 and 3.5, Qwen-Image, FLUX.2-Klein, WAN 2.1 and 2.2, HunyuanVideo 1.0 and 1.5, Qwen-VL, Qwen3, HunyuanImage3, and Bagel. The roadmap promises wider algorithm coverage for newer families and more reward backends. The repo lists no formal release version yet.

Bottom Line

UniRL runs one RL loop across 11 model families, from Stable Diffusion 3.5 to Qwen3, under Apache 2.0.

Quick Facts

Two team algorithms: Flow-DPPO and DRPO
DRPO paper: arXiv 2606.09821, released May 2026
Flow-DPPO manuscript dated June 8, 2026
11 supported model families listed
Apache 2.0 license; built on Ray, FSDP, SGLang, vLLM-Omni
Algorithm comparisons are company-reported

Tags:Tencent Hunyuanreinforcement learningmultimodal AIopen sourcediffusion modelsLLM training

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Tencent Hunyuan Open-Sources UniRL Multimodal RL Framework

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Xiaomi MiMo Hits 1,000 Tokens Per Second on Standard GPUs

Harness-1 Pushes Search-Agent State Out of the Model and Into the Environment

Tencent Hunyuan Releases PlanningBench for LLM Planning

Stay Ahead of the AI Curve