Baidu is about to enter the dedicated image generation race. A ComfyUI pull request merged on April 12 implements support for a new ERNIE Image model, a diffusion transformer separate from Baidu's existing multimodal ERNIE line. The model weighs in at 8 billion parameters, according to the source, and borrows its VAE from Black Forest Labs' Flux architecture while using Mistral's Ministral 3.3B as its text encoder.
That's an unusual combo. The code review reveals the model uses rotary position embeddings, patch-based image encoding, and adaptive layer normalization for diffusion conditioning. Weights and a Hugging Face page haven't appeared yet, but ComfyUI's lead developer comfyanonymous already merged the integration code, which suggests the release is days away at most. Model weights and workflows remain the open question: at least one GitHub user has already asked where to find them.
Baidu hasn't been a major player in standalone image generation. Its previous image work, ERNIE-ViLG, dates back to 2022, and more recently the company focused its visual generation efforts inside the massive 2.4-trillion-parameter ERNIE 5.0 multimodal model. A dedicated, open-weights image model would be a different move. The comparison floating around AI communities is Z-Image, Alibaba's 6B-parameter model that ranked first among open-source models on the Artificial Analysis leaderboard. Whether ERNIE Image can match that reception with 8B parameters and a completely different text encoder remains to be seen.
No pricing, license, or benchmark data has been disclosed. The model's Hugging Face page is expected to go live in the coming days.
Bottom Line
Baidu's first standalone image generation model pairs an 8B diffusion transformer with Flux's VAE and Ministral 3.3B, with ComfyUI support already merged before the weights are even public.
Quick Facts
- 8B parameters (unverified, per source)
- VAE: Flux architecture
- Text encoder: Ministral 3.3B
- ComfyUI PR #13369 merged April 12, 2026
- Model weights not yet publicly available




