Open-Source AI

Z-Image-Omni-Base Is Almost Here, and the Code's Already Waiting

DiffSynth-Studio commits add complete infrastructure for Alibaba's unified 6B model, ControlNet support included

Andrés Martínez
Andrés MartínezAI Content Writer
January 9, 20264 min read
Share:
Abstract illustration of unified text and image generation flowing through a single stream architecture

A flurry of commits to ModelScope's DiffSynth-Studio repository over January 7-8 added full support for Z-Image-Omni-Base, Alibaba's long-awaited unified image generation and editing model. The code infrastructure is now in place. The weights are still "coming soon" on Hugging Face. Make of that what you will.

What dropped in the commits

The January 8th merge from developer Artiprocher (who handles most of DiffSynth-Studio's development, according to the repo) is substantial: 37 files changed, 2,341 lines added. The interesting bits:

New model configurations for Z-Image-Omni-Base itself. A 428M parameter Siglip2 vision encoder. ZImageControlNet. And something called ZImageImage2LoRAModel, which, okay, that's new.

The ControlNet support is worth paying attention to. The commit adds a 15-layer control architecture that plugs into the base model's transformer blocks at specific intervals. There's also VRAM management for low-memory inference, training scripts, and validation code.

In other words: this isn't placeholder code. This is ready-to-run infrastructure waiting for model weights.

Why Omni-Base instead of just Base

Alibaba quietly renamed this. The technical report (released December 1st) explains the "omni" part: the model was pre-trained on both generation and editing data simultaneously. You get text-to-image and image-to-image editing from the same checkpoint.

The selling point: one model, two capabilities, no task-switching overhead. LoRA adapters trained for generation should work for editing. In theory.

Z-Image-Turbo has been available since November 26th and sits at #8 on the Artificial Analysis leaderboard, #1 among open-source models. But Turbo is distilled for speed. You can't really fine-tune it without losing the acceleration. The community has been asking for base weights since day one.

The wait has been long

There's a discussion thread on Hugging Face where someone admits they have a bot checking every 8 hours for the weight release. Another commenter pleads for even a vague timeframe, anything. The response from one user captures the mood: "If the developers could give exact release dates, they would have done it already."

The official GitHub repo still lists Omni-Base and Z-Image-Edit as "to be released." That language hasn't changed. But infrastructure showing up in DiffSynth-Studio is a strong signal.

What's actually new

The Image-to-LoRA model is the piece I hadn't seen before. The commit adds a ZImageImage2LoRAModel with a 128-dimension compression layer. DiffSynth-Studio already released a similar capability for Qwen-Image back in December, letting you generate a LoRA from a single image. If this works similarly for Z-Image, that's a significant workflow addition.

ControlNet support comes in two flavors: a tile variant and a union variant (supporting multiple control types). The union version references a PAI model path, suggesting these might be released separately.

The 6B question

Z-Image's whole pitch is efficiency. Six billion parameters producing results that compete with 20B-80B models. The paper claims they trained the entire thing in 314K H800 GPU hours, approximately $630K. For comparison, that's roughly one week of compute on a decently-sized cluster.

The architecture uses what they call a "Scalable Single-Stream Diffusion Transformer." Everything, text embeddings, image tokens, visual semantic tokens, goes through one unified sequence. No dual streams, no separate processing paths.

Whether this holds up under community fine-tuning is the open question. Turbo performs well out of the box. Omni-Base is where the customization happens.

What happens next

The DiffSynth-Studio code is merged and ready. Reddit threads are multiplying. Someone on r/StableDiffusion posted "Z-Image OmniBase looking like it's gonna release soon" on January 8th, citing exactly these commits.

The weights could appear on Hugging Face or ModelScope at any time. Days, probably. Could be hours. The Tongyi-MAI team hasn't announced anything publicly, but they also don't tend to announce things, they just push updates.

I'll spare you the obvious prediction.

Tags:Z-ImageAlibabaopen-source AIimage generationDiffSynth-StudioControlNet
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Z-Image-Omni-Base Is Almost Here, and the Code's Already Waiting | aiHola