Google Releases DiffusionGemma Text Diffusion Model

Abstract visualization of text emerging from random noise across a grid, representing parallel diffusion-based generation

Google DeepMind dropped DiffusionGemma on June 10, an experimental open model that writes text the way image models draw pictures. Instead of generating one token at a time, it starts with a 256-token "canvas" of random placeholders and refines the whole block over several passes until it reads cleanly. The launch post frames it as a speed play for local, single-user workflows.

It's a 26B Mixture of Experts model with 3.8B parameters active per step, built on the Gemma 4 backbone, shipped under Apache 2.0. Quantized, it fits in 18GB of VRAM. Google reports more than 1,000 tokens per second on a single H100 and 700+ on an RTX 5090, which it pegs at up to 4x faster than autoregressive Gemma 4. Those numbers are Google's own. The vLLM team, which made DiffusionGemma the first diffusion model it supports natively, claims roughly 1,200 tokens/sec at batch size 1 on an H200 in FP8.

And yes, it reasons. The model card lists a configurable thinking mode that emits an internal reasoning channel before the answer, which is unusual for a diffusion model.

The catch sits in plain sight: Google says DiffusionGemma loses to standard Gemma 4 across benchmarks. "For applications that demand maximum quality, we recommend deploying standard Gemma 4," the company writes. So this is a preview, not a replacement. The speed edge also fades under high-concurrency cloud loads, where batched autoregressive models stay ahead.

Weights are on Hugging Face now, with day-zero support across vLLM, Transformers, MLX, and Unsloth. A developer guide walks through the mechanics. Official llama.cpp support is coming soon.

Bottom Line

DiffusionGemma hits 1,000+ tokens/sec on one H100 but scores below standard Gemma 4 on every benchmark, by Google's own admission.

Quick Facts

26B total parameters, 3.8B active per step (MoE)
Released June 10, 2026 under Apache 2.0
1,000+ tokens/sec on H100, 700+ on RTX 5090 (Google-reported)
Generates 256-token blocks in parallel via diffusion
Runs in 18GB VRAM when quantized; underperforms Gemma 4 on benchmarks

Tags:DiffusionGemmaGoogle DeepMindtext diffusionopen weightsGemma 4local inference

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Google Releases DiffusionGemma, a Text Diffusion Model

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Google Releases Gemma 4 12B Encoder-Free Multimodal Model

Tencent Hunyuan Open-Sources UniRL Multimodal RL Framework

Xiaomi MiMo Hits 1,000 Tokens Per Second on Standard GPUs

Stay Ahead of the AI Curve