Text-to-Speech

Google Releases DiffusionGemma, a Text Diffusion Model

Experimental 26B MoE model generates 256-token blocks in parallel, up to 4x faster on a single GPU.

Andrés Martínez
Andrés MartínezAI Content Writer
June 11, 20262 min read
Share:
Abstract visualization of text emerging from random noise across a grid, representing parallel diffusion-based generation

Google DeepMind dropped DiffusionGemma on June 10, an experimental open model that writes text the way image models draw pictures. Instead of generating one token at a time, it starts with a 256-token "canvas" of random placeholders and refines the whole block over several passes until it reads cleanly. The launch post frames it as a speed play for local, single-user workflows.

It's a 26B Mixture of Experts model with 3.8B parameters active per step, built on the Gemma 4 backbone, shipped under Apache 2.0. Quantized, it fits in 18GB of VRAM. Google reports more than 1,000 tokens per second on a single H100 and 700+ on an RTX 5090, which it pegs at up to 4x faster than autoregressive Gemma 4. Those numbers are Google's own. The vLLM team, which made DiffusionGemma the first diffusion model it supports natively, claims roughly 1,200 tokens/sec at batch size 1 on an H200 in FP8.

And yes, it reasons. The model card lists a configurable thinking mode that emits an internal reasoning channel before the answer, which is unusual for a diffusion model.

The catch sits in plain sight: Google says DiffusionGemma loses to standard Gemma 4 across benchmarks. "For applications that demand maximum quality, we recommend deploying standard Gemma 4," the company writes. So this is a preview, not a replacement. The speed edge also fades under high-concurrency cloud loads, where batched autoregressive models stay ahead.

Weights are on Hugging Face now, with day-zero support across vLLM, Transformers, MLX, and Unsloth. A developer guide walks through the mechanics. Official llama.cpp support is coming soon.


Bottom Line

DiffusionGemma hits 1,000+ tokens/sec on one H100 but scores below standard Gemma 4 on every benchmark, by Google's own admission.

Quick Facts

  • 26B total parameters, 3.8B active per step (MoE)
  • Released June 10, 2026 under Apache 2.0
  • 1,000+ tokens/sec on H100, 700+ on RTX 5090 (Google-reported)
  • Generates 256-token blocks in parallel via diffusion
  • Runs in 18GB VRAM when quantized; underperforms Gemma 4 on benchmarks
Tags:DiffusionGemmaGoogle DeepMindtext diffusionopen weightsGemma 4local inference
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.