Google Releases Gemma 4 12B Encoder-Free Multimodal Model

Abstract visualization of a unified neural network processing text, image and audio inputs through a single pathway

Google DeepMind dropped Gemma 4 12B on Tuesday, a dense multimodal model that ditches the separate vision and audio encoders most models lean on. Vision and audio feed straight into the LLM backbone. The whole thing is meant to run locally on a laptop with 16GB of VRAM, and it ships under Apache 2.0.

The interesting part is the architecture. Google says it swapped the vision encoder for a lightweight embedding module built around a single matrix multiplication plus positional embeddings and normalizations, and for audio it removed the encoder entirely and projected the raw signal into the same space as text tokens. Less compute, fewer parameters, lower latency. That tracks with what the launch post describes.

It is also the first mid-sized Gemma to take native audio, slotting between the edge-friendly E4B and the 26B MoE. Google claims performance "nearing" that 26B model at under half the memory footprint, though full benchmarks weren't published at launch and the comparison is self-reported.

One correction worth flagging: there is no separate 124B sibling. The largest Gemma 4 is the 31B Dense. And contrary to the idea that nothing technical shipped, Google put out a companion developer guide and a skills repository alongside the weights. A full paper still hasn't appeared.

Weights are live on Hugging Face in pretrained and instruction-tuned variants. The context window runs to 256K tokens with multilingual support across 140-plus languages.

Bottom Line

Gemma 4 12B fits in 16GB of VRAM and replaces its vision encoder with a single matrix multiplication, with weights live on Hugging Face under Apache 2.0.

Quick Facts

12 billion parameters, dense decoder-only architecture
Runs on 16GB VRAM or unified memory
Apache 2.0 license
256K token context, 140+ languages
Performance 'nearing' 26B MoE, company-reported, full benchmarks not published

Tags:GemmaGoogle DeepMindopen weightsmultimodal AIlocal AIApache 2.0

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Google Releases Gemma 4 12B Encoder-Free Multimodal Model

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Moonshot Ships Kimi K3, Its Largest Model Yet

NVIDIA Releases LocateAnything-3B Visual Grounding Model

Meituan Open-Sources LongCat-2.0, a 1.6T Coding Model

Stay Ahead of the AI Curve