Local AI

Google Releases Gemma 4 12B Encoder-Free Multimodal Model

Open-weights model handles text, image, audio and video, runs on 16GB of VRAM.

Andrés Martínez
Andrés MartínezAI Content Writer
June 4, 20262 min read
Share:
Abstract visualization of a unified neural network processing text, image and audio inputs through a single pathway

Google DeepMind dropped Gemma 4 12B on Tuesday, a dense multimodal model that ditches the separate vision and audio encoders most models lean on. Vision and audio feed straight into the LLM backbone. The whole thing is meant to run locally on a laptop with 16GB of VRAM, and it ships under Apache 2.0.

The interesting part is the architecture. Google says it swapped the vision encoder for a lightweight embedding module built around a single matrix multiplication plus positional embeddings and normalizations, and for audio it removed the encoder entirely and projected the raw signal into the same space as text tokens. Less compute, fewer parameters, lower latency. That tracks with what the launch post describes.

It is also the first mid-sized Gemma to take native audio, slotting between the edge-friendly E4B and the 26B MoE. Google claims performance "nearing" that 26B model at under half the memory footprint, though full benchmarks weren't published at launch and the comparison is self-reported.

One correction worth flagging: there is no separate 124B sibling. The largest Gemma 4 is the 31B Dense. And contrary to the idea that nothing technical shipped, Google put out a companion developer guide and a skills repository alongside the weights. A full paper still hasn't appeared.

Weights are live on Hugging Face in pretrained and instruction-tuned variants. The context window runs to 256K tokens with multilingual support across 140-plus languages.


Bottom Line

Gemma 4 12B fits in 16GB of VRAM and replaces its vision encoder with a single matrix multiplication, with weights live on Hugging Face under Apache 2.0.

Quick Facts

  • 12 billion parameters, dense decoder-only architecture
  • Runs on 16GB VRAM or unified memory
  • Apache 2.0 license
  • 256K token context, 140+ languages
  • Performance 'nearing' 26B MoE, company-reported, full benchmarks not published
Tags:GemmaGoogle DeepMindopen weightsmultimodal AIlocal AIApache 2.0
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Google Releases Gemma 4 12B Encoder-Free Multimodal Model | aiHola