Google Gemini Embedding 2: First Multimodal Embedding Model

Abstract visualization of multimodal data streams converging into a unified vector space, representing text, images, video, audio, and documents

Google released Gemini Embedding 2 on March 10, its first natively multimodal embedding model. Available now in public preview as gemini-embedding-2-preview via the Gemini API and Vertex AI, the model takes text, images, video, audio, and PDF documents and projects them all into one unified vector space. Five modalities, one index. For teams stitching together CLIP derivatives and separate audio pipelines, this collapses a lot of plumbing.

The specs: up to 8,192 input tokens for text (quadruple the 2,048 limit of its predecessor), six images per request in PNG or JPEG, video up to 120 seconds in MP4 or MOV, native audio ingestion without intermediate transcription, and PDFs up to six pages. You can also mix modalities in a single request, passing an image alongside text and getting back one embedding that captures cross-modal relationships. Output dimensions default to 3,072 but scale down to 1,536 or 768 via Matryoshka Representation Learning, letting developers trade accuracy for cheaper storage.

Google claims top placement on the MTEB Multilingual leaderboard, though these are self-reported results. Text embedding pricing sits at $0.20 per million tokens, up from $0.15 for the text-only embedding-001. One catch worth flagging: the vector spaces between embedding-001 and the new model are incompatible. Anyone upgrading in production needs to re-embed their entire dataset. No migration path exists.

Over 100 languages are supported. Early access partners report use cases spanning legal discovery across multimedia records and audio-first knowledge bases that skip transcription entirely. The model is accessible through LangChain, LlamaIndex, Weaviate, ChromaDB, and other popular frameworks. Batch API support at 50% off is available for workloads that don't need real-time responses.

Bottom Line

Google's first multimodal embedding model covers five data types in a single API call at $0.20 per million text tokens, but upgrading from embedding-001 requires a full re-embed of existing datasets.

Quick Facts

Model ID: gemini-embedding-2-preview
Modalities: text, images, video, audio, PDFs
Text context: 8,192 tokens (up from 2,048)
Video limit: 120 seconds per request
Text pricing: $0.20 per 1M tokens (company-reported)
Output dimensions: 3,072 / 1,536 / 768

Tags:Googleembeddingsmultimodal AIGeminivector searchRAGsemantic search

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Google Launches First Multimodal Embedding Model on Gemini

Bottom Line

Quick Facts

Andrés Martínez

Related Articles

Google's Gemini-SQL2 Tops BIRD Text-to-SQL Leaderboard

Google Launches Gemini 3.5 Live Translate Across 70+ Languages

DeepSeek Adds Vision Mode to Its Chatbot

Stay Ahead of the AI Curve