Google released Gemini Embedding 2 on March 10, its first natively multimodal embedding model. Available now in public preview as gemini-embedding-2-preview via the Gemini API and Vertex AI, the model takes text, images, video, audio, and PDF documents and projects them all into one unified vector space. Five modalities, one index. For teams stitching together CLIP derivatives and separate audio pipelines, this collapses a lot of plumbing.
The specs: up to 8,192 input tokens for text (quadruple the 2,048 limit of its predecessor), six images per request in PNG or JPEG, video up to 120 seconds in MP4 or MOV, native audio ingestion without intermediate transcription, and PDFs up to six pages. You can also mix modalities in a single request, passing an image alongside text and getting back one embedding that captures cross-modal relationships. Output dimensions default to 3,072 but scale down to 1,536 or 768 via Matryoshka Representation Learning, letting developers trade accuracy for cheaper storage.
Google claims top placement on the MTEB Multilingual leaderboard, though these are self-reported results. Text embedding pricing sits at $0.20 per million tokens, up from $0.15 for the text-only embedding-001. One catch worth flagging: the vector spaces between embedding-001 and the new model are incompatible. Anyone upgrading in production needs to re-embed their entire dataset. No migration path exists.
Over 100 languages are supported. Early access partners report use cases spanning legal discovery across multimedia records and audio-first knowledge bases that skip transcription entirely. The model is accessible through LangChain, LlamaIndex, Weaviate, ChromaDB, and other popular frameworks. Batch API support at 50% off is available for workloads that don't need real-time responses.
Bottom Line
Google's first multimodal embedding model covers five data types in a single API call at $0.20 per million text tokens, but upgrading from embedding-001 requires a full re-embed of existing datasets.
Quick Facts
- Model ID: gemini-embedding-2-preview
- Modalities: text, images, video, audio, PDFs
- Text context: 8,192 tokens (up from 2,048)
- Video limit: 120 seconds per request
- Text pricing: $0.20 per 1M tokens (company-reported)
- Output dimensions: 3,072 / 1,536 / 768




