Elastic released jina-embeddings-v5-omni last week, a multimodal embedding family that maps text, images, audio, and video into one vector space. Two sizes ship: small at 1.57B parameters and nano at 0.9B. Both are live on Hugging Face and through the Jina API.
The pitch is drop-in compatibility. v5-omni shares the same text embedding space as the existing v5-text models, so teams running text search can index multimedia into the same vectors without rebuilding their pipeline. According to the technical report, Jina froze the text backbone and the new media encoders, training only the projector layers between them, roughly 0.35% of total weights.
"Make multimodal search as easy and scalable as text search already is," said Ken Exner, Elastic's chief product officer, which is the standard pitch. The numbers worth checking: Jina reports v5-omni-small leads its size class on image retrieval (MIEB) and audio retrieval (MAEB), with self-reported scores of 56.05 on image and 51.46 on audio. Video is the soft spot at 41.20, trailing the larger LCO-7B's 47.41.
The models run locally, through the Jina API, or via Elastic Inference Service. Task-specific variants for retrieval, classification, clustering, and text-matching are also up on Hugging Face. GGUF builds for llama.cpp require a fork; the patches aren't upstream yet.
Bottom Line
v5-omni-small trains just 0.35% of its weights and reuses existing v5-text indexes without re-embedding.
Quick Facts
- Small variant: 1.57B parameters
- Nano variant: 0.9B parameters
- Trainable weights: 0.35% of total (company-reported)
- Modalities: text, image, audio, video
- Announced: May 11, 2026



