Our first natively multimodal embedding model

Today we’re releasing Gemini Embedding 2, our first fully multimodal embedding model built on the Gemini architecture, in Public Preview via the Gemini API and Vertex AI.

Expanding on our previous text-only foundation, Gemini Embedding 2 maps text, images, videos, audio and documents into a single, unified embedding space, and captures semantic intent across over 100 languages. This simplifies complex pipelines and enhances a wide variety of multimodal downstream tasks—from Retrieval-Augmented Generation (RAG) and semantic search to sentiment analysis and data clustering.

New modalities and flexible output dimensions

The model is based on Gemini and leverages its best-in-class multimodal understanding capabilities to create high-quality embeddings across:

Text: supports an expansive context of up to 8192 input tokens
Images: capable of processing up to 6 images per request, supporting PNG and JPEG formats
Videos: supports up to 120 seconds of video input in MP4 and MOV formats
Audio: natively ingests and embeds audio data without needing intermediate text transcriptions
Documents: directly embed PDFs up to 6 pages long

Beyond processing one modality at a time, this model natively understands interleaved input so you can pass multiple modalities of input (e.g., image + text) in a single request. This allows the model to capture the complex, nuanced relationships between different media types, unlocking more accurate understanding of complex, real-world data.

Source_link