Summarized by Dodly:

Gemini Embedding 2: Multimodal AI Goes Live

Audio Summary

Summary

Google DeepMind has launched Gemini Embedding 2, a groundbreaking, natively multimodal embedding model that directly maps text, images, video, and audio into a single unified embedding space. This eliminates the need for intermediate text conversions, simplifying developer architectures and enabling direct embedding of raw media like images or video clips. The model supports over 100 languages out-of-the-box and offers flexible output dimensions, from 768 to 3072, using Matryoshka Representation Learning to maintain quality even at lower scales. Benchmarks show it sets a new standard for multimodal depth, outperforming leading models in text, image, and video tasks, and adding strong speech capabilities. Its primary use is as a retrieval backbone for multi-modal RAG agentic workflows, allowing agents to query various media types simultaneously without separate ingestion pipelines. It's also optimized for tasks like search, question answering, fact-checking, and classification. Gemini Embedding 2 is now available via the Gemini API and Gemini Enterprise Agent platform, with documentation and an interactive app available for developers to explore multimodal search.

Play the full video