Advancing the frontier of video understanding with Gemini 2.5

We recently launched two new models in our Gemini family: Gemini 2.5 Pro Preview (05/06) and Gemini 2.5 Flash (04/17). These models mark a major leap in video understanding. Gemini 2.5 Pro achieves state-of-the-art performance on key video understanding benchmarks, surpassing recent models like GPT 4.1 under comparable testing conditions (same prompt and video frames).

Furthermore, it rivals specialized fine-tuned models on several challenging benchmarks (e.g. YouCook2 dense captioning and QVHighlights moment retrieval). For cost-sensitive applications, Gemini 2.5 Flash provides a highly competitive alternative.

Advancing the frontier of video understanding with Gemini 2.5

Evaluation of Gemini 2.5 vs. prior models on video understanding benchmarks.
Performance is measured by string-match accuracy for multiple-choice VideoQA, LLM-based accuracy for EgoTempo, R1@0.5 for QVHighlights and CIDEr for YouCook2.
*Videos were processed at 1fps and linearly subsampled to a maximum of 256 frames, except for 1H-VideoQA (7200 frames).

Combining video and code with Gemini 2.5

Gemini 2.5 is the first time a natively multimodal model can use audio-visual information seamlessly with code and other data formats. To illustrate the power of Gemini 2.5’s video understanding capabilities, we showcase some of the use cases that we’ve been most excited about below.

Transforming videos into interactive applications

Gemini 2.5 Pro unlocks new possibilities for transforming videos into interactive applications. Video To Learning App, a Google AI Studio starter app, uses Gemini 2.5 to make learning from video content more effective and engaging.

First, the model sees a YouTube URL along with a text prompt that explains how it should analyze the video. Gemini 2.5 Pro analyzes the video and crafts a detailed spec for a learning application which reinforces key ideas in the video.

The generated spec is then sent directly back to Gemini 2.5 Pro to generate the code for the application, as illustrated in the vision correction simulator application below. Gemini 2.5 Flash can achieve similar results, offering a glimpse into novel video use cases in domains such as education and interactive content creation.

Creating animations from video with p5.js

Gemini 2.5 Pro unlocks exciting creative possibilities, such as the ability to generate dynamic animations from videos with a single prompt. This capability opens up new avenues for use cases such as automated content generation and creating accessible video summaries.

For example, when given our video on Project Astra along with the prompt ‘Create an animation in p5.js covering the different landmarks seen in this video.‘, Gemini 2.5 Pro analyzes the footage and produces a corresponding p5.js animation. The animation visualizes the landmarks identified by Gemini 2.5 Pro in the same temporal order as in the video.

Retrieving and describing moments from video

Gemini 2.5 Pro excels at identifying specific moments within videos using audio-visual cues with significantly higher accuracy than previous video processing systems. For example, in this 10-minute video of the Google Cloud Next ’25 opening keynote, it accurately identifies 16 distinct segments related to product presentations, using both audio and visual cues from the video to do so.

Temporal reasoning

With its advanced moment retrieval capabilities, Gemini 2.5 Pro is also able to solve nuanced temporal reasoning problems such as counting. In this example, Gemini successfully counts 17 distinct occurrences where the main character uses their phone in the project Astra video.

Building with Gemini 2.5 video understanding

Video understanding in Gemini 2.5 Flash and Pro are available in Google AI Studio, the Gemini API, and Vertex AI. Support for YouTube videos is available via the Gemini API and Google AI Studio, enabling anyone to build applications with access to billions of videos.

The Gemini API now offers a ‘low’ media resolution parameter enabling Gemini 2.5 Pro to process ~6 hours of video with 2 million token context. This provides for a more cost-effective setting with competitive video understanding performance (e.g., 84.7% vs 85.2% accuracy on VideoMME) for many long video understanding use cases.

Google’s TV Streamer 4K doubles as a smart home hub and it’s on sale

Gemini in Chrome expands to India, New Zealand and Canada

We are inspired by the innovative video applications already emerging from the community and can’t wait to see what you build!

Acknowledgements

^{A big shoutout to} ^{Aaron Wade}^{for creating} ^{Video To Learning App}^{and for the Vision Correction simulator example showcased in the blogpost.}

^{We thank} ^{Sergi Caelles}^, ^{Boyu Wang}^and ^{Saarthak Khanna}^{for their contributions on the evaluations presented above,} ^{Angeliki Lazaridou}^{for inspiring some demo examples, and the entire Gemini video understanding team for the work culminating in this release. Finally, we would like to thank the video understanding leads} ^{Mario Lučić, Shuo-yiin Chang,} ^and^{Paul Natsev,} ^{and overall multimodal understanding lead} ^{Jean-Baptiste Alayrac.}