• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, March 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Google Marketing

Conversational image segmentation with Gemini 2.5

Josh by Josh
July 22, 2025
in Google Marketing
0
Conversational image segmentation with Gemini 2.5


Gemini conversational image segmentation

The way AI visually understands images has evolved tremendously. Initially, AI could tell us “where” an object was using bounding boxes. Then, segmentation models arrived, precisely outlining an object’s shape. More recently, open-vocabulary models emerged, allowing us to segment objects using less common labels like “blue ski boot” or “xylophone” without needing a predefined list of categories.

Previous models matched pixels to nouns. However, the real challenge — conversational image segmentation (closely related to referring expression segmentation in the literature) — demands a deeper understanding: parsing complex descriptive phrases. Rather than just identifying “a car,” what if we could identify “the car that is farthest away?”

Today, Gemini’s advanced visual understanding brings a new level of conversational image segmentation. Gemini now “understands” what you’re asking it to “see.”

Leveraging conversational image segmentation queries

The magic of this feature lies in the types of questions you can ask. By moving beyond simple single-word labels, you can unlock a more intuitive and powerful way to interact with visual data. Consider the 5 categories of queries below.

1. Object relationships

Gemini can now identify objects based on their complex relationships to the objects around them.

1: Relational understanding: "the person holding the umbrella"

2: Ordering: "the third book from the left"

3: Comparative attributes: "the most wilted flower in the bouquet"

Sorry, your browser doesn’t support playback for this video

2. Conditional logic

Sometimes you need to query with conditional logic. For example, you can filter with queries like "food that is vegetarian". Gemini can also handle queries with negations like "the people who are not sitting".

Within an office meeting, the natural language query "the people who are not sitting" is used to overlay segmentation masks on the two individuals who are standing.

3. Abstract concepts

This is where Gemini’s world knowledge shines. You can ask it to segment things that don’t have a simple, fixed visual definition. This includes concepts like “damage,” “a mess,” or “opportunity.”

On a kitchen counter, a natural language segmentation overlay highlights a spill in response to the abstract query, "area that should be cleaned up".

4. In-image text

When appearance alone is not enough to distinguish the precise category of an object, the user might refer to it through a written text label present in the image. This requires OCR abilities for the model, one of the strengths of Gemini 2.5.

In a bakery setting, the model uses natural language segmentation to overlay masks on "the pistachio baklava" , distinguishing it from other nearby pastries based on in-image text.

5. Multi-lingual labels

Gemini is not restricted to a single language and can handle labels in many different languages.

A plate of food has natural language segmentation overlays identifying various components, with the model providing corresponding labels in French as requested by the prompt "tous les objects en franƧais".

Conversational image segmentation in action

Let’s explore how these query types could enable new use cases.

1. Unlocking creativity: Interactive media editing

This capability transforms creative workflows. Instead of using complex selection tools, a designer can now direct software with words. This allows for a more fluid and intuitive process, like when asking to select "the shadow cast by the building".

An aerial view of a park demonstrates a natural language segmentation overlay identifying "the shadow of the building".

2. Building a safer world: Intelligent safety & compliance monitoring

For workplace safety, you need to identify situations, not just objects. With a prompt like, "Highlight any employees on the factory floor not wearing a hard hat", Gemini comprehends the entire conditional instruction as a single query, producing a final, precise mask of only the non-compliant individuals.

At a construction site, a natural language segmentation overlay is applied to identify "the people not wearing a hard hat".

3. The future of claims: Nuanced insurance damage assessment

“Damage” is an abstract concept with many visual forms. An insurance adjuster can now use prompts like, "Segment the homes with weather damageā€ and Gemini will use its world knowledge to identify the specific dents and textures associated with that type of damage, distinguishing it from a simple reflection or rust.

In an aerial photo of a subdivision, natural language segmentation is used to overlay masks on each "damaged house".

Why this matters for developers

1: Flexible Language: Move beyond rigid, predefined classes. The natural language approach gives you the flexibility to build solutions for the “long tail” of visual queries that are specific to your industry and users.

2: Simplified Developer Experience: Get started in minutes with a single API. There is no need to find, train, and host separate, specialized segmentation models. This accessibility lowers the barrier to entry for building sophisticated vision applications.

Start building today

We believe that giving language a direct, pixel-level connection to vision will unlock a new generation of intelligent applications. We are incredibly excited to see what you will build.

Get started right away in Google AI Studio via our interactive:

Spatial Understanding demo

Or if you’d prefer a Python environment, feel free to start with our interactive Spatial Understanding colab.

To start building with the Gemini API, visit our developer guide and read more about starting with segmentation. You can also join our developer forum to meet other builders, discuss your use cases, and get help from the Gemini API team.

Recommended best practices

For best results, we recommend following the following best practices:

1: Use the gemini-2.5-flash model

2: Disable thinking set (thinkingBudget=0)

3: Stay close to the recommended prompt, and request JSON as output format.

Give the segmentation masks for the objects. 
Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key "box_2d", the segmentation mask in key "mask", and the text label in the key "label". 
Use descriptive labels.

Plain text

Acknowledgements

We thank Weicheng Kuo, Rich Munoz, and Huizhong Chen for their work on Gemini segmentation, Junyan Xu for work on infrastructure, Guillaume Vernade for work on documentation and code samples, and the entire Gemini image understanding team, culminating in this release. Finally, we would like to thank image understanding leads Xi Chen and Fei Xia and multimodal understanding lead Jean-Baptiste Alayrac.



Source_link

READ ALSO

Google’s Gemini AI is getting a bigger role across Docs, Sheets, and Slides

Plan mode is now available in Gemini CLI

Related Posts

Google’s Gemini AI is getting a bigger role across Docs, Sheets, and Slides
Google Marketing

Google’s Gemini AI is getting a bigger role across Docs, Sheets, and Slides

March 11, 2026
Plan mode is now available in Gemini CLI
Google Marketing

Plan mode is now available in Gemini CLI

March 11, 2026
Google completes acquisition of Wiz
Google Marketing

Google completes acquisition of Wiz

March 11, 2026
Google’s latest Pixel Watches have fallen to their lowest prices ever
Google Marketing

Google’s latest Pixel Watches have fallen to their lowest prices ever

March 11, 2026
Introducing Finish Changes and Outlines, now available in Gemini Code Assist extensions on IntelliJ and VS Code
Google Marketing

Introducing Finish Changes and Outlines, now available in Gemini Code Assist extensions on IntelliJ and VS Code

March 11, 2026
Our first natively multimodal embedding model
Google Marketing

Our first natively multimodal embedding model

March 11, 2026
Next Post
The Brief: Cheesesteak Races and ‘ColdplayGate’

The Brief: Cheesesteak Races and 'ColdplayGate'

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plansĀ 

Google announced the next step in its nuclear energy plansĀ 

August 20, 2025

EDITOR'S PICK

Google Maps can tell Polestar 4 drivers when to merge lanes

Google Maps can tell Polestar 4 drivers when to merge lanes

November 6, 2025
Crisis Management for Fitness Centers: A Leadership Guide to Protecting Your Brand

Crisis Management for Fitness Centers: A Leadership Guide to Protecting Your Brand

July 23, 2025
Nous Research Releases ‘Hermes Agent’ to Fix AI Forgetfulness with Multi-Level Memory and Dedicated Remote Terminal Access Support

Nous Research Releases ‘Hermes Agent’ to Fix AI Forgetfulness with Multi-Level Memory and Dedicated Remote Terminal Access Support

February 26, 2026
Google DeepMind Proposes New Framework for Intelligent AI Delegation to Secure the Emerging Agentic Web for Future Economies

Google DeepMind Proposes New Framework for Intelligent AI Delegation to Secure the Emerging Agentic Web for Future Economies

February 16, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to write press releases announcing a new CEO
  • Enterprise social media: 5 essential tools
  • NVIDIA Releases Nemotron 3 Super: A 120B Parameter Open-Source Hybrid Mamba-Attention MoE Model Delivering 5x Higher Throughput for Agentic AI
  • 7 Best Customer Self-Service Software I Recommend (2026)
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions