• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, March 10, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks

Josh by Josh
December 4, 2025
in Technology And Software
0
Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks



Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided.

READ ALSO

ChatGPT will now generate interactive visuals to help you with math and science concepts

I Used Google’s New Gemini-Powered ‘Help Me Create’ Tool in Docs. It’s Great at Corporate-Speak

A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. 

Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to power rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not just technical performance but also user trust, adaptability and communication style.

The latest HUMAINE test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro's trust score surged from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 now ranks number one overall in trust, ethics and safety 69% of the time across demographic subgroups, compared to its predecessor Gemini 2.5 Pro, which held the top spot only 16% of the time.

Overall, Gemini 3 ranked first in three of four evaluation categories: performance and reasoning, interaction and adaptiveness and trust and safety. It lost only on communication style, where DeepSeek V3 topped preferences at 43%. The HUMAINE test also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, sex, ethnicity and political orientation. The evaluation also found that users are now five times more likely to choose the model in head-to-head blind comparisons.

But the ranking matters less than why it won.

"It's the consistency across a very wide range of different use cases, and a personality and a style that appeals across a wide range of different user types," Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in some specific instances, other models are preferred by either small subgroups or on a particular conversation type, it's the breadth of knowledge and the flexibility of the model across a range of different use cases and audience types that allowed it to win this particular benchmark."

How blinded testing reveals what academic benchmarks miss

HUMAINE's methodology exposes gaps in how the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don't know which vendors power each response. They discuss whatever topics matter to them, not predetermined test questions.

It's the sample itself that matters. HUMAINE uses representative sampling across U.S. and UK populations, controlling for age, sex, ethnicity and political orientation. This reveals something static benchmarks can't capture: Model performance varies by audience.

"If you take an AI leaderboard, the majority of them still could have a fairly static list," Bradley said. "But for us, if you control for the audience, we end up with a slightly different leaderboard, whether you're looking at a left-leaning sample, right-leaning sample, U.S., UK. And I think age was actually the most different stated condition in our experiment."

For enterprises deploying AI across diverse employee populations, this matters. A model that performs well for one demographic may underperform for another.

The methodology also addresses a fundamental question in AI evaluation: Why use human judges at all when AI could evaluate itself? Bradley noted that his firm does use AI judges in certain use cases, although he stressed that human evaluation is still the critical factor.

"We see the biggest benefit coming from smart orchestration of both LLM judge and human data, both have strengths and weaknesses, that, when smartly combined, do better together," said Bradley. "But we still think that human data is where the alpha is. We're still extremely bullish that human data and human intelligence is required to be in the loop."

What trust means in AI evaluation

Trust, ethics and safety measures user confidence in reliability, factual accuracy and responsible behavior. In HUMAINE's methodology, trust isn't a vendor claim or a technical metric — it's what users report after blinded conversations with competing models.

The 69% figure represents probability across demographic groups. This consistency matters more than aggregate scores because organizations can serve diverse populations.

"There was no awareness that they were using Gemini in this scenario," Bradley said. "It was based only on the blinded multi-turn response."

This separates perceived trust from earned trust. Users judged model outputs without knowing which vendor produced them, eliminating Google's brand advantage. For customer-facing deployments where the AI vendor remains invisible to end users, this distinction matters.

What enterprises should do now

One of the critical things that enterprises should do now when considering different models is embrace an evaluation framework that works.

"It is increasingly challenging to evaluate models exclusively based on vibes," Bradley said. "I think increasingly we need more rigorous, scientific approaches to truly understand how these models are performing."

The HUMAINE data provides a framework: Test for consistency across use cases and user demographics, not just peak performance on specific tasks. Blind the testing to separate model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation as models change.

For enterprises looking to deploy AI at scale, this means moving beyond "which model is best" to "which model is best for our specific use case, user demographics and required attributes."

 The rigor of representative sampling and blind testing provides the data to make that determination — something technical benchmarks and vibes-based evaluation cannot deliver.



Source_link

Related Posts

ChatGPT will now generate interactive visuals to help you with math and science concepts
Technology And Software

ChatGPT will now generate interactive visuals to help you with math and science concepts

March 10, 2026
I Used Google’s New Gemini-Powered ‘Help Me Create’ Tool in Docs. It’s Great at Corporate-Speak
Technology And Software

I Used Google’s New Gemini-Powered ‘Help Me Create’ Tool in Docs. It’s Great at Corporate-Speak

March 10, 2026
Uzbekistan’s Uzum valuation leaps over 50% in seven months to $2.3B
Technology And Software

Uzbekistan’s Uzum valuation leaps over 50% in seven months to $2.3B

March 10, 2026
Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications
Technology And Software

Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications

March 10, 2026
Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp
Technology And Software

Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp

March 9, 2026
Our Favorite Wireless Headphones Are $60 Off
Technology And Software

Our Favorite Wireless Headphones Are $60 Off

March 9, 2026
Next Post
Gemini app moves to DeepMind, Assistant to Pixel/Android

Gemini app moves to DeepMind, Assistant to Pixel/Android

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

A Coding Implementation to Build a Self-Testing Agentic AI System Using Strands to Red-Team Tool-Using Agents and Enforce Safety at Runtime

A Coding Implementation to Build a Self-Testing Agentic AI System Using Strands to Red-Team Tool-Using Agents and Enforce Safety at Runtime

January 3, 2026
Spanish ‘soonicorn’ Multiverse Computing releases free compressed AI model

Spanish ‘soonicorn’ Multiverse Computing releases free compressed AI model

February 24, 2026
How Talking Feels More Natural than Typing

How Talking Feels More Natural than Typing

August 29, 2025
Google announces ‘voluntary exit program’ for Pixel, Android team

Google announces ‘voluntary exit program’ for Pixel, Android team

June 7, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How Meta Ads Attribution Works in 2026
  • ChatGPT will now generate interactive visuals to help you with math and science concepts
  • How to Build a Risk-Aware AI Agent with Internal Critic, Self-Consistency Reasoning, and Uncertainty Estimation for Reliable Decision-Making
  • The Link Between Brand Vision And Enduring Profitable Growth
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions