• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, June 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

Josh by Josh
December 11, 2025
in Technology And Software
0
The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI



There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction following to agentic web browsing and tool use. But many of these benchmarks have one major shortcoming: they measure the AI's ability to complete specific problems and requests, not how factual the model is in its outputs — how well it generates objectively correct information tied to real-world data — especially when dealing with information contained in imagery or graphics.

READ ALSO

Meta’s Edits app is getting an AI assistant and a desktop version

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

For industries where accuracy is paramount — legal, finance, and medical — the lack of a standardized way to measure factuality has been a critical blind spot.

That changes today: Google’s FACTS team and its data science unit Kaggle released the FACTS Benchmark Suite, a comprehensive evaluation framework designed to close this gap.

The associated research paper reveals a more nuanced definition of the problem, splitting "factuality" into two distinct operational scenarios: "contextual factuality" (grounding responses in provided data) and "world knowledge factuality" (retrieving information from memory or the web).

While the headline news is Gemini 3 Pro’s top-tier placement, the deeper story for builders is the industry-wide "factuality wall."

According to the initial results, no model—including Gemini 3 Pro, GPT-5, or Claude 4.5 Opus—managed to crack a 70% accuracy score across the suite of problems. For technical leaders, this is a signal: the era of "trust but verify" is far from over.

Deconstructing the Benchmark

The FACTS suite moves beyond simple Q&A. It is composed of four distinct tests, each simulating a different real-world failure mode that developers encounter in production:

  1. Parametric Benchmark (Internal Knowledge): Can the model accurately answer trivia-style questions using only its training data?

  2. Search Benchmark (Tool Use): Can the model effectively use a web search tool to retrieve and synthesize live information?

  3. Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams, and images without hallucinating?

  4. Grounding Benchmark v2 (Context): Can the model stick strictly to the provided source text?

Google has released 3,513 examples to the public, while Kaggle holds a private set to prevent developers from training on the test data—a common issue known as "contamination."

The Leaderboard: A Game of Inches

The initial run of the benchmark places Gemini 3 Pro in the lead with a comprehensive FACTS Score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%).However, a closer look at the data reveals where the real battlegrounds are for engineering teams.

Model

FACTS Score (Avg)

Search (RAG Capability)

Multimodal (Vision)

Gemini 3 Pro

68.8

83.8

46.1

Gemini 2.5 Pro

62.1

63.9

46.9

GPT-5

61.8

77.7

44.1

Grok 4

53.6

75.3

25.7

Claude 4.5 Opus

51.3

73.2

39.2

Data sourced from the FACTS Team release notes.

For Builders: The "Search" vs. "Parametric" Gap

For developers building RAG (Retrieval-Augmented Generation) systems, the Search Benchmark is the most critical metric.

The data shows a massive discrepancy between a model's ability to "know" things (Parametric) and its ability to "find" things (Search). For instance, Gemini 3 Pro scores a high 83.8% on Search tasks but only 76.4% on Parametric tasks.

This validates the current enterprise architecture standard: do not rely on a model's internal memory for critical facts.

If you are building an internal knowledge bot, the FACTS results suggest that hooking your model up to a search tool or vector database is not optional—it is the only way to push accuracy toward acceptable production levels.

The Multimodal Warning

The most alarming data point for product managers is the performance on Multimodal tasks. The scores here are universally low. Even the category leader, Gemini 2.5 Pro, only hit 46.9% accuracy.

The benchmark tasks included reading charts, interpreting diagrams, and identifying objects in nature. With less than 50% accuracy across the board, this suggests that Multimodal AI is not yet ready for unsupervised data extraction.

Bottom line: If your product roadmap involves having an AI automatically scrape data from invoices or interpret financial charts without human-in-the-loop review, you are likely introducing significant error rates into your pipeline.

Why This Matters for Your Stack

The FACTS Benchmark is likely to become a standard reference point for procurement. When evaluating models for enterprise use, technical leaders should look beyond the composite score and drill into the specific sub-benchmark that matches their use case:

  • Building a Customer Support Bot? Look at the Grounding score to ensure the bot sticks to your policy documents. (Gemini 2.5 Pro actually outscored Gemini 3 Pro here, 74.2 vs 69.0).

  • Building a Research Assistant? Prioritize Search scores.

  • Building an Image Analysis Tool? Proceed with extreme caution.

As the FACTS team noted in their release, "All evaluated models achieved an overall accuracy below 70%, leaving considerable headroom for future progress."For now, the message to the industry is clear: The models are getting smarter, but they aren't yet infallible. Design your systems with the assumption that, roughly one-third of the time, the raw model might just be wrong.



Source_link

Related Posts

Meta’s Edits app is getting an AI assistant and a desktop version
Technology And Software

Meta’s Edits app is getting an AI assistant and a desktop version

June 11, 2026
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Technology And Software

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

June 11, 2026
Two breakthroughs, one week: AI and gene editing hit a turning point
Technology And Software

Two breakthroughs, one week: AI and gene editing hit a turning point

June 11, 2026
Windows 11 Sucks Slightly Less Now, Thanks To A June Update
Technology And Software

Windows 11 Sucks Slightly Less Now, Thanks To A June Update

June 11, 2026
CISA Tells US Agencies to Fix Security Bugs in as Little as 3 Days Thanks to AI Threats
Technology And Software

CISA Tells US Agencies to Fix Security Bugs in as Little as 3 Days Thanks to AI Threats

June 10, 2026
Netflix expands revamped mobile app across Asia and doubles down on kids’ gaming
Technology And Software

Netflix expands revamped mobile app across Asia and doubles down on kids’ gaming

June 10, 2026
Next Post
Craft Food Roblox Strawberry Banana Swiss Roll Recipe

Craft Food Roblox Strawberry Banana Swiss Roll Recipe

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

A Star Fox Remake Is Heading To Switch 2 On June 25

A Star Fox Remake Is Heading To Switch 2 On June 25

May 7, 2026
Nutanix as VMware Alternative: Hybrid Cloud Modernization

Nutanix as VMware Alternative: Hybrid Cloud Modernization

October 10, 2025
Behind the Scenes of Continuous Improvement: Interview with a Regpack Client Support Specialist

Behind the Scenes of Continuous Improvement: Interview with a Regpack Client Support Specialist

August 21, 2025
List of Stalky Plants in Grow a Garden

List of Stalky Plants in Grow a Garden

August 18, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Agentic Search Optimization for App Discovery
  • Be the Answer, Not a Footnote: How to Navigate the 2026 Generative Engine Disruption
  • Meta’s Edits app is getting an AI assistant and a desktop version
  • Silverpush Strikes Gold (Thrice!) at The Drum Awards for Marketing 2026
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions