• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, November 13, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem

Josh by Josh
November 5, 2025
in Technology And Software
0
Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter



The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.

READ ALSO

Offload Patterns for East–West Traffic

SpyOnWeb: Top 5 Alternatives & Website Ownership Tools

That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. 

Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.

Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale.

"The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"

The 'Ouroboros problem' of AI evaluation

Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem."  An Ouroboros is an ancient symbol that depicts a snake eating its own tail. 

Using AI systems to evaluate AI systems creates a circular validation challenge.

"You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"

The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.

This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.

The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.

Lessons learned: Building judges that actually work

Databricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.

Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience.

"One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."

The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.

Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.

Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.

The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.

Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees.

"We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.

Production results: From pilots to seven-figure deployments

Frankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.

On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."

For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.

The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred.

"There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"

What enterprises should do now

The teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.

Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement plus one observed failure mode. These become your initial judge portfolio.

Second, create lightweight workflows with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batched annotation and inter-rater reliability checks to denoise your data.

Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should evolve with them.

"A judge is a way to evaluate a model, it's also a way to create guardrails, it's also a way to have a metric against which you can do prompt optimization and it's also a way to have a metric against which you can do reinforcement learning," Frankle said. "Once you have a judge that you know represents your human taste in an empirical form that you can query as much as you want, you can use it in 10,000 different ways to measure or improve your agents."



Source_link

Related Posts

Offload Patterns for East–West Traffic
Technology And Software

Offload Patterns for East–West Traffic

November 13, 2025
SpyOnWeb: Top 5 Alternatives & Website Ownership Tools
Technology And Software

SpyOnWeb: Top 5 Alternatives & Website Ownership Tools

November 13, 2025
Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget
Technology And Software

Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

November 13, 2025
Our favorite 2025 advent calendars from Lego, Pokémon, Funko Pop, Magna-Tiles and more
Technology And Software

Our favorite 2025 advent calendars from Lego, Pokémon, Funko Pop, Magna-Tiles and more

November 13, 2025
DHS Kept Chicago Police Records for Months in Violation of Domestic Espionage Rules
Technology And Software

DHS Kept Chicago Police Records for Months in Violation of Domestic Espionage Rules

November 13, 2025
‘Chad: The Brainrot IDE’ is a new Y Combinator-backed product so wild, people thought it was fake
Technology And Software

‘Chad: The Brainrot IDE’ is a new Y Combinator-backed product so wild, people thought it was fake

November 13, 2025
Next Post
How to measure and increase social media engagement in 2025

How to measure and increase social media engagement in 2025

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

Ubuntu 17.10: a last minute review

Ubuntu 17.10: a last minute review

June 7, 2025
Do Apple Watches and Oura Rings really make use healthier?

Do Apple Watches and Oura Rings really make use healthier?

June 29, 2025
VMware Alternatives 8 Best Virtualization Solutions

VMware Alternatives 8 Best Virtualization Solutions

August 2, 2025
A Developer’s Guide to OpenAI’s GPT-5 Model Capabilities

A Developer’s Guide to OpenAI’s GPT-5 Model Capabilities

August 8, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Full list of winners: The inaugural Zenith Awards
  • Who is Johnson Wen? The Ariana Grande Stage Invader
  • Offload Patterns for East–West Traffic
  • Building ReAct Agents with LangGraph: A Beginner’s Guide
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?