• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Monday, May 4, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A better method for identifying overconfident large language models | MIT News

Josh by Josh
March 19, 2026
in Al, Analytics and Automation
0
A better method for identifying overconfident large language models | MIT News



Large language models (LLMs) can generate credible but inaccurate responses, so researchers have developed uncertainty quantification methods to check the reliability of predictions. One popular method involves submitting the same prompt multiple times to see if the model generates the same answer.

But this method measures self-confidence, and even the most impressive LLM might be confidently wrong. Overconfidence can mislead users about the accuracy of a prediction, which might result in devastating consequences in high-stakes settings like health care or finance.   

To address this shortcoming, MIT researchers introduced a new method for measuring a different type of uncertainty that more reliably identifies confident but incorrect LLM responses.

Their method involves comparing a target model’s response to responses from a group of similar LLMs. They found that measuring cross-model disagreement more accurately captures this type of uncertainty than traditional approaches.

They combined their approach with a measure of LLM self-consistency to create a total uncertainty metric, and evaluated it on 10 realistic tasks, such as question-answering and math reasoning. This total uncertainty metric consistently outperformed other measures and was better at identifying unreliable predictions.

“Self-consistency is being used in a lot of different approaches for uncertainty quantification, but if your estimate of uncertainty only relies on a single model’s outcome, it is not necessarily trustable. We went back to the beginning to understand the limitations of current approaches and used those as a starting point to design a complementary method that can empirically improve the results,” says Kimia Hamidieh, an electrical engineering and computer science (EECS) graduate student at MIT and lead author of a paper on this technique.

She is joined on the paper by Veronika Thost, a research scientist at the MIT-IBM Watson AI Lab; Walter Gerych, a former MIT postdoc who is now an assistant professor at Worcester Polytechnic Institute; Mikhail Yurochkin, a staff research scientist at the MIT-IBM Watson AI Lab; and senior author Marzyeh Ghassemi, an associate professor in EECS and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems.

Understanding overconfidence

Many popular methods for uncertainty quantification involve asking a model for a confidence score or testing the consistency of its responses to the same prompt. These methods estimate aleatoric uncertainty, or how internally confident a model is in its own prediction.

However, LLMs can be confident when they are completely wrong. Research has shown that epistemic uncertainty, or uncertainty about whether one is using the right model, can be a better way to assess true uncertainty when a model is overconfident.

The MIT researchers estimate epistemic uncertainty by measuring disagreement across a similar group of LLMs.    

“If I ask ChatGPT the same question multiple times and it gives me the same answer over and over again, that doesn’t mean the answer is necessarily correct. If I switch to Claude or Gemini and ask them the same question, and I get a different answer, that is going to give me a sense of the epistemic uncertainty,” Hamidieh explains.

Epistemic uncertainty attempts to capture how far a target model diverges from the ideal model for that task. But since it is impossible to build an ideal model, researchers use surrogates or approximations that often rely on faulty assumptions.

To improve uncertainty quantification, the MIT researchers needed a more accurate way to estimate epistemic uncertainty.

An ensemble approach

The method they developed involves measuring the divergence between the target model and a small ensemble of models with similar size and architecture. They found that comparing semantic similarity, or how closely the meanings of the responses match, could provide a better estimate of epistemic uncertainty.

To achieve the most accurate estimate, the researchers needed a set of LLMs that covered diverse responses, weren’t too similar to the target model, and were weighted based on credibility.

“We found that the easiest way to satisfy all these properties is to take models that are trained by different companies. We tried many different approaches that were more complex, but this very simple approach ended up working best,” Hamidieh says.

Once they had developed this method for estimating epistemic uncertainty, they combined it with a standard approach that measures aleatoric uncertainty. This total uncertainty metric (TU) offered the most accurate reflection of whether a model’s confidence level is trustworthy.

“Uncertainty depends on the uncertainty of the given prompt as well as how close our model is to the optimal model. This is why summing up these two uncertainty metrics is going to give us the best estimate,” Hamidieh says.

TU could more effectively identify situations where an LLM is hallucinating, since epistemic uncertainty can flag confidently wrong outputs that aleatoric uncertainty might miss. It could also enable researchers to reinforce an LLM’s confidently correct answers during training, which may improve performance.

They tested TU using multiple LLMs on 10 common tasks, such as question-answering, summarization, translation, and math reasoning. Their method more effectively identified unreliable predictions than either measure on its own.

Measuring total uncertainty often required fewer queries than calculating aleatoric uncertainty, which could reduce computational costs and save energy.

Their experiments also revealed that epistemic uncertainty is most effective on tasks with a unique correct answer, like factual question-answering, but may underperform on more open-ended tasks.

In the future, the researchers could adapt their technique to improve its performance on open-ended queries. They may also build on this work by exploring other forms of aleatoric uncertainty.

This work is funded, in part, by the MIT-IBM Watson AI Lab.



Source_link

READ ALSO

A Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection

A Developer’s Guide to Systematic Prompting: Mastering Negative Constraints, Structured JSON Outputs, and Multi-Hypothesis Verbalized Sampling

Related Posts

A Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection
Al, Analytics and Automation

A Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection

May 4, 2026
A Developer’s Guide to Systematic Prompting: Mastering Negative Constraints, Structured JSON Outputs, and Multi-Hypothesis Verbalized Sampling
Al, Analytics and Automation

A Developer’s Guide to Systematic Prompting: Mastering Negative Constraints, Structured JSON Outputs, and Multi-Hypothesis Verbalized Sampling

May 4, 2026
Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time
Al, Analytics and Automation

Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time

May 3, 2026
Mistral AI Launches Remote Agents in Vibe and Mistral Medium 3.5 with 77.6% SWE-Bench Verified Score
Al, Analytics and Automation

Mistral AI Launches Remote Agents in Vibe and Mistral Medium 3.5 with 77.6% SWE-Bench Verified Score

May 3, 2026
You’re allowed to use AI to help make a movie, but you’re not allowed to use AI actors or writers
Al, Analytics and Automation

You’re allowed to use AI to help make a movie, but you’re not allowed to use AI actors or writers

May 2, 2026
Making the case for curiosity-driven science | MIT News
Al, Analytics and Automation

Making the case for curiosity-driven science | MIT News

May 2, 2026
Next Post
Top 11 JPG to WEBP Conversion Tools

Top 11 JPG to WEBP Conversion Tools

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Gemini in Google Sheets just achieved state-of-the-art performance

Gemini in Google Sheets just achieved state-of-the-art performance

March 15, 2026
AI in Finance: Reshaping Banking and Investments

AI in Finance: Reshaping Banking and Investments

August 6, 2025
Liquid AI Releases LFM2-ColBERT-350M: A New Small Model that brings Late Interaction Retrieval to Multilingual and Cross-Lingual RAG

Liquid AI Releases LFM2-ColBERT-350M: A New Small Model that brings Late Interaction Retrieval to Multilingual and Cross-Lingual RAG

October 29, 2025
Apex Legends won’t be playable on Nintendo Switch after its next season

Apex Legends won’t be playable on Nintendo Switch after its next season

February 2, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • 44 B2B Influencer Marketing Campaign & Activation Ideas for 2026 – TopRank® Marketing
  • Fitness App Development Cost, Pricing and ROI Guide 2026
  • Experiential Marketing Trend of the Week: Miniaturized Touchpoints
  • Celebrate America’s 250th with Google Arts & Culture
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions