• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Ai2 Researchers are Changing the Benchmarking Game by Introducing Fluid Benchmarking that Enhances Evaluation along Several Dimensions

Josh by Josh
September 17, 2025
in Al, Analytics and Automation
0
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter


A team of researchers from Allen Institute for Artificial Intelligence (Ai2), University of Washington and CMU introduce Fluid Benchmarking, an adaptive LLM evaluation method that replaces static accuracy with 2-parameter IRT ability estimation and Fisher-information–driven item selection. By asking only the most informative questions for a model’s current ability, it yields smoother training curves, delays benchmark saturation, improves external validity at small budgets, and filters mislabeled items.

Fluid Benchmarking replaces static accuracy with an adaptive, psychometrics-grounded procedure. A two-parameter logistic IRT model maps responses to a latent ability score and selects each next item by maximizing Fisher information at the model’s current ability estimate. Across six popular benchmarks and multiple model checkpoints, it improves validity (smaller rank distance), reduces variance (lower normalized total variation), delays saturation (more monotonic training curves), and avoids mislabeled items by ~100× compared to random sampling at equal budget.

What problem does Fluid Benchmarking solve?

Static subsets and plain accuracy conflate item quality and item difficulty, inflate step-to-step variance, and hit benchmark saturation early (training curves flatten while the model still improves). Fluid Benchmarking reframes both aggregation and selection: score in a latent ability space and adapt the item subset to the current ability, rather than treating all items equally or fixing them a priori.

How does it work?

1) Ability, not accuracy

Fit a 2-parameter logistic (2PL) IRT model on historical LM responses: for item j with discrimination aj​ and difficulty bj​, the probability a model with ability θi​ answers correctly is

p(uij​=1)=logistic(aj​(θi​−bj​))

At evaluation, estimate the MAP ability θ^i​ for the candidate LM by maximizing the 2PL likelihood over its observed right/wrong responses on the administered items. Items are weighted by their discrimination and difficulty, unlike accuracy which weights all equally

2) Dynamic item selection via Fisher information

At each step t, select the next item qj​ that maximizes Fisher information at the current ability estimate θ^(t):

I(θi​,aj​,bj​)=aj2​logistic(aj​(θi​−bj​))(1−logistic(aj​(θi​−bj​)))

High-information items minimize the variance of the ability estimate. As training progresses, the most informative items shift from easy to hard, so the administered subset evolves with model capability.

What does “better evaluation” mean here?

Fluid evaluates four dimensions with concrete metrics:

  • Validity: external agreement with “true” model ranking; measured by mean rank distance (lower is better).
  • Variance: normalized total variation of the training curve across checkpoints (lower is better).
  • Saturation: monotonicity (Spearman rank correlation between checkpoint index and predicted performance; higher is better).
  • Efficiency: quality at small item budgets.

How strong are the results?

Across six benchmarks (e.g., ARC-C, GSM8K, HellaSwag, MMLU, TruthfulQA, WinoGrande) and six LMs with 61–94 checkpoints each:

  • Validity: On the smallest subset (AP-10), mean rank distance drops from 20.0 → 10.1; on AP-50, 15.2 → 8.8.
  • Variance: Total variation shrinks markedly; e.g., 28.3 → 10.7 (AP-10) and 19.1 → 6.5 (AP-50).
  • Saturation: Monotonicity improves from 0.48 → 0.76 (AP-10) and 0.62 → 0.86 (AP-50).
  • Small-budget efficiency: With 10 items, Fluid improves mean rank distance by 9.9 vs. random; at 500 items, the improvement is 0.8—consistent with diminishing returns as budget grows.

In pretraining runs, accuracy space often looks flat late in training, but ability space continues to rise, delaying apparent saturation (e.g., HellaSwag monotonicity 0.91 → 0.99 for random vs. Fluid).

Fluid also avoids mislabeled items: on MMLU-Redux with 100-item budgets, mislabeled items per session drop from 0.75 (random) to 0.01 (Fluid)—about two orders of magnitude fewer.

Ablations isolate where the gains come from: IRT aggregation raises validity, but only dynamic selection lowers variance; “RANDOM-IRT” can even exceed random’s variance at large budgets, underscoring selection as the key lever.

Does it stop early when confident?

Yes. Fluid supports dynamic stopping using the standard error of the ability estimate; terminate when SE falls below the average ability gap between rank-adjacent LMs on the Open LLM Leaderboard. In practice, required items vary widely over training (≈20 early, >80 mid-run), showing why fixed budgets are suboptimal.

Where does it fit in the evaluation stack?

Fluid is benchmark-refinement: it does not invent new tasks; it re-weights and re-orders existing items to maximize information against a latent ability metric. It generalizes beyond pretraining to post-training and to other modalities, assuming enough responses to fit/update an IRT model. As models improve, IRT parameters must be refreshed to resolve difficulty among items that were previously “too hard,” otherwise the top of the scale compresses.

Summary

Fluid Benchmarking makes LLM evaluation budget-efficient and stable by scoring models in ability space and selecting items by Fisher information, yielding lower variance, better rank validity, and delayed saturation with far fewer questions. The trade-offs are operational: maintain fresh response matrices, periodically refit IRT parameters, and ensure reliable right/wrong binarization for open-ended tasks. As these practices standardize, Fluid becomes a practical default for in-loop pretraining and post-training evals across evolving benchmarks.


Check out the Paper, GitHub Page and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

[Recommended Read] 🧵 NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🔥[Recommended Read] NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI



Source_link

READ ALSO

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control

Quality Data Annotation for Cardiovascular AI

Related Posts

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control
Al, Analytics and Automation

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control

January 23, 2026
Quality Data Annotation for Cardiovascular AI
Al, Analytics and Automation

Quality Data Annotation for Cardiovascular AI

January 23, 2026
A Missed Forecast, Frayed Nerves and a Long Trip Back
Al, Analytics and Automation

A Missed Forecast, Frayed Nerves and a Long Trip Back

January 23, 2026
Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass
Al, Analytics and Automation

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass

January 23, 2026
Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future
Al, Analytics and Automation

Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future

January 22, 2026
Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents
Al, Analytics and Automation

Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents

January 22, 2026
Next Post
You absolutely should not buy Apple’s iPhone Air MagSafe battery pack

You absolutely should not buy Apple’s iPhone Air MagSafe battery pack

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

List of Special Harem Class Cheat Codes

List of Special Harem Class Cheat Codes

June 13, 2025
Blockchain in Cross-Border Payments: A Complete Guide

Blockchain in Cross-Border Payments: A Complete Guide

August 30, 2025
The best deals on MacBooks, AirPods, iPads and other Apple gear you can get right now

The best deals on MacBooks, AirPods, iPads and other Apple gear you can get right now

August 4, 2025
The Benefits of QR Code Event Badges for Event Check-In and Lead Retrieval

The Benefits of QR Code Event Badges for Event Check-In and Lead Retrieval

June 1, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Write an App Description: The Full Guide
  • Davos microcosm needs PR to help navigate an unprecedentedly complicated world
  • Websites That Pay You Money – 33 Legit Sites To Try In 2026
  • Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?