• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Your AI models are failing in production—Here’s how to fix model selection

Josh by Josh
June 4, 2025
in Technology And Software
0
Your AI models are failing in production—Here’s how to fix model selection
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter

READ ALSO

Robot butlers look more like Roombas than Rosey from the Jetsons

Sennheiser introduces new TV headphones bundle with Auracast


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Enterprises need to know if the models that power their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is hard to predict specific scenarios. A revamped version of the RewardBench benchmark looks to give organizations a better idea of a model’s real-life performance. 

The Allen Institute of AI (Ai2) launched RewardBench 2, an updated version of its reward model benchmark, RewardBench, which they claim provides a more holistic view of model performance and assesses how models align with an enterprise’s goals and standards. 

Ai2 built RewardBench with classification tasks that measure correlations through inference-time compute and downstream training. RewardBench mainly deals with reward models (RM), which can act as judges and evaluate LLM outputs. RMs assign a score or a “reward” that guides reinforcement learning with human feedback (RHLF).

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV

— Ai2 (@allen_ai) June 2, 2025

Nathan Lambert, a senior research scientist at Ai2, told VentureBeat that the first RewardBench worked as intended when it was launched. Still, the model environment rapidly evolved, and so should its benchmarks. 

“As reward models became more advanced and use cases more nuanced, we quickly recognized with the community that the first version didn’t fully capture the complexity of real-world human preferences,” he said. 

Lambert added that with RewardBench 2, “we set out to improve both the breadth and depth of evaluation—incorporating more diverse, challenging prompts and refining the methodology to reflect better how humans actually judge AI outputs in practice.” He said the second version uses unseen human prompts, has a more challenging scoring setup and new domains. 

Using evaluations for models that evaluate

While reward models test how well models work, it’s also important that RMs align with company values; otherwise, the fine-tuning and reinforcement learning process can reinforce bad behavior, such as hallucinations, reduce generalization, and score harmful responses too high.

RewardBench 2 covers six different domains: factuality, precise instruction following, math, safety, focus and ties.

“Enterprises should use RewardBench 2 in two different ways depending on their application. If they’re performing RLHF themselves, they should adopt the best practices and datasets from leading models in their own pipelines because reward models need on-policy training recipes (i.e. reward models that mirror the model they’re trying to train with RL). For inference time scaling or data filtering, RewardBench 2 has shown that they can select the best model for their domain and see correlated performance,” Lambert said. 

Lambert noted that benchmarks like RewardBench offer users a way to evaluate the models they’re choosing based on the “dimensions that matter most to them, rather than relying on a narrow one-size-fits-all score.” He said the idea of performance, which many evaluation methods claim to assess, is very subjective because a good response from a model highly depends on the context and goals of the user. At the same time, human preferences get very nuanced. 

Ai 2 released the first version of RewardBench in March 2024. At the time, the company said it was the first benchmark and leaderboard for reward models. Since then, several methods for benchmarking and improving RM have emerged. Researchers at Meta’s FAIR came out with reWordBench. DeepSeek released a new technique called Self-Principled Critique Tuning for smarter and scalable RM. 

Super excited that our second reward model evaluation is out. It’s substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling.

Happy hillclimbing!

Huge congrats to @saumyamalik44 who lead the project with a total commitment to excellence. https://t.co/c0b6rHTXY5

— Nathan Lambert (@natolambert) June 2, 2025

How models performed

Since RewardBench 2 is an updated version of RewardBench, Ai2 tested both existing and newly trained models to see if they continue to rank high. These included a variety of models, such as versions of Gemini, Claude, GPT-4.1, and Llama-3.1, along with datasets and models like Qwen, Skywork, and its own Tulu. 

The company found that larger reward models perform best on the benchmark because their base models are stronger. Overall, the strongest-performing models are variants of Llama-3.1 Instruct. In terms of focus and safety, Skywork data “is particularly helpful,” and Tulu did well on factuality. 

Ai2 said that while they believe RewardBench 2 “is a step forward in broad, multi-domain accuracy-based evaluation” for reward models, they cautioned that model evaluation should be mainly used as a guide to pick models that work best with an enterprise’s needs. 

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source_link

Related Posts

Robot butlers look more like Roombas than Rosey from the Jetsons
Technology And Software

Robot butlers look more like Roombas than Rosey from the Jetsons

January 23, 2026
Sennheiser introduces new TV headphones bundle with Auracast
Technology And Software

Sennheiser introduces new TV headphones bundle with Auracast

January 23, 2026
Legislators Push to Make Companies Tell Customers When Their Products Will Die
Technology And Software

Legislators Push to Make Companies Tell Customers When Their Products Will Die

January 22, 2026
Humans& thinks coordination is the next frontier for AI, and they’re building a model to prove it
Technology And Software

Humans& thinks coordination is the next frontier for AI, and they’re building a model to prove it

January 22, 2026
8 Best Gig Economy Jobs To Consider For Passive Income
Technology And Software

8 Best Gig Economy Jobs To Consider For Passive Income

January 22, 2026
Why LinkedIn says prompting was a non-starter — and small models was the breakthrough
Technology And Software

Why LinkedIn says prompting was a non-starter — and small models was the breakthrough

January 22, 2026
Next Post

Pinterest Chrome Extension: 5 Use Cases for Small Businesses

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

New Gemini feature and model updates for Pixels, smartphones

New Gemini feature and model updates for Pixels, smartphones

August 22, 2025
Unlock the Power of Livestreaming: Tips for Success

Unlock the Power of Livestreaming: Tips for Success

May 29, 2025
How BigCommerce Enterprise Powers Business Growth at Scale

How BigCommerce Enterprise Powers Business Growth at Scale

July 22, 2025
13 Ways to Get More Followers on Instagram in 2025

13 Ways to Get More Followers on Instagram in 2025

May 27, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Robot butlers look more like Roombas than Rosey from the Jetsons
  • A Missed Forecast, Frayed Nerves and a Long Trip Back
  • I Analyzed G2 Reviews for the 8 Best Free Presentation Tools
  • How Much Does It Cost to Build an App Like Arattai? Full Guide
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?