• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, March 10, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Your AI models are failing in production—Here’s how to fix model selection

Josh by Josh
June 4, 2025
in Technology And Software
0
Your AI models are failing in production—Here’s how to fix model selection

READ ALSO

Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications

Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Enterprises need to know if the models that power their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is hard to predict specific scenarios. A revamped version of the RewardBench benchmark looks to give organizations a better idea of a model’s real-life performance. 

The Allen Institute of AI (Ai2) launched RewardBench 2, an updated version of its reward model benchmark, RewardBench, which they claim provides a more holistic view of model performance and assesses how models align with an enterprise’s goals and standards. 

Ai2 built RewardBench with classification tasks that measure correlations through inference-time compute and downstream training. RewardBench mainly deals with reward models (RM), which can act as judges and evaluate LLM outputs. RMs assign a score or a “reward” that guides reinforcement learning with human feedback (RHLF).

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV

— Ai2 (@allen_ai) June 2, 2025

Nathan Lambert, a senior research scientist at Ai2, told VentureBeat that the first RewardBench worked as intended when it was launched. Still, the model environment rapidly evolved, and so should its benchmarks. 

“As reward models became more advanced and use cases more nuanced, we quickly recognized with the community that the first version didn’t fully capture the complexity of real-world human preferences,” he said. 

Lambert added that with RewardBench 2, “we set out to improve both the breadth and depth of evaluation—incorporating more diverse, challenging prompts and refining the methodology to reflect better how humans actually judge AI outputs in practice.” He said the second version uses unseen human prompts, has a more challenging scoring setup and new domains. 

Using evaluations for models that evaluate

While reward models test how well models work, it’s also important that RMs align with company values; otherwise, the fine-tuning and reinforcement learning process can reinforce bad behavior, such as hallucinations, reduce generalization, and score harmful responses too high.

RewardBench 2 covers six different domains: factuality, precise instruction following, math, safety, focus and ties.

“Enterprises should use RewardBench 2 in two different ways depending on their application. If they’re performing RLHF themselves, they should adopt the best practices and datasets from leading models in their own pipelines because reward models need on-policy training recipes (i.e. reward models that mirror the model they’re trying to train with RL). For inference time scaling or data filtering, RewardBench 2 has shown that they can select the best model for their domain and see correlated performance,” Lambert said. 

Lambert noted that benchmarks like RewardBench offer users a way to evaluate the models they’re choosing based on the “dimensions that matter most to them, rather than relying on a narrow one-size-fits-all score.” He said the idea of performance, which many evaluation methods claim to assess, is very subjective because a good response from a model highly depends on the context and goals of the user. At the same time, human preferences get very nuanced. 

Ai 2 released the first version of RewardBench in March 2024. At the time, the company said it was the first benchmark and leaderboard for reward models. Since then, several methods for benchmarking and improving RM have emerged. Researchers at Meta’s FAIR came out with reWordBench. DeepSeek released a new technique called Self-Principled Critique Tuning for smarter and scalable RM. 

Super excited that our second reward model evaluation is out. It’s substantially harder, much cleaner, and well correlated with downstream PPO/BoN sampling.

Happy hillclimbing!

Huge congrats to @saumyamalik44 who lead the project with a total commitment to excellence. https://t.co/c0b6rHTXY5

— Nathan Lambert (@natolambert) June 2, 2025

How models performed

Since RewardBench 2 is an updated version of RewardBench, Ai2 tested both existing and newly trained models to see if they continue to rank high. These included a variety of models, such as versions of Gemini, Claude, GPT-4.1, and Llama-3.1, along with datasets and models like Qwen, Skywork, and its own Tulu. 

The company found that larger reward models perform best on the benchmark because their base models are stronger. Overall, the strongest-performing models are variants of Llama-3.1 Instruct. In terms of focus and safety, Skywork data “is particularly helpful,” and Tulu did well on factuality. 

Ai2 said that while they believe RewardBench 2 “is a step forward in broad, multi-domain accuracy-based evaluation” for reward models, they cautioned that model evaluation should be mainly used as a guide to pick models that work best with an enterprise’s needs. 

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source_link

Related Posts

Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications
Technology And Software

Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications

March 10, 2026
Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp
Technology And Software

Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp

March 9, 2026
Our Favorite Wireless Headphones Are $60 Off
Technology And Software

Our Favorite Wireless Headphones Are $60 Off

March 9, 2026
The 2027 Chevy Bolt is the McRib of the automotive world
Technology And Software

The 2027 Chevy Bolt is the McRib of the automotive world

March 9, 2026
Dynamic UI for dynamic AI: Inside the emerging A2UI model
Technology And Software

Dynamic UI for dynamic AI: Inside the emerging A2UI model

March 9, 2026
Anthropic vs. OpenAI vs. the Pentagon: the AI safety fight shaping our future
Technology And Software

Anthropic vs. OpenAI vs. the Pentagon: the AI safety fight shaping our future

March 9, 2026
Next Post

Pinterest Chrome Extension: 5 Use Cases for Small Businesses

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

12 Incredible Freelance Productivity Tools To Boost Your Work Speed

12 Incredible Freelance Productivity Tools To Boost Your Work Speed

November 6, 2025
Incremental Attribution Added to Compare Attribution Settings

Incremental Attribution Added to Compare Attribution Settings

July 6, 2025
13 Quotes from the Customer Engagement Summit: Adapt or Die, Explained

13 Quotes from the Customer Engagement Summit: Adapt or Die, Explained

October 31, 2025
How to Use Social Media to Find Tenants for Your Real Estate Empire

How to Use Social Media to Find Tenants for Your Real Estate Empire

March 5, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications
  • A First Look at The National Ballet of Canada’s 75th Anniversary
  • Introducing Wednesday Build Hour – Google Developers Blog
  • The Scoop: NYT interview with Nike’s Elliott Hill shows art of CEO profile
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions