• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, April 30, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

AI agent evaluation replaces data labeling as the critical path to production deployment

Josh by Josh
November 21, 2025
in Technology And Software
0
AI agent evaluation replaces data labeling as the critical path to production deployment



As LLMs have continued to improve, there has been some discussion in the industry about the continued need for standalone data labeling tools, as LLMs are increasingly able to work with all types of data. HumanSignal, the lead commercial vendor behind the open-source Label Studio program, has a different view. Rather than seeing less demand for data labeling, the company is seeing more. 

READ ALSO

Instagram’s Recommendation Algorithm Will Penalize ‘Unoriginal’ Photo And Carousel Posts

Nike Promo Codes and Deals: 30% Off

Earlier this month, HumanSignal acquired Erud AI and launched its physical Frontier Data Labs for novel data collection. But creating data is only half the challenge. Today, the company is tackling what comes next: proving the AI systems trained on that data actually work. The new multi-modal agent evaluation capabilities let enterprises validate complex AI agents generating applications, images, code, and video.

"If you focus on the enterprise segments, then all of the AI solutions that they're building still need to be evaluated, which is just another word for data labeling by humans and even more so by experts," HumanSignal co-founder and CEO Michael Malyuk told VentureBeat in an exclusive interview.

The intersection of data labeling and agentic AI evaluation

Having the right data is great, but that's not the end goal for an enterprise. Where modern data labeling is headed is evaluation.

It's a fundamental shift in what enterprises need validated: not whether their model correctly classified an image, but whether their AI agent made good decisions across a complex, multi-step task involving reasoning, tool usage and code generation.

If evaluation is just data labeling for AI outputs, then the shift from models to agents represents a step change in what needs to be labeled. Where traditional data labeling might involve marking images or categorizing text, agent evaluation requires judging multi-step reasoning chains, tool selection decisions and multi-modal outputs — all within a single interaction.

"There is this very strong need for not just human in the loop anymore, but expert in the loop," Malyuk said. He pointed to high-stakes applications like healthcare and legal advice as examples where the cost of errors remains prohibitively high.

The connection between data labeling and AI evaluation runs deeper than semantics. Both activities require the same fundamental capabilities:

  • Structured interfaces for human judgment: Whether reviewers are labeling images for training data or assessing whether an agent correctly orchestrated multiple tools, they need purpose-built interfaces to capture their assessments systematically.

  • Multi-reviewer consensus: High-quality training datasets require multiple labelers who reconcile disagreements. High-quality evaluation requires the same — multiple experts assessing outputs and resolving differences in judgment.

  • Domain expertise at scale: Training modern AI systems requires subject matter experts, not just crowd workers clicking buttons. Evaluating production AI outputs requires the same depth of expertise.

  • Feedback loops into AI systems: Labeled training data feeds model development. Evaluation data feeds continuous improvement, fine-tuning and benchmarking.

Evaluating the full agent trace

The challenge with evaluating agents isn't just the volume of data, it's the complexity of what needs to be assessed. Agents don't produce simple text outputs; they generate reasoning chains, make tool selections, and produce artifacts across multiple modalities.

The new capabilities in Label Studio Enterprise address agent validation requirements: 

  • Multi-modal trace inspection: The platform provides unified interfaces for reviewing complete agent execution traces—reasoning steps, tool calls, and outputs across modalities. This addresses a common pain point where teams must parse separate log streams. 

  • Interactive multi-turn evaluation: Evaluators assess conversational flows where agents maintain state across multiple turns, validating context tracking and intent interpretation throughout the interaction sequence. 

  • Agent Arena: Comparative evaluation framework for testing different agent configurations (base models, prompt templates, guardrail implementations) under identical conditions. 

  • Flexible evaluation rubrics: Teams define domain-specific evaluation criteria programmatically rather than using pre-defined metrics, supporting requirements like comprehension accuracy, response appropriateness or output quality for specific use cases

Agent evaluation is the new battleground for data labeling vendors

HumanSignal isn't alone in recognizing that agent evaluation represents the next phase of the data labeling market. Competitors are making similar pivots as the industry responds to both technological shifts and market disruption.

Labelbox launched its Evaluation Studio in August 2025, focused on rubric-based evaluations. Like HumanSignal, the company is expanding beyond traditional data labeling into production AI validation.

The overall competitive landscape for data labeling shifted dramatically in June when Meta invested $14.3 billion for a 49% stake in Scale AI, the market's previous leader. The deal triggered an exodus of some of Scale's largest customers. HumanSignal capitalized on the disruption, with Malyuk claiming that his company was able to win multiples competitive deal last quarter. Malyuk cites platform maturity, configuration flexibility, and customer support as differentiators, though competitors make similar claims.

What this means for AI builders

For enterprises building production AI systems, the convergence of data labeling and evaluation infrastructure has several strategic implications:

Start with ground truth. Investment in creating high-quality labeled datasets with multiple expert reviewers who resolve disagreements pays dividends throughout the AI development lifecycle — from initial training through continuous production improvement.

Observability proves necessary but insufficient. While monitoring what AI systems do remains important, observability tools measure activity, not quality. Enterprises require dedicated evaluation infrastructure to assess outputs and drive improvement. These are distinct problems requiring different capabilities.

Training data infrastructure doubles as evaluation infrastructure. Organizations that have invested in data labeling platforms for model development can extend that same infrastructure to production evaluation. These aren't separate problems requiring separate tools — they're the same fundamental workflow applied at different lifecycle stages.

For enterprises deploying AI at scale, the bottleneck has shifted from building models to validating them. Organizations that recognize this shift early gain advantages in shipping production AI systems.

The critical question for enterprises has evolved: not whether AI systems are sophisticated enough, but whether organizations can systematically prove they meet the quality requirements of specific high-stakes domains.



Source_link

Related Posts

Instagram’s Recommendation Algorithm Will Penalize ‘Unoriginal’ Photo And Carousel Posts
Technology And Software

Instagram’s Recommendation Algorithm Will Penalize ‘Unoriginal’ Photo And Carousel Posts

April 30, 2026
Nike Promo Codes and Deals: 30% Off
Technology And Software

Nike Promo Codes and Deals: 30% Off

April 30, 2026
SoftBank is creating a robotics company that builds data centers — and already eyeing a $100B IPO
Technology And Software

SoftBank is creating a robotics company that builds data centers — and already eyeing a $100B IPO

April 30, 2026
The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wall
Technology And Software

The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wall

April 29, 2026
Texas Instruments made a new flagship graphing calculator: the TI-84 Evo
Technology And Software

Texas Instruments made a new flagship graphing calculator: the TI-84 Evo

April 29, 2026
Best Walking Pads for Working From Home (2026)
Technology And Software

Best Walking Pads for Working From Home (2026)

April 29, 2026
Next Post
A Guide to Reaching Mahjong Players on Social Media

A Guide to Reaching Mahjong Players on Social Media

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

ATCOenergy Rate of Last Resort (RoLR) – Brookline PR

ATCOenergy Rate of Last Resort (RoLR) – Brookline PR

May 30, 2025
Craft Food Cheese Soup Recipe

Craft Food Cheese Soup Recipe

March 15, 2026
What is a Marketing System and how can you Make The Shift?

What is a Marketing System and how can you Make The Shift?

June 17, 2025
Strategy, ROI & Governance Guide

Strategy, ROI & Governance Guide

February 4, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • A Journalism Grad’s Guide to PR  – Brookline PR
  • Instagram’s Recommendation Algorithm Will Penalize ‘Unoriginal’ Photo And Carousel Posts
  • B2B Sales and Intent Data: Maximizing Opportunities
  • Solving the “Whac-a-mole dilemma”: A smarter way to debias AI vision models | MIT News
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions