• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, March 25, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

NVIDIA AI Introduces PivotRL: A New AI Framework Achieving High Agentic Accuracy With 4x Fewer Rollout Turns Efficiently

Josh by Josh
March 25, 2026
in Al, Analytics and Automation
0
NVIDIA AI Introduces PivotRL: A New AI Framework Achieving High Agentic Accuracy With 4x Fewer Rollout Turns Efficiently


Post-training Large Language Models (LLMs) for long-horizon agentic tasks—such as software engineering, web browsing, and complex tool use—presents a persistent trade-off between computational efficiency and model generalization. While Supervised Fine-Tuning (SFT) is computationally inexpensive, it frequently suffers from out-of-domain (OOD) performance degradation and struggles to generalize beyond its training distribution. Conversely, end-to-end reinforcement learning (E2E RL) typically preserves OOD capabilities and achieves high in-domain accuracy, but it incurs massive compute costs due to the necessity of repeated, many-turn on-policy rollouts for every parameter update.

NVIDIA researchers have introduced PivotRL, a framework designed to bridge this gap. By operating on existing SFT trajectories, PivotRL aims to deliver the generalization benefits of E2E RL while maintaining the data efficiency associated with SFT.

The Architecture of a Pivot

The core of PivotRL is the transition from full-trajectory rollouts to targeted, turn-level updates. The framework identifies and utilizes two primary mechanisms: Pivot Filtering and Functional Rewards.

1. Pivot Filtering

In turn-level agentic training, every assistant completion at a model-call boundary is considered an action. PivotRL begins by extracting all assistant turns from an SFT dataset into a ‘pivot candidate’ pool.

The system then profiles these candidates offline using a frozen reference policy, π0. To optimize the training budget, PivotRL filters for pivots: specific states where local, on-policy rollouts exhibit high variance in outcomes. The filtering criteria are defined by two conditions:

  • Nonzero empirical reward variance: σ^2(s)>0\hat{\sigma}^2(s) > 0.
  • Low reward mean: μ^(s)<λdiff\hat{\mu}(s) < \lambda_{diff}

This approach addresses the uninformative-turn bottleneck. In group-normalized RL—specifically Group Relative Policy Optimization (GRPO)—turns where actions either uniformly succeed or uniformly fail result in a normalized advantage of zero, providing no meaningful gradient update. By focusing on mixed-outcome turns that remain difficult for the reference policy, PivotRL concentrates compute on states that provide the strongest learning signal.

2. Implementing Functional Rewards

Standard SFT-to-RL adaptations often rely on exact string matching with the demonstration data to assign rewards. However, in generative action spaces (e.g., shell commands or search queries), multiple functionally equivalent actions may diverge from the specific string in the training data.

PivotRL replaces strict matching with functional rewards, rfunc(s,a)=1[a∈ℳ(s)]r_{func}(s, a) = 1[a \in \mathcal{M}(s)], where ℳ(s)\mathcal{M}(s) is the set of locally acceptable actions determined by a domain-specific verifier. These verifiers can range from normalized schema checks and string similarity to lightweight LLM-as-a-judge scoring.

Theoretical Foundations: Gradient Signal and OOD Retention

The effectiveness of these design choices is supported by two primary theoretical results:

  • Theorem 3.2 (Reward Variance and GRPO Signal): The research team proved that the Fisher norm of the natural gradient of the statewise reward objective scales with the reward standard deviation. Specifically, the population GRPO score, γs,β,equalsσβ2\gamma_{s, \beta}, equals \frac{\sigma}{\beta^2}. This validates the strategy of filtering for mixed-outcome pivots to maximize the local in-domain learning signal.
  • Theorem 3.3 (Minimal KL Change): This theorem demonstrates that functional reward-based RL shifts probability mass toward acceptable actions while preserving the reference policy’s relative probability ordering for actions unrelated to the training task. Because the relative ranking of task-unrelated actions remains unchanged, PivotRL significantly mitigates the catastrophic forgetting and OOD degradation common in SFT.

Performance and Efficiency

The research team evaluated PivotRL using Qwen3-30B-A3B-Thinking-2507 as the base model across four agentic domains: conversational tool use (τ2−Bench)(\tau^2-Bench), software engineering (SWE-Bench Verified), terminal control (Terminal-Bench), and web browsing (BrowseComp).

In-Domain Accuracy Gains

Compared to SFT on identical data, PivotRL achieved superior in-domain results:

  • Average Gain: +14.11 points over the base model, compared to +9.94 points for SFT.
  • Domain Specifics: PivotRL outperformed SFT on τ2−Bench\tau^2-Bench (+5.37), Terminal-Bench (+6.25), and BrowseComp (+9.80).

Out-of-Domain Retention

The most significant advantage was observed in OOD stability. While SFT caused an average regression of -9.83 across eight OOD benchmarks (including math and science QA), PivotRL maintained a near-zero average change of +0.21. Notably, PivotRL achieved +10.04% higher OOD accuracy in non-agentic tasks compared to SFT.

Compute Efficiency on SWE-Bench

On SWE-Bench Verified, a rigorous standard for long-horizon agents, PivotRL demonstrated a substantial reduction in training overhead:

  • Turn Efficiency: PivotRL reached accuracy levels comparable to E2E RL using 4x fewer rollout turns.
  • Temporal Efficiency: Training was ~5.5x faster in wall-clock time than E2E RL when using the same number of compute nodes.

Key Takeaways

  • Hybrid Efficiency: PivotRL combines the compute efficiency of Supervised Fine-Tuning (SFT) with the out-of-domain (OOD) generalization of End-to-End RL.
  • Pivot Filtering: The framework identifies ‘pivots’—critical intermediate turns where sampled actions show high variance in success/failure, providing the strongest learning signals.
  • Functional Verifiers: Instead of requiring exact text matches, PivotRL uses domain-specific verifiers to reward any functionally equivalent action.
  • OOD Stability: Unlike SFT, PivotRL preserves the model’s performance on unrelated tasks (e.g., math) by maintaining the reference policy’s probability ordering for task-unrelated actions.
  • Production Speed: It achieves accuracy comparable to E2E RL with 4x fewer rollout turns and ~5.5x faster training time, as proven in NVIDIA’s Nemotron-3-Super.

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

READ ALSO

On algorithms, life, and learning | MIT News

Paged Attention in Large Language Models LLMs




Source_link

Related Posts

On algorithms, life, and learning | MIT News
Al, Analytics and Automation

On algorithms, life, and learning | MIT News

March 25, 2026
Paged Attention in Large Language Models LLMs
Al, Analytics and Automation

Paged Attention in Large Language Models LLMs

March 24, 2026
How to create “humble” AI | MIT News
Al, Analytics and Automation

How to create “humble” AI | MIT News

March 24, 2026
Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling
Al, Analytics and Automation

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling

March 24, 2026
The AI Race Just Got a Lot More Secretive
Al, Analytics and Automation

The AI Race Just Got a Lot More Secretive

March 24, 2026
Advancing international trade research and finding community | MIT News
Al, Analytics and Automation

Advancing international trade research and finding community | MIT News

March 23, 2026
Next Post
OpenAI is shutting down Sora, its powerful AI video model, app and API

OpenAI is shutting down Sora, its powerful AI video model, app and API

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

The Role of Transparency in Crisis PR

The Role of Transparency in Crisis PR

July 1, 2025
Growing Importance of Social Media Management in the Digital Era

Growing Importance of Social Media Management in the Digital Era

October 5, 2025
Google app gives Lens shortcuts a new look

Google app gives Lens shortcuts a new look

December 25, 2025
Elon Musk’s SpaceX might invest $2 billion in Musk’s xAI

Elon Musk’s SpaceX might invest $2 billion in Musk’s xAI

July 13, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Best practices for Ads and Analytics Advisor
  • 12 SEO Techniques to Boost Your Visibility and Traffic [2026]
  • How to Measure Mobile Marketing ROI in 2026
  • Top 42 SaaS SEO Firms for Improving Your AI Visibility
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions