• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

How Google’s 'internal RL' could unlock long-horizon AI agents

Josh by Josh
January 19, 2026
in Technology And Software
0
How Google’s 'internal RL' could unlock long-horizon AI agents
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter



Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem. 

READ ALSO

Everything in voice AI just changed: how enterprise AI builders can benefit

Robot butlers look more like Roombas than Rosey from the Jetsons

Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.

The limits of next-token prediction

Reinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.

This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, "on the order of one in a million," according to the researchers.

The issue isn't just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal.

"We argue that when facing a problem with some abstract structure… [goal-oriented exploration] is what you want," Schimpf said. By solving the problem at the abstract level first, the agent commits to a path, ensuring it doesn't "get lost in one of the reasoning steps" and fail to complete the broader workflow.

To address this, the field has long looked toward hierarchical reinforcement learning. HRL attempts to solve complex problems by decomposing them into a hierarchy of temporally abstract actions (high-level subroutines that represent different stages of the solution) rather than managing a task as a string of tokens. 

However, discovering these appropriate subroutines remains a longstanding challenge. Current HRL methods often fail to discover proper policies, frequently "converging to degenerate options" that do not represent meaningful behaviors. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning.

Steering the LLM's internal thoughts

To overcome these limitations, the Google team proposed internal RL. Advanced autoregressive models already "know" how to perform complex, multi-step tasks internally, even if they aren't explicitly trained to do so.

Because these complex behaviors are hidden inside the model's residual stream (i.e., the numerical values that carry information through the network's layers), the researchers introduced an "internal neural network controller," or metacontroller. Instead of monitoring and changing the output token, the metacontroller controls the model’s behavior by applying changes to the model's internal activations in the middle layers.

This nudge steers the model into a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining. 

The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, the researchers use a self-supervised framework where the model analyzes a full sequence of behavior and works backward to infer the hidden, high-level intent that best explains the actions.

During the internal RL phase, the updates are applied to the metacontroller, which shifts training from next-token prediction to learning high-level actions that can lead to the solution.

To understand the practical value of this, consider an enterprise agent tasked with code generation. Today, there is a difficult trade-off: You need "low temperature" (predictability) to get the syntax right, but "high temperature" (creativity) to solve the logic puzzle.

"Internal RL might facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of those actions to the robust, lower-temperature distribution of the base model," Schimpf said. The agent explores the solution without breaking the syntax.

The researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pretrained on a behavioral dataset and then frozen, while the metacontroller is trained to steer the frozen model's residual stream. In the second, the metacontroller and the base model are jointly optimized, with parameters of both networks updated simultaneously. 

Internal RL in action

To evaluate the effectiveness of internal RL, the researchers ran experiments across hierarchical environments designed to stump traditional learners. These included a discrete grid world and a continuous control task where a quadrupedal "ant" robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.

While baselines like GRPO and CompILE failed to learn the tasks within a million episodes due to the difficulty of credit assignment over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than tiny steps, the metacontroller drastically reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.

Notably, the researchers found that the "frozen" approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, perfectly aligning its internal switching mechanism with the ground-truth moments when an agent finished one subgoal and started the next.

As the industry currently fixates on reasoning models that output verbose "chains of thought" to solve problems, Google’s research points toward a different, perhaps more efficient future.

"Our study joins a growing body of work suggesting that 'internal reasoning' is not only feasible but potentially more efficient than token-based approaches," Schimpf said. "Moreover, these silent 'thoughts' can be decoupled from specific input modalities — a property that could be particularly relevant for the future of multi-modal AI."

If internal reasoning can be guided without being externalized, the future of AI agents may hinge less on prompting strategies and more on how well we can access and steer what models already represent internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift could matter more than any new reasoning benchmark.



Source_link

Related Posts

Everything in voice AI just changed: how enterprise AI builders can benefit
Technology And Software

Everything in voice AI just changed: how enterprise AI builders can benefit

January 23, 2026
Robot butlers look more like Roombas than Rosey from the Jetsons
Technology And Software

Robot butlers look more like Roombas than Rosey from the Jetsons

January 23, 2026
Sennheiser introduces new TV headphones bundle with Auracast
Technology And Software

Sennheiser introduces new TV headphones bundle with Auracast

January 23, 2026
Legislators Push to Make Companies Tell Customers When Their Products Will Die
Technology And Software

Legislators Push to Make Companies Tell Customers When Their Products Will Die

January 22, 2026
Humans& thinks coordination is the next frontier for AI, and they’re building a model to prove it
Technology And Software

Humans& thinks coordination is the next frontier for AI, and they’re building a model to prove it

January 22, 2026
8 Best Gig Economy Jobs To Consider For Passive Income
Technology And Software

8 Best Gig Economy Jobs To Consider For Passive Income

January 22, 2026
Next Post
What They Are & How to Find Them

What They Are & How to Find Them

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Researcher turns gpt-oss-20b into a non-reasoning base model

Researcher turns gpt-oss-20b into a non-reasoning base model

August 16, 2025
Stephanie Losee on Thought Leadership, Trust, and Experience – TopRank® Marketing

Stephanie Losee on Thought Leadership, Trust, and Experience – TopRank® Marketing

December 24, 2025

The Scoop: Kimmel suspension puts Disney in crosshairs of free speech vs. regulation debate

September 20, 2025
Better Choice or the Perfect Pair?

Better Choice or the Perfect Pair?

August 11, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • FleishmanHillard senior partner on the new rules of crisis spokespersonship
  • The Smile Scroll: How to Market Dental Solutions in a Filtered World
  • Everything in voice AI just changed: how enterprise AI builders can benefit
  • Quality Data Annotation for Cardiovascular AI
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?