How Google’s 'internal RL' could unlock long-horizon AI agents

Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem.

Walmart-backed PhonePe shelves IPO as global tensions rattle markets

Fixing AI failure: Three changes enterprises should make now

Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.

The limits of next-token prediction

Reinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.

This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, "on the order of one in a million," according to the researchers.

The issue isn't just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal.

"We argue that when facing a problem with some abstract structure… [goal-oriented exploration] is what you want," Schimpf said. By solving the problem at the abstract level first, the agent commits to a path, ensuring it doesn't "get lost in one of the reasoning steps" and fail to complete the broader workflow.

To address this, the field has long looked toward hierarchical reinforcement learning. HRL attempts to solve complex problems by decomposing them into a hierarchy of temporally abstract actions (high-level subroutines that represent different stages of the solution) rather than managing a task as a string of tokens.

However, discovering these appropriate subroutines remains a longstanding challenge. Current HRL methods often fail to discover proper policies, frequently "converging to degenerate options" that do not represent meaningful behaviors. Even sophisticated modern methods like GRPO (a popular RL algorithm used for sparse-reward tasks) fail in complex environments because they cannot effectively bridge the gap between low-level execution and high-level planning.

Steering the LLM's internal thoughts

To overcome these limitations, the Google team proposed internal RL. Advanced autoregressive models already "know" how to perform complex, multi-step tasks internally, even if they aren't explicitly trained to do so.

Because these complex behaviors are hidden inside the model's residual stream (i.e., the numerical values that carry information through the network's layers), the researchers introduced an "internal neural network controller," or metacontroller. Instead of monitoring and changing the output token, the metacontroller controls the model’s behavior by applying changes to the model's internal activations in the middle layers.

This nudge steers the model into a specific useful state. The base model then automatically generates the sequence of individual steps needed to achieve that goal because it has already seen those patterns during its initial pretraining.

The metacontroller operates through unsupervised learning and does not require human-labeled training examples. Instead, the researchers use a self-supervised framework where the model analyzes a full sequence of behavior and works backward to infer the hidden, high-level intent that best explains the actions.

During the internal RL phase, the updates are applied to the metacontroller, which shifts training from next-token prediction to learning high-level actions that can lead to the solution.

To understand the practical value of this, consider an enterprise agent tasked with code generation. Today, there is a difficult trade-off: You need "low temperature" (predictability) to get the syntax right, but "high temperature" (creativity) to solve the logic puzzle.

"Internal RL might facilitate this by allowing the model to explore the space of abstract actions, i.e. structuring logic and method calls, while delegating the token-level realization of those actions to the robust, lower-temperature distribution of the base model," Schimpf said. The agent explores the solution without breaking the syntax.

The researchers investigated two methods for applying this controller. In the first, the base autoregressive model is pretrained on a behavioral dataset and then frozen, while the metacontroller is trained to steer the frozen model's residual stream. In the second, the metacontroller and the base model are jointly optimized, with parameters of both networks updated simultaneously.

Internal RL in action

To evaluate the effectiveness of internal RL, the researchers ran experiments across hierarchical environments designed to stump traditional learners. These included a discrete grid world and a continuous control task where a quadrupedal "ant" robot must coordinate joint movements. Both environments used sparse rewards with very long action sequences.

While baselines like GRPO and CompILE failed to learn the tasks within a million episodes due to the difficulty of credit assignment over long horizons, internal RL achieved high success rates with a small number of training episodes. By choosing high-level goals rather than tiny steps, the metacontroller drastically reduced the search space. This allowed the model to identify which high-level decisions led to success, making credit assignment efficient enough to solve the sparse reward problem.

Notably, the researchers found that the "frozen" approach was superior. When the base model and metacontroller were co-trained from scratch, the system failed to develop meaningful abstractions. However, applied to a frozen model, the metacontroller successfully discovered key checkpoints without any human labels, perfectly aligning its internal switching mechanism with the ground-truth moments when an agent finished one subgoal and started the next.

As the industry currently fixates on reasoning models that output verbose "chains of thought" to solve problems, Google’s research points toward a different, perhaps more efficient future.

"Our study joins a growing body of work suggesting that 'internal reasoning' is not only feasible but potentially more efficient than token-based approaches," Schimpf said. "Moreover, these silent 'thoughts' can be decoupled from specific input modalities — a property that could be particularly relevant for the future of multi-modal AI."

If internal reasoning can be guided without being externalized, the future of AI agents may hinge less on prompting strategies and more on how well we can access and steer what models already represent internally. For enterprises betting on autonomous systems that must plan, adapt, and act over long horizons, that shift could matter more than any new reasoning benchmark.

Source_link