TTT-Discover optimizes GPU kernels 2x faster than human experts — by training during inference

Researchers from Stanford, Nvidia, and Together AI have developed a new technique that can discover new solutions to very complex problems. For example, they managed to optimize a critical GPU kernel to run 2x faster than the previous state-of-the-art written by human experts.

What to do when you regret a social media post, explained

Patriots vs. Seahawks time, where to watch and more

Their technique, called “Test-Time Training to Discover” (TTT-Discover), challenges the current paradigm of letting models “think longer” for reasoning problems. TTT-Discover allows the model to continue training during the inference process and update its weights for the problem at hand.

The limits of 'frozen' reasoning

Current enterprise AI strategies often rely on "frozen" models. Whether you use a closed or open reasoning model, the model's parameters are static. When you prompt these models, they search for answers within the fixed manifold of their training data. This works well for problems that resemble what the model has seen before.

However, true discovery problems, like inventing a novel algorithm or proving a new mathematical theorem, are, by definition, out-of-distribution. If the solution requires a leap of logic that doesn't exist in the training set, a frozen model will likely fail, no matter how much compute you throw at it during inference.

In comments to VentureBeat, Mert Yuksekgonul, a co-author of the paper and doctorate student at Stanford, illustrated this distinction using a famous mathematical breakthrough:

"I believe that thinking models wouldn't be able to prove, for example, P != NP, without test-time training, just like Andrew Wiles wouldn't be able to prove Fermat's Last Theorem without the 7 years he spent pursuing this single problem in isolation and continuously learning from his own failures."

TTT-Discover treats the test problem not as a query to be answered, but as an environment to be mastered. As the model attempts to solve the problem, it generates different types of data: failures, partial successes, and errors. Instead of discarding this data, TTT-Discover uses it to update the model's weights in real-time, effectively allowing the model to laser focus on that specific challenge as opposed to developing a very general problem-solving framework.

A different approach to reinforcement learning

TTT-Discover provides a fundamental shift on how reasoning models are trained. In standard reinforcement learning (RL) training, the goal is a generalist policy that performs well on average across many tasks. In TTT-Discover, the goal is to find the best solution to a very specific problem, and the policy is “a means towards this end,” according to the authors. Once the model discovers the artifact (i.e., the optimized code, the proof, or the molecule) the neural network that produced it can be discarded.

To achieve this, the researchers engineered two specific components that differentiate TTT-Discover from standard reinforcement learning:

Entropic objective: Standard RL optimizes for the average expected reward. If a model tries a risky path and fails, standard RL punishes it. TTT-Discover flips this. It uses an "entropic objective" that exponentially weighs high-reward outcomes. This forces the model to ignore "safe," average answers and aggressively hunt for "eureka" outliers, solutions that have a low probability of being found but offer a massive reward.
PUCT search: The system introduces PUCT, a tree-search algorithm inspired by AlphaZero. It explores different solution paths, building a dataset of attempts. The model then trains on this dataset in real-time, learning to recognize which partial steps lead to high-reward outcomes.

Crucially, this method works best on problems with a continuous reward signal. The system needs a way to measure incremental progress such as "runtime in microseconds" or "error rate" rather than a binary "pass/fail" signal. This allows the model to follow the gradual improvement toward the optimal solution.

The economics of 'heavy inference'

For enterprises accustomed to paying fractions of a cent per API call, the cost profile of TTT-Discover requires a mindset shift. In their experiments, the researchers reported that a single discovery run involves approximately 50 training steps and thousands of rollouts, costing roughly $500 per problem.

TTT-Discover could be for “static, high-value assets” as opposed to trivial and recurring problems that can be solved with existing models and approaches.

Consider a cloud-native enterprise running a data pipeline that processes petabytes of information nightly. If that pipeline relies on a specific SQL query or GPU kernel, optimizing that code by just 1% could save hundreds of thousands of dollars in annual compute costs. In this context, spending $500 to find a kernel that is 50% faster is a trivial expense with an immediate ROI.

"This makes the most sense for low-frequency, high-impact decisions where a single improvement is worth far more than the compute cost," Yuksekgonul said. "Supply chain routing, drug design, and material discovery qualify. In these settings, spending hundreds of dollars on a single discovery step can easily pay for itself."

Implementation considerations

One of the most significant findings for enterprise adoption is that TTT-Discover does not require a proprietary frontier model. The researchers achieved state-of-the-art results using gpt-oss-120b, OpenAI’s open-weights model. The researchers have released the code for TTT-Discover to enable researchers and developers to use it for their own models.

Because the technique works with open models, companies can run this "discovery loop" entirely within their own secure VPCs or on-premise H100 clusters without sending their proprietary data to third-party servers.

“If a company already runs reinforcement learning, there is no additional infrastructure required,” Yuksekgonul said. “TTT-Discover uses the same training stack (GPUs, rollout workers, optimizers, checkpointing).”

If they don’t already run RL, they would need to build that infrastructure. But enterprises can also use existing solutions to reduce the complexity of the process. The researchers orchestrated these training runs using the Tinker API by Thinking Machines, an API that manages the complexity of distributed training and inference.

“Tooling such as Tinker (and open variants, e.g., OpenTinker) lowers the setup cost, and both labor and compute costs are likely to drop over time,” he said.

Real-world use cases

The researchers deployed TTT-Discover across four distinct technical domains: systems engineering, algorithm design, biology, and mathematics. In almost every instance, the method set a new state-of-the-art.

In one experiment, the model optimized GPU kernels for matrix multiplication (including the "TriMul" kernel used in AlphaFold), achieving execution speeds up to 2x faster than prior state-of-the-art and outperforming the best human-written kernels on the leaderboard.

In competitive programming scenarios (AtCoder), it solved complex heuristic problems (e.g., optimizing geometric constraints for fishing nets) better than top human experts and prior AI baselines.

For the enterprise, the transition from these academic benchmarks to business value hinges on one specific constraint: the existence of a verifiable, scalar signal. Unlike a chatbot that generates text, TTT-Discover needs a hard metric (e.g., runtime, error rate, or profit margin) to optimize against.

Yuksekgonul said that this requirement draws a clear line between where this technology should and shouldn't be used. "At the moment, the key requirement is a reliable scalar signal of progress — cost, error, molecular properties — that the system can optimize against," he said.

This directs enterprise adoption toward "hard" engineering and operations challenges such as logistics, supply chain, and resource management, where problems like fleet routing or crew scheduling often rely on static heuristics. TTT-Discover can treat these as optimization environments, spending hours to find a route structure that shaves 5% off daily fuel costs.

The requirement for clear verifiers rules out qualitative tasks like "write a better marketing strategy," where verification is subjective and prone to noise.

"Hard to verify problems are still an open question,” Yuksekgonul said.

With current technology, the best path forward is to try to design verifiers, but “making those verifiers robust and hard to game is challenging, and we don’t have a good solution yet," he added.

From inference to invention

The broader implication is that enterprise AI stacks may need to evolve to support this kind of per-problem learning.

“Systems built around a frozen model will need to support per-problem (or per-domain) adaptation, and enterprises will need better problem specifications and internal feedback signals to make test-time learning effective,” Yuksekgonul said. “If training runs inside a private VPC, the training loop can also be integrated with more of the company’s internal environment, not just a central lab pipeline.”

For the enterprise, the value lies in identifying "million-dollar problems,” optimization challenges where a verifiable metric exists, but human progress has stalled. These are the candidates for TTT-Discover. By accepting higher latency and cost for specific queries, enterprises can turn their inference compute into an automated R&D lab, discovering solutions that were previously out of reach for both humans and frozen AI models.

Source_link