• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, April 2, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

Josh by Josh
February 15, 2026
in Technology And Software
0
Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy



Researchers at Nvidia have developed a technique that can reduce the memory costs of large language model reasoning by up to eight times. Their technique, called dynamic memory sparsification (DMS), compresses the key value (KV) cache, the temporary memory LLMs generate and store as they process prompts and reason through problems and documents.

READ ALSO

Microsoft launches 3 new AI models in direct shot at OpenAI and Google

What happened when they installed ChatGPT on a nuclear supercomputer

While researchers have proposed various methods to compress this cache before, most struggle to do so without degrading the model's intelligence. Nvidia's approach manages to discard much of the cache while maintaining (and in some cases improving) the model's reasoning capabilities.

Experiments show that DMS enables LLMs to "think" longer and explore more solutions without the usual penalty in speed or memory costs.

The bottleneck of reasoning

LLMs improve their performance on complex tasks by generating "chain-of-thought" tokens, essentially writing out their reasoning steps before arriving at a final answer. Inference-time scaling techniques leverage this by giving the model a larger budget to generate these thinking tokens or to explore multiple potential reasoning paths in parallel.

However, this improved reasoning comes with a significant computational cost. As the model generates more tokens, it builds up a KV cache.

For real-world applications, the KV cache is a major bottleneck. As the reasoning chain grows, the cache grows linearly, consuming vast amounts of memory on GPUs. This forces the hardware to spend more time reading data from memory than actually computing, which slows down generation and increases latency. It also caps the number of users a system can serve simultaneously, as running out of VRAM causes the system to crash or slow to a crawl.

Nvidia researchers frame this not just as a technical hurdle, but as a fundamental economic one for the enterprise.

"The question isn't just about hardware quantity; it's about whether your infrastructure is processing 100 reasoning threads or 800 threads for the same cost," Piotr Nawrot, Senior Deep Learning Engineer at Nvidia, told VentureBeat.

Previous attempts to solve this focused on heuristics-based approaches. These methods use rigid rules, such as a "sliding window" that only caches the most recent tokens and deletes the rest. While this reduces memory usage, it often forces the model to discard critical information required for solving the problem, degrading the accuracy of the output.

"Standard eviction methods attempt to select old and unused tokens for eviction using heuristics," the researchers said. "They simplify the problem, hoping that if they approximate the model's internal mechanics, the answer will remain correct."

Other solutions use paging to offload the unused parts of the KV cache to slower memory, but the constant swapping of data introduces latency overhead that makes real-time applications sluggish.

Dynamic memory sparsification

DMS takes a different approach by "retrofitting" existing LLMs to intelligently manage their own memory. Rather than applying a fixed rule for what to delete, DMS trains the model to identify which tokens are essential for future reasoning and which are disposable.

"It doesn't just guess importance; it learns a policy that explicitly preserves the model's final output distribution," Nawrot said.

The process transforms a standard, pre-trained LLM such as Llama 3 or Qwen 3 into a self-compressing model. Crucially, this does not require training the model from scratch, which would be prohibitively expensive. Instead, DMS repurposes existing neurons within the model’s attention layers to output a "keep" or "evict" signal for each token.

For teams worried about the complexity of retrofitting, the researchers noted that the process is designed to be lightweight. "To improve the efficiency of this process, the model's weights can be frozen, which makes the process similar to Low-Rank Adaptation (LoRA)," Nawrot said. This means a standard enterprise model like Qwen3-8B "can be retrofitted with DMS within hours on a single DGX H100."

One of the important parts of DMS is a mechanism called "delayed eviction." In standard sparsification, if a token is deemed unimportant, it is deleted immediately. This is risky because the model might need a split second to integrate that token's context into its current state.

DMS mitigates this by flagging a token for eviction but keeping it accessible for a short window of time (e.g., a few hundred steps). This delay allows the model to "extract" any remaining necessary information from the token and merge it into the current context before the token is wiped from the KV cache.

“The ‘delayed eviction’ mechanism is crucial because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between — they carry some information, but not enough to justify occupying an entire slot in memory,” Nawrot said. “This is where the redundancy lies. By keeping these tokens in a local window for a short time before eviction, we allow the model to attend to them and redistribute their information into future tokens.”

The researchers found that this retrofitting process is highly efficient. They could equip a pre-trained LLM with DMS in just 1,000 training steps, a tiny fraction of the compute required for the original training. The resulting models use standard kernels and can drop directly into existing high-performance inference stacks without custom hardware or complex software rewriting.

DMS in action

To validate the technique, the researchers applied DMS to several reasoning models, including the Qwen-R1 series (distilled from DeepSeek R1) and Llama 3.2, and tested them on difficult benchmarks like AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).

The results show that DMS effectively moves the Pareto frontier, the optimal trade-off between cost and performance. On the AIME 24 math benchmark, a Qwen-R1 32B model equipped with DMS achieved a score 12.0 points higher than a standard model when constrained to the same memory bandwidth budget. By compressing the cache, the model could afford to "think" much deeper and wider than the standard model could for the same memory and compute budget.

Perhaps most surprisingly, DMS defied the common wisdom that compression hurts long-context understanding. In "needle-in-a-haystack" tests, which measure a model's ability to find a specific piece of information buried in a large document, DMS variants actually outperformed the standard models. By actively managing its memory rather than passively accumulating noise, the model maintained a cleaner, more useful context.

For enterprise infrastructure, the efficiency gains translate directly to throughput and hardware savings. Because the memory cache is significantly smaller, the GPU spends less time fetching data, reducing the wait time for users. In tests with the Qwen3-8B model, DMS matched the accuracy of the vanilla model while delivering up to 5x higher throughput. This means a single server can handle five times as many customer queries per second without a drop in quality.

The future of memory

Nvidia has released DMS as part of its Model Optimizer framework. Regarding how enterprises can get started with DMS, Nawrot emphasized that the barrier to entry is low. "The 'minimum viable infrastructure' is standard Hugging Face pipelines — no custom CUDA kernels are required," Nawrot said, noting that the code is fully compatible with standard FlashAttention. 

Looking ahead, the team views DMS as part of a larger shift where memory management becomes a distinct, intelligent layer of the AI stack. Nawrot also confirmed that DMS is "fully compatible" with newer architectures like the Multi-Head Latent Attention (MLA) used in DeepSeek’s models, suggesting that combining these approaches could yield even greater efficiency gains.

As enterprises move from simple chatbots to complex agentic systems that require extended reasoning, the cost of inference is becoming a primary concern. Techniques like DMS provide a path to scale these capabilities sustainably.

"We’ve barely scratched the surface of what is possible," Nawrot said, "and we expect inference-time scaling to further evolve."



Source_link

Related Posts

Microsoft launches 3 new AI models in direct shot at OpenAI and Google
Technology And Software

Microsoft launches 3 new AI models in direct shot at OpenAI and Google

April 2, 2026
What happened when they installed ChatGPT on a nuclear supercomputer
Technology And Software

What happened when they installed ChatGPT on a nuclear supercomputer

April 2, 2026
Womanizer Coupons: Save 15% in April
Technology And Software

Womanizer Coupons: Save 15% in April

April 2, 2026
De-fi platform Drift suspends deposits and withdrawals after millions in crypto stolen in hack
Technology And Software

De-fi platform Drift suspends deposits and withdrawals after millions in crypto stolen in hack

April 2, 2026
The end of 'shadow AI' at enterprises? Kilo launches KiloClaw for Organizations to enable secure AI agents at scale
Technology And Software

The end of 'shadow AI' at enterprises? Kilo launches KiloClaw for Organizations to enable secure AI agents at scale

April 1, 2026
Meta and YouTube lost landmark social media trials. That’s bad for free speech.
Technology And Software

Meta and YouTube lost landmark social media trials. That’s bad for free speech.

April 1, 2026
Next Post
Beyond Locks and Keys: Important Security Measures for Every Business

Beyond Locks and Keys: Important Security Measures for Every Business

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Grow a Garden Persimmon Wiki

Grow a Garden Persimmon Wiki

October 8, 2025
Personal Intelligence in AI Mode and Gemini expands in the U.S.

Personal Intelligence in AI Mode and Gemini expands in the U.S.

March 22, 2026
Aura Frame Sale (2025): From Aura Carver to Aura Ink

Aura Frame Sale (2025): From Aura Carver to Aura Ink

December 16, 2025
Nintendo designed a Playdate-like crank for the Switch 2

Nintendo designed a Playdate-like crank for the Switch 2

August 8, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Do They Help with SEO?
  • Microsoft launches 3 new AI models in direct shot at OpenAI and Google
  • IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction
  • My Hands-On Review of Synthesia vs. HeyGen: Here’s the Winner
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions