• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Saturday, March 28, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Josh by Josh
March 28, 2026
in Technology And Software
0
IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models



Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length.

READ ALSO

Why Gmail’s AI-generated email replies feel so creepy

Verizon waives late fees for federal workers affected by partial DHS shutdown

The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in preliminary tests on the 744-billion-parameter GLM-5 model.

The DSA bottleneck

Large language models rely on the self-attention mechanism, a process where the model computes the relationship between every token in its context and all the preceding ones to predict the next token.

However, self-attention has a severe limitation. Its computational complexity scales quadratically with sequence length. For applications requiring extended context windows (e.g., large document processing, multi-step agentic workflows, or long chain-of-thought reasoning), this quadratic scaling leads to sluggish inference speeds and significant compute and memory costs.

Sparse attention offers a principled solution to this scaling problem. Instead of calculating the relationship between every token and all preceding ones, sparse attention optimizes the process by having each query select and attend to only the most relevant subset of tokens.

DeepSeek Sparse Attention (DSA) is a highly efficient implementation of this concept, first introduced in DeepSeek-V3.2. To determine which tokens matter most, DSA introduces a lightweight "lightning indexer module" at every layer of the model. This indexer scores all preceding tokens and selects a small batch for the main core attention mechanism to process. By doing this, DSA slashes the heavy core attention computation from quadratic to linear, dramatically speeding up the model while preserving output quality.

But the researchers identified a lingering flaw: the DSA indexer itself still operates at a quadratic complexity at every single layer. Even though the indexer is computationally cheaper than the main attention process, as context lengths grow, the time the model spends running these indexers skyrockets. This severely slows down the model, especially during the initial "prefill" stage where the prompt is first processed.

Caching attention with IndexCache

To solve the indexer bottleneck, the research team discovered a crucial characteristic of how DSA models process data. The subset of important tokens an indexer selects remains remarkably stable as data moves through consecutive transformer layers. Empirical tests on DSA models revealed that adjacent layers share between 70% and 100% of their selected tokens.

To capitalize on this cross-layer redundancy, the researchers developed IndexCache. The technique partitions the model’s layers into two categories. A small number of full (F) layers retain their indexers, actively scoring the tokens and choosing the most important ones to cache. The rest of the layers become shared (S), performing no indexing and reusing the cached indices from the nearest preceding F layer.

During inference, the model simply checks the layer type. If it reaches an F layer, it calculates and caches fresh indices. If it is an S layer, it skips the math and copies the cached data.

There is a wide range of optimization techniques that try to address the attention bottleneck by compressing the KV cache, where the computed attention values are stored. Instead of shrinking the memory footprint like standard KV cache compression, IndexCache attacks the compute bottleneck. 

“IndexCache is not a traditional KV cache compression or sharing technique,” Yushi Bai, co-author of the paper, told VentureBeat. “It eliminates this redundancy by reusing indices across layers, thereby reducing computation rather than just memory footprint. It is complementary to existing approaches and can be combined with them.”

The researchers developed two deployment approaches for IndexCache. (It is worth noting that IndexCache only applies to models that use the DSA architecture, such as the latest DeepSeek models and the latest family of GLM models.)

For developers working with off-the-shelf DSA models where retraining is unfeasible or too expensive, they created a training-free method relying on a “greedy layer selection” algorithm. By running a small calibration dataset through the model, this algorithm automatically determines the optimal placement of F and S layers without any weight updates. Empirical evidence shows that the greedy algorithm can safely remove 75% of the indexers while matching the downstream performance of the original model.

For teams pre-training or heavily fine-tuning their own foundation models, the researchers propose a training-aware version that optimizes the network parameters to natively support cross-layer sharing. This approach introduces a “multi-layer distillation loss” during training. It forces each retained indexer to learn how to select a consensus subset of tokens that will be highly relevant for all the subsequent layers it serves.

Real-world speedups on production models

To test the impact of IndexCache, the researchers applied it to the 30-billion-parameter GLM-4.7 Flash model and compared it against the standard baseline.

At a 200K context length, removing 75% of the indexers slashed the prefill latency from 19.5 seconds down to just 10.7 seconds, delivering a 1.82x speedup. The researchers note these speedups are expected to be even greater in longer contexts.

During the decoding phase, where the model generates its response, IndexCache boosted per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, yielding a 1.48x speedup. When the server's memory is fully saturated with requests, total decode throughput jumped by up to 51%.

For enterprise teams, these efficiency gains translate directly into cost savings. “In terms of ROI, IndexCache provides consistent benefits across scenarios, but the gains are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines,” Bai said. “In these cases, we observe at least an approximate 20% reduction in deployment cost and similar improvements in user-perceived latency.” He added that for very short-context tasks, the benefits hover around 5%.

Remarkably, these efficiency gains did not compromise reasoning capabilities. Using the training-free approach to eliminate 75% of indexers, the 30B model matched the original baseline's average score on long-context benchmarks, scoring 49.9 against the original 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model actually outperformed the original baseline, scoring 92.6 compared to 91.0.

The team also ran preliminary experiments on the production-scale 744-billion-parameter GLM-5 model. They found that eliminating 75% of its indexers with the training-free method yielded at least a 1.3x speedup on contexts over 100K tokens. At the same time, the model maintained a nearly identical quality average on long-context tasks.

Getting IndexCache into production

For development teams wanting to implement the training-free approach today, the process is straightforward but requires careful setup. While the greedy search algorithm automatically finds the optimal layer configuration, the quality of that configuration depends on the data it processes.

“We recommend using domain-specific data as a calibration set so that the discovered layer-sharing pattern aligns with real workloads,” Bai said.

Once calibrated, the optimization is highly accessible for production environments. Open-source patches are already available on GitHub for major serving engines. “Integration is relatively straightforward — developers can apply the patch to existing inference stacks, such as vLLM or SGLang, and enable IndexCache with minimal configuration changes,” Bai said.

While IndexCache provides an immediate fix for today’s compute bottlenecks, its underlying philosophy points to a broader shift in how the AI industry will approach model design.

“Future foundation models will likely be architected with downstream inference constraints in mind from the beginning,” Bai concluded. “This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency, rather than treating these as post-hoc concerns.”



Source_link

Related Posts

Why Gmail’s AI-generated email replies feel so creepy
Technology And Software

Why Gmail’s AI-generated email replies feel so creepy

March 28, 2026
Verizon waives late fees for federal workers affected by partial DHS shutdown
Technology And Software

Verizon waives late fees for federal workers affected by partial DHS shutdown

March 28, 2026
AI Research Is Getting Harder to Separate From Geopolitics
Technology And Software

AI Research Is Getting Harder to Separate From Geopolitics

March 28, 2026
Why SoftBank’s new $40B loan points to a 2026 OpenAI IPO
Technology And Software

Why SoftBank’s new $40B loan points to a 2026 OpenAI IPO

March 27, 2026
Intercom's new post-trained Fin Apex 1.0 beats GPT-5.4 and Claude Sonnet 4.6 at customer service resolutions
Technology And Software

Intercom's new post-trained Fin Apex 1.0 beats GPT-5.4 and Claude Sonnet 4.6 at customer service resolutions

March 27, 2026
The Supreme Court is scared it’s going to break the internet
Technology And Software

The Supreme Court is scared it’s going to break the internet

March 27, 2026
Next Post
How Evidence Matters in a Birth Injury Case

How Evidence Matters in a Birth Injury Case

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

How to Create a Pinterest Ad: Step-by-Step Guide for 2025

May 28, 2025
Best Employee Communications Software: My Top 5 Picks

Best Employee Communications Software: My Top 5 Picks

June 20, 2025
Recognizing the Signs of Narcissism On Social Media

Recognizing the Signs of Narcissism On Social Media

June 12, 2025
Sam Altman warns there’s no legal confidentiality when using ChatGPT as a therapist

Sam Altman warns there’s no legal confidentiality when using ChatGPT as a therapist

July 25, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How Evidence Matters in a Birth Injury Case
  • IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models
  • Augmenting citizen science with computer vision for fish monitoring | MIT News
  • LL COOL J and Google’s James Manyika discuss AI and music
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions