• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, March 18, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Moonshot AI Releases π‘¨π’•π’•π’†π’π’•π’Šπ’π’ π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’π’” to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Josh by Josh
March 16, 2026
in Al, Analytics and Automation
0
Moonshot AI Releases π‘¨π’•π’•π’†π’π’•π’Šπ’π’ π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’π’” to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers


Residual connections are one of the least questioned parts of modern Transformer design. In PreNorm architectures, each layer adds its output back into a running hidden state, which keeps optimization stable and allows deep models to train. Moonshot AI researchers argue that this standard mechanism also introduces a structural problem: all prior layer outputs are accumulated with fixed unit weights, which causes hidden-state magnitude to grow with depth and progressively weakens the contribution of any single layer.

The research team proposes Attention Residuals (AttnRes) as a drop-in replacement for standard residual accumulation. Instead of forcing every layer to consume the same uniformly mixed residual stream, AttnRes lets each layer aggregate earlier representations using softmax attention over depth. The input to layer (l) is a weighted sum of the token embedding and previous layer outputs, where the weights are computed over prior depth positions rather than over sequence positions. The core idea is simple: if attention improved sequence modeling by replacing fixed recurrence over time, a similar idea can be applied to the depth dimension of a network.

READ ALSO

NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact | MIT News

https://github.com/MoonshotAI/Attention-Residuals/tree/master?tab=readme-ov-file

Why Standard Residuals Become a Bottleneck

The research team identified three issues with standard residual accumulation. First, there is no selective access: all layers receive the same aggregated state even though attention layers and feed-forward or MoE layers may benefit from different mixtures of earlier information. Second, there is irreversible loss: once information is blended into a single residual stream, later layers cannot selectively recover specific earlier representations. Third, there is output growth: deeper layers tend to produce larger outputs to remain influential inside an ever-growing accumulated state, which can destabilize training.

This is the research team’s main framing: standard residuals behave like a compressed recurrence over layers. AttnRes replaces that fixed recurrence with explicit attention over previous layer outputs.

Full AttnRes: Attention Over All Previous Layers

In Full AttnRes, each layer computes attention weights over all preceding depth sources. The default design does not use an input-conditioned query. Instead, each layer has a learned layer-specific pseudo-query vector wl ∈ Rd, while keys and values come from the token embedding and previous layer outputs after RMSNorm. The RMSNorm step is important because it prevents large-magnitude layer outputs from dominating the depth-wise attention weights.

Full AttnRes is straightforward, but it increases cost. Per token, it requires O(L2 d) arithmetic and (O(Ld)) memory to store layer outputs. In standard training this memory largely overlaps with activations already needed for backpropagation, but under activation re-computation and pipeline parallelism the overhead becomes more significant because those earlier outputs must remain available and may need to be transmitted across stages.

Block AttnRes: A Practical Variant for Large Models

To make the method usable at scale, Moonshot AI research team introduces Block AttnRes. Instead of attending over every earlier layer output, the model partitions layers into N blocks. Within each block, outputs are accumulated into a single block representation, and attention is applied only over those block-level representations plus the token embedding. This reduces memory and communication overhead from O(Ld) to O(Nd).

The research team describes cache-based pipeline communication and a two-phase computation strategy that make Block AttnRes practical in distributed training and inference. This results in less than 4% training overhead under pipeline parallelism, while the repository reports less than 2% inference latency overhead on typical workloads.

Scaling Results

The research team evaluates five model sizes and compares three variants at each size: a PreNorm baseline, Full AttnRes, and Block AttnRes with about eight blocks. All variants within each size group share the same hyperparameters chosen under the baseline, which the research team note makes the comparison conservative. The fitted scaling laws are reported as:

Baseline: L = 1.891 x C-0.057
Block AttnRes: L = 1.870 x C-0.058
Full AttnRes: L = 1.865 x C-0.057

The practical implication is that AttnRes achieves lower validation loss across the tested compute range, and the Block AttnRes matches the loss of a baseline trained with about 1.25Γ— more compute.

Integration into Kimi Linear

Moonshot AI also integrates AttnRes into Kimi Linear, its MoE architecture with 48B total parameters and 3B activated parameters, and pre-trains it on 1.4T tokens. According to the research paper, AttnRes mitigates PreNorm dilution by keeping output magnitudes more bounded across depth and distributing gradients more uniformly across layers. Another implementation detail is that all pseudo-query vectors are initialized to zero so the initial attention weights are uniform across source layers, effectively reducing AttnRes to equal-weight averaging at the start of training and avoiding early instability.

On downstream evaluation, the reported gains are consistent across all listed tasks. It reports improvements from 73.5 to 74.6 on MMLU, 36.9 to 44.4 on GPQA-Diamond, 76.3 to 78.0 on BBH, 53.5 to 57.1 on Math, 59.1 to 62.2 on HumanEval, 72.0 to 73.9 on MBPP, 82.0 to 82.9 on CMMLU, and 79.6 to 82.5 on C-Eval.

Key Takeaways

  • Attention Residuals replaces fixed residual accumulation with softmax attention over previous layers.
  • The default AttnRes design uses a learned layer-specific pseudo-query, not an input-conditioned query.
  • Block AttnRes makes the method practical by reducing depth-wise memory and communication from O(Ld) to O(Nd).
  • Moonshot research teamreports lower scaling loss than the PreNorm baseline, with Block AttnRes matching about 1.25Γ— more baseline compute.
  • In Kimi Linear, AttnRes improves results across reasoning, coding, and evaluation benchmarks with limited overhead.

Check outΒ Paper andΒ Repo.Β Also,Β feel free to follow us onΒ TwitterΒ and don’t forget to join ourΒ 120k+ ML SubRedditΒ and Subscribe toΒ our Newsletter. Wait! are you on telegram?Β now you can join us on telegram as well.




Source_link

Related Posts

NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents
Al, Analytics and Automation

NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents

March 18, 2026
MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact | MIT News
Al, Analytics and Automation

MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact | MIT News

March 18, 2026
How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels
Al, Analytics and Automation

How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels

March 17, 2026
Users, Growth, and Global Trends
Al, Analytics and Automation

Users, Growth, and Global Trends

March 17, 2026
Al, Analytics and Automation

Google AI Releases WAXAL: A Multilingual African Speech Dataset for Training Automatic Speech Recognition and Text-to-Speech Models

March 17, 2026
Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model that Unifies Instruct, Reasoning, and Multimodal Workloads
Al, Analytics and Automation

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model that Unifies Instruct, Reasoning, and Multimodal Workloads

March 17, 2026
Next Post
Walmart-backed PhonePe shelves IPO as global tensions rattle markets

Walmart-backed PhonePe shelves IPO as global tensions rattle markets

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Your AI models are failing in productionβ€”Here’s how to fix model selection

Your AI models are failing in productionβ€”Here’s how to fix model selection

June 4, 2025

EDITOR'S PICK

AI Tools for Podcasts, Video Editing, Graphic Design & More

AI Tools for Podcasts, Video Editing, Graphic Design & More

August 9, 2025
Supporting teens in their digital lives

Supporting teens in their digital lives

October 19, 2025
AI as Your Co-Host: Unleashing NotebookLM’s Potential in Content Marketing

AI as Your Co-Host: Unleashing NotebookLM’s Potential in Content Marketing

July 17, 2025
What is it and why is it important?

What is it and why is it important?

August 3, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • A Checklist for Office Relocations in New York City
  • He Built the Definitive Epstein Databaseβ€”and It Consumed His Life
  • NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents
  • What’s the difference between measurement, metrics and analytics?
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions