• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, May 12, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes a Hidden Neuron Death Problem in Muon

Josh by Josh
May 12, 2026
in Al, Analytics and Automation
0
Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes a Hidden Neuron Death Problem in Muon


Researchers at Tilde Research have released Aurora, a new optimizer for training neural networks that addresses a structural flaw in the widely-used Muon optimizer. The flaw quietly kills off a significant fraction of MLP neurons during training and keeps them permanently dead. Aurora comes with a 1.1B parameter pretraining experiment, a new state-of-the-art result on the modded-nanoGPT speedrun benchmark, and open codes.

What is Muon?

To understand Aurora, it helps to first understand Muon. The Muon optimizer attracted attention in the ML community after outperforming AdamW in wall-clock time to convergence on the nanoGPT speedrun competition — a community benchmark that measures how fast you can train a GPT-style model to a target validation loss. Since then, Muon has been adopted in frontier-scale model training by several research groups.

READ ALSO

Understanding LLM Distillation Techniques  – MarkTechPost

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

Muon’s key algorithmic step is computing the polar factor of the gradient matrix. For a gradient matrix G with thin Singular Value Decomposition (SVD) G = UΣVᵀ, Muon computes polar(G) = UVᵀ, which is the closest semi-orthogonal matrix to G in the Frobenius norm. This orthogonalized gradient is then used to update the weights: W ← W − η UVᵀ for a learning rate η. The use of matmul-only iterative algorithms to compute the polar factor is what makes Muon practical at scale.

The NorMuon Puzzle: Row Normalization Helps, But Why?

Before Aurora, NorMuon led the modded-nanoGPT speedrun. It introduced a row-normalization step—similar to Adam’s per-parameter scaling—that adjusted the polar factor by its inverse RMS norm. While this often pulls the update away from a strictly orthogonal gradient, NorMuon still yields impressive results. The Tilde team set out to understand exactly what gap in Muon’s formulation NorMuon was addressing.

The Core Problem: Row-Norm Anisotropy and Neuron Death in Tall Matrices

The research team discovered that the Muon optimizer unintentionally “kills” a large portion of neurons in tall weight matrices, such as those found in SwiGLU-based MLP layers. Because it is mathematically impossible for these specific matrix shapes to stay perfectly orthogonal while keeping row updates even, the optimizer ends up giving massive updates to some neurons while virtually ignoring others. This results in a “death spiral” where under-performing neurons receive less signal over time, eventually becoming permanently inactive.

The research study revealed that by the 500th training step, more than one in four neurons are effectively dead. This isn’t just a local issue; the lack of activity in these neurons starves subsequent layers of necessary data, spreading the inefficiency throughout the model. Aurora solves this by using a new mathematical approach that enforces uniform updates across all neurons without sacrificing the benefits of orthogonalization.

Before arriving at Aurora, the research introduces an intermediate fix called U-NorMuon. The key observation is that NorMuon normalizes each row to unit norm (norm = 1), but this is actually the wrong target for a tall matrix. For a column-orthogonal tall matrix, the mathematically correct average row norm is √(n/m), not 1. U-NorMuon corrects this by normalizing tall matrix rows to have norm √(n/m) instead of 1.

In experiments at 340M scale, U-NorMuon outperforms both Muon and standard NorMuon and completely eliminates the neuron death phenomenon — leverage scores become approximately isotropic throughout training. Crucially, U-NorMuon propagates this benefit to layers it doesn’t directly touch: keeping up/gate rows alive ensures isotropic gradient flow into the down-projection, stabilizing its column leverage without any direct intervention.

However, U-NorMuon still has a problem: it forcefully overrides the polar factor with uniform row norms, sacrificing polar factor precision, which is both theoretically undesirable and empirically costly in the Muon framework (the paper shows that Muon achieves monotonically lower loss with more precise orthogonalization). This is the motivation for Aurora.

Aurora: Steepest Descent Under Two Joint Constraints

Aurora reformulates the update-selection problem from scratch. Rather than running orthogonalization and then patching it with row normalization, Aurora asks: what is the optimal update under the joint constraint of left semi-orthogonality and uniform row norms?

Formally, for tall matrices, Aurora solves:

U∗=argUmax​Tr(G⊤U)s.t.U⊤U=In​,∥Ui:​∥2=mn​∀iU ∗ =arg U max ​ Tr(G ⊤ U)s.t.U ⊤ U=I n ​ ,∥U i: ​ ∥ 2 = m n ​ ∀i

The research shows that these two constraints together force all singular values of U to exactly equal 1. This means the joint constraint still produces a valid left semi-orthogonal update, not a compromised one. This is the key insight that separates Aurora from NorMuon and U-NorMuon: it achieves row-norm uniformity and orthogonality simultaneously rather than trading one off against the other.

The research also provides two algorithmic implementations of Aurora’s solution. The Riemannian Aurora uses a gradient projection approach restricted to the joint Stiefel/equal-row-leverage manifold. The vanilla Aurora is a simpler, more practical implementation. Both are open-sourced. For non-tall (wide and square) matrices, row-norm uniformity is already implied by orthogonality, so Aurora leaves those parameters unchanged.

Results

Aurora was used to train a 1.1B model that achieves 100x data efficiency on open-source internet data and outperforms larger models on general evals like HellaSwag. At 1B scale, Aurora achieves large gains over both Muon and NorMuon. On the modded-nanoGPT optimization speedrun, Aurora’s submitted run outperforms the prior state-of-the-art (which was NorMuon). Untuned Aurora carries only a 6% compute overhead over traditional Muon and is designed as a drop-in replacement.

The research team also found that Aurora’s performance gains scale with MLP width, suggesting it is particularly effective for networks with large MLP expansion factors — which is consistent with the neuron death hypothesis, since wider MLPs have more tall matrices and more opportunity for leverage anisotropy to compound.

Key Takeaways

  • Muon’s polar factor update inherits row-norm anisotropy on tall matrices, causing over 25% of MLP neurons to permanently die as early as step 500 of training.
  • Aurora solves this by finding the optimal update under a joint constraint of left semi-orthogonality and uniform row norms — achieving both simultaneously rather than trading one off against the other.
  • At 1.1B scale, Aurora achieves 100x data efficiency on open-source internet data, outperforms larger models on HellaSwag, and sets a new SoTA on the modded-nanoGPT speedrun.
  • Aurora is a near-drop-in replacement for Muon with only 6% compute overhead, and its gains scale with MLP width.

Check out the Paper and GitHub Repo Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

Related Posts

Understanding LLM Distillation Techniques  – MarkTechPost
Al, Analytics and Automation

Understanding LLM Distillation Techniques  – MarkTechPost

May 11, 2026
Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs
Al, Analytics and Automation

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

May 11, 2026
OpenClaw vs Hermes Agent: Why Nous Research’s Self-Improving Agent Now Leads OpenRouter’s Global Rankings
Al, Analytics and Automation

OpenClaw vs Hermes Agent: Why Nous Research’s Self-Improving Agent Now Leads OpenRouter’s Global Rankings

May 10, 2026
NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX
Al, Analytics and Automation

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

May 10, 2026
Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents
Al, Analytics and Automation

Meet GitHub Spec-Kit: An Open Source Toolkit for Spec-Driven Development with AI Coding Agents

May 9, 2026
Al, Analytics and Automation

9 Best AI Tools for Spec-Driven Development in 2026: Kiro, BMAD, GSD, and More Compare

May 9, 2026
Next Post
Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models'

Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models'

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Google just released the first major Snapseed update in years

Google just released the first major Snapseed update in years

June 15, 2025
10 Guiding Principles For Building Innovation Incubators

10 Guiding Principles For Building Innovation Incubators

December 3, 2025
how we’re supporting early-career academics

how we’re supporting early-career academics

August 8, 2025
What Apple and Google’s Gemini deal means for both companies

What Apple and Google’s Gemini deal means for both companies

January 14, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Top 15 White Label SEO Agencies to Outsource Projects
  • LinkedIn Crossclimb Answer Today for May 12, 2026 (Puzzle #742)
  • Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models'
  • Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes a Hidden Neuron Death Problem in Muon
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions