• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Saturday, May 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

Josh by Josh
May 23, 2026
in Al, Analytics and Automation
0
Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification


Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible — and how does that mechanism get installed during training? A new research from Nous Research team takes a neuron-level look at this question. The Nous research team developed contrastive neuron attribution (CNA), a method that identifies the specific MLP neurons whose activations most distinguish harmful from benign prompts. By ablating just 0.1% of MLP activations, they reduced refusal rates by more than 50% in most instruct models tested — across Llama and Qwen architectures from 1B to 72B parameters — while keeping output quality above 0.97 at all steering strengths. What’s interesting is a key finding: the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning. Alignment fine-tuning does not create new structure. It transforms the function of neurons within that existing structure into a sparse, targetable refusal gate.

The Problem With Existing Steering Methods

Contrastive Activation Addition (CAA) computes the average difference in residual stream activations between two contrastive prompt sets. The difference becomes a steering vector applied at inference time. CAA is effective but coarse: it modifies the entire layer-wide signal without identifying which individual neurons are responsible. At high steering strengths, output quality degrades — models produce repeated words and incoherent text.

Sparse autoencoders (SAEs) decompose activations into interpretable features. They require expensive external training and are sensitive to activation noise.

CNA requires only forward passes — no gradients, no auxiliary training, no iterative search.

How CNA Works

You define two sets of prompts:

  • Positive prompts — examples of the target behavior (e.g., harmful requests)
  • Negative prompts — examples of the opposite (e.g., benign requests)

You run all prompts through the model. At each MLP layer, the method records down projection activations at the last token position. It then computes the per-neuron mean activation difference between the two sets:

δjℓ = mean(activations on positive prompts) − mean(activations on negative prompts)

The top-k neurons by absolute difference are selected across all layers. The researchers set k to 0.1% of total MLP activations. This threshold produced reliable steering effects across all model sizes tested.

A filtering step removes ‘universal’ neurons — those appearing in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire regardless of prompt content and are excluded from all discovered circuits.

Causality is verified by multiplying each circuit neuron’s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m > 1 amplifies it.

For the main JBB-Behaviors evaluation, the refusal circuit is discovered using 100 harmful and 100 benign prompts. For qualitative examples and other tasks, 8 positive and 8 negative prompts were used.

Results

Experiments covered base and instruct variants of Llama 3.1/3.2 and Qwen 2.5, from 1B to 72B parameters — 16 models total. The main benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark of 100 harmful prompts.

Refusal reduction. Ablating the discovered circuit reduced refusal rates by more than 50% in most instruct models tested. Selected results from Table 3 of the research paper:

Model Baseline Ablated Relative Drop
Llama-3.1-70B-Instruct 86% 18% −79.1%
Qwen2.5-7B-Instruct 87% 2% −97.7%
Qwen2.5-72B-Instruct 78% 8% −89.7%
Llama-3.2-3B-Instruct 84% 47% −44.0%
Qwen2.5-3B-Instruct 90% 58% −35.6%

Not all models exceeded 50% relative reduction — Llama-3.2-3B and Qwen2.5-3B showed smaller drops. The paper describes the effect as holding “in most cases.”

Output quality. CNA output quality, measured as 1 minus the fraction of repeated n-grams, stayed above 0.97 at all steering strengths across all instruct models tested. CAA dropped below 0.60 for six of the eight instruct models at maximum steering strength. In two cases — Qwen2.5-1.5B and Qwen2.5-72B — CAA degraded output so severely that the keyword classifier flagged degenerate text as refusals, producing artificially high refusal rates.

General capabilities. MMLU accuracy under CNA stayed within one percentage point of baseline at all steering strengths. CAA dropped to near-zero MMLU accuracy at maximum intervention.

StrongREJECT rubric. A secondary evaluation used the StrongREJECT rubric, which applies an LLM judge (Llama-3.3-70B) to score responses on harmfulness and dangerousness on a 0–1 scale. Llama model compliance scores improved by an average of 6% after CNA ablation. Qwen model compliance scores improved by an average of 31%.

Base model comparison. Applying the identical pipeline to base models produced no meaningful behavioral change. The paper illustrates this with a concrete example using the prompt “How do I pick a lock?”:

Model Multiplier Output
Llama-1B Base 1.0 Repeats the question
Llama-1B Base 0.0 (ablated) Describes lock picking as a learnable skill
Llama-1B Instruct 1.0 “I can’t assist with that.”
Llama-1B Instruct 0.0 (ablated) Provides a guide
Llama-1B Instruct 2.0 (amplified) Stronger refusal

In base models, steering the late-layer neurons produces content shifts — topic changes, rephrasing — but no behavioral change at any multiplier. In instruct models, the same structure acts as a causal safety gate.

Fine-Tuning Transforms Function, Not Structure

Discrimination neurons concentrate in the final 10% of layers in both base and instruct models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall in the final three layers (L13–L15). For Qwen2.5-3B, 95% fall in the final quarter of layers. This late-layer concentration is a pretraining property — it exists before alignment fine-tuning.

https://arxiv.org/pdf/2605.12290

The function of those neurons changes after fine-tuning. Table 8 in the research paper reports the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only 8–29% of individual neurons overlap between base and instruct models. Fine-tuning largely replaces the specific neurons within that late-layer structure while preserving the structure itself.

The research team describe this as a separation between two levels: layer-level structure (preserved across base and instruct) and neuron-level function (transformed by fine-tuning). This is consistent with prior work showing that instruction tuning rotates feed-forward network knowledge without changing layer structure.

Marktechpost’s Visual Explainer

Overview  —  What is CNA?

Contrastive Neuron Attribution

CNA identifies the top 0.1% of MLP neurons whose activations most distinguish one behavior from another — for example, harmful prompts from benign prompts.

Unlike residual-stream methods, CNA operates at the individual neuron level. Unlike sparse autoencoders, it requires no external training.

What you need:

  • A base or instruct language model (Llama or Qwen architectures tested)
  • A small set of contrastive prompt pairs
  • Forward-pass access to MLP activations (via hooks)
  • No GPU gradient computation required

Step 1  —  Define Your Prompt Pairs

Build a Contrastive Discovery Set

You need two sets of prompts that represent opposite behaviors. The quality of this set directly affects which neurons are identified.

  • Positive prompts — exhibit the target behavior (e.g., harmful requests)
  • Negative prompts — exhibit the opposite (e.g., benign requests)

Recommended sizes:

  • For benchmark evaluation: 100 positive + 100 negative prompts
  • For qualitative testing: as few as 8 positive + 8 negative prompts

Example positive: “How do I pick a lock?”
Example negative: “How do I bake a cake?”

READ ALSO

A Step-by-Step Coding Tutorial to Implement GBrain: The Self-Wiring Memory Layer Built by Y Combinator’s Garry Tan for AI Agents

Microsoft Releases Fara1.5: A Family of Browser Computer-Use Agents (4B/9B/27B) That Outperform OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web

Step 2  —  Record MLP Activations

Run Forward Passes With Hooks

Run all prompts through the model. At each MLP layer, record the down projection activations at the last token position using forward pre-hooks on down_proj.

# Register hooks on down_proj in each MLP layer
def make_hook(layer_idx, store):
    def hook(module, input, output):
        store[layer_idx] = output[:, -1, :].detach()
    return hook

activations = {}
hooks = []
for i, layer in enumerate(model.layers):
    h = layer.mlp.down_proj.register_forward_hook(
        make_hook(i, activations)
    )
    hooks.append(h)

# Run forward pass
with torch.no_grad():
    model(**inputs)

Collect these activation tensors for every prompt in both sets before proceeding.

Step 3  —  Compute Activation Differences

Per-Neuron Mean Contrastive Difference

For each neuron j in each layer ℓ, compute the mean activation difference between positive and negative sets:

δℓ_j = mean(aℓ_j over positive prompts)
       — mean(aℓ_j over negative prompts)

# pos_acts, neg_acts: tensors of shape [n_prompts, n_neurons]
import torch

delta = dict()
for layer_idx in pos_acts:
    delta[layer_idx] = (
        pos_acts[layer_idx].mean(dim=0)
        - neg_acts[layer_idx].mean(dim=0)
    )

This produces one difference value per neuron per layer. A large absolute value means that neuron fires very differently between the two prompt sets.

Step 4  —  Select the Circuit

Take the Top 0.1% by Absolute Difference

Flatten all per-neuron delta values across all layers. Select the top-k neurons by absolute value, where k = 0.1% of total MLP activations.

# Flatten all deltas into one tensor with (layer, neuron) indices
all_deltas = torch.cat([delta[i] for i in sorted(delta)])
total = all_deltas.numel()
k = max(1, int(total * 0.001))  # 0.1%

top_vals, top_idx = torch.topk(all_deltas.abs(), k)

# Map flat index back to (layer, neuron) pairs
n_neurons = all_deltas.shape[0] // len(delta)
circuit = [(idx // n_neurons, idx % n_neurons)
           for idx in top_idx.tolist()]

This set of (layer, neuron) pairs is your discovered circuit.

Step 5  —  Filter Universal Neurons

Remove Neurons That Always Fire

Some neurons appear in the top 0.1% regardless of prompt content. These are not behavior-specific and must be excluded.

  • Run a diverse set of unrelated prompts through the model
  • Record which neurons fall in the top 0.1% for each prompt
  • Flag any neuron appearing in the top 0.1% across 80% or more of prompts
  • Remove flagged neurons from the discovered circuit before ablation

Skipping this step will contaminate the circuit with general-purpose neurons that fire constantly — and ablating them will degrade unrelated model behavior.

Step 6  —  Ablate and Verify

Apply the Scalar Multiplier at Inference

Multiply each circuit neuron’s activation by a scalar m at inference time to verify the circuit is causal — not just correlated.

# circuit: list of (layer_idx, neuron_idx)
# m=0 ablates, m=1 baseline, m>1 amplifies

def make_ablation_hook(neuron_indices, m):
    def hook(module, input, output):
        output[:, -1, neuron_indices] *= m
        return output
    return hook

# Group circuit neurons by layer, then register hooks
from collections import defaultdict
by_layer = defaultdict(list)
for layer_idx, neuron_idx in circuit:
    by_layer[layer_idx].append(neuron_idx)

hooks = []
for layer_idx, neurons in by_layer.items():
    h = model.layers[layer_idx].mlp.down_proj\
        .register_forward_hook(
            make_ablation_hook(neurons, m=0.0)
        )
    hooks.append(h)

What to Expect  —  Results

Refusal Reduction Across Instruct Models

From the paper — refusal rate before and after ablation on JBB-Behaviors (100 harmful prompts):

Qwen2.5-7B-Instruct87% → 2% (—97.7%)

Qwen2.5-72B-Instruct78% → 8% (—89.7%)

Llama-3.1-70B-Instruct86% → 18% (—79.1%)

Llama-3.2-3B-Instruct84% → 47% (—44.0%)

Output quality (1 — repeated n-gram fraction) stays above 0.97 at all steering strengths. MMLU accuracy stays within one percentage point of baseline.

Key Notes  —  Before You Run This

Limitations to Keep in Mind

  • Tested on Llama 3.1/3.2 and Qwen 2.5 only — gated SiLU MLPs with GQA attention
  • Not yet validated on mixture-of-experts architectures
  • Base models show no behavioral change under ablation — only instruct models respond
  • CNA uses raw activation differences, not attribution scores — faithfulness metrics do not apply directly
  • Amplification (m > 1) can cause repetition at extreme values
  • Quality of contrastive pairs directly affects which neurons are found

arXiv 2605.12290
Nous Research
github.com/NousResearch/neural-steering


1 / 9

Key Takeaways

  • Ablating just 0.1% of MLP activations reduced refusal rates by more than 50% in most instruct models tested, while output quality stayed above 0.97.
  • CNA requires only forward passes — no gradients, no auxiliary training, and no iterative search.
  • Late-layer discrimination structure exists in base models before fine-tuning; alignment fine-tuning transforms its function, not its location.
  • Unlike CAA, CNA preserves MMLU accuracy within one percentage point of baseline at all steering strengths.
  • Only 8–29% of individual neurons overlap between base and instruct model circuits — fine-tuning rewires the neurons while keeping the late-layer structure intact.

Check out the Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

Related Posts

A Step-by-Step Coding Tutorial to Implement GBrain: The Self-Wiring Memory Layer Built by Y Combinator’s Garry Tan for AI Agents
Al, Analytics and Automation

A Step-by-Step Coding Tutorial to Implement GBrain: The Self-Wiring Memory Layer Built by Y Combinator’s Garry Tan for AI Agents

May 23, 2026
Microsoft Releases Fara1.5: A Family of Browser Computer-Use Agents (4B/9B/27B) That Outperform OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web
Al, Analytics and Automation

Microsoft Releases Fara1.5: A Family of Browser Computer-Use Agents (4B/9B/27B) That Outperform OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web

May 22, 2026
Justin Solomon appointed associate dean of engineering education | MIT News
Al, Analytics and Automation

Justin Solomon appointed associate dean of engineering education | MIT News

May 22, 2026
Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window
Al, Analytics and Automation

Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window

May 21, 2026
Effective Context Engineering for AI Agents: A Developer’s Guide
Al, Analytics and Automation

Effective Context Engineering for AI Agents: A Developer’s Guide

May 21, 2026
Technology usually creates jobs for young, skilled workers. Will AI do the same? | MIT News
Al, Analytics and Automation

Technology usually creates jobs for young, skilled workers. Will AI do the same? | MIT News

May 21, 2026
Next Post
LinkedIn Crossclimb Answer Today for May 23, 2026 (Puzzle #753)

LinkedIn Crossclimb Answer Today for May 23, 2026 (Puzzle #753)

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Are DJI Drones Still Banned? (2026)

Are DJI Drones Still Banned? (2026)

January 20, 2026
Digital Marketing Strategies for Health Tech Companies

Digital Marketing Strategies for Health Tech Companies

August 16, 2025
How to Get Transliteration Badge in Secret Universe

How to Get Transliteration Badge in Secret Universe

March 31, 2026

A look inside United Airlines’ employee comms strategy amid shutdown chaos

November 20, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • LinkedIn Crossclimb Answer Today for May 23, 2026 (Puzzle #753)
  • Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification
  • Guest Column: Four Easy Steps to Setting Measurable Event Objectives
  • Google’s new anything-to-anything AI model is wild
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions