• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, June 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

Josh by Josh
June 11, 2026
in Al, Analytics and Automation
0
Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding


This week, Cohere AI team shipped its first developer-facing coding model named ‘North Mini Code‘. ‘North Mini Code’ is open-weight and focused at software engineers. It is a mixture-of-experts (MoE) model with 30B total parameters. Only 3B of those parameters activate per token.

The release is positioned around “sovereign” AI. The idea is simple: run capable models on your own terms. Small, efficient coding models let teams self-host without large GPU clusters. North Mini Code targets that gap directly.

North Mini Code

North Mini Code is a 30B-A3B parameter model. The A3B stands for three billion active parameters per forward pass. Cohere optimized it for three jobs: code generation, agentic software engineering, and terminal tasks. The model is text-in, text-out. There is no image or video input.

The context window is 256K tokens. Maximum output length is 64K tokens. Cohere lists a minimum hardware bar of one H100 at FP8. Weights ship under Apache 2.0 on Hugging Face. You can also reach it through the Cohere API, Model Vault, and OpenRouter.

Field North-Mini-Code-1.0
License Apache 2.0
Model size 30B total; 3B active
Context length 256K total; 64K max generation
Optimized for Code generation, agentic software engineering, terminal tasks
Availability Hugging Face, Cohere API, Cohere Model Vault, OpenRouter
Hardware (minimum) 1× H100 @ FP8

The Architecture

North Mini Code is a decoder-only Transformer with sparse MoE layers. Its attention interleaves two types in a 3:1 ratio. Sliding-window attention uses RoPE for positions. Global attention uses no positional embeddings at all. The feed-forward block holds 128 experts. Eight experts activate per token. Each expert is an FFN with SwiGLU activation.

The router applies a sigmoid before top-k selection. A single dense layer sits before the sparse layers. That mix keeps active compute small while widening total capacity. Cohere released the weights in BF16.

Post-training ran in two phases. First came two-stage cascaded supervised fine-tuning (SFT). Then came reinforcement learning with verifiable rewards (RLVR). The post-training focused on agentic coding. The model also supports interleaved thinking and native tool use.

Benchmarks

Cohere reports a 33.4 on the Artificial Analysis Coding Index. It describes this as a competitive position among similarly sized models. The company evaluated on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2. It also used Terminal-Bench Hard, SciCode, and LiveCodeBench v6.

The methodology is specific. SWE-Bench used the SWE-agent harness v1.1.0. Terminal-Bench v2 used a simple ReAct harness with one terminal tool. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark ran with three seeds, then averaged. Sampling used temperature 1.0 and top_p 0.95.

The Speed

In Cohere’s internal tests, North Mini Code reached up to 2.8x higher output throughput. That held at identical concurrency and hardware. It also showed a 30% edge in inter-token latency. Time-to-first-token was closer between the two. Devstral Small 2 kept a slight TTFT lead.

Metric North Mini Code vs Devstral Small 2
Output throughput Up to 2.8x higher (same concurrency and hardware)
Inter-token latency 30% better for North Mini Code
Time-to-first-token Slightly behind Devstral Small 2

Use Cases With Examples

Cohere built North Mini Code for agentic workflows.

Three patterns stand out in its own framing:

  • Sub-agent orchestration: A main agent delegates subtasks to helpers. Example: one agent writes unit tests while another fixes failing code.
  • Systems architecture mapping: The model reads a repository and sketches its structure. Example: tracing how services call each other before a large refactor.
  • Code reviews: The model scans a diff for problems. Example: flagging an unguarded null dereference before a merge.

Terminal tasks fit the model as well. Example: listing files, running a build, then parsing the output for errors.

Getting Started

The fastest path is Hugging Face Transformers. Install Transformers from source for this model. Recommended sampling is temperature 1.0 and top_p 0.95.

# Install Transformers from source (required for this model):
# pip install "git+https://github.com/huggingface/transformers.git"
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "Write a python program to check if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]

# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

gen_tokens = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
)

# Decode only the newly generated tokens, not the prompt
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].shape[-1]:])
print(output)

For serving, vLLM works. You need vLLM main plus Cohere’s melody library. Accurate response parsing depends on it.

uv pip install "git+https://github.com/vllm-project/vllm.git"
uv pip install "cohere_melody>=0.9.0"

vllm serve CohereLabs/North-Mini-Code-1.0 \
  -tp 2 \
  --max-model-len 320000 \
  --tool-call-parser cohere_command4 \
  --reasoning-parser cohere_command4 \
  --enable-auto-tool-choice

Quantized builds exist for Ollama, LM Studio, and llama.cpp. You can also try the model before downloading. Cohere offers free access through OpenCode and a hosted Hugging Face Space.

Key Takeaways

  • Cohere’s first coding model, North Mini Code, is a 30B mixture-of-experts that activates just 3B parameters per token.
  • It runs on a single H100 at FP8, with 256K context and 64K max output.
  • Weights ship under Apache 2.0, though the Hugging Face card adds a non-commercial note.
  • Cohere official release reports 33.4 on the Artificial Analysis Coding Index, and up to 2.8x throughput over Devstral Small 2.
  • Built for agentic coding—sub-agent orchestration, architecture mapping, code reviews with native tool use

Marktechpost’s Interactive Explainer

Cohere · Open-Weight Coding Model

North Mini Code

Cohere’s first developer coding model: a 30B mixture-of-experts that activates just 3B parameters per token, built for agentic software engineering and terminal tasks.

30B total params
3B active / token
256K context
64K max output
1× H100 @ FP8




The model at a glance

Open weights, released June 9, 2026. Text in, text out.

Size

30B total / 3B active

Architecture

Sparse MoE (decoder-only)

Min hardware

1× H100 @ FP8

License

Apache 2.0 see note

Context window · drag to explore

128K tokens

a mid-size codebase

8K64K output cap256K max

Relatable sizes are approximate. The exact limits are 256K context and 64K maximum generation.

Optimized for

Code generation
Agentic software engineering
Terminal tasks

Agentic use cases

Sub-agent orchestration
Systems architecture mapping
Code reviews

License note: Cohere’s blog states Apache 2.0. The Hugging Face card adds an acceptable-use addendum and a non-commercial note. Check both before deploying.

READ ALSO

Building Semantic Search with Transformers.js and Sentence Embeddings

Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News

The forward pass

Tap any stage to see what it does. The MoE block is where sparsity happens.


→

→

→

→

Input tokens

Text is tokenized and fed to a decoder-only Transformer. The model is text in, text out.

Try the router

Each MoE block holds 128 experts. The router selects 8 per token. Route tokens and watch coverage grow.

Coral = the 8 experts firing now. Peach = experts used earlier in the run. Hover a square to inspect.

8 / 128 experts

6.25% of experts run per token, so compute stays small.

Unique experts used0 / 128

Tokens routed0


Reported performance

Figures are from Cohere. Independent runs on your own workload still matter.

0

Artificial Analysis Coding Index

0

Output throughput vs Devstral Small 2

0

Better inter-token latency


Higher is better

North Mini Codeup to 2.8×

Devstral Small 21.0× (baseline)

Time-to-first-token was closely matched, with Devstral Small 2 holding a slight edge.

Benchmarks: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, LiveCodeBench v6. Harnesses: SWE-agent v1.1.0 (SWE-Bench), a ReAct harness with one terminal tool (Terminal-Bench v2), Terminus-2 (Terminal-Bench Hard). Each run used 3 seeds, averaged, at temperature 1.0 and top_p 0.95.

Quickstart

Hugging Face Transformers, installed from source. Recommended sampling: temperature 1.0, top_p 0.95.

# Install Transformers from source, then:
from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "CohereLabs/North-Mini-Code-1.0"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python palindrome checker."}]
inputs = tok.apply_chat_template(
    msgs, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

out = model.generate(**inputs, max_new_tokens=1024,
                     do_sample=True, temperature=1.0, top_p=0.95)
print(tok.decode(out[0][inputs["input_ids"].shape[-1]:]))

Serve with vLLM (+ cohere_melody)
Trained for OpenCode
Native tool use + interleaved thinking

Quantized: Ollama, LM Studio, llama.cpp
Also on Cohere API, Model Vault, OpenRouter


Check out the Model weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

Related Posts

Building Semantic Search with Transformers.js and Sentence Embeddings
Al, Analytics and Automation

Building Semantic Search with Transformers.js and Sentence Embeddings

June 11, 2026
Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News
Al, Analytics and Automation

Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News

June 10, 2026
Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared
Al, Analytics and Automation

Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared

June 10, 2026
The Practitioner’s Guide to AgentOps
Al, Analytics and Automation

The Practitioner’s Guide to AgentOps

June 10, 2026
The consequences of relying on AI for accurate news | MIT News
Al, Analytics and Automation

The consequences of relying on AI for accurate news | MIT News

June 10, 2026
Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API
Al, Analytics and Automation

Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API

June 9, 2026
Next Post
Which Social Media APIs Support Multi-Platform Posting? 6 Free + Paid Options

Which Social Media APIs Support Multi-Platform Posting? 6 Free + Paid Options

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

What Is On-Page SEO? And How to Do It

What Is On-Page SEO? And How to Do It

January 12, 2026
Chrome is helping you shop smarter and safer with store reviews.

Chrome is helping you shop smarter and safer with store reviews.

July 28, 2025
Four AI policy recommendations for the USPTO

Four AI policy recommendations for the USPTO

September 20, 2025
What it is & how to create one

What it is & how to create one

June 5, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Which Social Media APIs Support Multi-Platform Posting? 6 Free + Paid Options
  • Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding
  • Who knew a spiral could do so much? Pentagram did, in this joyful Tokyo museum identity — BP&O
  • 5 Best Scheduling Software that Integrate with QuickBooks
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions