• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, April 30, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference

Josh by Josh
April 30, 2026
in Al, Analytics and Automation
0
IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference


IBM released two new open speech recognition models— Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR — and they make a compelling case for what a ~2B-parameter speech model can do. Both are available on Hugging Face under the Apache 2.0 license.

The pair targets a specific problem that enterprise AI teams know well: most production-grade automatic speech recognition (ASR) systems either demand massive compute or sacrifice accuracy to stay within budget. IBM’s bet is that careful architecture decisions can let you have it both ways.

What These Models Actually Do

Granite Speech 4.1 2B is a compact and efficient speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) covering English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NAR, focuses exclusively on ASR — specifically targeting latency-sensitive deployments — and supports English, French, German, Spanish, and Portuguese, but not Japanese. That’s a meaningful distinction: teams that need Japanese transcription or any speech translation capability should reach for the standard autoregressive model.

IBM also quietly released a third variant alongside these two. Granite Speech 4.1 2B-Plus adds speaker-attributed ASR and word-level timestamps for applications where knowing who said what — and exactly when — is a requirement.

Word Error Rate (WER) is the primary metric for measuring transcription quality. Lower is better. A WER of 5% means roughly 5 out of every 100 words are wrong. On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B scores a mean WER of 5.33. Drilling into benchmark detail — on LibriSpeech clean, the model achieves a WER of 1.33, and 2.5 on LibriSpeech other.

The Architecture, Explained

Both models share the same three-component design at a high level — a speech encoder, a modality adapter, and a language model — though the decoding mechanism diverges significantly.

The first component is the speech encoder. The architecture uses 16 conformer blocks trained with Connectionist Temporal Classification (CTC) with two classification heads — one for graphemic (character-level) outputs and one for BPE units — using frame importance sampling to focus on informative parts of the audio. A Conformer is a neural network layer that combines convolutional layers (good at capturing local acoustic patterns) with attention mechanisms (good at capturing long-range dependencies). CTC is a training technique that lets the model learn from audio-text pairs without needing exact frame-level alignment.

The second component is a speech-text modality adapter. A 2-layer window query transformer (Q-Former) operates on blocks of 15 1024-dimensional acoustic embeddings coming from the last conformer block, downsampling by a factor of 5 using 3 trainable queries per block and per layer — for a total temporal downsampling factor of 10 — resulting in a 10Hz acoustic embedding rate for the LLM. This adapter bridges the gap between continuous acoustic features and discrete text tokens, compressing the audio representation so the language model can process it efficiently. In the NAR model, the Q-Former has 160M parameters and downsamples the concatenated hidden representations from four encoder layers (layers 4, 8, 12, and 16).

The third component is the language model. Granite Speech 4.1 2B uses an intermediate checkpoint of granite-4.0-1b-base with 128k context length, fine-tuned on all training corpora. In the NAR variant, this becomes a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal attention mask removed to enable bidirectional context — adapted with LoRA at rank 128 applied to both attention and MLP layers.

The Autoregressive vs. Non-Autoregressive Tradeoff

This is where the two models diverge most sharply, and it has direct consequences for production deployment.

In the standard Granite Speech 4.1 2B, text is generated autoregressively — one token at a time, each depending on every token before it. This produces accurate, stable transcripts with full support for AST, keyword-biased recognition, and punctuation, but is inherently sequential and slower at scale.

Granite Speech 4.1 2B-NAR takes a fundamentally different approach. Rather than decoding tokens one at a time, it edits a CTC hypothesis in a single forward pass using a bidirectional LLM, achieving competitive accuracy with faster inference than autoregressive alternatives. This is the NLE (Non-autoregressive LLM-based Editing) architecture. Concretely: the CTC encoder produces a rough initial transcript, that hypothesis is interleaved with insertion slots, and then a bidirectional LLM predicts edits — copy, insert, delete, or replace — at all positions simultaneously in one pass.

The NAR model measured an RTFx of approximately 1820 on a single H100 GPU using batched inference at batch size 128. RTFx (real-time factor multiplier) measures how many times faster than real time a model can process audio — an RTFx of 1820 means a one-hour audio file can be transcribed in under two seconds on that hardware. One practical constraint engineers should note: the NAR model requires flash_attention_2 for inference, since this backend supports sequence packing and respects the is_causal=False flag.

Training Data and Infrastructure

The two models were trained on different datasets. The standard model was trained on 174,000 hours of audio from public corpora for ASR and AST, as well as synthetic datasets tailored to support Japanese ASR, keyword-biased ASR, and speech translation. The NAR model was trained on approximately 130,000 hours of speech across five languages using publicly available datasets including CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.

The infrastructure gap between the two is equally telling. The standard model’s training was completed in 30 days — 26 days for the encoder and 4 days for the projector — on 8 H100 GPUs. The NAR model trained in just 3 days on 16 H100 GPUs (2 nodes) for 5 epochs — a much lighter training run, which reflects the architectural simplicity of editing over full autoregressive generation.

Key Takeaways

Here are 5 short key takeaways:

  • IBM released two open ASR models — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — both ~2B parameters, and Apache 2.0 licensed.
  • The standard model achieves a mean WER of 5.33 on the Open ASR Leaderboard, supports 6 languages for ASR (including Japanese), bidirectional speech translation, keyword biasing, and punctuation/truecasing — competitive with models several times its size.
  • The NAR model trades capabilities for speed — it drops Japanese, AST, and keyword biasing, but delivers an RTFx of ~1820 on a single H100 GPU by editing a CTC hypothesis in a single forward pass rather than generating tokens one at a time.
  • The architecture has three core components — a 16-layer Conformer encoder trained with dual-head CTC, a 2-layer window Q-Former projector that downsamples audio to a 10Hz embedding rate, and a fine-tuned granite-4.0-1b-base language model.
  • A third variant, Granite Speech 4.1 2B-Plus, also exists — extending the standard model with speaker-attributed ASR and word-level timestamps for applications where speaker identity and precise timing are required.

Check out the Model-Granite Speech 4.1 2B and Model-Granite Speech 4.1 2B (NAR). Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

READ ALSO

How AI Policy in South Africa Is Ruining Itself

The MIT-IBM Computing Research Lab launches to shape the future of AI and quantum computing | MIT News

Related Posts

How AI Policy in South Africa Is Ruining Itself
Al, Analytics and Automation

How AI Policy in South Africa Is Ruining Itself

April 30, 2026
The MIT-IBM Computing Research Lab launches to shape the future of AI and quantum computing | MIT News
Al, Analytics and Automation

The MIT-IBM Computing Research Lab launches to shape the future of AI and quantum computing | MIT News

April 29, 2026
Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings
Al, Analytics and Automation

Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings

April 29, 2026
Enabling privacy-preserving AI training on everyday devices | MIT News
Al, Analytics and Automation

Enabling privacy-preserving AI training on everyday devices | MIT News

April 29, 2026
OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters
Al, Analytics and Automation

OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters

April 29, 2026
Top 10 Physical AI Models Powering Real-World Robots in 2026
Al, Analytics and Automation

Top 10 Physical AI Models Powering Real-World Robots in 2026

April 28, 2026
Next Post
Nike Promo Codes and Deals: 30% Off

Nike Promo Codes and Deals: 30% Off

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Four AI policy recommendations for the USPTO

Four AI policy recommendations for the USPTO

September 20, 2025
Symptoms, Stages, and How to Recover

Symptoms, Stages, and How to Recover

August 19, 2025
Why Integrity and Accuracy Are the New Mandates for Review Marketplaces in the Age of AI

Why Integrity and Accuracy Are the New Mandates for Review Marketplaces in the Age of AI

December 18, 2025
80% of Our AI Search Traffic Goes to Our Homepage, Product Pages, and Free Tools

80% of Our AI Search Traffic Goes to Our Homepage, Product Pages, and Free Tools

June 16, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Pick an Affordable SEO Agency & Our Top 15 Picks
  • Nike Promo Codes and Deals: 30% Off
  • IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference
  • Healthy Brands Begin With Strategic Integrity
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions