• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, April 24, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass

Josh by Josh
January 23, 2026
in Al, Analytics and Automation
0
Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass


Microsoft has released VibeVoice-ASR as part of the VibeVoice family of open source frontier voice AI models. VibeVoice-ASR is described as a unified speech-to-text model that can handle 60-minute long-form audio in a single pass and output structured transcriptions that encode Who, When, and What, with support for Customized Hotwords.

VibeVoice sits in a single repository that hosts Text-to-Speech, real time TTS, and Automatic Speech Recognition models under an MIT license. VibeVoice uses continuous speech tokenizers that run at 7.5 Hz and a next-token diffusion framework where a Large Language Model reasons over text and dialogue and a diffusion head generates acoustic detail. This framework is mainly documented for TTS, but it defines the overall design context in which VibeVoice-ASR lives.

https://huggingface.co/microsoft/VibeVoice-ASR

Long form ASR with a single global context

Unlike conventional ASR (Automatic Speech Recognition) systems that first cut audio into short segments and then run diarization and alignment as separate components, VibeVoice-ASR is designed to accept up to 60 minutes of continuous audio input within a 64K token length budget. The model keeps one global representation of the full session. This means the model can maintain speaker identity and topic context across the entire hour instead of resetting every few seconds.

60-minute Single-Pass Processing

The first key feature is that many conventional ASR systems process long audio by cutting it into short segments, which can lose global context. VibeVoice-ASR instead takes up to 60 minutes of continuous audio within a 64K token window so it can maintain consistent speaker tracking and semantic context across the entire recording.

This is important for tasks like meeting transcription, lectures, and long support calls. A single pass over the complete sequence simplifies the pipeline. There is no need to implement custom logic to merge partial hypotheses or repair speaker labels at boundaries between audio chunks.

Customized Hotwords for domain accuracy

Customized Hotwords are the second key feature. Users can provide hotwords such as product names, organization names, technical terms, or background context. The model uses these hotwords to guide the recognition process.

This allows you to bias decoding toward the correct spelling and pronunciation for domain specific tokens without retraining the model. For example, a dev-user can pass internal project names or customer specific terms at inference time. This is useful when deploying the same base model across several products that share similar acoustic conditions but very different vocabularies.

Microsoft also ships a finetuning-asr directory with LoRA based fine tuning scripts for VibeVoice-ASR. Together, hotwords and LoRA fine tuning give a path for both light weight adaptation and deeper domain specialization.

Rich Transcription, diarization, and timing

The third feature is Rich Transcription with Who, When, and What. The model jointly performs ASR, diarization, and timestamping, and returns a structured output that indicates who said what and when.

See below the three evaluation figures named DER, cpWER, and tcpWER.

https://huggingface.co/microsoft/VibeVoice-ASR
  • DER is Diarization Error Rate, it measures how well the model assigns speech segments to the correct speaker
  • cpWER and tcpWER are word error rate metrics computed under conversational settings

These graphs summarize how well the model performs on multi speaker long form data, which is the primary target setting for this ASR system.

The structured output format is well suited for downstream processing like speaker specific summarization, action item extraction, or analytics dashboards. Since segments, speakers, and timestamps already come from a single model, downstream code can treat the transcript as a time aligned event log.

Key Takeaways

  • VibeVoice-ASR is a unified speech to text model that handles 60 minute long form audio in a single pass within a 64K token context.
  • The model jointly performs ASR, diarization, and timestamping so it outputs structured transcripts that encode Who, When, and What in a single inference step.
  • Customized Hotwords let users inject domain specific terms such as product names or technical jargon to improve recognition accuracy without retraining the model.
  • Evaluation with DER, cpWER, and tcpWER focuses on multi speaker conversational scenarios which aligns the model with meetings, lectures, and long calls.
  • VibeVoice-ASR is released in the VibeVoice open source stack under MIT license with official weights, fine tuning scripts, and an online Playground for experimentation.

Check out the Model Weights, Repo and Playground. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source_link

READ ALSO

Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office

Related Posts

Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model
Al, Analytics and Automation

Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

April 24, 2026
“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office
Al, Analytics and Automation

“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office

April 23, 2026
Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures
Al, Analytics and Automation

Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures

April 23, 2026
The Most Efficient Approach to Crafting Your Personal AI Productivity System
Al, Analytics and Automation

The Most Efficient Approach to Crafting Your Personal AI Productivity System

April 23, 2026
Teaching AI models to say “I’m not sure” | MIT News
Al, Analytics and Automation

Teaching AI models to say “I’m not sure” | MIT News

April 23, 2026
Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks
Al, Analytics and Automation

Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks

April 22, 2026
Next Post
Sennheiser introduces new TV headphones bundle with Auracast

Sennheiser introduces new TV headphones bundle with Auracast

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

How To Develop A Smart AI Trip Planner App in 2025

How To Develop A Smart AI Trip Planner App in 2025

June 21, 2025
Google Shopping launches agentic checkout and more AI shopping tools

Google Shopping launches agentic checkout and more AI shopping tools

November 16, 2025
Tuft & Needle Is Offering a Free Pillow Set With Mattress Purchase

Tuft & Needle Is Offering a Free Pillow Set With Mattress Purchase

December 17, 2025
It’s not just you, Google and others are partially down today

It’s not just you, Google and others are partially down today

June 12, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • 13 Local Citation Building Services You Can Trust
  • Bob Iger rejoins Thrive Capital as advisor after Disney exit
  • Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model
  • What are the Best IT Alerting Software for Enterprises?
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions