• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, June 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers

Josh by Josh
August 26, 2025
in Al, Analytics and Automation
0
Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers


Microsoft’s latest open source release, VibeVoice-1.5B, redefines the boundaries of text-to-speech (TTS) technology—delivering expressive, long-form, multi-speaker generated audio that is MIT licensed, scalable, and highly flexible for research use. This model isn’t just another TTS engine; it’s a framework designed to generate up to 90 minutes of uninterrupted, natural-sounding audio, support simultaneous generation of up to four distinct speakers, and even handle cross-lingual and singing synthesis scenarios. With a streaming architecture and a larger 7B model announced for the near future, VibeVoice-1.5B positions itself as a major advance for AI-powered conversational audio, podcasting, and synthetic voice research.

Key Features

  • Massive Context and Multi-Speaker Support: VibeVoice-1.5B can synthesize up to 90 minutes of speech with up to four distinct speakers in a single session—far surpassing the typical 1-2 speaker limit of traditional TTS models.
  • Simultaneous Generation: The model isn’t just stitching together single-voice clips; it’s designed to support parallel audio streams for multiple speakers, mimicking natural conversation and turn-taking.
  • Cross-Lingual and Singing Synthesis: While primarily trained on English and Chinese, the model is capable of cross-lingual synthesis and can even generate singing—features rarely demonstrated in previous open source TTS models.
  • MIT License: Fully open source and commercially friendly, with a focus on research, transparency, and reproducibility.
  • Scalable for Streaming and Long-Form Audio: The architecture is designed for efficient long-duration synthesis and anticipates a forthcoming 7B streaming-capable model, further expanding possibilities for real-time and high-fidelity TTS.
  • Emotion and Expressiveness: The model is touted for its emotion control and natural expressiveness, making it suitable for applications like podcasts or conversational scenarios.
https://huggingface.co/microsoft/VibeVoice-1.5B

Architecture and Technical Deep Dive

VibeVoice’s foundation is a 1.5B-parameter LLM (Qwen2.5-1.5B) that integrates with two novel tokenizers—Acoustic and Semantic—both designed to operate at a low frame rate (7.5Hz) for computational efficiency and consistency across long sequences.

READ ALSO

MIT affiliates win 2026 Hertz Foundation Fellowships | MIT News

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

  • Acoustic Tokenizer: A σ-VAE variant with a mirrored encoder-decoder structure (each ~340M parameters), achieving 3200x downsampling from raw audio at 24kHz.
  • Semantic Tokenizer: Trained via an ASR proxy task, this encoder-only architecture mirrors the acoustic tokenizer’s design (minus the VAE components).
  • Diffusion Decoder Head: A lightweight (~123M parameter) conditional diffusion module predicts acoustic features, leveraging Classifier-Free Guidance (CFG) and DPM-Solver for perceptual quality.
  • Context Length Curriculum: Training starts at 4k tokens and scales up to 65k tokens—enabling the model to generate very long, coherent audio segments.
  • Sequence Modeling: The LLM understands dialogue flow for turn-taking, while the diffusion head generates fine-grained acoustic details—separating semantics and synthesis while preserving speaker identity over long durations.

Model Limitations and Responsible Use

  • English and Chinese Only: The model is trained solely on these languages; other languages may produce unintelligible or offensive outputs.
  • No Overlapping Speech: While it supports turn-taking, VibeVoice-1.5B does not model overlapping speech between speakers.
  • Speech-Only: The model does not generate background sounds, Foley, or music—audio output is strictly speech.
  • Legal and Ethical Risks: Microsoft explicitly prohibits use for voice impersonation, disinformation, or authentication bypass. Users must comply with laws and disclose AI-generated content.
  • Not for Professional Real-Time Applications: While efficient, this release is not optimized for low-latency, interactive, or live-streaming scenarios; that’s the target for the soon-to-come 7B variant.

Conclusion

Microsoft’s VibeVoice-1.5B is a breakthrough in open TTS: scalable, expressive, and multi-speaker, with a lightweight diffusion-based architecture that unlocks long-form, conversational audio synthesis for researchers and open source developers. While use is currently research-focused and limited to English/Chinese, the model’s capabilities—and the promise of upcoming versions—signal a paradigm shift in how AI can generate and interact with synthetic speech.

For technical teams, content creators, and AI enthusiasts, VibeVoice-1.5B is a must-explore tool for the next generation of synthetic voice applications—available now on Hugging Face and GitHub, with clear documentation and an open license. As the field pivots toward more expressive, interactive, and ethically transparent TTS, Microsoft’s latest offering is a landmark for open source AI speech synthesis.


FAQs

What makes VibeVoice-1.5B different from other text-to-speech models?

VibeVoice-1.5B can generate up to 90 minutes of expressive, multi-speaker audio (up to four speakers), supports cross-lingual and singing synthesis, and is fully open source under the MIT license—pushing the boundaries of long-form conversational AI audio generation

What hardware is recommended for running the model locally?

Community tests show that generating a multi-speaker dialog with the 1.5 B checkpoint consumes ≈ 7 GB of GPU VRAM, so an 8 GB consumer card (e.g., RTX 3060) is generally sufficient for inference.

Which languages and audio styles does the model support today?

VibeVoice-1.5B is trained only on English and Chinese and can perform cross-lingual narration (e.g., English prompt → Chinese speech) as well as basic singing synthesis. It produces speech only—no background sounds—and does not model overlapping speakers; turn-taking is sequential.


Check out the Technical Report, Model on Hugging Face and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



Source_link

Related Posts

MIT affiliates win 2026 Hertz Foundation Fellowships | MIT News
Al, Analytics and Automation

MIT affiliates win 2026 Hertz Foundation Fellowships | MIT News

June 11, 2026
Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding
Al, Analytics and Automation

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

June 11, 2026
Building Semantic Search with Transformers.js and Sentence Embeddings
Al, Analytics and Automation

Building Semantic Search with Transformers.js and Sentence Embeddings

June 11, 2026
Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News
Al, Analytics and Automation

Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News

June 10, 2026
Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared
Al, Analytics and Automation

Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared

June 10, 2026
The Practitioner’s Guide to AgentOps
Al, Analytics and Automation

The Practitioner’s Guide to AgentOps

June 10, 2026
Next Post
Best Early Labor Day Mattress Sales (2025)

Best Early Labor Day Mattress Sales (2025)

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

How to Optimize Your Content for LLMs With Semrush

How to Optimize Your Content for LLMs With Semrush

February 23, 2026
Delectables® Gets Crowned as the #1 Wet Cat Treat Brand in the World

Delectables® Gets Crowned as the #1 Wet Cat Treat Brand in the World

May 6, 2026
Who’s Calling Now? India’s AI Startups Are Fighting Back Against Spam Calls

Who’s Calling Now? India’s AI Startups Are Fighting Back Against Spam Calls

October 23, 2025
AI Is Mainstream, But Document Infrastructure Is Failing to Keep Up, Apryse Global Survey Reveals

AI Is Mainstream, But Document Infrastructure Is Failing to Keep Up, Apryse Global Survey Reveals

December 5, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Health brands you’ll see during the 2026 FIFA World Cup
  • Best Registration Platform for Events in 2026: Comparison Across 13 Critical Categories
  • Agentic Search Optimization for App Discovery
  • Be the Answer, Not a Footnote: How to Navigate the 2026 Generative Engine Disruption
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions