• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, June 4, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights

Josh by Josh
June 4, 2026
in Al, Analytics and Automation
0


Miso Labs has released MisoTTS, an open-weights 8-billion-parameter text-to-speech model. It generates expressive speech from both text and audio context. The model uses residual vector quantization (RVQ) to widen its sonic range. This avoids scaling a single flat vocabulary while keeping parameter count fixed.

What is MisoTTS

MisoTTS is an 8B-parameter text-to-dialogue RVQ Transformer. It is inspired by the Sesame CSM architecture. It pairs a Llama 3.2-style backbone with a smaller audio decoder. It generates Mimi audio codes from text and optional audio context. The model conditions on both text and prior audio. That second input lets it respond to the speaker’s tone.

The text vocabulary is 128,256 tokens, and there are 32 audio codebooks. Mimi is the audio tokenizer, and max sequence length is 2,048. Default inference runs in torch.bfloat16.

Miso Labs claims 110ms latency. It lists ElevenLabs at 700ms and Sesame at 300ms.

The Vocabulary Size Problem

Standard transformers generate from a fixed vocabulary of discrete tokens. That works when a small vocabulary covers the target space. Human speech does not fit that assumption. It varies across pitch, rhythm, emphasis, emotion, and accent.

Expanding the audio vocabulary is the obvious fix. But larger vocabularies need more parameters in a standard transformer. Each token must be represented and predicted by the model. Miso Labs calls this the vocabulary size problem.

The second issue is conditioning. Most TTS models condition only on text. They ignore the interlocutor’s tone. Miso Labs argues this contributes to the “uncanny valley” effect.

Residual Vector Quantization: The Core Idea

MisoTTS addresses both problems with residual vector quantization (RVQ). Miso Labs traces RVQ to image-generation research and to Sesame’s CSM for audio. Instead of one token index, the model emits a vector of indices.

Each audio token is 32 codebook indices over 2048-way codebooks. The model keeps a separate codebook for each position in the vector. To recover the sound, it sums the looked-up vectors. Each codebook adds another refinement to the signal.

This is what makes the scaling work. Addressable vocabulary equals codebook size raised to the depth. Growing the depth adds no parameters to the model. So MisoTTS reaches about 204832, or roughly 10105 addressable tokens. Miso Labs notes naive scaling would require a far larger network.

https://www.misolabs.ai/blog/miso-tts-8b

The Two-Transformer Architecture

The model splits into a backbone and a decoder. The backbone is a 7.7B-parameter transformer, autoregressive over time. It predicts the first codebook index and a final hidden state.

A 300M-parameter decoder then runs autoregressively over depth. It predicts the remaining codebook indices, one position at a time. Each prediction conditions on the indices already chosen in the frame. The same 300M parameters are reused for every position.

Embeddings follow the same logic. Text tokens use a single lookup. An audio token’s embedding is the sum of per-position codebook lookups. Interleaving text and audio lets the backbone use conversation history. That is how it carries context across turns.

Strengths and Challenges

Strengths:

  • Open weights on day one, under a modified MIT license.
  • RVQ scales the sonic range without scaling parameter count.
  • Conditions on audio context, not text alone.
  • Local deployment keeps sensitive audio data in-house.
  • The architecture and math are documented in a public blog post.

Challenges:

  • Half-duplex only, with no turn-taking yet.
  • The large model needs a capable CUDA GPU.
  • API access is announced but not yet available.
  • Latency and quality claims still need third-party testing.

Marktechpost’s Visual Explainer

Marktechpost · Model Brief
01 / 09

Open-Weights Release · June 3, 2026

MisoTTS

An 8B emotive text-to-speech model from Miso Labs, built on residual vector quantization and conditioned on both text and audio.

8B params
RVQ Transformer
Mimi codes
modified MIT

What MisoTTS Is

A text-to-dialogue RVQ Transformer

  • An 8B-parameter model inspired by the Sesame CSM architecture.
  • Pairs a Llama 3.2-style backbone with a smaller audio decoder.
  • Generates Mimi audio codes from text and optional audio context.
  • Conditions on prior audio, so output responds to speaker tone.

At a Glance

Published specifications

Parameters

8B (7.7B + 300M)

Architecture

RVQ Transformer

Audio codebooks

32 (2048-way)

Default precision

torch.bfloat16

The Motivation

The vocabulary size problem

  • Transformers generate from a fixed vocabulary of discrete tokens.
  • Speech varies in pitch, rhythm, emphasis, emotion, and accent.
  • A bigger audio vocabulary needs more parameters in a standard transformer.
  • Most TTS condition only on text, ignoring tone — the “uncanny valley” effect.

The Core Idea

Residual vector quantization

  • The model emits a vector of indices, not a single token index.
  • Each token is 32 codebook indices over 2048-way codebooks.
  • Summing the looked-up vectors reconstructs the sound.
  • Depth scales addressable vocabulary to ~204832 (≈10105) with no added parameters.

Architecture

Two transformers, one vector token

  • Backbone (7.7B) — autoregressive over time; predicts codebook index k₁ and hidden state h₀.
  • Decoder (300M) — autoregressive over depth; predicts k₂ through k₃₂.
  • The same 300M parameters are reused for every position.
  • Interleaved text and audio let the backbone use conversation history.

READ ALSO

Teaching AI agents to ask better questions by playing “Battleship” | MIT News

How to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers

Run It Locally

Inference in a few lines

from generator import load_miso_8b
import torchaudio

gen = load_miso_8b(device="cuda",
    model_path_or_repo_id="MisoLabs/MisoTTS")

audio = gen.generate(
    text="Hello from Miso.",
    speaker=0, context=[],
    max_audio_length_ms=10_000)

torchaudio.save("miso.wav",
    audio.unsqueeze(0).cpu(), gen.sample_rate)

Setup uses uv with Python 3.10. Weights download from Hugging Face. Audio is watermarked by default via SilentCipher. One-shot voice cloning works from a ~10-second clip.

Limitations

Where it stops, for now

  • Handles individual turns only; no turn-taking yet.
  • Generates half-duplex audio — it cannot speak while the other party speaks.
  • Miso Labs frames full-duplex and turn-taking as future work.
  • API access is announced but not yet available.

Key Takeaways

The short version

  • Open-weights 8B TTS under a modified MIT license.
  • Conditions on text and audio, so output tracks speaker tone.
  • RVQ scales vocabulary to ~204832 without adding parameters.
  • 7.7B backbone over time, 300M decoder over depth.
  • Half-duplex and single-turn today; API access pending.

Key Takeaways

  • Miso Labs open-sourced MisoTTS, an 8B text-to-speech model, under a modified MIT license.
  • It conditions on both text and audio context, making generations responsive to speaker tone.
  • Residual vector quantization (32 codebooks × 2048-way) scales vocabulary to ~2048³² without adding parameters.
  • Architecture splits a 7.7B backbone (over time) and a 300M decoder (over depth).
  • It is half-duplex and single-turn only today; API access is still pending.

Check out the Model Weights, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

Related Posts

Teaching AI agents to ask better questions by playing “Battleship” | MIT News
Al, Analytics and Automation

Teaching AI agents to ask better questions by playing “Battleship” | MIT News

June 4, 2026
How to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers
Al, Analytics and Automation

How to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers

June 4, 2026
Medical Image Annotation for Ophthalmology & AI
Al, Analytics and Automation

Medical Image Annotation for Ophthalmology & AI

June 3, 2026
MIT researchers teach AI models to interpret charts | MIT News
Al, Analytics and Automation

MIT researchers teach AI models to interpret charts | MIT News

June 3, 2026
NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation
Al, Analytics and Automation

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

June 3, 2026
TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions
Al, Analytics and Automation

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

June 2, 2026
Next Post
The TikTok Ban Was Never About TikTok

The TikTok Ban Was Never About TikTok

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Fitbit’s AI health coach is now available on your iPhone

Fitbit’s AI health coach is now available on your iPhone

February 11, 2026
We’re announcing new health AI funding, while a new report signals a turning point for health in Europe.

We’re announcing new health AI funding, while a new report signals a turning point for health in Europe.

December 3, 2025
Commercialization Planning, It’s Not Just For Tech Companies

Commercialization Planning, It’s Not Just For Tech Companies

June 9, 2025
Turning Customer Insights into Marketing Action: 5 Takeaways

Turning Customer Insights into Marketing Action: 5 Takeaways

August 28, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • App Organic Marketing in 2026: The Full-Stack Playbook
  • Understanding the 3 categories of online chatter
  • The TikTok Ban Was Never About TikTok
  • Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions