• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Saturday, June 20, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk

Josh by Josh
May 6, 2026
in Al, Analytics and Automation
0
Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk


Voice AI has a dirty secret: most of it was never designed for conversation. The dominant paradigm — feed text in, get audio out — traces its lineage to audiobook narration and voiceover production, where the model never hears the person on the other end. That’s fine when you’re generating a podcast intro. It’s not fine when a frustrated user is trying to get support from an AI agent at 11pm.

Inworld AI is calling that out directly with the launch of Realtime TTS-2, a new voice model released as a research preview via its Inworld API and Inworld Realtime API. The model hears the full audio of the exchange, picks up the user’s tone, pacing and emotional state, then takes voice direction in plain English the way developers prompt an LLM.

READ ALSO

Yandex Open-Sources YaFF: A Zero-Copy Wire Format for Protobuf With Near-Struct Read Speed

NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning

What’s Actually Different Here

The meaningful architectural distinction with TTS-2 is that it operates as a closed-loop system. The model takes the actual audio of the prior turns of the exchange as input, not just a transcript — it hears how the user actually sounded. That’s a non-trivial difference. A transcript of “okay, fine” gives you the words. The audio of “okay, fine” tells you whether the person is relieved, resigned, or sarcastic. TTS-2 is designed to use that signal.

The same line lands differently after a joke than after bad news, and the model knows the difference because it heard the prior turn. Tone, pacing, and emotional state carry forward automatically. Practically speaking, audio context flows across turns inside a Realtime session without developers needing to pass explicit prior_audio fields or build additional plumbing.

Four Capabilities, One Model

Inworld team is shipping TTS-2 with four key features, positioning the combination and not any individual piece, as the differentiation.

  1. Voice Direction: It lets developers steer delivery using plain-language prompts inline at inference time. Instead of selecting from a fixed emotion enum like [sad] or [excited], developers pass a bracket tag like [speak sadly, as if something bad just happened] directly in the text. Long, descriptive prompts beat short labels — the model responds far better to full context than single-word labels. Inline non-verbal markers like [laugh], [sigh], [breathe], [clear_throat], and [cough] can be dropped anywhere in the text where the moment should occur, and the model places them as audio events, not pronounced words.
  2. Conversational Awareness: It is the closed-loop architecture described above — the architectural shift that separates TTS-2 from prior-generation models that treat each sentence as a stateless generation call.
  3. Crosslingual support: One voice identity is preserved across over 100 languages, including mid-utterance language switches inside a single generation. No language flag is needed — the model handles transitions automatically, keeping timbre, pitch, and character constant across the switch. The top-tier languages ship at native-speaker quality, while the long tail is described as launch-window experimental, consistent with the model releasing as a research preview.
  4. Advanced Voice Design: It generates a saved voice from a written prompt and no reference audio required. Developers can describe a person in prose, save the result as a reusable voice, and call it like any other voice in the app. Voice Design ships with three stability modes: Expressive (for live consumer conversation and companions), Balanced (the default for most agent workloads), and Stable (for IVR and professional deployments where pitch drift is unacceptable).

The Conversational Layer Underneath

Beyond the four key features, it calls out a set of behaviors that push speech further into what it describes as “person paying attention” territory. The most technically interesting is disfluencies: the model generates natural uh and um, self-corrections, mid-noun-phrase pauses, and trailing thoughts that signal warmth and recall rather than malfunction. Critically, different speaker profiles cluster fillers differently, and the model follows the rhythm — filler-as-energy sounds different from filler-as-hesitation. Voice cloning is also supported via a two-step API: upload a reference sample (5–15 seconds, clean, single speaker) to /voices/v1/voices:clone, get a voice ID, and use it like any other voice.

Where It Fits in the Stack

TTS-2 is one layer in Inworld’s broader Realtime API pipeline. The full stack includes Realtime STT, which transcribes and profiles the speaker in one pass — capturing age, accent, pitch, vocal style, emotional tone, and pacing as structured signals on the same connection. A Realtime Router that routes across 200+ models, selecting the appropriate model and tools based on the user’s state and conversation context. And TTS-2 at the output layer. The pipeline runs over a single persistent WebSocket connection, with sub-200ms median time-to-first-audio for the TTS layer.

https://artificialanalysis.ai/text-to-speech/leaderboard. (data as of May 5, 2026)

The Broader Context

Realtime TTS 1.5 already ranks #1 on the Artificial Analysis Speech Arena (as of May 5, 2026), ahead of Google (#2) and ElevenLabs (#3). The launch of TTS-2 signals that Inworld considers raw audio quality a solved problem — and is now competing on the behavioral layer: context-awareness, steerability, and identity consistency across languages.


Check out the Docs and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

Related Posts

Yandex Open-Sources YaFF: A Zero-Copy Wire Format for Protobuf With Near-Struct Read Speed
Al, Analytics and Automation

Yandex Open-Sources YaFF: A Zero-Copy Wire Format for Protobuf With Near-Struct Read Speed

June 20, 2026
NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning
Al, Analytics and Automation

NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning

June 20, 2026
A better way to model the behavior of metal alloys | MIT News
Al, Analytics and Automation

A better way to model the behavior of metal alloys | MIT News

June 19, 2026
Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages
Al, Analytics and Automation

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

June 19, 2026
MIT in the media: For the future of tech, “Massachusetts can absolutely lead” | MIT News
Al, Analytics and Automation

MIT in the media: For the future of tech, “Massachusetts can absolutely lead” | MIT News

June 19, 2026
Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight
Al, Analytics and Automation

Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight

June 18, 2026
Next Post
Future of Performance Marketing Measurement

Future of Performance Marketing Measurement

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Tips, examples, and 2026 data

Tips, examples, and 2026 data

February 3, 2026
My 6 Picks for the Best Enterprise Risk Management Software

My 6 Picks for the Best Enterprise Risk Management Software

September 22, 2025
Google’s approach to AI and learning

Google’s approach to AI and learning

November 9, 2025
Experiential Trend of the Week: Cinematic Campaigns

Experiential Trend of the Week: Cinematic Campaigns

August 19, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Pulling back the curtain on Nissan’s ’follow the sun’ global comms structure
  • GeoGuessr Daily Challenge Answer Today for June 20, 2026
  • Yandex Open-Sources YaFF: A Zero-Copy Wire Format for Protobuf With Near-Struct Read Speed
  • Recognizing the Event Industry’s Top Builders
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions