• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, June 3, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Josh by Josh
June 3, 2026
in Al, Analytics and Automation
0
NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation


NVIDIA AI team have released Cosmos 3. It is a family of omnimodal world models for physical AI. The models combine physical reasoning, world generation, and action generation. All three capabilities live inside one open model. NVIDIA open sourced the checkpoints, training scripts, deployment tools, and datasets. The Cosmos 3 release targets robotics, autonomous vehicles, and warehouse monitoring teams.

NVIDIA Cosmos 3

Physical AI systems must understand the world before acting in it. Robots and vehicles need to perceive, predict, and then act. Earlier Cosmos releases split these jobs across separate models. Cosmos 3 unifies them with a Mixture-of-Transformers (MoT) architecture. The architecture is built around two towers.

The reasoner tower is a vision-language model (VLM). It interprets images, videos, and text using an autoregressive architecture. It understands motion, object interactions, and other physical context. NVIDIA team describes this tower as the model’s brain.

The generator tower produces future observations and action sequences. It uses a diffusion-based process for physics-aware video and actions. These outputs are conditioned on the reasoner tower’s understanding. Information flows one way, from reasoner to generator. The reasoner can run alone. The generator always activates both towers for guided generation.

A single model can therefore handle reasoning and generation together.

https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3

The Model Family

NVIDIA team describes three model scales: Edge, Nano, and Super. Each uses the dual-tower Mixture-of-Transformers design. The two towers are initialized from pre-trained Qwen3-VL weights. That roughly doubles the parameter count of the backbone transformer.

Cosmos3-Nano is a 16B model built on a dense 8B transformer. It adapts the Qwen3-VL 8B architecture. Nano targets efficient inference on workstation GPUs. It runs on hardware like the NVIDIA RTX PRO 6000. That suits real-time robotics and on-device physical AI.

Cosmos3-Super is a 64B model built on a dense 32B transformer. It adapts the Qwen3-VL 32B architecture. Super targets datacenter GPUs, including NVIDIA Hopper and Blackwell. It fits large-scale synthetic data generation and advanced reasoning.

This release ships Nano and Super, along with task-specific variants. These include Super Text2Image, Super Image2Video, and Nano-Policy-DROID.

How the Unified Design Works

Both towers share one transformer architecture and a joint attention operator. They use a 3D multimodal rotary position embedding (mRoPE). mRoPE aligns video, audio, and action tokens on one temporal axis. In Reasoner Mode, tokens pass through causal self-attention. This enables next-token prediction for perception, planning, and reasoning. In Generator Mode, noisy tokens are denoised through full attention. The autoregressive tokens are never updated by the diffusion tokens.

The model treats action as a core modality with dedicated action tokens. Supported inputs include text, image, video, and JSON action arrays. Outputs include images, video, synchronized sound, action states, and text. The reasoner follows Qwen3-VL-compatible message conventions for vision inputs.

Generation supports 256p, 480p, and 720p resolution tiers. Frame counts range from 5 to 300, defaulting to 189. That equals about 7.9 seconds of video at 24 FPS. Sound is generated as stereo AAC at 48 kHz. Action conditioning spans camera, vehicle, egocentric, single-arm, dual-arm, and humanoid embodiments. Each embodiment uses a fixed action dimension, such as 9D for cameras.

The Benchmark Case

NVIDIA team evaluated Cosmos 3 across reasoning and generation suites. On reasoning, Super and Nano lead VANTAGE-Bench at their respective tiers. VANTAGE-Bench tests VLMs on real-world fixed-camera footage. It covers warehouses, transportation, and smart spaces. Cosmos 3 also tops the Traffic Anomaly Reasoning (TAR) leaderboard. TAR is the official leaderboard for AI City Challenge 2026 Track 3.

On generation, NVIDIA reports open-source state-of-the-art results. Cosmos 3 is the open-source SOTA on R-Bench. It also leads PAI-Bench, Physics-IQ, and RoboLab on public leaderboards. On Artificial Analysis, it leads two open-source leaderboards. These cover text-to-image and image-to-video without audio.

NVIDIA team also introduced its Cosmos Human Evaluation framework, called HUE. HUE decomposes each generated video into yes/no fact questions. It scores four dimensions across seven physical AI domains. The dimensions are semantic alignment, physical laws, geometric reasoning, and visual integrity. A VLM pipeline drafts the questions, and human experts refine them.

Marktechpost’s Visual Explainer




marktechpost@guide ~ /nvidia/cosmos-3
01 / 09

DEVELOPER GUIDE · PHYSICAL AI

NVIDIA Cosmos 3

Open omnimodal world models for physical AI.

Released May 31, 2026. One model for physical reasoning, world generation, and action generation.

Mixture-of-Transformers
Open weights
OpenMDW-1.1

Use ← → or swipe to navigate

01 · WHAT IT IS

A unified model for understanding and generation

Cosmos 3 is a family of omnimodal world models for physical AI. Earlier Cosmos releases split jobs across separate models. Cosmos 3 unifies them in a single open model.

  • Physical reasoning over images, video, and text.
  • World generation of physics-aware video and sound.
  • Action generation for robots and autonomous systems.

Subsumes VLMs, video generators, world simulators, and world-action models.

02 · ARCHITECTURE

Two towers, one transformer

REASONER TOWER

An autoregressive vision-language model (VLM). It interprets motion, object interactions, and physical context. NVIDIA calls it the model’s brain.

GENERATOR TOWER

A diffusion-based path for physics-aware video and actions. It is conditioned on the reasoner’s understanding.

Information flows one way, reasoner → generator. Both towers share a 3D multimodal RoPE (mRoPE).

03 · MODEL FAMILY

Pick a size for your hardware

Cosmos3-Nano
16B total (dense 8B, Qwen3-VL 8B). Workstation GPUs like RTX PRO 6000. Real-time robotics.

Cosmos3-Super
64B total (dense 32B, Qwen3-VL 32B). Datacenter Hopper and Blackwell GPUs. Large-scale SDG.

Cosmos3-Edge
4B total (dense 2B). On-device scale. Planned for a later release.

Plus variants: Super-Text2Image, Super-Image2Video, and Nano-Policy-DROID.

READ ALSO

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

04 · MODALITIES

Inputs, outputs, and generation settings

  • Inputs: text, image, video, and JSON action arrays.
  • Outputs: image, video, synchronized sound, action states, text.
  • Resolution: 256p, 480p, 720p. Sound: stereo AAC at 48 kHz.
  • Length: 5 to 300 frames; default 189 (about 7.9s at 24 FPS).
  • Embodiments: camera, vehicle, egocentric, single-arm, dual-arm, humanoid.

05 · BENCHMARKS

What NVIDIA reports

REASONING

Nano and Super lead VANTAGE-Bench at their tiers. Cosmos 3 tops TAR, the AI City Challenge 2026 Track 3 leaderboard.

GENERATION

Open-source SOTA on R-Bench. Leads PAI-Bench, Physics-IQ, and RoboLab. Top open-source on Artificial Analysis text-to-image and image-to-video.

HUE evaluates videos with yes/no fact checks across four dimensions and seven domains.

06 · OPEN RELEASE

Everything ships open

  • Checkpoints for Nano, Super, and task-specific variants.
  • Six SDG datasets: robotics, physics, spatial reasoning, human motion, driving, warehouses.
  • Training recipes: SFT plus action post-training.
  • Action modes: forward dynamics, inverse dynamics, and policy generation.
  • License: OpenMDW-1.1.

07 · DEPLOYMENT

Run it in production

  • NIM microservices: Reasoner NIM available now; Generator NIM later.
  • Quantization: BF16, FP8, and NVFP4. NVFP4 gives up to 2x speedup.
  • Serving: the Reasoner NIM stack is built on vLLM.
  • Efficient Video Sampling (EVS): prunes redundant video tokens at inference.

Use Diffusers and Transformers for research; vLLM-Omni and vLLM for serving.

08 · LIMITATIONS & START

Know the caveats, then build

Outputs can show temporal inconsistency, unstable motion, object morphing, inaccurate 3D structure, and sound-video misalignment. Safety-critical control needs validation, guardrails, and system-level analysis.

GitHubgithub.com/nvidia/cosmos

Hugging Facehuggingface.co/collections/nvidia/cosmos3

Key Takeaways

  • Cosmos 3 is NVIDIA’s open family of omnimodal world models, unifying physical reasoning, world generation, and action generation in one model.
  • A two-tower Mixture-of-Transformers design pairs an autoregressive VLM reasoner with a diffusion generator, conditioned one-way from reasoner to generator.
  • Two checkpoints ship now: Cosmos3-Nano (16B, dense 8B backbone) for workstations and Cosmos3-Super (64B, dense 32B backbone) for datacenters.
  • NVIDIA open sourced the checkpoints, six SDG datasets, training recipes, and the HUE benchmark under the OpenMDW-1.1 license.
  • It reports open-source SOTA on R-Bench and leading Artificial Analysis text-to-image and image-to-video results.

Check out the Model Weights, GitHub Repo, Project Page and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source_link

Related Posts

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions
Al, Analytics and Automation

TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

June 2, 2026
JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines
Al, Analytics and Automation

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

June 2, 2026
Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Top of Hermes Agent
Al, Analytics and Automation

Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Top of Hermes Agent

June 1, 2026
Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch
Al, Analytics and Automation

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

June 1, 2026
An Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls
Al, Analytics and Automation

An Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls

May 31, 2026
Trajectory Releases a Concurrent Multi-LoRA Training Stack for Continual Learning, Reporting a 2.81× Experiment-Throughput Gain
Al, Analytics and Automation

Trajectory Releases a Concurrent Multi-LoRA Training Stack for Continual Learning, Reporting a 2.81× Experiment-Throughput Gain

May 31, 2026
Next Post
Power BI Development Process: Step-by-Step Guide for Businesses 

Power BI Development Process: Step-by-Step Guide for Businesses 

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Advanced Frontline Marketing Strategies Are Outperforming In B2B Orgs

Advanced Frontline Marketing Strategies Are Outperforming In B2B Orgs

June 4, 2025
New Perch AI model helps protect endangered species

New Perch AI model helps protect endangered species

August 7, 2025
I Tested FreeGF AI for 30 Days: Here’s what really happened

I Tested FreeGF AI for 30 Days: Here’s what really happened

October 2, 2025
How Buffer Helped Me Triple My LinkedIn Impressions and Grow My Newsletter

How Buffer Helped Me Triple My LinkedIn Impressions and Grow My Newsletter

December 2, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Precision at Scale: The Career and Professional Philosophy of Anubhav Mittal
  • Power BI Development Process: Step-by-Step Guide for Businesses 
  • NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation
  • What Is a Good Domain Rating? (With Real Data)
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions