• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, March 12, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation

Josh by Josh
November 16, 2025
in Al, Analytics and Automation
0


Most text to video models generate a single clip from a prompt and then stop. They do not keep an internal world state that persists as actions arrive over time. PAN, a new model from MBZUAI’s Institute of Foundation Models, is designed to fill that gap by acting as a general world model that predicts future world states as video, conditioned on history and natural language actions.

https://arxiv.org/pdf/2511.09057

From video generator to interactive world simulator

PAN is defined as a general, interactable, long horizon world model. It maintains an internal latent state that represents the current world, then updates that state when it receives a natural language action such as ‘turn left and speed up’ or ‘move the robot arm to the red block.’ The model then decodes the updated state into a short video segment that shows the consequence of that action. This cycle repeats, so the same world state evolves across many steps.

This design allows PAN to support open domain, action conditioned simulation. It can roll out counterfactual futures for different action sequences. An external agent can query PAN as a simulator, compare predicted futures, and choose actions based on those predictions.

GLP architecture, separating what happens from how it looks

The base of PAN is the Generative Latent Prediction, GLP, architecture. GLP separates world dynamics from visual rendering. First, a vision encoder maps images or video frames into a latent world state. Second, an autoregressive latent dynamics backbone based on a large language model predicts the next latent state, conditioned on history and the current action. Third, a video diffusion decoder reconstructs the corresponding video segment from that latent state.

In PAN, the vision encoder and backbone are built on Qwen2.5-VL-7B-Instruct. The vision tower tokenizes frames into patches and produces structured embeddings. The language backbone runs over a history of world states and actions, plus learned query tokens, and outputs the latent representation of the next world state. These latents live in the shared multimodal space of the VLM, which helps ground the dynamics in both text and vision.

The video diffusion decoder is adapted from Wan2.1-T2V-14B, a diffusion transformer for high fidelity video generation. The research team trains this decoder with a flow matching objective, using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current natural language action, with a dedicated cross attention stream for the world state and another for the action text.

https://arxiv.org/pdf/2511.09057

Causal Swin DPM and sliding window diffusion

Naively chaining single shot video models by conditioning only on the last frame leads to local discontinuities and rapid quality degradation over long rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk wise causal attention.

The decoder operates on a sliding temporal window that holds two chunks of video frames at different noise levels. During denoising, one chunk moves from high noise to clean frames and then leaves the window. A new noisy chunk enters at the other end. Chunk wise causal attention ensures that the later chunk can only attend to the earlier one, not to unseen future actions. This keeps transitions between chunks smooth and reduces error accumulation over long horizons.

PAN also adds controlled noise to the conditioning frame, rather than using a perfectly sharp frame. This suppresses incidental pixel details that do not matter for dynamics and encourages the model to focus on stable structure such as objects and layout.

https://arxiv.org/pdf/2511.09057

Training stack and data construction

PAN is trained in two stages. In the first stage, the research team adapts Wan2.1 T2V 14B into the Causal Swin DPM architecture. They train the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs.

In the second stage, they integrate the frozen Qwen2.5 VL 7B Instruct backbone with the video diffusion decoder under the GLP objective. The vision language model remains frozen. The model learns query embeddings and the decoder so that predicted latents and reconstructed videos stay consistent. This joint training also uses sequence parallelism and Ulysses style attention sharding to handle long context sequences. Early stopping ends training after 1 epoch once validation converges, even though the schedule allows 5 epochs.

Training data comes from widely used publicly accessible video sources that cover everyday activities, human object interactions, natural environments, and multi agent scenarios. Long form videos are segmented into coherent clips using shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic quality, heavy text overlays, and screen recordings using rule based metrics, pretrained detectors, and a custom VLM filter. The research team then re-captions clips with dense, temporally grounded descriptions that emphasize motion and causal events.

Benchmarks, action fidelity, long horizon stability, planning

The research team evaluates the model along three axes, action simulation fidelity, long horizon forecast, and simulative reasoning and planning, against both open source and commercial video generators and world models. Baselines include WAN 2.1 and 2.2, Cosmos 1 and 2, V JEPA 2, and commercial systems such as KLING, MiniMax Hailuo, and Gen 3.

For action simulation fidelity, a VLM based judge scores how well the model executes language specified actions while maintaining a stable background. PAN reaches 70.3% accuracy on agent simulation and 47% on environment simulation, for an overall score of 58.6%. It achieves the highest fidelity among open source models and surpasses most commercial baselines.

For long horizon forecast, the research team measures Transition Smoothness and Simulation Consistency. Transition Smoothness uses optical flow acceleration to quantify how smooth motion is across action boundaries. Simulation Consistency uses metrics inspired by WorldScore to monitor degradation over extended sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, including KLING and MiniMax, on these metrics.

For simulative reasoning and planning, PAN is used as an internal simulator inside an OpenAI-o3 based agent loop. In step wise simulation, PAN achieves 56.1% accuracy, the best among open source world models.

https://arxiv.org/pdf/2511.09057

Key Takwaways

  1. PAN implements the Generative Latent Prediction architecture, combining a Qwen2.5-VL-7B based latent dynamics backbone with a Wan2.1-T2V-14B based video diffusion decoder, to unify latent world reasoning and realistic video generation.
  2. The Causal Swin DPM mechanism introduces a sliding window, chunk wise causal denoising process that conditions on partially noised past chunks, which stabilizes long horizon video rollouts and reduces temporal drift compared to naive last frame conditioning.
  3. PAN is trained in two stages, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a flow matching objective, then jointly training the GLP stack with a frozen Qwen2.5-VL backbone and learned query embeddings plus decoder.
  4. The training corpus consists of large scale video action pairs from diverse domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to learn action conditioned, long range dynamics instead of isolated short clips.
  5. PAN achieves state of the art open source results on action simulation fidelity, long horizon forecasting, and simulative planning, with reported scores such as 70.3% agent simulation, 47% environment simulation, 53.6% transition smoothness, and 64.1% simulation consistency, while remaining competitive with leading commercial systems.

Comparison Table

Dimension PAN Cosmos video2world WFM Wan2.1 T2V 14B V JEPA 2
Organization MBZUAI Institute of Foundation Models NVIDIA Research Wan AI and Open Laboratory Meta AI
Primary role General world model for interactive, long horizon world simulation with natural language actions World foundation model platform for Physical AI with video to world generation for control and navigation High quality text to video and image to video generator for general content creation and editing Self supervised video model for understanding, prediction and planning tasks
World model framing Explicit GLP world model, latent state, action, and next observation defined, focuses on simulative reasoning and planning Described as world foundation model that generates future video worlds from past video and control prompt, aimed at Physical AI, robotics, driving, navigation Framed as video generation model, not primarily as world model, no persistent internal world state described in docs Joint embedding predictive architecture for video, focuses on latent prediction rather than explicit generative supervision in observation space
Core architecture GLP stack, vision encoder from Qwen2.5 VL 7B, LLM based latent dynamics backbone, video diffusion decoder with Causal Swin DPM Family of diffusion based and autoregressive world models, with video2world generation, plus diffusion decoder and prompt upsampler based on a language model Spatio temporal variational autoencoder and diffusion transformer T2V model at 14 billion parameters, supports multiple generative tasks and resolutions JEPA style encoder plus predictor architecture that matches latent representations of consecutive video observations
Backbone and latent space Multimodal latent space from Qwen2.5 VL 7B, used both for encoding observations and for autoregressive latent prediction under actions Token based video2world model with text prompt conditioning and optional diffusion decoder for refinement, latent space details depend on model variant Latent space from VAE plus diffusion transformer, driven mainly by text or image prompts, no explicit agent action sequence interface Latent space built from self supervised video encoder with predictive loss in representation space, not generative reconstruction loss
Action or control input Natural language actions in dialogue format, applied at every simulation step, model predicts next latent state and decodes video conditioned on action and history Control input as text prompt and optionally camera pose for navigation and downstream tasks such as humanoid control and autonomous driving Text prompts and image inputs for content control, no explicit multi step agent action interface described as world model control Does not focus on natural language actions, used more as visual representation and predictor module inside larger agents or planners
Long horizon design Causal Swin DPM sliding window diffusion, chunk wise causal attention, conditioning on slightly noised last frame to reduce drift and maintain stable long horizon rollouts Video2world model generates future video given past window and prompt, supports navigation and long sequences but the paper does not describe a Causal Swin DPM style mechanism Can generate several seconds at 480 P and 720 P, focuses on visual quality and motion, long horizon stability is evaluated through Wan Bench but without explicit world state mechanism Long temporal reasoning comes from predictive latent modeling and self supervised training, not from generative video rollouts with explicit diffusion windows
Training data focus Large scale video action pairs across diverse physical and embodied domains, with segmentation, filtering and dense temporal recaptioning for action conditioned dynamics Mix of proprietary and public Internet videos focused on Physical AI categories such as driving, manipulation, human activity, navigation and nature dynamics, with a dedicated curation pipeline Large open domain video and image corpora for general visual generation, with Wan Bench evaluation prompts, not targeted specifically at agent environment rollouts Large scale unlabelled video data for self supervised representation learning and prediction, details in V JEPA 2 paper

PAN is an important step because it operationalizes Generative Latent Prediction with production scale components such as Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on well defined benchmarks for action simulation, long horizon forecasting, and simulative planning. The training and evaluation pipeline is clearly documented by the research team, the metrics are reproducible, and the model is released within a transparent world modeling framework rather than as an opaque video demo. Overall, PAN shows how a vision language backbone plus diffusion video decoder can function as a practical world model instead of a pure generative toy.


Check out the Paper, Technical details and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Max is an AI analyst at MarkTechPost, based in Silicon Valley, who actively shapes the future of technology. He teaches robotics at Brainvyne, combats spam with ComplyEmail, and leverages AI daily to translate complex tech advancements into clear, understandable insights

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

Meta Unveils Four New Chips to Power Its AI and Recommendation Systems

New MIT class uses anthropology to improve chatbots | MIT News

Related Posts

Meta Unveils Four New Chips to Power Its AI and Recommendation Systems
Al, Analytics and Automation

Meta Unveils Four New Chips to Power Its AI and Recommendation Systems

March 12, 2026
New MIT class uses anthropology to improve chatbots | MIT News
Al, Analytics and Automation

New MIT class uses anthropology to improve chatbots | MIT News

March 12, 2026
How to Design a Streaming Decision Agent with Partial Reasoning, Online Replanning, and Reactive Mid-Execution Adaptation in Dynamic Environments
Al, Analytics and Automation

How to Design a Streaming Decision Agent with Partial Reasoning, Online Replanning, and Reactive Mid-Execution Adaptation in Dynamic Environments

March 12, 2026
3 Questions: On the future of AI and the mathematical and physical sciences | MIT News
Al, Analytics and Automation

3 Questions: On the future of AI and the mathematical and physical sciences | MIT News

March 12, 2026
NVIDIA Releases Nemotron 3 Super: A 120B Parameter Open-Source Hybrid Mamba-Attention MoE Model Delivering 5x Higher Throughput for Agentic AI
Al, Analytics and Automation

NVIDIA Releases Nemotron 3 Super: A 120B Parameter Open-Source Hybrid Mamba-Attention MoE Model Delivering 5x Higher Throughput for Agentic AI

March 11, 2026
A better method for planning complex visual tasks | MIT News
Al, Analytics and Automation

A better method for planning complex visual tasks | MIT News

March 11, 2026
Next Post
The Best TVs We’ve Reviewed in 2025: Sony, Samsung, LG, and More

The Best TVs We’ve Reviewed in 2025: Sony, Samsung, LG, and More

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

How to request removal of non-consensual explicit images in Search

How to request removal of non-consensual explicit images in Search

February 15, 2026
Grow a Garden Pomegranate Wiki

Grow a Garden Pomegranate Wiki

November 27, 2025

Self-improving language models are becoming reality with MIT's updated SEAL technique

October 14, 2025
AI is transforming medicine. Could it bring doctors and patients together?

AI is transforming medicine. Could it bring doctors and patients together?

October 28, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Navigating Regulations in Home Wellness Marketing
  • How to Beat Dawnhold in Demacia Rising in League of Legends
  • The team behind continuous batching says your idle GPUs should be running inference, not sitting dark
  • Meta Unveils Four New Chips to Power Its AI and Recommendation Systems
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions