• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Sunday, April 26, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A better method for planning complex visual tasks | MIT News

Josh by Josh
March 11, 2026
in Al, Analytics and Automation
0
A better method for planning complex visual tasks | MIT News



MIT researchers have developed a generative artificial intelligence-driven approach for planning long-term visual tasks, like robot navigation, that is about twice as effective as some existing techniques.

Their method uses a specialized vision-language model to perceive the scenario in an image and simulate actions needed to reach a goal. Then a second model translates those simulations into a standard programming language for planning problems, and refines the solution.

In the end, the system automatically generates a set of files that can be fed into classical planning software, which computes a plan to achieve the goal. This two-step system generated plans with an average success rate of about 70 percent, outperforming the best baseline methods that could only reach about 30 percent.

Importantly, the system can solve new problems it hasn’t encountered before, making it well-suited for real environments where conditions can change at a moment’s notice.

“Our framework combines the advantages of vision-language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead author of an open-access paper on this technique. “It can take a single image and move it through simulation and then to a reliable, long-horizon plan that could be useful in many real-life applications.”

She is joined on the paper by Yongchao Chen, a graduate student in the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, an associate professor in AeroAstro and a principal investigator in LIDS; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. The paper will be presented at the International Conference on Learning Representations.

Tackling visual tasks

For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.

Many real-world planning problems, like robotic assembly and autonomous driving, have visual inputs that an LLM can’t handle well on its own. The researchers sought to expand into the visual domain by utilizing vision-language models (VLMs), powerful AI systems that can process images and text.

But VLMs struggle to understand spatial relationships between objects in a scene and often fail to reason correctly over many steps. This makes it difficult to use VLMs for long-range planning.

On the other hand, scientists have developed robust, formal planners that can generate effective long-horizon plans for complex situations. However, these software systems can’t process visual inputs and require expert knowledge to encode a problem into language the solver can understand.

Fan and her team built an automatic planning system that takes the best of both methods. The system, called VLM-guided formal planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready-to-use files for formal planning software.

The researchers first carefully trained a small model they call SimVLM to specialize in describing the scenario in an image using natural language and simulating a sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of initial files in a formal planning language known as the Planning Domain Definition Language (PDDL).

The files are ready to be fed into a classical PDDL solver, which computes a step-by-step plan to solve the task. GenVLM compares the results of the solver with those of the simulator and iteratively refines the PDDL files.

“The generator and simulator work together to be able to reach the exact same result, which is an action simulation that achieves the goal,” Hao says.

Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files.

A flexible approach

VLMFP generates two separate PDDL files. The first is a domain file that defines the environment, valid actions, and domain rules. It also produces a problem file that defines the initial states and the goal of a particular problem at hand.

“One advantage of PDDL is the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” Hao explains.

To enable the system to generalize effectively, the researchers needed to carefully design just enough training data for SimVLM so the model learned to understand the problem and goal without memorizing patterns in the scenario. When tested, SimVLM successfully described the scenario, simulated actions, and detected if the goal was reached in about 85 percent of experiments.

Overall, the VLMFP framework achieved a success rate of about 60 percent on six 2D planning tasks and greater than 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly. It also generated valid plans for more than 50 percent of scenarios it hadn’t seen before, far outpacing the baseline methods.

“Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual-based planning problems,” Fan adds.

In the future, the researchers want to enable VLMFP to handle more complex scenarios and explore methods to identify and mitigate hallucinations by the VLMs.

“In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual-based planning into the picture, this work is an important piece of the puzzle,” Fan says.

This work was funded, in part, by the MIT-IBM Watson AI Lab.



Source_link

READ ALSO

Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness

Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation

Related Posts

Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness
Al, Analytics and Automation

Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness

April 25, 2026
Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation
Al, Analytics and Automation

Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation

April 25, 2026
MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone | MIT News
Al, Analytics and Automation

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone | MIT News

April 24, 2026
Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates
Al, Analytics and Automation

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates

April 24, 2026
Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model
Al, Analytics and Automation

Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

April 24, 2026
“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office
Al, Analytics and Automation

“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office

April 23, 2026
Next Post
Looking Glass’ Musubi showcases its holographic display in a consumer-friendly package

Looking Glass' Musubi showcases its holographic display in a consumer-friendly package

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Android Quick Share can now work with iOS’s AirDrop

Android Quick Share can now work with iOS’s AirDrop

November 23, 2025
Google and Apple reportedly warn employees on visas to avoid international travel

Google and Apple reportedly warn employees on visas to avoid international travel

December 20, 2025

A Complete Guide to Meta Ads Attribution

June 9, 2025
I Started Over on LinkedIn After Deleting 7,000 Followers — and Grew Faster Than Before

I Started Over on LinkedIn After Deleting 7,000 Followers — and Grew Faster Than Before

October 22, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Farm Scraps Fast in Goat Simulator 3
  • Best Apps for Focus (2026): Focus Friend, Forest, Focus Traveller
  • The 3 elements behind every effective story
  • OpenAI CEO apologizes to Tumbler Ridge community
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions