• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, June 10, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A better method for planning complex visual tasks | MIT News

Josh by Josh
March 11, 2026
in Al, Analytics and Automation
0
A better method for planning complex visual tasks | MIT News



MIT researchers have developed a generative artificial intelligence-driven approach for planning long-term visual tasks, like robot navigation, that is about twice as effective as some existing techniques.

Their method uses a specialized vision-language model to perceive the scenario in an image and simulate actions needed to reach a goal. Then a second model translates those simulations into a standard programming language for planning problems, and refines the solution.

In the end, the system automatically generates a set of files that can be fed into classical planning software, which computes a plan to achieve the goal. This two-step system generated plans with an average success rate of about 70 percent, outperforming the best baseline methods that could only reach about 30 percent.

Importantly, the system can solve new problems it hasn’t encountered before, making it well-suited for real environments where conditions can change at a moment’s notice.

“Our framework combines the advantages of vision-language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead author of an open-access paper on this technique. “It can take a single image and move it through simulation and then to a reliable, long-horizon plan that could be useful in many real-life applications.”

She is joined on the paper by Yongchao Chen, a graduate student in the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, an associate professor in AeroAstro and a principal investigator in LIDS; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. The paper will be presented at the International Conference on Learning Representations.

Tackling visual tasks

For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.

Many real-world planning problems, like robotic assembly and autonomous driving, have visual inputs that an LLM can’t handle well on its own. The researchers sought to expand into the visual domain by utilizing vision-language models (VLMs), powerful AI systems that can process images and text.

But VLMs struggle to understand spatial relationships between objects in a scene and often fail to reason correctly over many steps. This makes it difficult to use VLMs for long-range planning.

On the other hand, scientists have developed robust, formal planners that can generate effective long-horizon plans for complex situations. However, these software systems can’t process visual inputs and require expert knowledge to encode a problem into language the solver can understand.

Fan and her team built an automatic planning system that takes the best of both methods. The system, called VLM-guided formal planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready-to-use files for formal planning software.

The researchers first carefully trained a small model they call SimVLM to specialize in describing the scenario in an image using natural language and simulating a sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of initial files in a formal planning language known as the Planning Domain Definition Language (PDDL).

The files are ready to be fed into a classical PDDL solver, which computes a step-by-step plan to solve the task. GenVLM compares the results of the solver with those of the simulator and iteratively refines the PDDL files.

“The generator and simulator work together to be able to reach the exact same result, which is an action simulation that achieves the goal,” Hao says.

Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files.

A flexible approach

VLMFP generates two separate PDDL files. The first is a domain file that defines the environment, valid actions, and domain rules. It also produces a problem file that defines the initial states and the goal of a particular problem at hand.

“One advantage of PDDL is the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” Hao explains.

To enable the system to generalize effectively, the researchers needed to carefully design just enough training data for SimVLM so the model learned to understand the problem and goal without memorizing patterns in the scenario. When tested, SimVLM successfully described the scenario, simulated actions, and detected if the goal was reached in about 85 percent of experiments.

Overall, the VLMFP framework achieved a success rate of about 60 percent on six 2D planning tasks and greater than 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly. It also generated valid plans for more than 50 percent of scenarios it hadn’t seen before, far outpacing the baseline methods.

“Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual-based planning problems,” Fan adds.

In the future, the researchers want to enable VLMFP to handle more complex scenarios and explore methods to identify and mitigate hallucinations by the VLMs.

“In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual-based planning into the picture, this work is an important piece of the puzzle,” Fan says.

This work was funded, in part, by the MIT-IBM Watson AI Lab.



Source_link

READ ALSO

Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared

The Practitioner’s Guide to AgentOps

Related Posts

Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared
Al, Analytics and Automation

Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared

June 10, 2026
The Practitioner’s Guide to AgentOps
Al, Analytics and Automation

The Practitioner’s Guide to AgentOps

June 10, 2026
The consequences of relying on AI for accurate news | MIT News
Al, Analytics and Automation

The consequences of relying on AI for accurate news | MIT News

June 10, 2026
Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API
Al, Analytics and Automation

Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API

June 9, 2026
NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab
Al, Analytics and Automation

NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab

June 9, 2026
ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset
Al, Analytics and Automation

ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset

June 8, 2026
Next Post
Looking Glass’ Musubi showcases its holographic display in a consumer-friendly package

Looking Glass' Musubi showcases its holographic display in a consumer-friendly package

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Replying to Comments Boosts Engagement by 5-42% on These Major Platforms

Replying to Comments Boosts Engagement by 5-42% on These Major Platforms

November 14, 2025
How to Stay Ready for Compliance Audits for Online Payments

How to Stay Ready for Compliance Audits for Online Payments

September 26, 2025
The AI Growth Engine: Accelerating Your Marketing Results

The AI Growth Engine: Accelerating Your Marketing Results

June 19, 2025
P2P Lending Software Development: Process, Costs & Features

P2P Lending Software Development: Process, Costs & Features

September 24, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Netflix expands revamped mobile app across Asia and doubles down on kids’ gaming
  • Marketing Attribution: Moving Beyond Touchpoints
  • Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared
  • UK’s Most Eco-Friendly Brands Revealed as Charlotte Tilbury Takes No.1 Spot
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions