• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, April 24, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

Josh by Josh
October 6, 2025
in Al, Analytics and Automation
0
A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples






Do curated, tool-grounded demonstrations build stronger software agents than broad piles of generic instruction data? A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning method that turns a base model into a capable software/research agent using 78 samples. LIMI scores 73.5% average on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating strong baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants trained on 10,000 samples—with 128× less data.

https://arxiv.org/pdf/2509.17567

What exactly is new?

  • Agency Efficiency Principle: LIMI state that agentic competence scales more with data quality/structure than raw sample count. The research team fine-tune GLM-4.5/GLM-4.5-Air on 78 long-horizon, tool-use trajectories (samples) and report large gains on AgencyBench and generalization suites (TAU2-bench, EvalPlus-HE/MBPP, DS-1000, SciCode).
  • Minimal but dense supervision. Each trajectory (~13k–152k tokens; ~42.4k avg.) captures complete multi-turn workflows—model reasoning, tool calls, and environment observations—collected in the SII-CLI execution environment. Tasks span “vibe coding” (interactive software development) and research workflows (search, analysis, experiment design).
https://arxiv.org/pdf/2509.17567

How does it work?

  • Base models: GLM-4.5 (355B) and GLM-4.5-Air (106B). Training uses the slime SFT framework with identical configs across comparisons (to isolate data effects).
  • Data construction: 60 real queries from practitioners + 18 synthesized from high-star GitHub PRs (tight QA by PhD annotators). For each query, LIMI logs the full agent trajectory to successful completion inside SII-CLI.
  • Evaluation: AgencyBench (R=3 rounds) with FTFC, SR@3, RC@3; plus generalization suites (TAU2-airline/retail Pass^4, EvalPlus HE/MBPP, DS-1000, SciCode).
https://arxiv.org/pdf/2509.17567

Results

  • AgencyBench (avg): 73.5%. LIMI vs. GLM-4.5 (+28.4 pts); FTFC 71.7% vs 37.8%; SR@3 74.6% vs 47.4%.
  • Data efficiency: LIMI (78 samples) outperforms GLM-4.5 trained on AFM-CodeAgent SFT (10,000 samples): 73.5% vs 47.8%—+53.7% absolute with 128× less data. Similar gaps hold vs AFM-WebAgent (7,610) and CC-Bench-Traj (260).
  • Generalization: Across tool-use/coding/scientific computing, LIMI averages ~57%, exceeding GLM-4.5 and other baselines; without tool access, LIMI still leads slightly (50.0% vs 48.7% for GLM-4.5), indicating intrinsic gains beyond environment tooling.
https://arxiv.org/pdf/2509.17567

Key Takeaways

  1. Data efficiency dominates scale. LIMI reaches 73.5% average on AgencyBench using curated trajectories, surpassing GLM-4.5 (45.1%) and showing a +53.7-point advantage over a 10k-sample SFT baseline—with 128× fewer samples.
  2. Trajectory quality, not bulk. Training data are long-horizon, tool-grounded workflows in collaborative software development and scientific research, collected via the SII-CLI execution stack referenced by the paper.
  3. Across-metric gains. On AgencyBench, LIMI reports FTFC 71.7%, SR@3 74.6%, and strong RC@3, with detailed tables showing large margins over baselines; generalization suites (TAU2, EvalPlus-HE/MBPP, DS-1000, SciCode) average 57.2%.
  4. Works across scales. Fine-tuning GLM-4.5 (355B) and GLM-4.5-Air (106B) both yields large deltas over their bases, indicating method robustness to model size.

The research team trains GLM-4.5 variants with 78 curated, long-horizon, tool-grounded trajectories captured in a CLI environment spanning software-engineering and research tasks. It reports 73.5% average on AgencyBench with FTFC, RC@3, and SR@3 metrics; baseline GLM-4.5 is reported at 45.1%. A comparison against a 10,000-sample AFM-CodeAgent SFT baseline shows 73.5% vs 47.8%; tool-free evaluation indicates intrinsic gains (≈50.0% for LIMI vs 48.7% GLM-4.5). Trajectories are multi-turn and token-dense, emphasizing planning, tool orchestration, and verification.


Check out the Paper, GitHub Page and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



READ ALSO

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone | MIT News

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates




Previous articleStreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows




Source_link

Related Posts

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone | MIT News
Al, Analytics and Automation

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone | MIT News

April 24, 2026
Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates
Al, Analytics and Automation

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates

April 24, 2026
Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model
Al, Analytics and Automation

Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

April 24, 2026
“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office
Al, Analytics and Automation

“Your Next Coworker May Not Be Human” as Google Bets Everything on AI Agents to Power the Office

April 23, 2026
Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures
Al, Analytics and Automation

Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures

April 23, 2026
The Most Efficient Approach to Crafting Your Personal AI Productivity System
Al, Analytics and Automation

The Most Efficient Approach to Crafting Your Personal AI Productivity System

April 23, 2026
Next Post
Grow a Garden Farmer Chipmunk Pet Wiki

Grow a Garden Farmer Chipmunk Pet Wiki

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Google Cloud is working with HCA Healthcare on a Nurse Handoff app

Google Cloud is working with HCA Healthcare on a Nurse Handoff app

August 2, 2025
Labor Day Mobile Marketing Strategies

Labor Day Mobile Marketing Strategies

May 29, 2025
How to Gather Evidence After a Bicycle Accident for a Strong Case

How to Gather Evidence After a Bicycle Accident for a Strong Case

March 27, 2026
I Found 9 Field Service Management Tools To Be The Most Valuable

I Found 9 Field Service Management Tools To Be The Most Valuable

September 24, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Top 25 SEM Tools: Content, SEO, and More!
  • MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone | MIT News
  • Which is the Best Knowledge Base Software for Contact Centers?
  • 10 Critical Benefits of Computer Vision for Business in 2026
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions