• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, June 9, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Meta and Harvard Researchers Introduce the Confucius Code Agent (CCA): A Software Engineering Agent that can Operate at Large-Scale Codebases

Josh by Josh
January 10, 2026
in Al, Analytics and Automation
0
Meta and Harvard Researchers Introduce the Confucius Code Agent (CCA): A Software Engineering Agent that can Operate at Large-Scale Codebases


How far can a mid sized language model go if the real innovation moves from the backbone into the agent scaffold and tool stack? Meta and Harvard researchers have released the Confucius Code Agent, an open sourced AI software engineer built on the Confucius SDK that is designed for industrial scale software repositories and long running sessions. The system targets real GitHub projects, complex test toolchains at evaluation time, and reproducible results on benchmarks such as SWE Bench Pro and SWE Bench Verified, while exposing the full scaffold for developers.

https://arxiv.org/pdf/2512.10398

Confucius SDK, scaffolding around the model

The Confucius SDK is an agent development platform that treats scaffolding as a primary design problem rather than a thin wrapper around a language model. It is organized around 3 axes, Agent Experience, User Experience, and Developer Experience.

READ ALSO

NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab

ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset

Agent Experience controls what the model sees, including context layout, working memory and tool results. User Experience focuses on readable traces, code diffs and safeguards for human engineers. Developer Experience focuses on observability, configuration and debugging of the agent itself.

The SDK introduces 3 core mechanisms, a unified orchestrator with hierarchical working memory, a persistent note taking system, and a modular extension interface for tools. A meta agent then automates synthesis and refinement of agent configurations through a build, test, improve loop. The Confucius Code Agent is one concrete instantiation of this scaffold for software engineering.

https://arxiv.org/pdf/2512.10398

Hierarchical working memory for long horizon coding

Real software tasks on SWE Bench Pro often require reasoning over dozens of files and many interaction steps. The orchestrator in Confucius SDK maintains hierarchical working memory, which partitions a trajectory into scopes, summarizes past steps and keeps compressed context for later turns.

This design helps keep prompts within model context limits while preserving important artifacts such as patches, error logs and design decisions. The key point is that effective tool based coding agents need an explicit memory architecture, not just a sliding window of previous messages.

Persistent note taking for cross session learning

The second mechanism is a note taking system that uses a dedicated agent to write structured Markdown notes from execution traces. These notes capture task specific strategies, repository conventions and common failure modes, and they are stored as long term memory that can be reused across sessions.

The research team ran Confucius Code Agent twice on 151 SWE Bench Pro instances with Claude 4.5 Sonnet. On the first run the agent solves tasks from scratch and generates notes. On the second run the agent reads these notes. In this setting, average turns drop from 64 to 61, token usage drops from about 104k to 93k, and Resolve@1 improves from 53.0 to 54.4. This shows that notes are not just logs, they function as effective cross session memory.

Modular extensions and tool use sophistication

Confucius SDK exposes tools as extensions, for example file editing, command execution, test runners and code search. Each extension can maintain its own state and prompt wiring.

The research team studies the impact of tool use sophistication using an ablation on a 100 example subset of SWE Bench Pro. With Claude 4 Sonnet, moving from a configuration without advanced context features to one with advanced context raises Resolve@1 from 42.0 to 48.6. With Claude 4.5 Sonnet, a simple tool use configuration reaches 44.0, while richer tool handling reaches 51.6, with 51.0 for an intermediate variant. These numbers indicate that how the agent chooses and sequences tools matters almost as much as the backbone model choice.

https://arxiv.org/pdf/2512.10398

Meta agent for automatic agent design

On top of these mechanisms, the Confucius SDK includes a meta agent that takes a natural language specification of an agent and iteratively proposes configurations, prompts and extension sets. It then runs the candidate agent on tasks, inspects traces and metrics, and edits the configuration in a build, test, improve loop.

The Confucius Code Agent that the research team evaluates is produced with the help of this meta agent, rather than only hand tuned. This approach turns some of the agent engineering process itself into an LLM guided optimization problem.

Results on SWE Bench Pro and SWE Bench Verified

The main evaluation uses SWE Bench Pro, which has 731 GitHub issues that require modifying real repositories until tests pass. All compared systems share the same repositories, tool environment and evaluation harness, so differences come from the scaffolds and models.

On SWE Bench Pro, the reported Resolve@1 scores are

  • Claude 4 Sonnet with SWE Agent, 42.7
  • Claude 4 Sonnet with Confucius Code Agent, 45.5
  • Claude 4.5 Sonnet with SWE Agent, 43.6
  • Claude 4.5 Sonnet with Live SWE Agent, 45.8
  • Claude 4.5 Sonnet with Confucius Code Agent, 52.7
  • Claude 4.5 Opus with Anthropic system card scaffold, 52.0
  • Claude 4.5 Opus with Confucius Code Agent, 54.3

These results show that a strong scaffold with a mid tier model, Claude 4.5 Sonnet with Confucius Code Agent at 52.7, can outperform a stronger model with a weaker scaffold, Claude 4.5 Opus with 52.0.

On SWE Bench Verified, Confucius Code Agent with Claude 4 Sonnet reaches Resolve@1 74.6, compared to 66.6 for SWE Agent and 72.8 for OpenHands. A mini SWE Agent variant with Claude 4.5 Sonnet reaches 70.6, which is also below Confucius Code Agent with Claude 4 Sonnet.

The research team also report performance as a function of edited file count. For tasks editing 1 to 2 files, Confucius Code Agent reaches 57.8 Resolve@1, for 3 to 4 files it reaches 49.2, for 5 to 6 files it reaches 44.1, for 7 to 10 files it reaches 52.6, and for more than 10 files it reaches 44.4. This indicates stable behavior on multi file changes in large codebases.

Key Takeaways

  • Scaffolding can outweigh model size: Confucius Code Agent shows that with strong scaffolding, Claude 4.5 Sonnet reaches 52.7 Resolve@1 on SWE-Bench-Pro, surpassing Claude 4.5 Opus with a weaker scaffold at 52.0.
  • Hierarchical working memory is essential for long horizon coding: The Confucius SDK orchestrator uses hierarchical working memory and context compression to manage long trajectories over large repositories, rather than relying on a simple rolling history.
  • Persistent notes act as effective cross session memory: On 151 SWE-Bench-Pro tasks with Claude 4.5 Sonnet, reusing structured notes reduces turns from 64 to 61, token usage from about 104k to 93k, and increases Resolve@1 from 53.0 to 54.4.
  • Tool configuration materially impacts success rates: On a 100 task SWE-Bench-Pro subset, moving from simple to richer tool handling with Claude 4.5 Sonnet increases Resolve@1 from 44.0 to 51.6, indicating that learned tool routing and recovery strategies are a major performance lever, not just an implementation detail.
  • Meta agent automates agent design and tuning: A meta agent iteratively proposes prompts, tool sets and configurations, then evaluates and edits them in a build, test, improve loop, and the production Confucius Code Agent is itself generated with this process rather than only manual tuning.

Check out the PAPER HERE. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



Source_link

Related Posts

NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab
Al, Analytics and Automation

NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab

June 9, 2026
ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset
Al, Analytics and Automation

ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset

June 8, 2026
Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription
Al, Analytics and Automation

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

June 8, 2026
Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation
Al, Analytics and Automation

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

June 7, 2026
Best 21 Low-Code and No-Code AI Tools in 2026
Al, Analytics and Automation

Best 21 Low-Code and No-Code AI Tools in 2026

June 7, 2026
Tod Machover receives George Peabody Medal for contributions to music and technology | MIT News
Al, Analytics and Automation

Tod Machover receives George Peabody Medal for contributions to music and technology | MIT News

June 6, 2026
Next Post
Anthropic cracks down on unauthorized Claude usage by third-party harnesses and rivals

Anthropic cracks down on unauthorized Claude usage by third-party harnesses and rivals

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

The GOP’s Attacks on James Talarico Are Straight Out of the Incel Handbook

The GOP’s Attacks on James Talarico Are Straight Out of the Incel Handbook

May 29, 2026
Top 10 Physical AI Models Powering Real-World Robots in 2026

Top 10 Physical AI Models Powering Real-World Robots in 2026

April 28, 2026
Unlocking Peak Performance on Qualcomm NPU with LiteRT

Unlocking Peak Performance on Qualcomm NPU with LiteRT

November 25, 2025
Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

May 26, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Email marketing : pourquoi il reste le canal le plus puissant des CMO
  • Time’s editor-in-chief on AI, relationships and what makes a good pitch
  • Sandstone raises $30M to bring AI to in-house legal teams
  • Identity. Certain. Truly Deeply helps Keyek build a brand as definitive as its technology. – Truly Deeply – Brand Strategy & Creative Agency Melbourne
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions