• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, May 6, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization

Josh by Josh
July 21, 2025
in Al, Analytics and Automation
0
TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization


Introduction

As large language models (LLMs) advance in software engineering tasks—ranging from code generation to bug fixing—performance optimization remains an elusive frontier, especially at the repository level. To bridge this gap, researchers from TikTok and collaborating institutions have introduced SWE-Perf—the first benchmark specifically designed to evaluate the ability of LLMs to optimize code performance in real-world repositories.

Unlike prior benchmarks focused on correctness or function-level efficiency (e.g., SWE-Bench, Mercury, EFFIBench), SWE-Perf captures the complexity and contextual depth of repository-scale performance tuning. It provides a reproducible, quantitative foundation to study and improve the performance optimization capabilities of modern LLMs.

READ ALSO

Games people — and machines — play: Untangling strategic reasoning to advance AI | MIT News

Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk

Image source: https://arxiv.org/abs/2507.12415

Why SWE-Perf Is Needed

Real-world codebases are often large, modular, and intricately interdependent. Optimizing them for performance requires understanding of cross-file interactions, execution paths, and computational bottlenecks—challenges beyond the scope of isolated function-level datasets.

LLMs today are largely evaluated on tasks like syntax correction or small function transformations. But in production environments, performance tuning across repositories can yield more substantial system-wide benefits. SWE-Perf is explicitly built to measure LLM capabilities in such settings.

Image source: https://arxiv.org/abs/2507.12415

Dataset Construction

SWE-Perf is constructed from over 100,000 pull requests across high-profile GitHub repositories. The final dataset covered 9 repositories including:

  • 140 curated instances demonstrating measurable and stable performance improvements.
  • Complete codebases pre- and post-optimization.
  • Target functions categorized as oracle (file-level) or realistic (repo-level).
  • Unit tests and Docker environments for reproducible execution and performance measurement.
  • Expert-authored patches used as gold standards.

To ensure validity, each unit test must:

  1. Pass before and after the patch.
  2. Show statistically significant runtime gains over 20 repetitions (Mann-Whitney U test, p < 0.1).

Performance is measured using minimum performance gain (δ), isolating statistical improvements attributable to the patch while filtering noise.

Benchmark Settings: Oracle vs. Realistic

  • Oracle Setting: The model receives only the target functions and corresponding files. This setting tests localized optimization skills.
  • Realistic Setting: The model is given an entire repository and must identify and optimize performance-critical paths autonomously. This is a closer analog to how human engineers work.

Evaluation Metrics

SWE-Perf defines a three-tier evaluation framework, reporting each metric independently:

  1. Apply: Can the model-generated patch be applied cleanly?
  2. Correctness: Does the patch preserve functional integrity (all unit tests pass)?
  3. Performance: Does the patch yield measurable runtime improvement?

The metrics are not aggregated into a single score, allowing more nuanced evaluation of tradeoffs between syntactic correctness and performance gains.

Experimental Results

The benchmark evaluates several top-tier LLMs under both oracle and realistic settings:

Model Setting Performance (%)
Claude-4-opus Oracle 1.28
GPT-4o Oracle 0.60
Gemini-2.5-Pro Oracle 1.48
Claude-3.7 (Agentless) Realistic 0.41
Claude-3.7 (OpenHands) Realistic 2.26
Expert (Human Patch) – 10.85

Notably, even the best-performing LLM configurations fall significantly short of human-level performance. The agent-based method OpenHands, built on Claude-3.7-sonnet, outperforms other configurations in the realistic setting but still lags behind expert-crafted optimizations.

Key Observations

  • Agent-based frameworks like OpenHands are better suited for complex, multi-step optimization, outperforming direct model prompts and pipeline-based approaches like Agentless.
  • Performance degrades as the number of target functions increases—LLMs struggle with broader optimization scopes.
  • LLMs exhibit limited scalability in long-runtime scenarios, where expert systems continue to show performance gains.
  • Patch analysis shows LLMs focus more on low-level code structures (e.g., imports, environment setup), while experts target high-level semantic abstractions for performance tuning.

Conclusion

SWE-Perf represents a pivotal step toward measuring and improving the performance optimization capabilities of LLMs in realistic software engineering workflows. It uncovers a significant capability gap between existing models and human experts, offering a strong foundation for future research in repository-scale performance tuning. As LLMs evolve, SWE-Perf can serve as a north star guiding them toward practical, production-ready software enhancement at scale.


Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



Source_link

Related Posts

Games people — and machines — play: Untangling strategic reasoning to advance AI | MIT News
Al, Analytics and Automation

Games people — and machines — play: Untangling strategic reasoning to advance AI | MIT News

May 6, 2026
Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk
Al, Analytics and Automation

Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk

May 6, 2026
Why Gradient Descent Zigzags and How Momentum Fixes It
Al, Analytics and Automation

Why Gradient Descent Zigzags and How Momentum Fixes It

May 5, 2026
White House Weighs AI Checks Before Public Release, Silicon Valley Warned
Al, Analytics and Automation

White House Weighs AI Checks Before Public Release, Silicon Valley Warned

May 5, 2026
Al, Analytics and Automation

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines

May 5, 2026
A Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection
Al, Analytics and Automation

A Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection

May 4, 2026
Next Post
Brain tech revolution: How new neural devices are treating depression and more

Brain tech revolution: How new neural devices are treating depression and more

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

AI-powered Automated Account Reconciliation Solutions for Enterprise Finance

AI-powered Automated Account Reconciliation Solutions for Enterprise Finance

March 1, 2026
SNAP benefits expiring: “Welfare queen” food stamp stereotypes go viral.

SNAP benefits expiring: “Welfare queen” food stamp stereotypes go viral.

October 31, 2025
AI Decisioning in Omnichannel Marketing: Complete Guide

AI Decisioning in Omnichannel Marketing: Complete Guide

April 27, 2026
Identity and Access Management in Healthcare Guide 2026

Identity and Access Management in Healthcare Guide 2026

January 1, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • When A Nation’s Brand Comes Under Pressure
  • In-Person Event Ideas for Small Business That Fill Slow Days
  • Camp Director Resource Tip: Your Campers Are Your Counselors-in-Training
  • Google, Microsoft, and xAI will allow the US government to review their new AI models
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions