• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, November 13, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

Josh by Josh
November 8, 2025
in Technology And Software
0
Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter



The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Harbor, a new framework for testing, improving and optimizing AI agents in containerized environments.

READ ALSO

SpyOnWeb: Top 5 Alternatives & Website Ownership Tools

Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

The dual release aims to address long-standing pain points in testing and optimizing AI agents, particularly those built to operate autonomously in realistic developer environments.

With a more difficult and rigorously verified task set, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing frontier model capabilities.

Harbor, the accompanying runtime framework, enables developers and researchers to scale evaluations across thousands of cloud containers and integrates with both open-source and proprietary agents and training pipelines.

“Harbor is the package we wish we had had while making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models."

Higher Bar, Cleaner Data

Terminal-Bench 1.0 saw rapid adoption after its release in May 2025, becoming a default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems through the command line, mimicking how developers work behind the scenes of the graphical user interface.

However, its broad scope came with inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.

Version 2.0 addresses those issues directly. The updated suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic, and clearly specified, raising the difficulty ceiling while improving reliability and reproducibility.

A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependence on unstable third-party APIs.

“Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” Shaw noted on X. “We believe this is because task quality is substantially higher in the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark update, the team launched Harbor, a new framework for running and evaluating agents in cloud-deployed containers.

Harbor supports large-scale rollout infrastructure, with compatibility for major providers like Daytona and Modal.

Designed to generalize across agent architectures, Harbor supports:

  • Evaluation of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines

  • Custom benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.

Early Results: GPT-5 Leads in Task Success

Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate — the highest among all agents tested so far.

Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

Top 5 Agent Results (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.

Submission and Use

To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation.

harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark.

Aiming for Standardization

The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has grown.

These tools offer a potential foundation for a unified evaluation stack — supporting model improvement, environment simulation, and benchmark standardization across the AI ecosystem.



Source_link

Related Posts

SpyOnWeb: Top 5 Alternatives & Website Ownership Tools
Technology And Software

SpyOnWeb: Top 5 Alternatives & Website Ownership Tools

November 13, 2025
Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget
Technology And Software

Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

November 13, 2025
Our favorite 2025 advent calendars from Lego, Pokémon, Funko Pop, Magna-Tiles and more
Technology And Software

Our favorite 2025 advent calendars from Lego, Pokémon, Funko Pop, Magna-Tiles and more

November 13, 2025
DHS Kept Chicago Police Records for Months in Violation of Domestic Espionage Rules
Technology And Software

DHS Kept Chicago Police Records for Months in Violation of Domestic Espionage Rules

November 13, 2025
‘Chad: The Brainrot IDE’ is a new Y Combinator-backed product so wild, people thought it was fake
Technology And Software

‘Chad: The Brainrot IDE’ is a new Y Combinator-backed product so wild, people thought it was fake

November 13, 2025
Is Business Central Same as Dynamics 365 CRM or ERP?
Technology And Software

Is Business Central Same as Dynamics 365 CRM or ERP?

November 12, 2025
Next Post
Grow a Garden Buttercup Wiki

Grow a Garden Buttercup Wiki

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

EDITOR'S PICK

How to Get Sunflower in Grow a Garden

How to Get Sunflower in Grow a Garden

June 12, 2025
Cyber Incident Planning And Response – A Business Imperative In 2025

Cyber Incident Planning And Response – A Business Imperative In 2025

July 3, 2025
80+ Up-to-Date AI Statistics for 2025 (No Stale Sources)

80+ Up-to-Date AI Statistics for 2025 (No Stale Sources)

June 29, 2025
AI in AdTech PR: A Strategic Imperative for Modern Communications

AI in AdTech PR: A Strategic Imperative for Modern Communications

June 14, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Which is best for your business in 2025?
  • SpyOnWeb: Top 5 Alternatives & Website Ownership Tools
  • Top 8 3D Point Cloud Annotation Companies in 2026
  • How to Compare Your AI Visibility Against Your Competitors
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?