• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, June 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Technology And Software

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Josh by Josh
June 11, 2026
in Technology And Software
0
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark



Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows.

READ ALSO

Two breakthroughs, one week: AI and gene editing hit a turning point

Windows 11 Sucks Slightly Less Now, Thanks To A June Update

In a shocking upset, OpenAI’s GPT-5.5 from April, operating through the Codex harness, secured the absolute top spot on the new ALE Leaderboard with a 24.0% pass rate, beating Anthropic's highly anticipated, brand new Mythos-class Claude Fable 5 model released just yesterday, which came in third with a score of 22.0%.

Rather than testing models on isolated coding puzzles, ALE is explicitly designed as an instrument to close the gap between academic benchmark hype and real, GDP-relevant labor impact. And right now, the data proves the most advanced models in the world are fundamentally failing the exam.

Ending the Era of 'Cheating' and Brittle Graders

The fundamental shift in ALE lies in its evaluation architecture and the demands it places on the agent.

Historically, AI benchmarks have relied on static question-answering or narrow, text-based terminal environments. More recent agentic evaluations introduced multi-step interaction but suffered from severe grading issues.

As noted in recent independent audits of older leaderboards like SWE-Bench Pro, automated verifiers frequently reject correct solutions, and certain models—specifically the Claude Opus family—have been caught "cheating" by reading hidden answer keys in a container's Git history rather than solving the underlying problem.

ALE neutralizes these loopholes by forcing models into a strict Generalist Computer-Use Agent (GCUA) framework. To pass, an agent cannot merely execute terminal commands.

The benchmark maps capability across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).

An agent must use its "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations inside heavy desktop software.

Crucially, ALE almost entirely rejects the unpredictable "LLM-as-a-judge" grading paradigm, relying on it for a mere 6.8% of its workflows. If a task involves generating a 3D mesh or parsing SEC filings, the benchmark uses deterministic, code-based evaluation to compare the agent's artifact against an expert's ground-truth reference.

Measuring Task Performance Across 55 Industries

ALE launches with 1,490 task instances and is scaling toward a massive 5,000-task target. What makes the product remarkable is its authenticity. The tasks are strictly anchored in the U.S. federal occupational taxonomy (O*NET / SOC 2018), covering 55 non-physical industry sub-domains.

The workflows are sourced directly from the professional histories of industry practitioners. Agents are asked to perform 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects compositing in Adobe After Effects.

When faced with these authentic, long-horizon workflows, the limitations of current AI are glaring. ALE divides its tasks into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam.

Top 5 Agentic Harnesses on the ALE Leaderboard

Rank

Agent Harness

Underlying Model

Pass Rate

Mean Score

1

Codex

gpt-5-5

24.0%

42.8%

2

Ale Claw

gpt-5-5

23.0%

45.8%

3

Claude Code

claude-fable-5

22.0%

40.5%

4

OpenClaw

gpt-5-5

21.1%

41.0%

5

Cursor CLI

composer-2-5

20.4%

38.5%

The victory of GPT-5.5 aligns with recent third-party analysis suggesting that OpenAI's models are currently superior at strictly adhering to multi-part, complex prompts. Conversely, users report Anthropic's Claude architecture can sometimes be "forgetful" with multi-part instructions, abandoning required steps mid-workflow — a fatal flaw in ALE's rigorous pipeline.

And while hitting a 24.0% pass rate is enough to claim the crown, the absolute performance ceiling remains remarkably low.

On the hardest "Last-Exam" tier — representing the frontier of professional difficulty — most configurations, including Anthropic's older Claude Opus 4.8 and Google's Gemini CLI, record a devastating 0.0% pass rate.

Solving Benchmark Contamination

A core vulnerability in modern AI evaluation is "benchmark contamination"—the phenomenon where test questions inevitably leak into the massive data lakes used to train next-generation models. Once a model memorizes the benchmark, the evaluation becomes entirely useless.

ALE solves this through a dual-use deployment strategy. The project operates as an open-source research initiative, but it closely guards its evaluation data. Only about 10% of the dataset (roughly 150 tasks) is released publicly on platforms like GitHub and Hugging Face. The remaining 1,300+ tasks are kept strictly private.

For developers and enterprise evaluators, this means ALE functions as a "living benchmark". Private tasks are systematically rotated into the public pool over time, while retired public tasks are swapped out.

This rolling release ensures that the evaluation surface remains uncontaminated across successive model generations, giving enterprise buyers confidence that an agent's high score is earned, not memorized.

Additionally, ALE provides transparency by tracking both "Full" and "Unlicensed" scores. Because real professional work often requires paid, proprietary software, the "Full" leaderboard incorporates tasks that rely on commercial CAD tools, paid APIs, or licensed datasets.

The "Unlicensed" tier drops these license-gated tasks to provide a clean, like-for-like comparison using only freely available tools, ensuring models aren't simply rewarded for having access to paid enterprise software.

Bottom Line: ALE Shows Even the Highest-Performing Models and Harnesses Have Room for Improvement

For developers frustrated by the gap between marketing claims and actual production performance, ALE's brutal grading curve is highly validating.

Zengyi Qin, an MIT PhD researcher and data contributor to the project, took to X to announce the launch, sharing images of the paper and the staggering 100+ institution contributor list.

"Introducing Agents’ Last Exam (ALE)," Qin wrote. "Built by 300+ domain experts from 100+ institutions. Covering 55 industry domains. Claude Opus 4.8 has 0.0% pass rate on the hardest subset. Glad to have contributed to this benchmark".

In a follow-up post highlighting the Hugging Face ArXiv paper link, Qin added:

"Very solid work from project leads @YiyouSun @Xinyang_Han_ @dawnsongtweets and @BerkeleyRDI".

As businesses deploy billions in capital betting on AI agents, they desperately need a compass that points true north. If an agent can eventually conquer the gauntlet of Agents' Last Exam, it won't just be passing a test—it will be proving it is ready to join the workforce. Until then, the sobering pass rates on the leaderboard serve as a necessary reality check for the entire AI ecosystem.



Source_link

Related Posts

Two breakthroughs, one week: AI and gene editing hit a turning point
Technology And Software

Two breakthroughs, one week: AI and gene editing hit a turning point

June 11, 2026
Windows 11 Sucks Slightly Less Now, Thanks To A June Update
Technology And Software

Windows 11 Sucks Slightly Less Now, Thanks To A June Update

June 11, 2026
CISA Tells US Agencies to Fix Security Bugs in as Little as 3 Days Thanks to AI Threats
Technology And Software

CISA Tells US Agencies to Fix Security Bugs in as Little as 3 Days Thanks to AI Threats

June 10, 2026
Netflix expands revamped mobile app across Asia and doubles down on kids’ gaming
Technology And Software

Netflix expands revamped mobile app across Asia and doubles down on kids’ gaming

June 10, 2026
ISO 27001: The Security Standard Every Business Needs Right Now
Technology And Software

ISO 27001: The Security Standard Every Business Needs Right Now

June 10, 2026
Apple’s new Siri AI is more than just a smarter assistant — it's a new enterprise app layer
Technology And Software

Apple’s new Siri AI is more than just a smarter assistant — it's a new enterprise app layer

June 10, 2026

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Top Resources for Giving Tuesday [Roundup]

Top Resources for Giving Tuesday [Roundup]

June 9, 2025
KYC Automation Benefits, Tools, Steps and Use Cases

KYC Automation Benefits, Tools, Steps and Use Cases

November 22, 2025
The Game Mechanics Players are Responding to Right Now January 2025 (Updated)

The Game Mechanics Players are Responding to Right Now January 2025 (Updated)

January 28, 2026
Exclusive Research: EventTrack 2026 – Event Marketer

Exclusive Research: EventTrack 2026 – Event Marketer

October 27, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
  • Building Semantic Search with Transformers.js and Sentence Embeddings
  • Why Iconic Brands Need To Earn Permission To Change
  • Watch: Justin Tabas of Bosch in The Studio at EMS 2026
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions