V-HelpPremium IT service for your business
← All news
Artificial intelligence

GPT-5.5 Outperforms Claude Fable 5 in New Agents’ Last Exam Benchmark

GPT-5.5 Outperforms Claude Fable 5 in New Agents’ Last Exam Benchmark

Photo: VentureBeat

Quick answer

The new Agents’ Last Exam benchmark evaluates AI agents' ability to perform professional tasks across 55 industries, including 3D modeling and graphic design.

Researchers at the Center for Responsible Decentralized Intelligence at the University of California, Berkeley (RDI) have introduced Agents’ Last Exam (ALE), a new benchmark designed to evaluate AI agents' ability to perform economically significant professional tasks. Unlike traditional tests, ALE focuses on real-world workflows, such as creating 3D models in Siemens NX, configuring scenes in Unreal Engine, and working with neuroimaging in FSLeyes.

In ALE, models are assessed across five key parameters: logical reasoning, visual perception, task orchestration, tool usage, and runtime environment interaction. The benchmark covers 55 industries and includes 1,490 tasks based on real-world professional scenarios. ALE emphasizes deterministic evaluation: instead of using LLMs as judges, it employs code checks to compare AI-generated artifacts with reference solutions.

The test results were unexpected: OpenAI’s GPT-5.5, operating via the Codex framework, claimed the top spot with a 24% success rate, outperforming Anthropic’s recently released Claude Fable 5, which scored 22%. However, even leading models performed poorly on the most complex tasks—most models, including Claude Opus 4.8 and Gemini CLI, failed entirely. This underscores the existing gap between academic benchmarks and real-world AI performance in professional settings.

To prevent data leaks and ensure objective evaluations, ALE uses a unique task rotation system: only 10% of the dataset is publicly available, while the remaining tasks are stored privately and updated regularly. This approach prevents benchmark contamination and ensures reliable results for corporate clients.

Common questions

What is Agents’ Last Exam (ALE)?
ALE is a benchmark developed by researchers at UC Berkeley to assess AI agents' ability to perform complex professional tasks across 55 industries, including 3D modeling and graphic design.
Which models performed best in ALE?
GPT-5.5 from OpenAI took first place with a 24% success rate, surpassing Claude Fable 5 from Anthropic, which scored 22%. However, even top models failed to solve the most complex tasks.
Why are ALE benchmark results important for businesses?
ALE measures real-world AI performance in professional workflows, not just academic tests. Low success rates indicate that AI agents are not yet ready for full-scale business integration.
Share:

Dzen feed: /feed/dzen.xml · RSS: /feed.xml

Why trust this

Prepared by the V-Help editorial team from the primary source with a published date.

Published by: V-Help.ru news desk

Source: VentureBeat