Why are traditional AI agent evaluation metrics insufficient?

Traditional metrics like EVAL provide static performance assessments but do not account for predictability across diverse scenarios and environments. This limits their applicability for evaluating real-world AI agent reliability in business tasks.

What are the primary concerns businesses have about AI agents?

A VentureBeat study found that 40% of IT executives fear unauthorized data access, while 27% worry about prompt manipulation or injections. Only 4% are willing to rely solely on built-in model safeguards.

← All news

Artificial intelligence

Amazon to Unveil Framework for Building Trustworthy AI Agents at VB Transform 2026

June 24, 2026

Photo: images.ctfassets.net

Quick answer

Amazon has developed a framework for evaluating AI agent reliability, emphasizing predictability, security, and resilience.

At the VB Transform 2026 conference, taking place July 14–15 in Menlo Park, Amazon will present its framework for developing reliable AI agents. The company introduces a new approach to evaluating their effectiveness, going beyond traditional metrics like EVAL. According to Brian Silverthorn, Director of the AGI Autonomy Research Lab at Amazon, existing standards often fail to reflect real-world AI reliability in dynamic business scenarios.

Amazon’s framework focuses on four key aspects: stability, resilience, predictability, and security. Instead of relying on built-in model safeguards, the company uses isolated environments (sandboxes) where AI agents propose changes that are reviewed by humans before implementation. This approach is particularly critical for sensitive sectors like finance, where errors can have serious consequences.

A VentureBeat survey of over 100 IT executives revealed that only 4% are fully confident in built-in AI model safeguards. Primary concerns include unauthorized data access (40%) and prompt manipulation (27%). The conference will also feature a presentation by Waymo on AI applications in the physical world, highlighting growing interest in secure and scalable AI-driven solutions.

Common questions

Why are traditional AI agent evaluation metrics insufficient?: Traditional metrics like EVAL provide static performance assessments but do not account for predictability across diverse scenarios and environments. This limits their applicability for evaluating real-world AI agent reliability in business tasks.
How does Amazon propose to enhance trust in AI agents?: Amazon employs a sandbox approach where AI agents suggest changes that are reviewed by humans before implementation. This reduces risks of unauthorized data access and manipulation.
What are the primary concerns businesses have about AI agents?: A VentureBeat study found that 40% of IT executives fear unauthorized data access, while 27% worry about prompt manipulation or injections. Only 4% are willing to rely solely on built-in model safeguards.

Dzen feed: /feed/dzen.xml · RSS: /feed.xml