"Success Rate" is a Lie: Introducing the CuP Metric for Enterprise Agents

Traditional benchmarks measure completion, not compliance. Discover why Completion Under Policy (CuP) is the only metric that matters for production-grade AI agents.

In the world of AI agents, "Success Rate" is the metric everyone chases. We see flashy demos of agents booking flights, navigating complex websites, and managing CRM data with high completion rates. But for the enterprise, these numbers are often a dangerous distraction.

If an agent successfully books a flight but does so by violating a corporate travel policy, or if it updates a customer record but bypasses a mandatory "ask the user" consent step, is that a success? In a production environment, it's a liability.

The Hidden Risk in "Successful" Agents

Traditional benchmarks measure only whether an agent finishes a task. They ignore safety (avoiding unintended actions) and trustworthiness (adhering to organizational, user, or task constraints). In one study of three top-performing open agents, the average completion rate was 24.3%, but when enterprise policies were applied, the CuP fell to just 15%. Over one-third of "successes" were actually policy violations.

Introducing the CuP Metric

At Katonic, we believe enterprise readiness requires a more rigorous standard. That's why we're moving beyond the raw Completion Rate (CR) to a more honest metric: Completion Under Policy (CuP).

Legacy Metric
Completion Rate
The agent finished the task. No questions asked about how it got there.
Enterprise Standard
CuP
The agent finished the task AND incurred zero policy violations.

Developed through IBM Research's ST-WEBAGENTBENCH framework, Completion Under Policy (CuP) is the first principled standard for enterprise-grade deployment. By merging effectiveness with compliance, CuP penalizes both recklessness and over-cautiousness, guiding developers toward agents that act decisively yet responsibly.

The Six Dimensions of Enterprise Trust

To calculate CuP, we evaluate agents across six orthogonal dimensions of Safety and Trustworthiness (ST) derived from senior enterprise stakeholders:

Dimension 01

User Consent

Does the agent ask for confirmation before irreversible operations? Critical for maintaining user trust and control.

Dimension 02

Boundary & Scope

Does the agent stay within authorized areas? A sales assistant shouldn't open financial reports.

Dimension 03

Strict Execution

Does the agent follow instructions exactly without fabricating data (hallucinations)?

Dimension 04

Hierarchy Adherence

Does the agent prioritize organizational rules over user preferences or task goals?

Dimension 05

Robustness & Security

Can the agent resist jailbreak prompts and protect sensitive data from extraction?

Dimension 06

Error Handling

Does the agent fail transparently and recover safely instead of failing silently?

Why Scaling is the Ultimate Test

In a real enterprise, you don't just have one policy; you have dozens. The CuP metric reveals a "Scalability Gap": as policy load increases, agent performance decays sharply.

The Scalability Gap

CuP scores collapse as policy complexity increases

24%
18%
1 Policy
23%
13%
2-3 Policies
24%
10%
4 Policies
23%
7%
5+ Policies
Completion Rate (stays flat)
CuP Score (decays sharply)

Research shows that while raw completion rates stay flat around 24%, CuP scores drop from 18.2% with one policy to just 7.1% with five or more policies. Today's agents lack the robust mechanisms to handle concurrent constraints.

If your AI agent strategy only looks at success rates, you are flying blind. Enterprise deployment demands simultaneous optimization for capability and compliance.

Bridging the Gap with CUGA Architecture

This is where the CUGA (Computer Using Generalist Agent) architecture, deployed on Katonic, changes the game. CUGA was designed specifically to move from "Benchmark Success" to "Business Impact." It utilizes a Hierarchical Planner-Executor architecture that treats policies as non-negotiable prerequisites.

CUGA on Katonic

Hierarchical Planner-Executor Architecture

87%
Accuracy on real-world tasks
92%
Responses with full provenance logs
90%
Faster than hand-coded scripts

The Bottom Line

If your AI agent strategy only looks at success rates, you are flying blind. Enterprise deployment demands simultaneous optimization for capability and compliance.

By adopting the CuP metric, leaders can finally differentiate between a flashy demo and a production-ready agent. It's time to stop chasing the "lie" and start building for trust.

Key Takeaways

What enterprise leaders need to know

38%
Of "successful" completions are actually policy violations
6
Dimensions of Safety & Trustworthiness to evaluate
CuP
The only metric that matters for production AI
Katonic AI

Katonic AI

Katonic AI provides enterprise-grade agent platforms with built-in policy compliance. Our CUGA-powered agents are designed from the ground up to meet enterprise safety and trustworthiness requirements, delivering both capability and compliance.

See a CuP-compliant agent in action

Ready for CuP-Compliant Agents?

Discover how Katonic's enterprise agent platform delivers both capability and compliance, with built-in policy enforcement from day one.