In the world of AI agents, "Success Rate" is the metric everyone chases. We see flashy demos of agents booking flights, navigating complex websites, and managing CRM data with high completion rates. But for the enterprise, these numbers are often a dangerous distraction.
If an agent successfully books a flight but does so by violating a corporate travel policy, or if it updates a customer record but bypasses a mandatory "ask the user" consent step, is that a success? In a production environment, it's a liability.
The Hidden Risk in "Successful" Agents
Traditional benchmarks measure only whether an agent finishes a task. They ignore safety (avoiding unintended actions) and trustworthiness (adhering to organizational, user, or task constraints). In one study of three top-performing open agents, the average completion rate was 24.3%, but when enterprise policies were applied, the CuP fell to just 15%. Over one-third of "successes" were actually policy violations.
Introducing the CuP Metric
At Katonic, we believe enterprise readiness requires a more rigorous standard. That's why we're moving beyond the raw Completion Rate (CR) to a more honest metric: Completion Under Policy (CuP).
Developed through IBM Research's ST-WEBAGENTBENCH framework, Completion Under Policy (CuP) is the first principled standard for enterprise-grade deployment. By merging effectiveness with compliance, CuP penalizes both recklessness and over-cautiousness, guiding developers toward agents that act decisively yet responsibly.
The Six Dimensions of Enterprise Trust
To calculate CuP, we evaluate agents across six orthogonal dimensions of Safety and Trustworthiness (ST) derived from senior enterprise stakeholders:
User Consent
Does the agent ask for confirmation before irreversible operations? Critical for maintaining user trust and control.
Boundary & Scope
Does the agent stay within authorized areas? A sales assistant shouldn't open financial reports.
Strict Execution
Does the agent follow instructions exactly without fabricating data (hallucinations)?
Hierarchy Adherence
Does the agent prioritize organizational rules over user preferences or task goals?
Robustness & Security
Can the agent resist jailbreak prompts and protect sensitive data from extraction?
Error Handling
Does the agent fail transparently and recover safely instead of failing silently?
Why Scaling is the Ultimate Test
In a real enterprise, you don't just have one policy; you have dozens. The CuP metric reveals a "Scalability Gap": as policy load increases, agent performance decays sharply.
The Scalability Gap
CuP scores collapse as policy complexity increases
Research shows that while raw completion rates stay flat around 24%, CuP scores drop from 18.2% with one policy to just 7.1% with five or more policies. Today's agents lack the robust mechanisms to handle concurrent constraints.
If your AI agent strategy only looks at success rates, you are flying blind. Enterprise deployment demands simultaneous optimization for capability and compliance.
Bridging the Gap with CUGA Architecture
This is where the CUGA (Computer Using Generalist Agent) architecture, deployed on Katonic, changes the game. CUGA was designed specifically to move from "Benchmark Success" to "Business Impact." It utilizes a Hierarchical Planner-Executor architecture that treats policies as non-negotiable prerequisites.
CUGA on Katonic
Hierarchical Planner-Executor Architecture
The Bottom Line
If your AI agent strategy only looks at success rates, you are flying blind. Enterprise deployment demands simultaneous optimization for capability and compliance.
By adopting the CuP metric, leaders can finally differentiate between a flashy demo and a production-ready agent. It's time to stop chasing the "lie" and start building for trust.
Key Takeaways
What enterprise leaders need to know