Now AvailableNEW

CUGA Enterprise Agent on Katonic

An open-source generalist agent framework from IBM Research, purpose-built for enterprise automation. CUGA combines ReAct, CodeAct, and Planner-Executor patterns into a modular architecture enabling trustworthy, policy-aware, and composable automation across web interfaces, APIs, and enterprise systems.

61.7%
WebArena Benchmark
48.2%
AppWorld Benchmark
457
API Integrations
CUGA Agent - Katonic Studio
LIVE
User Task
"Get top accounts by revenue from digital sales, calculate Q4 growth, and draft an executive summary email."
Task Planning
Decomposed into 4 subtasks
Done
Code Generation
Revenue calculation script
Done
Secure Execution
Sandbox environment
Running
Email Draft
Executive summary ready
Pending
Powered By
What is CUGA?

The Computer Using Generalist Agent

CUGA is IBM Research's open-source AI agent designed for complex enterprise automation-from multi-step workflows to code execution and API orchestration. Ranked #1 on WebArena and AppWorld benchmarks.

Stop building agents from scratch. Start with a generalist.

Building domain-specific enterprise agents is complex: orchestration, planning logic, safety policies, evaluation, and continuous improvement. CUGA abstracts this complexity with a Planner-Executor architecture built on LangGraph-enabling cyclic graphs for retry loops and dynamic re-planning.

CodeAct Pattern-writes and executes Python code (not just JSON tool calls) for complex logic, loops, and data transformation
Task Ledger-persistent record of execution state enabling dynamic re-planning when step outputs invalidate future steps
Model Agnostic-works with GPT-4, IBM Granite, Mistral, LLaMA, and Azure OpenAI. No vendor lock-in.
Glass-Box Transparency-every decision is auditable. Critical for regulated industries that can't deploy black-box agents.
CUGA Planner-Executor Architecture
1
Task Analyzer + Plan Controller
Interprets intent, decomposes into Task Ledger
2
API Agent (CodeAct)
Shortlister → Code Planner → Coder → Reflection
3
Browser Agent (Computer Use)
Browser Planner → Action Agent → QA Agent
4
ALTK Post-Processing
JSON Processor, RAG Repair, Silent Review

See CUGA in Action

Watch how CUGA automates complex enterprise workflows

CUGA Agent LIVE
Hybrid Task Execution

get top account by revenue from digital sales, then add it to current page

Hybrid task execution on web and API

CUGA Agent LIVE
Human in the Loop

Watch CUGA pause for human approval during critical decision points

Example Task: get best accounts

Benchmark Results

Ranked #1 on both WebArena and AppWorld leaderboards - beating OpenAI Operator, Anthropic, and Google.

61.7%
WebArena
Complex autonomous web agent benchmark across e-commerce, CMS, forums, and dev platforms.
#1 on Leaderboard
48.2%
AppWorld
750 real-world tasks across 457 APIs. Comprehensive test of API-based agent capabilities.
#1 on Leaderboard

WebArena Leaderboard

#1 IBM CUGA 61.7% Open Source
#2 OpenAI Operator 58.1% Closed
#3 Autonomous Web Agent 57.1% Closed
#4 ScribeAgent + GPT-4o 53.0% Closed

Why this matters: CUGA outperforms OpenAI's Operator by 3.6 percentage points while being fully open source. For regulated industries that need to inspect agent decisions, this is the only production-ready option.

Proven in Production

Enterprise Pilot Results

Real metrics from IBM's BPO Talent Acquisition pilot deployment - not synthetic benchmarks.

87%
Task Accuracy
26 enterprise analytics tasks
~90%
Dev Time Reduction
vs. specialized agents
~50%
Dev Cost Reduction
vs. custom builds
95%
Provenance Logs
Full audit trail coverage
4.6/5
Reproducibility
Analyst-reported consistency
11.2s
Avg Latency
Per query response time
"CUGA saved 20-30 minutes of manual dashboard comparisons per query. It freed time for actual decision-making."
— BPO Talent Acquisition Architects, IBM Consulting

Research Papers

Explore the research behind CUGA's architecture and enterprise deployment

Towards Enterprise-Ready Computer Using Generalist Agent

Our evolutionary approach to building enterprise-ready agentic systems, achieving state-of-the-art performance on WebArena and AppWorld through systematic evaluation, analysis, and refinement.

Read Paper

From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production

Evidence from deploying CUGA in enterprise production, including architectural modifications for auditability, safety, and governance.

Read Paper

ST-WEBAGENTBENCH: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

A configurable benchmark suite with 222 tasks for evaluating web agent safety and trustworthiness across enterprise scenarios, introducing the Completion Under Policy (CuP) metric.

Read Paper
Agent Lifecycle Toolkit

Self-Healing Reliability with ALTK

CUGA's secret weapon: the Agent Lifecycle Toolkit (ALTK) provides the "immune system" that turns fragile prototypes into resilient enterprise systems. Reduces parsing-related failures by 33%+ in production.

The ALTK Philosophy

Reliability cannot be "prompted" into an LLM-it must be engineered around it. ALTK intervenes at three critical stages: Pre-LLM (before reasoning), Pre-Tool (before execution), and Post-Tool (after results). This cycle of prompt → call → validation → reflection/replan reduced parsing-related failures by more than one-third in IBM pilot runs.

Pre-LLM

Spotlight

Steers the model's attention toward critical instructions. Prevents "instruction drift" in long contexts by dynamically biasing attention logits-improving constraint adherence by 26%+.

Pre-Tool

SPARC

Semantic Pre-execution Analysis for Reliable Calls. Validates generated arguments against OpenAPI specs before execution-catching hallucinated parameters before they cause failures.

Pre-Tool

Refraction

Syntax repair engine. Intercepts minor code errors (missing brackets, indentation) and repairs them deterministically-saving costly LLM inference cycles.

Post-Tool

JSON Processor

Auto-generates extraction code for "fat" API payloads. Filters megabytes of JSON to relevant fields only-reducing token costs and improving reasoning accuracy.

Post-Tool

RAG Repair

Self-healing infrastructure. When tools fail, RAG Repair searches documentation to find solutions-mimicking a developer "Googling the error" and generating corrected commands.

Post-Tool

Silent Review

Semantic auditor. Detects "silent failures" where APIs return 200 OK but with empty or error content-prompting the agent to try alternative strategies.

Configurable Reasoning Modes

Not every task needs deep planning. Trade off latency, cost, and accuracy based on your requirements.

Low Latency

Fast Heuristics Mode

Lighter prompting with faster models (Granite, GPT-3.5). Bypasses deep planning for routine tasks.

Best for: Customer service, FAQ responses, simple lookups
High Accuracy

Deep Planning Mode

"System 2" thinking. Extensive task decomposition, self-reflection, and multi-step planning.

Best for: Financial transactions, compliance, high-stakes decisions

How CUGA Works on Katonic

Deploy, run, and scale CUGA on your infrastructure using Katonic's sovereign AI platform.

1

Deploy Models on Ops

Deploy open source LLMs like LLaMA, Mistral, or Granite on Katonic Ops. Run on NVIDIA GPUs or Groq LPUs with enterprise-grade inference that powers CUGA.

Katonic Ops
2

Deploy CUGA in Studio

Launch and deploy CUGA from Katonic AI Studio. Connect to your deployed models, configure data sources, and define the tools CUGA can access.

Katonic Studio
3

Access via ACE Co-pilot

CUGA becomes available as an intelligent agent in ACE Co-pilot. Your teams interact with CUGA through natural language - complex tasks are automatically handled.

Katonic ACE Co-pilot
4

Configure with Langflow

Use Langflow from the Katonic App Store to visually configure and customize CUGA's workflows. Build and modify agent pipelines without code.

Agent Marketplace

Key Features & Capabilities

Everything enterprises need to deploy autonomous AI agents at scale-from structured planning to policy enforcement.

High-Performing Generalist Agent

Combines best-of-breed agentic patterns (planner-executor, CodeAct) with structured planning and smart variable management to prevent hallucination and handle complexity.

Human-in-the-Loop Controls

Configure policy-aware instructions and approval gates. Business defines where autonomy is permitted and where human approval is mandatory.

API/Tool Hub Integration

Onboard new APIs in hours, not weeks. Centralized hub minimizes OpenAPI specs into LLM-friendly schemas with strict JSON validation.

Computer Use (Browser Agent)

Navigate web interfaces via DOM interaction. Combines Browser Planner, Action Agent, and QA Agent for visual parsing.

Open Source & Model Agnostic

Apache 2.0 license with no vendor lock-in. Choose your LLM-GPT-4, Granite, LLaMA, Mistral. CUGA can even be a tool for other agents.

Full Provenance & Audit Trails

Every response includes API paths, parameters, and computation logs. 95% of pilot responses had complete audit trails for compliance.

How CUGA Compares to Production AI Agents

The only open-source agent that beats the tech giants on benchmarks while giving you full control.

Feature IBM CUGA on Katonic OpenAI Operator Anthropic Computer Use Google Mariner
WebArena Score 61.7% #1 58.1%
Open Source Apache 2.0 Proprietary Proprietary Proprietary
Data Sovereignty On-premise / Your cloud OpenAI servers Anthropic servers Google servers
Auditability Glass-Box (full logs) Black Box Black Box Black Box
Model Choice Any LLM (GPT, LLaMA, etc.) GPT-4 only Claude only Gemini only
Enterprise HITL Configurable gates Limited Limited Limited
Self-Healing (ALTK) Native No No No
Best For Regulated enterprises Consumer tasks Developer tools Browser research
vs. OpenAI Operator: CUGA outperforms by 3.6 percentage points on WebArena while being fully open source and deployable on your infrastructure.
vs. Anthropic/Google: All three major vendors offer black-box agents that run on their servers. For regulated industries, this is a non-starter.
Why it matters: Banking, healthcare, government, and defense cannot send sensitive data to third-party APIs. CUGA is the only production-grade option.

Fully Open Source

CUGA is released under Apache 2.0 license. Inspect the code, contribute improvements, and deploy with confidence knowing there's no black box.

Ready to See CUGA in Action?

Get the enterprise AI agent running on your infrastructure. Full data sovereignty. No vendor lock-in. Reach out for a personalized demo.