What benchmarks does CUGA lead?

CUGA achieves state-of-the-art performance with 61.7% on WebArena (complex web automation benchmark) and 48.2% on AppWorld (750 real-world tasks across 457 APIs).

How does CUGA work on Katonic?

CUGA on Katonic works in 4 steps: 1) Deploy open source LLMs on Katonic Ops using NVIDIA GPUs or Groq LPUs, 2) Deploy CUGA from Katonic AI Studio and connect to your models, 3) Access CUGA as an agent through Katonic ACE Co-pilot, 4) Configure workflows visually using Langflow from the Agent Marketplace.

Yes, CUGA is fully open source under the Apache 2.0 license. You can inspect the code, contribute improvements, and deploy with confidence knowing there's no black box.

What is ALTK in CUGA?

ALTK (Agent Lifecycle Toolkit) is CUGA's self-healing reliability system. It provides an 'immune system' that turns fragile prototypes into resilient enterprise systems through interventions at Pre-LLM, Pre-Tool, and Post-Tool stages including Spotlight, SPARC, Refraction, JSON Processor, RAG Repair, and Silent Review.

Now AvailableNEW

CUGA Enterprise Agent on Katonic

Name: CUGA on Katonic
Author: IBM Research

An open-source generalist agent framework from IBM Research, purpose-built for enterprise automation. CUGA combines ReAct, CodeAct, and Planner-Executor patterns into a modular architecture enabling trustworthy, policy-aware, and composable automation across web interfaces, APIs, and enterprise systems.

61.7%

WebArena Benchmark

48.2%

AppWorld Benchmark

457

API Integrations

Request a Demo See How It Works

CUGA Agent - Katonic Studio

LIVE

User Task

"Get top accounts by revenue from digital sales, calculate Q4 growth, and draft an executive summary email."

Task Planning

Decomposed into 4 subtasks

Done

Code Generation

Revenue calculation script

Done

Secure Execution

Sandbox environment

Running

Email Draft

Executive summary ready

Pending

GPUs

LPUs

Research

What is CUGA?

The Computer Using Generalist Agent

CUGA is IBM Research's open-source AI agent designed for complex enterprise automation-from multi-step workflows to code execution and API orchestration. Ranked #1 on WebArena and AppWorld benchmarks.

Stop building agents from scratch. Start with a generalist.

Building domain-specific enterprise agents is complex: orchestration, planning logic, safety policies, evaluation, and continuous improvement. CUGA abstracts this complexity with a Planner-Executor architecture built on LangGraph-enabling cyclic graphs for retry loops and dynamic re-planning.

CodeAct Pattern-writes and executes Python code (not just JSON tool calls) for complex logic, loops, and data transformation

Task Ledger-persistent record of execution state enabling dynamic re-planning when step outputs invalidate future steps

Model Agnostic-works with GPT-4, IBM Granite, Mistral, LLaMA, and Azure OpenAI. No vendor lock-in.

Glass-Box Transparency-every decision is auditable. Critical for regulated industries that can't deploy black-box agents.

CUGA Planner-Executor Architecture

Task Analyzer + Plan Controller

Interprets intent, decomposes into Task Ledger

API Agent (CodeAct)

Shortlister → Code Planner → Coder → Reflection

Browser Agent (Computer Use)

Browser Planner → Action Agent → QA Agent

ALTK Post-Processing

JSON Processor, RAG Repair, Silent Review

See CUGA in Action

Watch how CUGA automates complex enterprise workflows

CUGA Agent LIVE

Hybrid Task Execution

get top account by revenue from digital sales, then add it to current page

Hybrid task execution on web and API

CUGA Agent LIVE

Human in the Loop

Watch CUGA pause for human approval during critical decision points

Example Task: get best accounts

Benchmark Results

Ranked #1 on both WebArena and AppWorld leaderboards - beating OpenAI Operator, Anthropic, and Google.

61.7%

WebArena

Complex autonomous web agent benchmark across e-commerce, CMS, forums, and dev platforms.

#1 on Leaderboard

48.2%

AppWorld

750 real-world tasks across 457 APIs. Comprehensive test of API-based agent capabilities.

#1 on Leaderboard

WebArena Leaderboard

#1 IBM CUGA 61.7% Open Source

#2 OpenAI Operator 58.1% Closed

#3 Autonomous Web Agent 57.1% Closed

#4 ScribeAgent + GPT-4o 53.0% Closed

Why this matters: CUGA outperforms OpenAI's Operator by 3.6 percentage points while being fully open source. For regulated industries that need to inspect agent decisions, this is the only production-ready option.

Proven in Production

Enterprise Pilot Results

Real metrics from IBM's BPO Talent Acquisition pilot deployment - not synthetic benchmarks.

87%

Task Accuracy

26 enterprise analytics tasks

~90%

Dev Time Reduction

vs. specialized agents

~50%

Dev Cost Reduction

vs. custom builds

95%

Provenance Logs

Full audit trail coverage

4.6/5

Reproducibility

Analyst-reported consistency

11.2s

Avg Latency

Per query response time

"CUGA saved 20-30 minutes of manual dashboard comparisons per query. It freed time for actual decision-making."

— BPO Talent Acquisition Architects, IBM Consulting

Research Papers

Explore the research behind CUGA's architecture and enterprise deployment

Towards Enterprise-Ready Computer Using Generalist Agent

Our evolutionary approach to building enterprise-ready agentic systems, achieving state-of-the-art performance on WebArena and AppWorld through systematic evaluation, analysis, and refinement.

Read Paper

From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production

Evidence from deploying CUGA in enterprise production, including architectural modifications for auditability, safety, and governance.

Read Paper

ST-WEBAGENTBENCH: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

A configurable benchmark suite with 222 tasks for evaluating web agent safety and trustworthiness across enterprise scenarios, introducing the Completion Under Policy (CuP) metric.

Read Paper

Agent Lifecycle Toolkit

Self-Healing Reliability with ALTK

CUGA's secret weapon: the Agent Lifecycle Toolkit (ALTK) provides the "immune system" that turns fragile prototypes into resilient enterprise systems. Reduces parsing-related failures by 33%+ in production.

The ALTK Philosophy

Reliability cannot be "prompted" into an LLM-it must be engineered around it. ALTK intervenes at three critical stages: Pre-LLM (before reasoning), Pre-Tool (before execution), and Post-Tool (after results). This cycle of prompt → call → validation → reflection/replan reduced parsing-related failures by more than one-third in IBM pilot runs.

Pre-LLM

Spotlight

Steers the model's attention toward critical instructions. Prevents "instruction drift" in long contexts by dynamically biasing attention logits-improving constraint adherence by 26%+.

Pre-Tool

SPARC

Semantic Pre-execution Analysis for Reliable Calls. Validates generated arguments against OpenAPI specs before execution-catching hallucinated parameters before they cause failures.

Pre-Tool

Refraction

Syntax repair engine. Intercepts minor code errors (missing brackets, indentation) and repairs them deterministically-saving costly LLM inference cycles.

Post-Tool

JSON Processor

Auto-generates extraction code for "fat" API payloads. Filters megabytes of JSON to relevant fields only-reducing token costs and improving reasoning accuracy.

Post-Tool

RAG Repair

Self-healing infrastructure. When tools fail, RAG Repair searches documentation to find solutions-mimicking a developer "Googling the error" and generating corrected commands.

Post-Tool

Silent Review

Semantic auditor. Detects "silent failures" where APIs return 200 OK but with empty or error content-prompting the agent to try alternative strategies.

Configurable Reasoning Modes

Not every task needs deep planning. Trade off latency, cost, and accuracy based on your requirements.

Low Latency

Fast Heuristics Mode

Lighter prompting with faster models (Granite, GPT-3.5). Bypasses deep planning for routine tasks.

Best for: Customer service, FAQ responses, simple lookups

High Accuracy

Deep Planning Mode

"System 2" thinking. Extensive task decomposition, self-reflection, and multi-step planning.

Best for: Financial transactions, compliance, high-stakes decisions

How CUGA Works on Katonic

Deploy, run, and scale CUGA on your infrastructure using Katonic's sovereign AI platform.

Deploy Models on Ops

Deploy open source LLMs like LLaMA, Mistral, or Granite on Katonic Ops. Run on NVIDIA GPUs or Groq LPUs with enterprise-grade inference that powers CUGA.

Katonic Ops

Deploy CUGA in Studio

Launch and deploy CUGA from Katonic AI Studio. Connect to your deployed models, configure data sources, and define the tools CUGA can access.

Katonic Studio

Access via ACE Co-pilot

CUGA becomes available as an intelligent agent in ACE Co-pilot. Your teams interact with CUGA through natural language - complex tasks are automatically handled.

Katonic ACE Co-pilot

Configure with Langflow

Use Langflow from the Katonic App Store to visually configure and customize CUGA's workflows. Build and modify agent pipelines without code.

Agent Marketplace

Key Features & Capabilities

Everything enterprises need to deploy autonomous AI agents at scale-from structured planning to policy enforcement.

High-Performing Generalist Agent

Combines best-of-breed agentic patterns (planner-executor, CodeAct) with structured planning and smart variable management to prevent hallucination and handle complexity.

Human-in-the-Loop Controls

Configure policy-aware instructions and approval gates. Business defines where autonomy is permitted and where human approval is mandatory.

API/Tool Hub Integration

Onboard new APIs in hours, not weeks. Centralized hub minimizes OpenAPI specs into LLM-friendly schemas with strict JSON validation.

Computer Use (Browser Agent)

Navigate web interfaces via DOM interaction. Combines Browser Planner, Action Agent, and QA Agent for visual parsing.

Open Source & Model Agnostic

Apache 2.0 license with no vendor lock-in. Choose your LLM-GPT-4, Granite, LLaMA, Mistral. CUGA can even be a tool for other agents.

Full Provenance & Audit Trails

Every response includes API paths, parameters, and computation logs. 95% of pilot responses had complete audit trails for compliance.

How CUGA Compares to Production AI Agents

The only open-source agent that beats the tech giants on benchmarks while giving you full control.

Feature	IBM CUGA on Katonic	OpenAI Operator	Anthropic Computer Use	Google Mariner
WebArena Score	61.7% #1	58.1%	—	—
Open Source	✓ Apache 2.0	✗ Proprietary	✗ Proprietary	✗ Proprietary
Data Sovereignty	✓ On-premise / Your cloud	✗ OpenAI servers	✗ Anthropic servers	✗ Google servers
Auditability	✓ Glass-Box (full logs)	✗ Black Box	✗ Black Box	✗ Black Box
Model Choice	✓ Any LLM (GPT, LLaMA, etc.)	✗ GPT-4 only	✗ Claude only	✗ Gemini only
Enterprise HITL	✓ Configurable gates	Limited	Limited	Limited
Self-Healing (ALTK)	✓ Native	✗ No	✗ No	✗ No
Best For	Regulated enterprises	Consumer tasks	Developer tools	Browser research

vs. OpenAI Operator: CUGA outperforms by 3.6 percentage points on WebArena while being fully open source and deployable on your infrastructure.

vs. Anthropic/Google: All three major vendors offer black-box agents that run on their servers. For regulated industries, this is a non-starter.

Why it matters: Banking, healthcare, government, and defense cannot send sensitive data to third-party APIs. CUGA is the only production-grade option.

Fully Open Source

CUGA is released under Apache 2.0 license. Inspect the code, contribute improvements, and deploy with confidence knowing there's no black box.

View on GitHub Try on Hugging Face

Ready to See CUGA in Action?

Get the enterprise AI agent running on your infrastructure. Full data sovereignty. No vendor lock-in. Reach out for a personalized demo.

Request a Demo