ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings

发布时间：2026-03-18来源：MarkTechPost

Large language models (LLMs) are transitioning from conversational to autonomous agents capable of executing complex professional workflows. However, their deployment in enterprise environments remains limited by the lack of benchmarks that capture the specific challenges of professional settings: long-horizon planning, persistent state changes, and strict access protocols. To address this, researchers from ServiceNow Research, Mila and Universite de Montreal have introduced

EnterpriseOps-Gym

, a high-fidelity sandbox designed to evaluate agentic planning in realistic enterprise scenarios.

The Evaluation Environment

EnterpriseOps-Gym features a containerized Docker environment that simulates eight mission-critical enterprise domains:

Operational Domains:

Customer Service Management (CSM), Human Resources (HR), and IT Service Management (ITSM).

Collaboration Domains:

Email, Calendar, Teams, and Drive.

Hybrid Domain:

Cross-domain tasks requiring coordinated execution across multiple systems.

The benchmark comprises

164 relational database tables

and

512 functional tools

. With a mean foreign key degree of

1.7

, the environment presents high relational density, forcing agents to navigate complex inter-table dependencies to maintain referential integrity

. The benchmark includes

1,150 expert-curated tasks

, with execution trajectories averaging 9 steps and reaching up to 34 steps

.

Performance Results: A Capability Gap

The research team evaluated 14 frontier models using a

pass@1

metric, where a task is successful only if all outcome-based SQL verifiers pass.

Model	Average Success Rate (%)	Cost per Task (USD)
Claude Opus 4.5	37.4%	$0.36
Gemini-3-Flash	31.9%	$0.03
GPT-5.2 (High)	31.8%	Not explicitly listed in text
Claude Sonnet 4.5	30.9%	$0.26
GPT-5	29.8%	$0.16
DeepSeek-V3.2 (High)	24.5%	$0.014
GPT-OSS-120B (High)	23.7%	$0.015

The results indicate that even state-of-the-art models fail to reach 40% reliability in these structured environments

. Performance is strongly domain-dependent; models performed best on collaboration tools (Email, Teams) but dropped significantly in policy-heavy domains like

ITSM (28.5%)

and

Hybrid (30.7%)

workflows

.

Planning vs. Execution

A critical finding of this research is that

strategic planning

, rather than tool invocation, is the primary performance bottleneck.

The research team conducted ‘Oracle’ experiments where agents were provided with human-authored plans. This intervention improved performance by

14-35 percentage points

across all models. Strikingly, smaller models like

Qwen3-4B

became competitive with much larger models when strategic reasoning was externalized. Conversely, adding ‘distractor tools’ to simulate retrieval errors had a negligible impact on performance, further suggesting that tool discovery is not the binding constraint.

Failure Modes and Safety Concerns

The qualitative analysis revealed four recurring failure patterns:

Missing Prerequisite Lookup:

Creating objects without querying necessary prerequisites, leading to “orphaned” records.

Cascading State Propagation:

Failing to trigger follow-up actions required by system policies after a state change.

Incorrect ID Resolution:

Passing unverified or guessed identifiers to tool calls.

Premature Completion Hallucination:

Declaring a task finished before all required steps are executed.

Furthermore, agents struggle with

safe refusal

. The benchmark includes 30 infeasible tasks (e.g., requests violating access rules or involving inactive users)

. The best-performing model,

GPT-5.2 (Low)

, correctly refused these tasks only

53.9%

of the time

. In professional settings, failing to refuse an unauthorized or impossible task can lead to corrupted database states and security risks

.

Orchestration and Multi-Agent Systems (MAS)

The research team also evaluated whether more complex agent architectures could close the performance gap. While a

Planner+Executor

setup (where one model plans and another executes) yielded modest gains, more complex

decomposition architectures

often regressed performance. In domains like CSM and HR, tasks have strong sequential state dependencies; breaking these into sub-tasks for separate agents often disrupted the necessary context, leading to lower success rates than simple ReAct loops.

Economic Considerations: The Pareto Frontier

For deployment, the benchmark establishes a clear cost-performance tradeoff:

Gemini-3-Flash

represents the strongest practical tradeoff for closed-source models, offering 31.9% performance at a 90% lower cost than GPT-5 or Claude Sonnet 4.5.

DeepSeek-V3.2 (High)

and

GPT-OSS-120B (High)

are the dominant open-source options, offering approximately 24% performance at roughly $0.015 per task.

Claude Opus 4.5

remains the benchmark for absolute reliability (37.4%) but at the highest cost of $0.36 per task.

Key Takeaways

Benchmark Scale and Complexity

: EnterpriseOps-Gym provides a high-fidelity evaluation environment featuring

164 relational database tables

and

512 functional tools

across eight enterprise domains.

Significant Performance Gap

: Current frontier models are not yet reliable for autonomous deployment; the top-performing model,

Claude Opus 4.5

, achieves only a

37.4% success rate

.

Planning as the Primary Bottleneck

: Strategic reasoning is the binding constraint rather than tool execution, as providing agents with human-authored plans improves performance by

14 to 35 percentage points

.

Inadequate Safe Refusal

: Models struggle to identify and refuse infeasible or policy-violating requests, with even the best-performing model cleanly abstaining only

53.9%

of the time.

Thinking Budget Limitations

: While increasing test-time compute yields gains in some domains, performance plateaus in others, suggesting that more ‘thinking’ tokens cannot fully overcome fundamental gaps in

policy understanding

or

domain knowledge

.

Check out

Paper

,

Codes

and

Technical details

.

Also, feel free to follow us on

Twitter

and don’t forget to join our

120k+ ML SubReddit

and Subscribe to

our Newsletter

. Wait! are you on telegram?

now you can join us on telegram as well.

转载说明：本文系转载内容，版权归原作者及原出处所有。转载目的在于传递更多行业信息，文章观点仅代表原作者本人，与本平台立场无关。若涉及作品版权问题，请原作者或相关权利人及时与本平台联系，我们将在第一时间核实后移除相关内容。