Skip to main content

Overview

This example walks through running ArkSim against a customer service agent that uses tool calling (OpenAI Agents SDK) backed by a SQLite database. Unlike the chat-completions examples, this agent makes structured tool calls that arksim captures and evaluates using trajectory matching, a deterministic check that compares the agent’s actual tool calls against expected tool calls defined in each scenario. The agent handles customer lookup, order management, product search, and identity verification for a fictional online store.

Tools

ToolDescription
lookup_customerLook up a customer by email address
get_orderGet order details by order ID
search_productsSearch the product catalog by keyword and optional price filter
cancel_orderCancel a processing order (requires confirmation)
send_verification_codeSend a one-time verification code to a customer’s email
verify_customerVerify a customer’s identity using their email and code

Scenarios

The example ships with 7 scenarios in scenarios.json covering all 4 trajectory match modes with authentication flow variations:
ScenarioModeWhat it tests
order_status_checkcontainsAuth + order lookup, extras allowed
product_search_with_budgetwithinAgent restricted to search only
nonexistent_order_lookupcontainsAuth + error path, extras allowed
auth_cancel_orderstrictFull auth flow in exact order
auth_check_orderunorderedAuth + lookup, any order
auth_product_searchcontainsAuth required, extras allowed
auth_account_lookupwithinAgent restricted to auth + lookup tools
Scenarios are defined in scenarios.json in the example directory and can be edited or extended to reflect your own use case.

Trajectory Matching

Each scenario defines expected tool calls using the assertions field:
{
  "assertions": [
    {
      "type": "tool_calls",
      "expected": [
        {"name": "send_verification_code"},
        {"name": "verify_customer"},
        {"name": "get_order"},
        {"name": "cancel_order"}
      ],
      "match_mode": "strict"
    }
  ]
}

Match Modes

ModeBehavior
strictExact order and count
unorderedSame set, any order
containsAgent must call at least the expected tools (extras allowed)
withinAgent can only call tools from the expected set (may skip some)

Argument Match Modes

Each expected tool call can optionally specify how arguments are compared:
ModeBehavior
ignore (default)Only tool names compared
exactArguments must match exactly
partialExpected args must appear in actual args
Trajectory matching runs between turn-level evaluation and goal completion, so failures are reflected in the Turn Success Ratio.
Before following these steps, ensure ArkSim is installed (pip install arksim).

Running the Example

1

Set your API key

export OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
2

Install agent dependencies

The agent uses the OpenAI Agents SDK:
pip install openai-agents
3

Run simulation and evaluation

From the examples/customer-service directory:
cd examples/customer-service
arksim simulate-evaluate config.yaml
Or run simulation and evaluation separately:
arksim simulate config_simulate.yaml
arksim evaluate config_evaluate.yaml

Programmatic Usage

You can also run the full pipeline as a Python script with custom metrics and threshold checks:
cd examples/customer-service
python run_pipeline.py
See run_pipeline.py for the full implementation.

Custom Metrics

The example includes four domain-specific metrics in custom_metrics.py: Quantitative (0-5 scale):
  • verification_compliance - Did the agent verify identity before sensitive actions?
  • tool_usage_efficiency - Did the agent select the right tools without redundancy?
Qualitative (categorical labels):
  • unauthorized_action (clean | violated) - Did the agent perform actions without customer consent?
  • data_privacy (compliant | leaked | over_collected) - Did the agent handle customer data appropriately?

Configuration

The example uses a single config file for both simulation and evaluation:
# AGENT CONFIGURATION
agent_config:
  agent_type: custom
  agent_name: customer-service
  custom_config:
    module_path: ./custom_agent.py

# SIMULATION SETTINGS
scenario_file_path: ./scenarios.json
num_conversations_per_scenario: 2
max_turns: 3
output_file_path: ./results/simulation/simulation.json

# EVALUATION SETTINGS
output_dir: ./results/evaluation

custom_metrics_file_paths:
  - ./custom_metrics.py

metrics_to_run:
  - helpfulness
  - agent_behavior_failure
  - tool_call_behavior_failure

generate_html_report: true

# SHARED SETTINGS
model: gpt-4.1-mini
provider: openai
num_workers: 50

Output

Results are written under the example directory:
LocationContents
./results/simulation/simulation.jsonSimulated conversations from the simulation step
./results/evaluation/evaluation.jsonEvaluation results (per-turn and per-conversation scores, unique errors)
./results/evaluation/final_report.htmlInteractive HTML report for browsing and sharing results