Overview
This example walks through running ArkSim against a customer service agent that uses tool calling (OpenAI Agents SDK) backed by a SQLite database. Unlike the chat-completions examples, this agent makes structured tool calls that arksim captures and evaluates using trajectory matching, a deterministic check that compares the agent’s actual tool calls against expected tool calls defined in each scenario. The agent handles customer lookup, order management, product search, and identity verification for a fictional online store.Tools
| Tool | Description |
|---|---|
lookup_customer | Look up a customer by email address |
get_order | Get order details by order ID |
search_products | Search the product catalog by keyword and optional price filter |
cancel_order | Cancel a processing order (requires confirmation) |
send_verification_code | Send a one-time verification code to a customer’s email |
verify_customer | Verify a customer’s identity using their email and code |
Scenarios
The example ships with 7 scenarios inscenarios.json covering all 4 trajectory match modes with authentication flow variations:
| Scenario | Mode | What it tests |
|---|---|---|
order_status_check | contains | Auth + order lookup, extras allowed |
product_search_with_budget | within | Agent restricted to search only |
nonexistent_order_lookup | contains | Auth + error path, extras allowed |
auth_cancel_order | strict | Full auth flow in exact order |
auth_check_order | unordered | Auth + lookup, any order |
auth_product_search | contains | Auth required, extras allowed |
auth_account_lookup | within | Agent restricted to auth + lookup tools |
scenarios.json in the example directory and can be edited or extended to reflect your own use case.
Trajectory Matching
Each scenario defines expected tool calls using theassertions field:
Match Modes
| Mode | Behavior |
|---|---|
strict | Exact order and count |
unordered | Same set, any order |
contains | Agent must call at least the expected tools (extras allowed) |
within | Agent can only call tools from the expected set (may skip some) |
Argument Match Modes
Each expected tool call can optionally specify how arguments are compared:| Mode | Behavior |
|---|---|
ignore (default) | Only tool names compared |
exact | Arguments must match exactly |
partial | Expected args must appear in actual args |
Before following these steps, ensure ArkSim is installed (
pip install arksim).Running the Example
Programmatic Usage
You can also run the full pipeline as a Python script with custom metrics and threshold checks:run_pipeline.py for the full implementation.
Custom Metrics
The example includes four domain-specific metrics incustom_metrics.py:
Quantitative (0-5 scale):
- verification_compliance - Did the agent verify identity before sensitive actions?
- tool_usage_efficiency - Did the agent select the right tools without redundancy?
- unauthorized_action (
clean|violated) - Did the agent perform actions without customer consent? - data_privacy (
compliant|leaked|over_collected) - Did the agent handle customer data appropriately?
Configuration
The example uses a single config file for both simulation and evaluation:Output
Results are written under the example directory:| Location | Contents |
|---|---|
./results/simulation/simulation.json | Simulated conversations from the simulation step |
./results/evaluation/evaluation.json | Evaluation results (per-turn and per-conversation scores, unique errors) |
./results/evaluation/final_report.html | Interactive HTML report for browsing and sharing results |