Why run ArkSim in CI?
Agents break in non-obvious ways. A prompt tweak, a model upgrade, or a dependency change can silently degrade helpfulness, introduce false information, or cause goal failures — without triggering any unit test. Running ArkSim on every pull request turns quality regression into a CI signal:| Without ArkSim in CI | With ArkSim in CI |
|---|---|
| Regressions found in production | Regressions caught before merge |
| Manual spot-checking before releases | Automated evaluation on every PR |
| Subjective “looks good to me” reviews | Objective, reproducible metric scores |
| No record of quality over time | Artifact history per commit |
Choose your approach
| pytest (custom agent) | HTTP server | |
|---|---|---|
| How it works | ArkSim loads your agent class in-process | ArkSim calls your agent over HTTP |
| Agent type | Python class subclassing BaseAgent | Any HTTP server (any language or framework) |
| CI complexity | Simple — just run pytest | Requires starting, health-checking, and stopping a server |
| Template | arksim-pytest.yml + test_agent_quality.py | arksim.yml |
Approach 1: pytest with a custom agent
Your agent is a Python class that subclassesBaseAgent. ArkSim loads it in-process — no HTTP server needed. The pytest test runs simulation and evaluation directly and asserts scores.
How it works
Setup
Subclass BaseAgent
Your agent class must implement two methods:If you already have an existing agent, wrap it in a thin adapter:
Copy the pytest templates
Download the GitHub Actions example:Copy the pytest workflow and test template into your repo:Create a
conftest.py in your tests/ directory so pytest can resolve your agent and metrics modules:Create your scenarios
Create See Build Scenarios for the schema, and
tests/arksim/scenarios.json with test cases representative of your agent’s real usage:examples/bank-insurance for a full example.Customize the test file
Open
tests/test_agent_quality.py and update:- Import your agent class (replace
from my_agent import MyAgent) - Import your custom metrics if needed (replace
from my_metrics import ...), or remove the custom metrics block entirely to use built-in metrics only - Set
NUMERIC_THRESHOLDSandQUALITATIVE_FAILURE_LABELSto match your quality bar - Adjust
num_convos_per_scenario,max_turns, andnum_workersas needed
Add GitHub secret
In your repo → Settings → Secrets and variables → Actions:
| Secret | Purpose |
|---|---|
OPENAI_API_KEY | LLM ArkSim uses to evaluate your agent |
Approach 2: HTTP server
Your agent runs as an HTTP server exposing an OpenAI-compatible chat completions endpoint. ArkSim calls it over HTTP during CI — works with any language or framework.How it works
| Exit code | Meaning |
|---|---|
0 | All evaluations passed |
1 | Evaluation failed — a threshold was not met |
2 | Config / usage error |
3 | Internal / engine error |
Setup
Get the workflow template
Download the GitHub Actions example:Copy the workflow template into your repo:
Create your ArkSim config
Create See Evaluation for all configuration options and the full list of available metrics.
arksim/config.yaml in your repo. Point the endpoint at the port where your agent server will listen:Create your scenarios
Create
arksim/scenarios.json with test cases representative of your agent’s real usage. See Build Scenarios for the schema, and examples/bank-insurance for a full example.Add custom metrics (optional)
If you have custom metrics, add them to
arksim/custom_metrics/ and reference them in your arksim/config.yaml. See Custom Metrics for the schema.Customize the workflow
Open Wait for agent — update the health-check URL if your server uses a different route:
.github/workflows/arksim.yml and update the two TODO sections:Start agent server — replace with your framework’s startup command:- FastAPI / uvicorn
- LangChain + LangServe
- OpenAI Agents SDK
- Google ADK
Add GitHub secrets
In your repo → Settings → Secrets and variables → Actions:
| Secret | Purpose |
|---|---|
OPENAI_API_KEY | LLM ArkSim uses to evaluate your agent |
AGENT_API_KEY | (optional) API key your agent server needs |
Quality gates
Configure pass/fail thresholds in your config or test file. If any gate is not met the job fails.Per-metric thresholds
Fail if any conversation’s per-metric score falls below the minimum. Built-in turn-level metrics use a 1–5 scale;goal_completion and overall_score use 0–1:
Qualitative failure labels
Fail if any evaluated turn returns one of these labels:Viewing the evaluation report
After every run — pass or fail — two artifacts are uploaded so you can inspect results regardless of outcome.- Open the workflow run in GitHub Actions.
- Scroll to Artifacts at the bottom.
- Download arksim-html-report → unzip → open
final_report.htmlin your browser. - Download arksim-full-results for the raw simulation and evaluation JSONs (useful for programmatic analysis or diffing metric trends across runs).