Evaluation

What is Evaluation?

Evaluation analyzes the conversation transcripts produced by simulation and scores your agent across multiple dimensions. The output is a set of files summarizing your agent’s performance across turns, conversations, and error types.

What Gets Evaluated

Each agent response is scored on five metrics per turn:

Metric	What it measures	Scale
Helpfulness	How effectively the response addresses the user’s need	1–5
Coherence	Logical flow and clarity of the response	1–5
Relevance	How directly the response addresses the user’s input	1–5
Verbosity	Conciseness; penalizes unnecessarily long responses.	1–5
Faithfulness	Whether the response contradicts the user’s knowledge	1–5

Scores range from 1 (poor) to 5 (excellent), with 3–4 considered good and 4–5 excellent. You can add domain-specific custom metrics (e.g. product suitability, compliance) via Python files. At the conversation level, two additional scores are computed on a 0–1 scale:

Goal Completion: whether the user’s goal was fully addressed by the end of the conversation.
Turn Success Ratio: proportion of turns with no detected behavior failure.

These combine into an Overall Agent Score per conversation:

overall_agent_score = turn_success_ratio × 0.75 + goal_completion_score × 0.25Each conversation is assigned a status based on this score. For the exact status values and thresholds, see Evaluation output in the Schema Reference.

Behavior Failure Detection

Beyond numeric scores, evaluation detects the type of failure in each underperforming turn:

Failure type	Description
`lack of specific information`	Correct but incomplete answer that omits details present in the knowledge
`failure to ask for clarification`	Proceeds with an assumption instead of asking the user
`disobey user request`	Completely ignores what the user asked
`repetition`	Restates previous content without adding value
`false information`	Directly contradicts what is stated in the knowledge
`no failure`	No failure detected

Across all conversations, duplicate failures are deduplicated into a unique errors list with occurrence counts. When errors are found and a scenario file is available, the evaluator automatically generates focus files for targeted reruns (see Focus Files below).

Inputs

The input to evaluation is the simulation output file (e.g. simulation.json) from the Simulation step. Knowledge referenced during evaluation comes from your Scenarios file. Evaluation is configured via a YAML file:

# Inputs
scenario_file_path: ./examples/path/to/scenario.json
simulation_file_path: ./examples/path/to/simulation.json

# Model
provider: openai
model: gpt-5.1

# Execution
num_workers: 50

# Metrics
metrics_to_run: [
  faithfulness,
  helpfulness,
  coherence,
  verbosity,
  relevance,
  goal_completion,
  agent_behavior_failure
]
custom_metrics_file_paths: []

# Output
output_dir: ./examples/path/to/evaluation
generate_html_report: true

# Threshold gates (optional — omit or set to null to disable)
numeric_thresholds:
  overall_score: 0.7
  faithfulness: 3.5
  helpfulness: 3.0
  goal_completion: 0.8
qualitative_failure_labels:
  agent_behavior_failure: ["false information", "disobey user request"]
  prohibited_statements: ["violated"]

For all evaluation config keys, types, and defaults, see Run configuration → Evaluation keys in the Schema Reference.

Custom Metrics

In addition to the built-in metrics (helpfulness, coherence, relevance, verbosity, faithfulness, goal completion, behavior failure), you can define custom metrics in Python and load them via config.

custom_metrics_file_paths

list[str]

List of paths to Python files. Each file is loaded and every public QuantitativeMetric or QualitativeMetric subclass is instantiated and run. Custom metrics always run — they are not filtered by metrics_to_run.

metrics_to_run

list[str]

Names of built-in metrics to run. If empty, all built-in metrics run. Use this to restrict evaluation to a subset of built-ins while still running all custom metrics from custom_metrics_file_paths.

Scope

Every custom metric has a scope that controls when it runs:

Scope	Runs	Receives
`"turn"` (default)	Once per agent response	The current turn + conversation history up to that point
`"conversation"`	Once per full conversation	The complete conversation history

Use "turn" for response-quality checks (clarity, tone, compliance statements). Use "conversation" for end-to-end assessments that only make sense after the whole conversation (goal completion, needs assessment, product suitability).

Metric Types

Quantitative
Qualitative

Produces a numeric score (e.g. 0–5).

Step	Detail
Subclass	`QuantitativeMetric`
Implement	`score(score_input: ScoreInput) -> QuantResult`
Return	`QuantResult` with `name`, `value` (float), `reason`

from arksim.evaluator import (
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
)

# Turn-level: scored on every agent response
class ClarityScore(QuantitativeMetric):
    def __init__(self, llm=None):
        super().__init__(name="clarity", score_range=(0, 5), llm=llm, scope="turn")

    def score(self, score_input: ScoreInput) -> QuantResult:
        # self.llm is the evaluator's configured LLM, injected automatically.
        # score_input.chat_history, .current_turn, .user_goal, .knowledge, etc.
        value = ...  # your scoring logic using self.llm or any other logic
        return QuantResult(name="clarity", value=value, reason="...")

# Conversation-level: scored once after the full conversation
class ProductSuitabilityScore(QuantitativeMetric):
    def __init__(self, llm=None):
        super().__init__(name="product_suitability", score_range=(0, 5), llm=llm, scope="conversation")

    def score(self, score_input: ScoreInput) -> QuantResult:
        # score_input.chat_history contains the full conversation.
        value = ...
        return QuantResult(name="product_suitability", value=value, reason="...")

Produces a categorical label (e.g. "complete" / "partial").

Step	Detail
Subclass	`QualitativeMetric`
Implement	`evaluate(score_input: ScoreInput) -> QualResult`
Return	`QualResult` with `name`, `value` (str), `reason`

from arksim.evaluator import (
    QualitativeMetric,
    QualResult,
    ScoreInput,
)

# Turn-level: checked on every agent response
class ProhibitedStatements(QualitativeMetric):
    def __init__(self, llm=None):
        super().__init__(name="prohibited_statements", llm=llm, scope="turn")

    def evaluate(self, score_input: ScoreInput) -> QualResult:
        # self.llm is the evaluator's configured LLM, injected automatically.
        # score_input.chat_history, .current_turn, .user_goal, .knowledge, etc.
        label = ...  # "ok" | "violated"
        return QualResult(name="prohibited_statements", value=label, reason="...")

# Conversation-level: checked once after the full conversation
class NeedsAssessment(QualitativeMetric):
    def __init__(self, llm=None):
        super().__init__(name="needs_assessment", llm=llm, scope="conversation")

    def evaluate(self, score_input: ScoreInput) -> QualResult:
        # score_input.chat_history contains the full conversation.
        label = ...  # "complete" | "partial" | "missing"
        return QualResult(name="needs_assessment", value=label, reason="...")

Available fields

Both metric types receive a ScoreInput:

score_input.chat_history   # Full conversation turns
score_input.user_goal      # The goal the user was trying to achieve
score_input.knowledge      # Retrieved knowledge / context passed to the agent

Threshold Gates

Threshold gates let you fail the CLI with a non-zero exit code when evaluation results fall below acceptable levels. All three gate types are evaluated after every arksim evaluate or arksim simulate-evaluate run.

`numeric_thresholds`

Per-metric minimum scores on each metric’s native scale. The mean score across all turns per conversation is compared against the threshold. Any conversation that falls below the threshold fails the run with exit code 1.

Metric	Scale	Notes
`overall_score`	0–1	Per-conversation `overall_agent_score` compared directly
`faithfulness`, `helpfulness`, `coherence`, `verbosity`, `relevance`	1–5	Mean of per-turn scores
`goal_completion`	0–1	Per-conversation score compared directly
Custom quantitative metrics (`scope="turn"`)	Your scale	Mean of per-turn scores
Custom quantitative metrics (`scope="conversation"`)	Your scale	Per-conversation score compared directly

numeric_thresholds:
  overall_score: 0.7
  faithfulness: 3.5
  helpfulness: 3.0
  goal_completion: 0.8

`qualitative_failure_labels`

Hard-gate failure labels for qualitative metrics. Any evaluated turn whose label appears in the list fails the run with exit code 1. Turns where the metric did not run are skipped.

qualitative_failure_labels:
  agent_behavior_failure: ["false information", "disobey user request"]
  prohibited_statements: ["violated"]

If any gate fails, the CLI exits with code 1 and logs which conversations and turns did not meet the requirement. See Exit codes for the full list of exit codes.

Programmatic threshold gates

When running ArkSim programmatically (e.g. with a custom agent class), you can apply the same gates by importing check_numeric_thresholds and check_qualitative_failure_labels directly from arksim.evaluator:

import sys
from arksim.evaluator import (
    Evaluator,
    check_numeric_thresholds,
    check_qualitative_failure_labels,
)

evaluator_output = evaluator.evaluate(simulation_output)

numeric_ok = check_numeric_thresholds(
    evaluator_output,
    numeric_thresholds={
        "clarity": 3.5,
        "goal_completion": 0.6,
    },
)
qual_ok = check_qualitative_failure_labels(
    evaluator_output,
    qualitative_failure_labels={
        "agent_behavior_failure": ["false information"],
        "prohibited_statements": ["violated"],
    },
)

if not numeric_ok or not qual_ok:
    sys.exit(1)

You can also write your own custom gate by inspecting evaluator_output.conversations directly and applying any logic you need:

# Example: fail if any conversation has overall_agent_score below 0.7
for convo in evaluator_output.conversations:
    if convo.overall_agent_score < 0.7:
        print(f"Conversation {convo.conversation_id} failed: score {convo.overall_agent_score:.2f}")
        sys.exit(1)

Running Evaluation

Install ArkSim

pip install arksim

Run evaluation

CLI
Python (from file)
Python (in memory)

arksim evaluate config.yaml

from arksim.evaluator import run_evaluation, EvaluationInput

evaluation_results = run_evaluation(EvaluationInput(
    simulation_file_path="./simulation.json",
    scenario_file_path="./scenarios.json",
    output_dir="./evaluation",
    model="gpt-5.1",
    provider="openai",
    num_workers=50,
    generate_html_report=True,
))

from arksim.scenario import Scenarios
from arksim.evaluator import run_evaluation, EvaluationInput

# simulation comes from run_simulation(...)
scenarios = Scenarios.load("./scenarios.json")

evaluation_results = run_evaluation(EvaluationInput(
    output_dir="./evaluation",
    model="gpt-5.1",
    provider="openai",
    num_workers="auto",
    generate_html_report=True,
), simulation=simulation, scenarios=scenarios)

Output Files

File	Description
`evaluation.json`	Full evaluation output: conversations with scores, failure labels, and unique errors. See Evaluation output for the schema.
`final_report.html`	High-level summary with metric averages, overall assessment, and top unique errors, rendered for easier browsing and sharing.
`focus/error_N.json`	Filtered scenario file for error group N (sorted by severity). Only appears when unique errors are detected and a scenario file is provided. See Focus Files.
`focus/all_failures.json`	Union of all failing scenarios across every error group. Use this to rerun everything that failed in one pass.

Reading the Output

Results go to evaluation.json: conversations (one per run), unique_errors (deduplicated failures), and error_scenario_mappings (which scenarios triggered each error).

Show By conversation

Each conversations[] entry has:

overall_agent_score (0–1): Turn success + goal completion. Use to rank or filter.
evaluation_status: Done, Partial Failure, Failed, or Evaluation Failed.
goal_completion_score (0–1, or -1 if skipped): How fully the goal was addressed. goal_completion_reason explains it.
turn_success_ratio (0–1): Share of turns with no behavior failure.

Use turn_scores[] for a turn: turn_id, scores (e.g. helpfulness, faithfulness), turn_behavior_failure, turn_behavior_failure_reason, unique_error_ids. Match IDs to unique_errors for category and description.

Show By error type

Each unique_errors[] entry has:

unique_error_description: What went wrong.
behavior_failure_category: e.g. false information, failure to ask for clarification, repetition.
severity: e.g. critical, high, medium, low.
occurrences: { "conversation_id", "turn_id" } list. Use to find the conversation and turn in logs or CSV.

Use occurrences to go from an error to the exact turn; then check turn_scores and the transcript.

Show By error scenario mapping

Each error_scenario_mappings[] entry maps one unique error to the scenarios that triggered it:

unique_error_id: Links to the matching entry in unique_errors.
scenario_ids: Which scenarios triggered this error. Use these IDs with --scenario_file_path to rerun only the affected scenarios.
severity and error_description: Same as the linked unique_errors entry, included for convenience.

This field is empty when no errors are detected or no scenario file was provided. See Focus Files for the file-based workflow.

Focus Files

When evaluation detects unique errors and a scenario file is provided, the evaluator generates focus files in a focus/ subdirectory. These are filtered scenario files that let you rerun only the failing subset without any manual config.

How it works

Each unique error group is mapped back to the scenarios that triggered it. The evaluator writes one JSON file per error group plus a combined file with all failing scenarios:

evaluation/
  evaluation.json
  final_report.html
  focus/
    error_1.json          # scenarios for the highest-severity error
    error_2.json          # scenarios for the second error
    all_failures.json     # union of all failing scenarios

Error files are numbered by severity (critical first, then high, medium, low) and by occurrence count within the same severity level. Each focus file uses the standard Scenarios schema, so it works directly with --scenario_file_path:

# Rerun only the scenarios that triggered error group 1
arksim simulate-evaluate config.yaml \
  --scenario_file_path ./evaluation/focus/error_1.json

# Or rerun all failing scenarios at once
arksim simulate-evaluate config.yaml \
  --scenario_file_path ./evaluation/focus/all_failures.json

Pass a separate --output_dir for each rerun so results do not overwrite the original evaluation. Each rerun produces its own evaluation, report, and focus files (or no focus files if everything passes).

Iterative fix workflow

Focus files support an iterative fix-and-verify loop:

Run arksim simulate-evaluate config.yaml and review the errors
Fix the highest-severity bug in your agent
Rerun with the focus file: --scenario_file_path ./evaluation/focus/error_1.json --output_dir ./results/rerun_1
If the error is resolved, move to the next error group
When all targeted reruns pass, do a full battle test with the original scenario file

The focus set shrinks as you fix bugs. Scenarios that pass on rerun will not appear in the new run’s focus files.

Getting started

Core Capabilities

CI/CD

Integrations

Examples

Help

What is Evaluation?

What Gets Evaluated

Behavior Failure Detection

Inputs

Custom Metrics

Scope

Metric Types

Available fields

Threshold Gates

`numeric_thresholds`

`qualitative_failure_labels`

Programmatic threshold gates

Running Evaluation

Output Files

Reading the Output

Focus Files

How it works

Iterative fix workflow

Getting started

Core Capabilities

CI/CD

Integrations

Examples

Help

Documentation Index

​What is Evaluation?

​What Gets Evaluated

​Behavior Failure Detection

​Inputs

​Custom Metrics

​Scope

​Metric Types

​Available fields

​Threshold Gates

​numeric_thresholds

​qualitative_failure_labels

​Programmatic threshold gates

​Running Evaluation

​Output Files

​Reading the Output

​Focus Files

​How it works

​Iterative fix workflow

What is Evaluation?

What Gets Evaluated

Behavior Failure Detection

Inputs

Custom Metrics

Scope

Metric Types

Available fields

Threshold Gates

`numeric_thresholds`

`qualitative_failure_labels`

Programmatic threshold gates

Running Evaluation

Output Files

Reading the Output

Focus Files

How it works

Iterative fix workflow