Skip to main content

What is Evaluation?

Evaluation analyzes the conversation transcripts produced by simulation and scores your agent across multiple dimensions. The output is a set of files summarizing your agent’s performance across turns, conversations, and error types.
Evaluation

What Gets Evaluated

Each agent response is scored on five metrics per turn:
MetricWhat it measuresScale
HelpfulnessHow effectively the response addresses the user’s need1–5
CoherenceLogical flow and clarity of the response1–5
RelevanceHow directly the response addresses the user’s input1–5
VerbosityConciseness; penalizes unnecessarily long responses.1–5
FaithfulnessWhether the response contradicts the user’s knowledge1–5
Scores range from 1 (poor) to 5 (excellent), with 3–4 considered good and 4–5 excellent. You can add domain-specific custom metrics (e.g. product suitability, compliance) via Python files. At the conversation level, two additional scores are computed on a 0–1 scale:
  • Goal Completion: whether the user’s goal was fully addressed by the end of the conversation.
  • Turn Success Ratio: proportion of turns with no detected behavior failure.
These combine into an Overall Agent Score per conversation:
overall_agent_score = turn_success_ratio × 0.75 + goal_completion_score × 0.25Each conversation is assigned a status based on this score. For the exact status values and thresholds, see Evaluation output in the Schema Reference.

Behavior Failure Detection

Beyond numeric scores, evaluation detects the type of failure in each underperforming turn:
Failure typeDescription
lack of specific informationCorrect but incomplete answer that omits details present in the knowledge
failure to ask for clarificationProceeds with an assumption instead of asking the user
disobey user requestCompletely ignores what the user asked
repetitionRestates previous content without adding value
false informationDirectly contradicts what is stated in the knowledge
no failureNo failure detected
Across all conversations, duplicate failures are deduplicated into a unique errors list with occurrence counts.

Inputs

The input to evaluation is the simulation output file (e.g. simulation.json) from the Simulation step. Knowledge referenced during evaluation comes from your Scenarios file. Evaluation is configured via a YAML file:
# Inputs
scenario_file_path: ./examples/path/to/scenario.json
simulation_file_path: ./examples/path/to/simulation.json

# Model
provider: openai
model: gpt-5.1

# Execution
num_workers: 50

# Metrics
metrics_to_run: [
  faithfulness,
  helpfulness,
  coherence,
  verbosity,
  relevance,
  goal_completion,
  agent_behavior_failure
]
custom_metrics_file_paths: []

# Output
output_dir: ./examples/path/to/evaluation
generate_html_report: true

# Threshold gates (optional — omit or set to null to disable)
numeric_thresholds:
  overall_score: 0.7
  faithfulness: 3.5
  helpfulness: 3.0
  goal_completion: 0.8
qualitative_failure_labels:
  agent_behavior_failure: ["false information", "disobey user request"]
  prohibited_statements: ["violated"]
For all evaluation config keys, types, and defaults, see Run configuration → Evaluation keys in the Schema Reference.

Custom Metrics

In addition to the built-in metrics (helpfulness, coherence, relevance, verbosity, faithfulness, goal completion, behavior failure), you can define custom metrics in Python and load them via config.
custom_metrics_file_paths
list[str]
List of paths to Python files. Each file is loaded and every public QuantitativeMetric or QualitativeMetric subclass is instantiated and run. Custom metrics always run — they are not filtered by metrics_to_run.
metrics_to_run
list[str]
Names of built-in metrics to run. If empty, all built-in metrics run. Use this to restrict evaluation to a subset of built-ins while still running all custom metrics from custom_metrics_file_paths.

Metric Types

Produces a numeric score (e.g. 0–5).
StepDetail
SubclassQuantitativeMetric
Implementscore(score_input: ScoreInput) -> QuantResult
ReturnQuantResult with name, value (float), reason
from arksim.evaluator import (
    QuantitativeMetric,
    QuantResult,
    ScoreInput,
)

class ClarityScore(QuantitativeMetric):
    def score(self, score_input: ScoreInput) -> QuantResult:
        # Use your own LLM or any logic; independent of evaluator model config.
        # score_input.chat_history, .user_goal, .knowledge, etc.
        value = ...  # your scoring logic or LLM call
        return QuantResult(name="clarity", value=value, reason="...")

Available fields

Both metric types receive a ScoreInput:
score_input.chat_history   # Full conversation turns
score_input.user_goal      # The goal the user was trying to achieve
score_input.knowledge      # Retrieved knowledge / context passed to the agent

Threshold Gates

Threshold gates let you fail the CLI with a non-zero exit code when evaluation results fall below acceptable levels. All three gate types are evaluated after every arksim evaluate or arksim simulate-evaluate run.

numeric_thresholds

Per-metric minimum scores on each metric’s native scale. The mean score across all turns per conversation is compared against the threshold. Any conversation that falls below the threshold fails the run with exit code 1.
MetricScaleNotes
overall_score0–1Per-conversation overall_agent_score compared directly
faithfulness, helpfulness, coherence, verbosity, relevance1–5Mean of per-turn scores
goal_completion0–1Per-conversation score compared directly
Custom quantitative metricsYour scaleMean of per-turn scores
numeric_thresholds:
  overall_score: 0.7
  faithfulness: 3.5
  helpfulness: 3.0
  goal_completion: 0.8

qualitative_failure_labels

Hard-gate failure labels for qualitative metrics. Any evaluated turn whose label appears in the list fails the run with exit code 1. Turns where the metric did not run are skipped.
qualitative_failure_labels:
  agent_behavior_failure: ["false information", "disobey user request"]
  prohibited_statements: ["violated"]
If any gate fails, the CLI exits with code 1 and logs which conversations and turns did not meet the requirement. See Exit codes for the full list of exit codes.

Programmatic threshold gates

When running ArkSim programmatically (e.g. with a custom agent class), you can apply the same gates by importing check_numeric_thresholds and check_qualitative_failure_labels directly from arksim.evaluator:
import sys
from arksim.evaluator import (
    Evaluator,
    check_numeric_thresholds,
    check_qualitative_failure_labels,
)

evaluator_output = evaluator.evaluate(simulation_output)

numeric_ok = check_numeric_thresholds(
    evaluator_output,
    numeric_thresholds={
        "clarity": 3.5,
        "goal_completion": 0.6,
    },
)
qual_ok = check_qualitative_failure_labels(
    evaluator_output,
    qualitative_failure_labels={
        "agent_behavior_failure": ["false information"],
        "prohibited_statements": ["violated"],
    },
)

if not numeric_ok or not qual_ok:
    sys.exit(1)
You can also write your own custom gate by inspecting evaluator_output.conversations directly and applying any logic you need:
# Example: fail if any conversation has overall_agent_score below 0.7
for convo in evaluator_output.conversations:
    if convo.overall_agent_score < 0.7:
        print(f"Conversation {convo.conversation_id} failed: score {convo.overall_agent_score:.2f}")
        sys.exit(1)

Running Evaluation

1

Install ArkSim

pip install arksim
2

Run evaluation

arksim evaluate config.yaml

Output Files

FileDescription
evaluation.jsonFull evaluation output: conversations with scores, failure labels, and unique errors. See Evaluation output for the schema.
final_report.htmlHigh-level summary with metric averages, overall assessment, and top unique errors, rendered for easier browsing and sharing.

Reading the Output

Results go to evaluation.json: conversations (one per run) and unique_errors (deduplicated failures).