Evaluation analyzes the conversation transcripts produced by simulation and scores your agent across multiple dimensions.The output is a set of files summarizing your agent’s performance across turns, conversations, and error types.
Each agent response is scored on five metrics per turn:
Metric
What it measures
Scale
Helpfulness
How effectively the response addresses the user’s need
1–5
Coherence
Logical flow and clarity of the response
1–5
Relevance
How directly the response addresses the user’s input
1–5
Verbosity
Conciseness; penalizes unnecessarily long responses.
1–5
Faithfulness
Whether the response contradicts the user’s knowledge
1–5
Scores range from 1 (poor) to 5 (excellent), with 3–4 considered good and 4–5 excellent. You can add domain-specific custom metrics (e.g. product suitability, compliance) via Python files.At the conversation level, two additional scores are computed on a 0–1 scale:
Goal Completion: whether the user’s goal was fully addressed by the end of the conversation.
Turn Success Ratio: proportion of turns with no detected behavior failure.
These combine into an Overall Agent Score per conversation:
overall_agent_score = turn_success_ratio × 0.75 + goal_completion_score × 0.25Each conversation is assigned a status based on this score. For the exact status values and thresholds, see Evaluation output in the Schema Reference.
The input to evaluation is the simulation output file (e.g. simulation.json) from the Simulation step. Knowledge referenced during evaluation comes from your Scenarios file.Evaluation is configured via a YAML file:
# Inputsscenario_file_path: ./examples/path/to/scenario.jsonsimulation_file_path: ./examples/path/to/simulation.json# Modelprovider: openaimodel: gpt-5.1# Executionnum_workers: 50# Metricsmetrics_to_run: [ faithfulness, helpfulness, coherence, verbosity, relevance, goal_completion, agent_behavior_failure]custom_metrics_file_paths: []# Outputoutput_dir: ./examples/path/to/evaluationgenerate_html_report: true# Threshold gates (optional — omit or set to null to disable)numeric_thresholds: overall_score: 0.7 faithfulness: 3.5 helpfulness: 3.0 goal_completion: 0.8qualitative_failure_labels: agent_behavior_failure: ["false information", "disobey user request"] prohibited_statements: ["violated"]
In addition to the built-in metrics (helpfulness, coherence, relevance, verbosity, faithfulness, goal completion, behavior failure), you can define custom metrics in Python and load them via config.
List of paths to Python files. Each file is loaded and every public QuantitativeMetric or QualitativeMetric subclass is instantiated and run. Custom metrics always run — they are not filtered by metrics_to_run.
Names of built-in metrics to run. If empty, all built-in metrics run. Use this to restrict evaluation to a subset of built-ins while still running all custom metrics from custom_metrics_file_paths.
from arksim.evaluator import ( QuantitativeMetric, QuantResult, ScoreInput,)class ClarityScore(QuantitativeMetric): def score(self, score_input: ScoreInput) -> QuantResult: # Use your own LLM or any logic; independent of evaluator model config. # score_input.chat_history, .user_goal, .knowledge, etc. value = ... # your scoring logic or LLM call return QuantResult(name="clarity", value=value, reason="...")
Produces a categorical label (e.g. "complete" / "partial").
score_input.chat_history # Full conversation turnsscore_input.user_goal # The goal the user was trying to achievescore_input.knowledge # Retrieved knowledge / context passed to the agent
Threshold gates let you fail the CLI with a non-zero exit code when evaluation results fall below acceptable levels. All three gate types are evaluated after every arksim evaluate or arksim simulate-evaluate run.
Per-metric minimum scores on each metric’s native scale. The mean score across all turns per conversation is compared against the threshold. Any conversation that falls below the threshold fails the run with exit code 1.
Hard-gate failure labels for qualitative metrics. Any evaluated turn whose label appears in the list fails the run with exit code 1. Turns where the metric did not run are skipped.
qualitative_failure_labels: agent_behavior_failure: ["false information", "disobey user request"] prohibited_statements: ["violated"]
If any gate fails, the CLI exits with code 1 and logs which conversations and turns did not meet the requirement. See Exit codes for the full list of exit codes.
When running ArkSim programmatically (e.g. with a custom agent class), you can apply the same gates by importing check_numeric_thresholds and check_qualitative_failure_labels directly from arksim.evaluator:
You can also write your own custom gate by inspecting evaluator_output.conversations directly and applying any logic you need:
# Example: fail if any conversation has overall_agent_score below 0.7for convo in evaluator_output.conversations: if convo.overall_agent_score < 0.7: print(f"Conversation {convo.conversation_id} failed: score {convo.overall_agent_score:.2f}") sys.exit(1)
Results go to evaluation.json: conversations (one per run) and unique_errors (deduplicated failures).
Show By conversation
Each conversations[] entry has:
overall_agent_score (0–1): Turn success + goal completion. Use to rank or filter.
evaluation_status: Done, Partial Failure, Failed, or Evaluation Failed.
goal_completion_score (0–1, or -1 if skipped): How fully the goal was addressed. goal_completion_reason explains it.
turn_success_ratio (0–1): Share of turns with no behavior failure.
Use turn_scores[] for a turn: turn_id, scores (e.g. helpfulness, faithfulness), turn_behavior_failure, turn_behavior_failure_reason, unique_error_ids. Match IDs to unique_errors for category and description.
Show By error type
Each unique_errors[] entry has:
unique_error_description: What went wrong.
behavior_failure_category: e.g. false information, failure to ask for clarification, repetition.
severity: e.g. critical, high, medium, low.
occurrences: { "conversation_id", "turn_id" } list. Use to find the conversation and turn in logs or CSV.
Use occurrences to go from an error to the exact turn; then check turn_scores and the transcript.