Documentation Index
Fetch the complete documentation index at: https://docs.arklex.ai/llms.txt
Use this file to discover all available pages before exploring further.
What is Evaluation?
Evaluation analyzes the conversation transcripts produced by simulation and scores your agent across multiple dimensions. The output is a set of files summarizing your agent’s performance across turns, conversations, and error types.What Gets Evaluated
Each agent response is scored on five metrics per turn:| Metric | What it measures | Scale |
|---|---|---|
| Helpfulness | How effectively the response addresses the user’s need | 1–5 |
| Coherence | Logical flow and clarity of the response | 1–5 |
| Relevance | How directly the response addresses the user’s input | 1–5 |
| Verbosity | Conciseness; penalizes unnecessarily long responses. | 1–5 |
| Faithfulness | Whether the response contradicts the user’s knowledge | 1–5 |
- Goal Completion: whether the user’s goal was fully addressed by the end of the conversation.
- Turn Success Ratio: proportion of turns with no detected behavior failure.
overall_agent_score = turn_success_ratio × 0.75 + goal_completion_score × 0.25Each conversation is assigned a status based on this score. For the exact status values and thresholds, see Evaluation output in the Schema Reference.Behavior Failure Detection
Beyond numeric scores, evaluation detects the type of failure in each underperforming turn:| Failure type | Description |
|---|---|
lack of specific information | Generic or partial answer that omits important details |
failure to ask for clarification | Proceeds with an assumption instead of asking the user |
disobey user request | Completely ignores what the user asked |
repetition | Restates previous content without adding value |
false information | Contradicts/fabricates information relative to knowledge |
no failure | No failure detected |
Inputs
The input to evaluation is the simulation output file (e.g.simulation.json) from the Simulation step. Knowledge referenced during evaluation comes from your Scenarios file.
Evaluation is configured via a YAML file:
Custom Metrics
In addition to the built-in metrics (helpfulness, coherence, relevance, verbosity, faithfulness, goal completion, behavior failure), you can define custom metrics in Python and load them via config.List of paths to Python files. Each file is loaded and every public
QuantitativeMetric or QualitativeMetric subclass is instantiated and run. Custom metrics always run — they are not filtered by metrics_to_run.Names of built-in metrics to run. If empty, all built-in metrics run. Use this to restrict evaluation to a subset of built-ins while still running all custom metrics from
custom_metrics_file_paths.Metric Types
- Quantitative
- Qualitative
Produces a numeric score (e.g. 0–5).
| Step | Detail |
|---|---|
| Subclass | QuantitativeMetric |
| Implement | score(score_input: ScoreInput) -> QuantResult |
| Return | QuantResult with name, value (float), reason |
Available fields
Both metric types receive aScoreInput:
Running Evaluation
Output Files
| File | Description |
|---|---|
evaluation.json | Full evaluation output: conversations with scores, failure labels, and unique errors. See Evaluation output for the schema. |
final_report.html | High-level summary with metric averages, overall assessment, and top unique errors, rendered for easier browsing and sharing. |
Reading the Output
Results go toevaluation.json: conversations (one per run) and unique_errors (deduplicated failures).