Evaluate Conversation

Overview

Use this stage for conversation evaluation only with completed simulation results. When to use this:

Re-evaluating existing conversations with different criteria
Generating new fix suggestions after updating your codebase
Analyzing agent performance against specific quality thresholds
Creating reports for stakeholders or CI/CD pipelines

What it does: Analyzes completed conversations against your agent’s goals, knowledge accuracy, and capabilities to produce scores and actionable feedback. Output: Evaluation results, scores, and optional HTML report saved to output_dir (default: ./results/evaluation/).

Running the Command

Binary
Docker

./run_arksim.sh evaluate config_evaluate.yaml

docker run --platform linux/amd64 \
	-v ./examples:/examples \
	--env-file .env \
	-it public.ecr.aws/f1v9v1i4/arklex/arksim:1.0.0-alpha \
	evaluate config_evaluate.yaml

Configuration File

Understanding Key Settings

input_dir — Points to conversation logs from the simulate step
code_file_path — Optional: provide your agent’s source code to get specific fix suggestions
entry_function — The main function in your code where the agent logic starts
generate_html_report — Creates a visual report with scores and conversation breakdowns
score_threshold — Set a pass/fail threshold (0.0-1.0) for CI/CD integration

YAML Configuration

# Path to agent setup directory
agent_setup_dir: ./examples/bank-insurance

# Path to simulation output directory (from simulate step)
input_dir: ./results/conversation

# Output directory for evaluation results (default: evaluation/)
output_dir: null

# LLM settings
model: gpt-5.1                           # LLM model used by Arksim (not your agent)
provider: null                           # LLM provider (e.g., openai, azure)
num_workers: auto                        # Number of parallel workers

# Code fix settings (both required together)
code_file_path: null                     # Path to code file for fix generation (requires entry_function)
entry_function: null                     # Entry function for code fix generation (requires code_file_path)

# Report settings
generate_html_report: true               # Whether to generate an HTML report
score_threshold: null                    # If any per-conversation final score is below this threshold, exit with non-zero code

Code Fix Requirements: Both code_file_path and entry_function must be provided together. If only one is set, code fixes will not be generated.

Next Steps

After evaluation completes, review the results in the output directory:

CSV files — Detailed metrics for analysis and tracking
HTML report — Interactive visualization of agent performance
Fix suggestions — Module-level and code-level recommendations (if code analysis enabled)

See the Results page for a detailed explanation of all evaluation outputs.

Getting started

Guide

Examples

Advanced Usage

Help

Evaluate Conversation

Overview

Running the Command

Configuration File

Understanding Key Settings

YAML Configuration

Next Steps

Getting started

Guide

Examples

Advanced Usage

Help

​Overview

​Running the Command

​Configuration File

​Understanding Key Settings

​YAML Configuration

​Next Steps

Overview

Running the Command

Configuration File

Understanding Key Settings

YAML Configuration

Next Steps