Overview
Use this stage for conversation evaluation only with completed simulation results. When to use this:- Re-evaluating existing conversations with different criteria
- Generating new fix suggestions after updating your codebase
- Analyzing agent performance against specific quality thresholds
- Creating reports for stakeholders or CI/CD pipelines
output_dir (default: ./results/evaluation/).
Running the Command
- Binary
- Docker
Configuration File
Understanding Key Settings
- input_dir — Points to conversation logs from the simulate step
- code_file_path — Optional: provide your agent’s source code to get specific fix suggestions
- entry_function — The main function in your code where the agent logic starts
- generate_html_report — Creates a visual report with scores and conversation breakdowns
- score_threshold — Set a pass/fail threshold (0.0-1.0) for CI/CD integration
YAML Configuration
Code Fix Requirements: Both
code_file_path and entry_function must be provided together. If only one is set, code fixes will not be generated.Next Steps
After evaluation completes, review the results in the output directory:- CSV files — Detailed metrics for analysis and tracking
- HTML report — Interactive visualization of agent performance
- Fix suggestions — Module-level and code-level recommendations (if code analysis enabled)