Results

Overview

After running a simulation, you’ll receive detailed conversation logs, per-turn quality metrics, error analysis, and actionable code fixes to improve your agent.

Output Structure

Results are saved in the results folder, organized by stage:

results/
├── conversation/
│   ├── conversations.json                   # Full conversation logs between user simulator and agent
│   └── target_goals.json                    # Goals assigned to simulated users
├── evaluation/
│   ├── agent_performance_per_convo.csv      # Performance metrics aggregated per conversation
│   ├── agent_performance_per_turn.csv       # Performance metrics for each turn
│   ├── unique_errors.csv                    # Distinct errors encountered during evaluation
│   ├── final_report.md                      # Summary report of the simulation run
│   └── final_report.html                    # Interactive HTML report of the simulation result
└── scenario/
    ├── attributes.json                      # User attributes used in simulation
    ├── cap_goal_map.json                    # Mapping of capabilities to goals
    ├── goals.json                           # Goal definitions
    ├── knowledges.json                      # Knowledge base entries
    ├── profiles.json                        # User profile configurations
    └── artifacts/

Simulated Conversation Data

Complete conversation transcripts and associated metadata are stored in conversation/conversations.json. This file contains the full dialogue history and context for each simulated interaction.

What’s Included

Each conversation entry includes:

Conversation History — Complete turn-by-turn dialogue between the simulated user and agent
System Prompts — Instructions provided to the user simulator that guide behavior
User Profiles — Detailed persona information including attributes, demographics, and behavioral characteristics
User Events — Relevant contextual events that may influence user behavior or needs
User Goals — Specific objectives assigned to each conversation that define success criteria
Knowledge Base — FAQ-style content and domain knowledge available to the agent during the conversation

Performance Metrics

Per-Turn Metrics

Per-turn metrics evaluate the quality of each individual agent response during a conversation.

Scoring Scale

All metrics use a 1–5 scale, where:

1 = Poor performance
5 = Excellent performance (except for verbosity score)

Available Metrics

Metric	Description
`helpfulness_score`	Measures how effectively the response addresses the user’s needs and provides value
`coherence_score`	Evaluates clarity, logical flow, and internal consistency of the response
`verbosity_score`	Assesses response length appropriateness (1 = overly verbose, 5 = appropriately succinct)
`relevance_score`	Measures how directly the response addresses the user’s input without unnecessary content
`faithfulness_score`	Evaluates correctness and grounding with respect to provided knowledge base

Per-Conversation Metrics

Per-conversation metrics assess overall agent performance across complete interactions.

Available Metrics

Field	Description
`goal_completion_score`	Indicates whether the agent successfully completed the user’s goal. Scored 0.0–1.0, where 1.0 represents full goal achievement.
`turn_success_ratio`	Measures consistency of agent behavior across turns. Calculated as `(total_turns - behavior_failures) / total_turns`. A ratio of 1.0 indicates zero failures.
`final_score`	Weighted combination of goal completion and turn success ratio. Formula: `0.25 × goal_completion_score + 0.75 × turn_success_ratio`.
`status`	Conversation status category based on final score: Done (`1.0`), Partial Failure (`≥0.6` and `<1.0`), or Failed (`<0.6`)

Interpreting Status

Status	Final Score	Meaning
Done	`1.0`	Agent successfully completed the user’s goal with no behavior failures
Partial Failure	`≥0.6` and `<1.0`	Agent had minor issues or incomplete goal completion, but the interaction was acceptable
Failed	`<0.6`	Agent had significant behavior failures or did not complete the user’s goal

These metrics help you understand both the reliability of agent behavior and the effectiveness of goal achievement across different conversation scenarios.

Error Analysis

Agent Behavior Failures

When scores fall below threshold levels, the system identifies specific behavior failure categories:

Failure Category	Description
Failure to Ask for Clarification	Agent proceeds with assumptions instead of requesting clarification on ambiguous queries
Repetition	Agent redundantly restates the same content without adding new value
Disobey User Request	Agent response completely ignores or contradicts the user’s clear request
Lack of Specific Information	Agent provides only generic or partial answers when specific details are available
False Information	Agent generates hallucinated or fabricated information that contradicts the knowledge base

These failure categories help identify root causes and guide targeted improvements to agent behavior.

Unique Errors

The unique errors analysis identifies and deduplicates recurring behavior failure patterns across conversations, grouping failures by root cause.

How It Works

The system analyzes behavior failures across all conversations and groups similar failures together, creating fine-grained error descriptions that capture the specific ways agents break down.

Data Fields

Field	Description
`unique_error`	Fine-grained description of the specific error pattern
`agent_behaviour_failure_category`	High-level failure category (e.g., “Failure to Ask for Clarification”, “Repetition”)
`occurrences`	List of conversation and turn identifiers where this error appears

Fix Recommendations

Module Fixes

For each unique error identified in the evaluation, the system generates high-level fix recommendations that target specific agent framework modules. The system evaluates all potential module-level fixes and selects the most effective option with detailed reasoning.

Module fixes provide high-level guidance on which agent framework components should be modified. When code analysis is available, these module fixes serve as the foundation for generating specific code-level implementation suggestions.

Agent Framework Modules

Errors are mapped to standard agent architecture components:

Module	Responsibility
Memory Module	Manages context, history, and knowledge retrieval
Tools/Action Module	Interfaces with external APIs and services
Planning/Reasoning Module	Handles multi-step reasoning and task decomposition
Orchestration Module	Coordinates module behavior and control flow
Generation Module	Produces natural language responses

This mapping helps identify which components need modification to resolve specific error patterns.

Generated Outputs

The module fixes analysis produces:

Module Fixes — Comprehensive fix suggestions for each implicated module, addressing how that component should be modified to resolve the error
Best Module Fix — The selected fix that would most effectively address the root cause of the error
Best Module Fix Reasoning — Detailed explanation of why this particular module fix was chosen over alternatives

Code Fixes

When code analysis is enabled using --code-file-path and --entry-function parameters, the system generates actionable, implementation-level code fixes for each unique error. These fixes provide specific code modifications that can be directly applied to resolve identified issues.

Limitation: Code fixes are currently available only for OpenAI SDK-based agents implemented in a single file. For other implementations, module-level fixes provide guidance on what to change, even if specific code patches cannot be generated.

Generated Outputs

The code fixes analysis produces:

Code Fixes — All generated code-level fixes for each implicated module, showing specific implementation changes
Code Fix Entry Function — The entry point function where the fix should be applied
Best Code Fix Location — The specific function or code location where the fix is most effective
Best Code Fix — The selected best fix with precise line numbers and code modifications

These code fixes build upon module-level recommendations to provide ready-to-apply implementation guidance for supported agent architectures.

Getting started

Guide

Examples

Advanced Usage

Help

Overview

Output Structure

Simulated Conversation Data

What’s Included

Performance Metrics

Per-Turn Metrics

Scoring Scale

Available Metrics

Per-Conversation Metrics

Available Metrics

Interpreting Status

Error Analysis

Agent Behavior Failures

Unique Errors

How It Works

Data Fields

Fix Recommendations

Module Fixes

Agent Framework Modules

Generated Outputs

Code Fixes

Generated Outputs

Getting started

Guide

Examples

Advanced Usage

Help

​Overview

​Output Structure

​Simulated Conversation Data

​What’s Included

​Performance Metrics

​Per-Turn Metrics

​Scoring Scale

​Available Metrics

​Per-Conversation Metrics

​Available Metrics

​Interpreting Status

​Error Analysis

​Agent Behavior Failures

​Unique Errors

​How It Works

​Data Fields

​Fix Recommendations

​Module Fixes

​Agent Framework Modules

​Generated Outputs

​Code Fixes

​Generated Outputs

Overview

Output Structure

Simulated Conversation Data

What’s Included

Performance Metrics

Per-Turn Metrics

Scoring Scale

Available Metrics

Per-Conversation Metrics

Available Metrics

Interpreting Status

Error Analysis

Agent Behavior Failures

Unique Errors

How It Works

Data Fields

Fix Recommendations

Module Fixes

Agent Framework Modules

Generated Outputs

Code Fixes

Generated Outputs