goose/crates/goose-bench/README.md

# Goose Benchmarking Framework

The `goose-bench` crate provides a framework for benchmarking and evaluating LLM models with the Goose framework. This tool helps quantify model performance across various tasks and generate structured reports.

## Features

- Run benchmark suites across multiple LLM models
- Execute evaluations in parallel when supported
- Generate structured JSON and CSV reports
- Process evaluation results with custom scripts
- Calculate aggregate metrics across evaluations
- Support for tool-shim evaluation
- Generate leaderboards and comparative metrics

## Prerequisites

- **Python Environment**: The `generate-leaderboard` command executes Python scripts and requires a valid Python environment with necessary dependencies (pandas, etc.)
- **OpenAI API Key**: For evaluations using LLM-as-judge (like `blog_summary` and `restaurant_research`), you must have an `OPENAI_API_KEY` environment variable set, as the judge uses the OpenAI GPT-4o model

## Benchmark Workflow

Running benchmarks is a two-step process:

### Step 1: Run Benchmarks

First, run the benchmark evaluations with your configuration:

```bash
goose bench run --config /path/to/your-config.json
```

This will execute all evaluations for all models specified in your configuration and create a benchmark directory with results.

### Step 2: Generate Leaderboard

After the benchmarks complete, generate the leaderboard and aggregated metrics:

```bash
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output-directory
```

The benchmark directory path will be shown in the output of the previous command, typically in the format `benchmark-YYYY-MM-DD-HH:MM:SS`.

**Note**: This command requires a valid Python environment as it executes Python scripts for data aggregation and leaderboard generation.

## Configuration

Benchmark configuration is provided through a JSON file. Here's a sample configuration file (leaderboard-config.json) that you can use as a template:

```json
{
  "models": [
    {
      "provider": "databricks",
      "name": "gpt-4-1-mini",
      "parallel_safe": true,
      "tool_shim": {
        "use_tool_shim": false,
        "tool_shim_model": null
      }
    },
    {
      "provider": "databricks",
      "name": "claude-3-5-sonnet",
      "parallel_safe": true,
      "tool_shim": null
    },
    {
      "provider": "databricks",
      "name": "gpt-4o",
      "parallel_safe": true,
      "tool_shim": null
    }
  ],
  "evals": [
    {
      "selector": "core:developer",
      "post_process_cmd": null,
      "parallel_safe": true
    },
    {
      "selector": "core:developer_search_replace",
      "post_process_cmd": null,
      "parallel_safe": true
    },
    {
      "selector": "vibes:blog_summary",
      "post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
      "parallel_safe": true
    },
    {
      "selector": "vibes:restaurant_research",
      "post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
      "parallel_safe": true
    }
  ],
  "include_dirs": [],
  "repeat": 3,
  "run_id": null,
  "output_dir": "/path/to/output/directory",
  "eval_result_filename": "eval-results.json",
  "run_summary_filename": "run-results-summary.json",
  "env_file": "/path/to/.goosebench.env"
}
```

## Configuration Options

### Models

- `provider`: The LLM provider (e.g., "databricks", "openai")
- `name`: The model name
- `parallel_safe`: Whether the model can be run in parallel
- `tool_shim`: Configuration for tool-shim support
  - `use_tool_shim`: Whether to use tool-shim
  - `tool_shim_model`: Optional custom model for tool-shim

### Evaluations

- `selector`: The evaluation selector in format `suite:evaluation`
- `post_process_cmd`: Optional path to a post-processing script
- `parallel_safe`: Whether the evaluation can be run in parallel

### Global Configuration

- `include_dirs`: Additional directories to include in the benchmark environment
- `repeat`: Number of times to repeat evaluations (for statistical significance)
- `run_id`: Optional identifier for the run (defaults to timestamp)
- `output_dir`: Directory to store benchmark results (must be absolute path)
- `eval_result_filename`: Filename for individual evaluation results
- `run_summary_filename`: Filename for run summary
- `env_file`: Optional path to environment variables file

## Environment Variables

You can provide environment variables through the `env_file` configuration option. This is useful for provider API keys and other sensitive information. Example `.goosebench.env` file:

```bash
OPENAI_API_KEY=your_openai_api_key_here
DATABRICKS_TOKEN=your_databricks_token_here
# Add other environment variables as needed
```

**Important**: For evaluations that use LLM-as-judge (like `blog_summary` and `restaurant_research`), you must set `OPENAI_API_KEY` as the judging system uses OpenAI's GPT-4o model.

## Post-Processing

You can specify post-processing commands for evaluations, which will be executed after each evaluation completes. The command receives the path to the evaluation results file as its first argument.

For example, the `run_vibes_judge.sh` script processes outputs from the `blog_summary` and `restaurant_research` evaluations, using LLM-based judging to assign scores.

## Output Structure

Results are organized in a directory structure that follows this pattern:

```
{benchmark_dir}/
├── config.cfg                           # Configuration used for the benchmark
├── {provider}-{model}/
│   ├── eval-results/
│   │   └── aggregate_metrics.csv        # Aggregated metrics for this model
│   └── run-{run_id}/
│       ├── {suite}/
│       │   └── {evaluation}/
│       │       ├── eval-results.json    # Individual evaluation results
│       │       ├── {eval_name}.jsonl    # Session logs
│       │       └── work_dir.json        # Info about evaluation working dir
│       └── run-results-summary.json     # Summary of all evaluations in this run
├── leaderboard.csv                      # Final leaderboard comparing all models
└── all_metrics.csv                      # Union of all metrics across all models
```

### Output Files Explained

#### Per-Model Files

- **`eval-results/aggregate_metrics.csv`**: Contains aggregated metrics for each evaluation, averaged across all runs. Includes metrics like `score_mean`, `total_tokens_mean`, `prompt_execution_time_seconds_mean`, etc.

#### Global Output Files

- **`leaderboard.csv`**: Final leaderboard ranking all models by their average performance across evaluations. Contains columns like:
  - `provider`, `model_name`: Model identification
  - `avg_score_mean`: Average score across all evaluations
  - `avg_prompt_execution_time_seconds_mean`: Average execution time
  - `avg_total_tool_calls_mean`: Average number of tool calls
  - `avg_total_tokens_mean`: Average token usage

- **`all_metrics.csv`**: Comprehensive dataset containing detailed metrics for every model-evaluation combination. This is a union of all individual model metrics, useful for detailed analysis and custom reporting.

Each model gets its own directory, containing run results and aggregated CSV files for analysis. The `generate-leaderboard` command processes all individual evaluation results and creates the comparative metrics files.

## Error Handling and Troubleshooting

**Important**: The current version of goose-bench does not have robust error handling for common issues that can occur during evaluation runs, such as:

- Rate limiting from inference providers
- Network timeouts or connection errors
- Provider API errors that cause early session termination
- Resource exhaustion or memory issues

### Checking for Failed Evaluations

After running benchmarks, you should inspect the generated metrics files to identify any evaluations that may have failed or terminated early:

1. **Check the `aggregate_metrics.csv` files** in each model's `eval-results/` directory for:
   - Missing evaluations (fewer rows than expected)
   - Unusually low scores or metrics
   - Zero or near-zero execution times
   - Missing or NaN values

2. **Look for `server_error_mean` column** in the aggregate metrics - values greater than 0 indicate server errors occurred during evaluation

3. **Review session logs** (`.jsonl` files) in individual evaluation directories for error messages like:
   - "Server error"
   - "Rate limit exceeded"
   - "TEMPORARILY_UNAVAILABLE"
   - Unexpected session terminations

### Re-running Failed Evaluations

If you identify failed evaluations, you may need to:

1. **Adjust rate limiting**: Add delays between requests or reduce parallel execution
2. **Update environment variables**: Ensure API keys and tokens are valid
3. **Re-run specific model/evaluation combinations**: Create a new config with only the failed combinations
4. **Check provider status**: Verify the inference provider is operational

Example of creating a config to re-run failed evaluations:

```json
{
  "models": [
    {
      "provider": "databricks",
      "name": "claude-3-5-sonnet",
      "parallel_safe": false
    }
  ],
  "evals": [
    {
      "selector": "vibes:blog_summary",
      "post_process_cmd": "/path/to/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
      "parallel_safe": false
    }
  ],
  "repeat": 1,
  "output_dir": "/path/to/retry-benchmark"
}
```

We recommend monitoring evaluation progress and checking for errors regularly, especially when running large benchmark suites across multiple models.

## Available Commands

### List Evaluations
```bash
goose bench selectors --config /path/to/config.json
```

### Generate Initial Config
```bash
goose bench init-config --name my-benchmark-config.json
```

### Run Benchmarks
```bash
goose bench run --config /path/to/config.json
```

### Generate Leaderboard
```bash
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output
```