Files
goose/crates/goose-bench/README.md
2025-05-21 15:00:13 -04:00

274 lines
9.9 KiB
Markdown

# Goose Benchmarking Framework
The `goose-bench` crate provides a framework for benchmarking and evaluating LLM models with the Goose framework. This tool helps quantify model performance across various tasks and generate structured reports.
## Features
- Run benchmark suites across multiple LLM models
- Execute evaluations in parallel when supported
- Generate structured JSON and CSV reports
- Process evaluation results with custom scripts
- Calculate aggregate metrics across evaluations
- Support for tool-shim evaluation
- Generate leaderboards and comparative metrics
## Prerequisites
- **Python Environment**: The `generate-leaderboard` command executes Python scripts and requires a valid Python environment with necessary dependencies (pandas, etc.)
- **OpenAI API Key**: For evaluations using LLM-as-judge (like `blog_summary` and `restaurant_research`), you must have an `OPENAI_API_KEY` environment variable set, as the judge uses the OpenAI GPT-4o model
## Benchmark Workflow
Running benchmarks is a two-step process:
### Step 1: Run Benchmarks
First, run the benchmark evaluations with your configuration:
```bash
goose bench run --config /path/to/your-config.json
```
This will execute all evaluations for all models specified in your configuration and create a benchmark directory with results.
### Step 2: Generate Leaderboard
After the benchmarks complete, generate the leaderboard and aggregated metrics:
```bash
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output-directory
```
The benchmark directory path will be shown in the output of the previous command, typically in the format `benchmark-YYYY-MM-DD-HH:MM:SS`.
**Note**: This command requires a valid Python environment as it executes Python scripts for data aggregation and leaderboard generation.
## Configuration
Benchmark configuration is provided through a JSON file. Here's a sample configuration file (leaderboard-config.json) that you can use as a template:
```json
{
"models": [
{
"provider": "databricks",
"name": "gpt-4-1-mini",
"parallel_safe": true,
"tool_shim": {
"use_tool_shim": false,
"tool_shim_model": null
}
},
{
"provider": "databricks",
"name": "claude-3-5-sonnet",
"parallel_safe": true,
"tool_shim": null
},
{
"provider": "databricks",
"name": "gpt-4o",
"parallel_safe": true,
"tool_shim": null
}
],
"evals": [
{
"selector": "core:developer",
"post_process_cmd": null,
"parallel_safe": true
},
{
"selector": "core:developer_search_replace",
"post_process_cmd": null,
"parallel_safe": true
},
{
"selector": "vibes:blog_summary",
"post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
"parallel_safe": true
},
{
"selector": "vibes:restaurant_research",
"post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
"parallel_safe": true
}
],
"include_dirs": [],
"repeat": 3,
"run_id": null,
"output_dir": "/path/to/output/directory",
"eval_result_filename": "eval-results.json",
"run_summary_filename": "run-results-summary.json",
"env_file": "/path/to/.goosebench.env"
}
```
## Configuration Options
### Models
- `provider`: The LLM provider (e.g., "databricks", "openai")
- `name`: The model name
- `parallel_safe`: Whether the model can be run in parallel
- `tool_shim`: Configuration for tool-shim support
- `use_tool_shim`: Whether to use tool-shim
- `tool_shim_model`: Optional custom model for tool-shim
### Evaluations
- `selector`: The evaluation selector in format `suite:evaluation`
- `post_process_cmd`: Optional path to a post-processing script
- `parallel_safe`: Whether the evaluation can be run in parallel
### Global Configuration
- `include_dirs`: Additional directories to include in the benchmark environment
- `repeat`: Number of times to repeat evaluations (for statistical significance)
- `run_id`: Optional identifier for the run (defaults to timestamp)
- `output_dir`: Directory to store benchmark results (must be absolute path)
- `eval_result_filename`: Filename for individual evaluation results
- `run_summary_filename`: Filename for run summary
- `env_file`: Optional path to environment variables file
## Environment Variables
You can provide environment variables through the `env_file` configuration option. This is useful for provider API keys and other sensitive information. Example `.goosebench.env` file:
```bash
OPENAI_API_KEY=your_openai_api_key_here
DATABRICKS_TOKEN=your_databricks_token_here
# Add other environment variables as needed
```
**Important**: For evaluations that use LLM-as-judge (like `blog_summary` and `restaurant_research`), you must set `OPENAI_API_KEY` as the judging system uses OpenAI's GPT-4o model.
## Post-Processing
You can specify post-processing commands for evaluations, which will be executed after each evaluation completes. The command receives the path to the evaluation results file as its first argument.
For example, the `run_vibes_judge.sh` script processes outputs from the `blog_summary` and `restaurant_research` evaluations, using LLM-based judging to assign scores.
## Output Structure
Results are organized in a directory structure that follows this pattern:
```
{benchmark_dir}/
├── config.cfg # Configuration used for the benchmark
├── {provider}-{model}/
│ ├── eval-results/
│ │ └── aggregate_metrics.csv # Aggregated metrics for this model
│ └── run-{run_id}/
│ ├── {suite}/
│ │ └── {evaluation}/
│ │ ├── eval-results.json # Individual evaluation results
│ │ ├── {eval_name}.jsonl # Session logs
│ │ └── work_dir.json # Info about evaluation working dir
│ └── run-results-summary.json # Summary of all evaluations in this run
├── leaderboard.csv # Final leaderboard comparing all models
└── all_metrics.csv # Union of all metrics across all models
```
### Output Files Explained
#### Per-Model Files
- **`eval-results/aggregate_metrics.csv`**: Contains aggregated metrics for each evaluation, averaged across all runs. Includes metrics like `score_mean`, `total_tokens_mean`, `prompt_execution_time_seconds_mean`, etc.
#### Global Output Files
- **`leaderboard.csv`**: Final leaderboard ranking all models by their average performance across evaluations. Contains columns like:
- `provider`, `model_name`: Model identification
- `avg_score_mean`: Average score across all evaluations
- `avg_prompt_execution_time_seconds_mean`: Average execution time
- `avg_total_tool_calls_mean`: Average number of tool calls
- `avg_total_tokens_mean`: Average token usage
- **`all_metrics.csv`**: Comprehensive dataset containing detailed metrics for every model-evaluation combination. This is a union of all individual model metrics, useful for detailed analysis and custom reporting.
Each model gets its own directory, containing run results and aggregated CSV files for analysis. The `generate-leaderboard` command processes all individual evaluation results and creates the comparative metrics files.
## Error Handling and Troubleshooting
**Important**: The current version of goose-bench does not have robust error handling for common issues that can occur during evaluation runs, such as:
- Rate limiting from inference providers
- Network timeouts or connection errors
- Provider API errors that cause early session termination
- Resource exhaustion or memory issues
### Checking for Failed Evaluations
After running benchmarks, you should inspect the generated metrics files to identify any evaluations that may have failed or terminated early:
1. **Check the `aggregate_metrics.csv` files** in each model's `eval-results/` directory for:
- Missing evaluations (fewer rows than expected)
- Unusually low scores or metrics
- Zero or near-zero execution times
- Missing or NaN values
2. **Look for `server_error_mean` column** in the aggregate metrics - values greater than 0 indicate server errors occurred during evaluation
3. **Review session logs** (`.jsonl` files) in individual evaluation directories for error messages like:
- "Server error"
- "Rate limit exceeded"
- "TEMPORARILY_UNAVAILABLE"
- Unexpected session terminations
### Re-running Failed Evaluations
If you identify failed evaluations, you may need to:
1. **Adjust rate limiting**: Add delays between requests or reduce parallel execution
2. **Update environment variables**: Ensure API keys and tokens are valid
3. **Re-run specific model/evaluation combinations**: Create a new config with only the failed combinations
4. **Check provider status**: Verify the inference provider is operational
Example of creating a config to re-run failed evaluations:
```json
{
"models": [
{
"provider": "databricks",
"name": "claude-3-5-sonnet",
"parallel_safe": false
}
],
"evals": [
{
"selector": "vibes:blog_summary",
"post_process_cmd": "/path/to/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
"parallel_safe": false
}
],
"repeat": 1,
"output_dir": "/path/to/retry-benchmark"
}
```
We recommend monitoring evaluation progress and checking for errors regularly, especially when running large benchmark suites across multiple models.
## Available Commands
### List Evaluations
```bash
goose bench selectors --config /path/to/config.json
```
### Generate Initial Config
```bash
goose bench init-config --name my-benchmark-config.json
```
### Run Benchmarks
```bash
goose bench run --config /path/to/config.json
```
### Generate Leaderboard
```bash
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output
```