mirror of
https://github.com/aljazceru/goose.git
synced 2026-02-14 19:14:31 +01:00
274 lines
9.9 KiB
Markdown
274 lines
9.9 KiB
Markdown
# Goose Benchmarking Framework
|
|
|
|
The `goose-bench` crate provides a framework for benchmarking and evaluating LLM models with the Goose framework. This tool helps quantify model performance across various tasks and generate structured reports.
|
|
|
|
## Features
|
|
|
|
- Run benchmark suites across multiple LLM models
|
|
- Execute evaluations in parallel when supported
|
|
- Generate structured JSON and CSV reports
|
|
- Process evaluation results with custom scripts
|
|
- Calculate aggregate metrics across evaluations
|
|
- Support for tool-shim evaluation
|
|
- Generate leaderboards and comparative metrics
|
|
|
|
## Prerequisites
|
|
|
|
- **Python Environment**: The `generate-leaderboard` command executes Python scripts and requires a valid Python environment with necessary dependencies (pandas, etc.)
|
|
- **OpenAI API Key**: For evaluations using LLM-as-judge (like `blog_summary` and `restaurant_research`), you must have an `OPENAI_API_KEY` environment variable set, as the judge uses the OpenAI GPT-4o model
|
|
|
|
## Benchmark Workflow
|
|
|
|
Running benchmarks is a two-step process:
|
|
|
|
### Step 1: Run Benchmarks
|
|
|
|
First, run the benchmark evaluations with your configuration:
|
|
|
|
```bash
|
|
goose bench run --config /path/to/your-config.json
|
|
```
|
|
|
|
This will execute all evaluations for all models specified in your configuration and create a benchmark directory with results.
|
|
|
|
### Step 2: Generate Leaderboard
|
|
|
|
After the benchmarks complete, generate the leaderboard and aggregated metrics:
|
|
|
|
```bash
|
|
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output-directory
|
|
```
|
|
|
|
The benchmark directory path will be shown in the output of the previous command, typically in the format `benchmark-YYYY-MM-DD-HH:MM:SS`.
|
|
|
|
**Note**: This command requires a valid Python environment as it executes Python scripts for data aggregation and leaderboard generation.
|
|
|
|
## Configuration
|
|
|
|
Benchmark configuration is provided through a JSON file. Here's a sample configuration file (leaderboard-config.json) that you can use as a template:
|
|
|
|
```json
|
|
{
|
|
"models": [
|
|
{
|
|
"provider": "databricks",
|
|
"name": "gpt-4-1-mini",
|
|
"parallel_safe": true,
|
|
"tool_shim": {
|
|
"use_tool_shim": false,
|
|
"tool_shim_model": null
|
|
}
|
|
},
|
|
{
|
|
"provider": "databricks",
|
|
"name": "claude-3-5-sonnet",
|
|
"parallel_safe": true,
|
|
"tool_shim": null
|
|
},
|
|
{
|
|
"provider": "databricks",
|
|
"name": "gpt-4o",
|
|
"parallel_safe": true,
|
|
"tool_shim": null
|
|
}
|
|
],
|
|
"evals": [
|
|
{
|
|
"selector": "core:developer",
|
|
"post_process_cmd": null,
|
|
"parallel_safe": true
|
|
},
|
|
{
|
|
"selector": "core:developer_search_replace",
|
|
"post_process_cmd": null,
|
|
"parallel_safe": true
|
|
},
|
|
{
|
|
"selector": "vibes:blog_summary",
|
|
"post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
|
|
"parallel_safe": true
|
|
},
|
|
{
|
|
"selector": "vibes:restaurant_research",
|
|
"post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
|
|
"parallel_safe": true
|
|
}
|
|
],
|
|
"include_dirs": [],
|
|
"repeat": 3,
|
|
"run_id": null,
|
|
"output_dir": "/path/to/output/directory",
|
|
"eval_result_filename": "eval-results.json",
|
|
"run_summary_filename": "run-results-summary.json",
|
|
"env_file": "/path/to/.goosebench.env"
|
|
}
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### Models
|
|
|
|
- `provider`: The LLM provider (e.g., "databricks", "openai")
|
|
- `name`: The model name
|
|
- `parallel_safe`: Whether the model can be run in parallel
|
|
- `tool_shim`: Configuration for tool-shim support
|
|
- `use_tool_shim`: Whether to use tool-shim
|
|
- `tool_shim_model`: Optional custom model for tool-shim
|
|
|
|
### Evaluations
|
|
|
|
- `selector`: The evaluation selector in format `suite:evaluation`
|
|
- `post_process_cmd`: Optional path to a post-processing script
|
|
- `parallel_safe`: Whether the evaluation can be run in parallel
|
|
|
|
### Global Configuration
|
|
|
|
- `include_dirs`: Additional directories to include in the benchmark environment
|
|
- `repeat`: Number of times to repeat evaluations (for statistical significance)
|
|
- `run_id`: Optional identifier for the run (defaults to timestamp)
|
|
- `output_dir`: Directory to store benchmark results (must be absolute path)
|
|
- `eval_result_filename`: Filename for individual evaluation results
|
|
- `run_summary_filename`: Filename for run summary
|
|
- `env_file`: Optional path to environment variables file
|
|
|
|
## Environment Variables
|
|
|
|
You can provide environment variables through the `env_file` configuration option. This is useful for provider API keys and other sensitive information. Example `.goosebench.env` file:
|
|
|
|
```bash
|
|
OPENAI_API_KEY=your_openai_api_key_here
|
|
DATABRICKS_TOKEN=your_databricks_token_here
|
|
# Add other environment variables as needed
|
|
```
|
|
|
|
**Important**: For evaluations that use LLM-as-judge (like `blog_summary` and `restaurant_research`), you must set `OPENAI_API_KEY` as the judging system uses OpenAI's GPT-4o model.
|
|
|
|
## Post-Processing
|
|
|
|
You can specify post-processing commands for evaluations, which will be executed after each evaluation completes. The command receives the path to the evaluation results file as its first argument.
|
|
|
|
For example, the `run_vibes_judge.sh` script processes outputs from the `blog_summary` and `restaurant_research` evaluations, using LLM-based judging to assign scores.
|
|
|
|
## Output Structure
|
|
|
|
Results are organized in a directory structure that follows this pattern:
|
|
|
|
```
|
|
{benchmark_dir}/
|
|
├── config.cfg # Configuration used for the benchmark
|
|
├── {provider}-{model}/
|
|
│ ├── eval-results/
|
|
│ │ └── aggregate_metrics.csv # Aggregated metrics for this model
|
|
│ └── run-{run_id}/
|
|
│ ├── {suite}/
|
|
│ │ └── {evaluation}/
|
|
│ │ ├── eval-results.json # Individual evaluation results
|
|
│ │ ├── {eval_name}.jsonl # Session logs
|
|
│ │ └── work_dir.json # Info about evaluation working dir
|
|
│ └── run-results-summary.json # Summary of all evaluations in this run
|
|
├── leaderboard.csv # Final leaderboard comparing all models
|
|
└── all_metrics.csv # Union of all metrics across all models
|
|
```
|
|
|
|
### Output Files Explained
|
|
|
|
#### Per-Model Files
|
|
|
|
- **`eval-results/aggregate_metrics.csv`**: Contains aggregated metrics for each evaluation, averaged across all runs. Includes metrics like `score_mean`, `total_tokens_mean`, `prompt_execution_time_seconds_mean`, etc.
|
|
|
|
#### Global Output Files
|
|
|
|
- **`leaderboard.csv`**: Final leaderboard ranking all models by their average performance across evaluations. Contains columns like:
|
|
- `provider`, `model_name`: Model identification
|
|
- `avg_score_mean`: Average score across all evaluations
|
|
- `avg_prompt_execution_time_seconds_mean`: Average execution time
|
|
- `avg_total_tool_calls_mean`: Average number of tool calls
|
|
- `avg_total_tokens_mean`: Average token usage
|
|
|
|
- **`all_metrics.csv`**: Comprehensive dataset containing detailed metrics for every model-evaluation combination. This is a union of all individual model metrics, useful for detailed analysis and custom reporting.
|
|
|
|
Each model gets its own directory, containing run results and aggregated CSV files for analysis. The `generate-leaderboard` command processes all individual evaluation results and creates the comparative metrics files.
|
|
|
|
## Error Handling and Troubleshooting
|
|
|
|
**Important**: The current version of goose-bench does not have robust error handling for common issues that can occur during evaluation runs, such as:
|
|
|
|
- Rate limiting from inference providers
|
|
- Network timeouts or connection errors
|
|
- Provider API errors that cause early session termination
|
|
- Resource exhaustion or memory issues
|
|
|
|
### Checking for Failed Evaluations
|
|
|
|
After running benchmarks, you should inspect the generated metrics files to identify any evaluations that may have failed or terminated early:
|
|
|
|
1. **Check the `aggregate_metrics.csv` files** in each model's `eval-results/` directory for:
|
|
- Missing evaluations (fewer rows than expected)
|
|
- Unusually low scores or metrics
|
|
- Zero or near-zero execution times
|
|
- Missing or NaN values
|
|
|
|
2. **Look for `server_error_mean` column** in the aggregate metrics - values greater than 0 indicate server errors occurred during evaluation
|
|
|
|
3. **Review session logs** (`.jsonl` files) in individual evaluation directories for error messages like:
|
|
- "Server error"
|
|
- "Rate limit exceeded"
|
|
- "TEMPORARILY_UNAVAILABLE"
|
|
- Unexpected session terminations
|
|
|
|
### Re-running Failed Evaluations
|
|
|
|
If you identify failed evaluations, you may need to:
|
|
|
|
1. **Adjust rate limiting**: Add delays between requests or reduce parallel execution
|
|
2. **Update environment variables**: Ensure API keys and tokens are valid
|
|
3. **Re-run specific model/evaluation combinations**: Create a new config with only the failed combinations
|
|
4. **Check provider status**: Verify the inference provider is operational
|
|
|
|
Example of creating a config to re-run failed evaluations:
|
|
|
|
```json
|
|
{
|
|
"models": [
|
|
{
|
|
"provider": "databricks",
|
|
"name": "claude-3-5-sonnet",
|
|
"parallel_safe": false
|
|
}
|
|
],
|
|
"evals": [
|
|
{
|
|
"selector": "vibes:blog_summary",
|
|
"post_process_cmd": "/path/to/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
|
|
"parallel_safe": false
|
|
}
|
|
],
|
|
"repeat": 1,
|
|
"output_dir": "/path/to/retry-benchmark"
|
|
}
|
|
```
|
|
|
|
We recommend monitoring evaluation progress and checking for errors regularly, especially when running large benchmark suites across multiple models.
|
|
|
|
## Available Commands
|
|
|
|
### List Evaluations
|
|
```bash
|
|
goose bench selectors --config /path/to/config.json
|
|
```
|
|
|
|
### Generate Initial Config
|
|
```bash
|
|
goose bench init-config --name my-benchmark-config.json
|
|
```
|
|
|
|
### Run Benchmarks
|
|
```bash
|
|
goose bench run --config /path/to/config.json
|
|
```
|
|
|
|
### Generate Leaderboard
|
|
```bash
|
|
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output
|
|
```
|