Goose Benchmarking Framework
The goose-bench crate provides a framework for benchmarking and evaluating LLM models with the Goose framework. This tool helps quantify model performance across various tasks and generate structured reports.
Features
- Run benchmark suites across multiple LLM models
- Execute evaluations in parallel when supported
- Generate structured JSON and CSV reports
- Process evaluation results with custom scripts
- Calculate aggregate metrics across evaluations
- Support for tool-shim evaluation
- Generate leaderboards and comparative metrics
Prerequisites
- Python Environment: The
generate-leaderboardcommand executes Python scripts and requires a valid Python environment with necessary dependencies (pandas, etc.) - OpenAI API Key: For evaluations using LLM-as-judge (like
blog_summaryandrestaurant_research), you must have anOPENAI_API_KEYenvironment variable set, as the judge uses the OpenAI GPT-4o model
Benchmark Workflow
Running benchmarks is a two-step process:
Step 1: Run Benchmarks
First, run the benchmark evaluations with your configuration:
goose bench run --config /path/to/your-config.json
This will execute all evaluations for all models specified in your configuration and create a benchmark directory with results.
Step 2: Generate Leaderboard
After the benchmarks complete, generate the leaderboard and aggregated metrics:
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output-directory
The benchmark directory path will be shown in the output of the previous command, typically in the format benchmark-YYYY-MM-DD-HH:MM:SS.
Note: This command requires a valid Python environment as it executes Python scripts for data aggregation and leaderboard generation.
Configuration
Benchmark configuration is provided through a JSON file. Here's a sample configuration file (leaderboard-config.json) that you can use as a template:
{
"models": [
{
"provider": "databricks",
"name": "gpt-4-1-mini",
"parallel_safe": true,
"tool_shim": {
"use_tool_shim": false,
"tool_shim_model": null
}
},
{
"provider": "databricks",
"name": "claude-3-5-sonnet",
"parallel_safe": true,
"tool_shim": null
},
{
"provider": "databricks",
"name": "gpt-4o",
"parallel_safe": true,
"tool_shim": null
}
],
"evals": [
{
"selector": "core:developer",
"post_process_cmd": null,
"parallel_safe": true
},
{
"selector": "core:developer_search_replace",
"post_process_cmd": null,
"parallel_safe": true
},
{
"selector": "vibes:blog_summary",
"post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
"parallel_safe": true
},
{
"selector": "vibes:restaurant_research",
"post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
"parallel_safe": true
}
],
"include_dirs": [],
"repeat": 3,
"run_id": null,
"output_dir": "/path/to/output/directory",
"eval_result_filename": "eval-results.json",
"run_summary_filename": "run-results-summary.json",
"env_file": "/path/to/.goosebench.env"
}
Configuration Options
Models
provider: The LLM provider (e.g., "databricks", "openai")name: The model nameparallel_safe: Whether the model can be run in paralleltool_shim: Configuration for tool-shim supportuse_tool_shim: Whether to use tool-shimtool_shim_model: Optional custom model for tool-shim
Evaluations
selector: The evaluation selector in formatsuite:evaluationpost_process_cmd: Optional path to a post-processing scriptparallel_safe: Whether the evaluation can be run in parallel
Global Configuration
include_dirs: Additional directories to include in the benchmark environmentrepeat: Number of times to repeat evaluations (for statistical significance)run_id: Optional identifier for the run (defaults to timestamp)output_dir: Directory to store benchmark results (must be absolute path)eval_result_filename: Filename for individual evaluation resultsrun_summary_filename: Filename for run summaryenv_file: Optional path to environment variables file
Environment Variables
You can provide environment variables through the env_file configuration option. This is useful for provider API keys and other sensitive information. Example .goosebench.env file:
OPENAI_API_KEY=your_openai_api_key_here
DATABRICKS_TOKEN=your_databricks_token_here
# Add other environment variables as needed
Important: For evaluations that use LLM-as-judge (like blog_summary and restaurant_research), you must set OPENAI_API_KEY as the judging system uses OpenAI's GPT-4o model.
Post-Processing
You can specify post-processing commands for evaluations, which will be executed after each evaluation completes. The command receives the path to the evaluation results file as its first argument.
For example, the run_vibes_judge.sh script processes outputs from the blog_summary and restaurant_research evaluations, using LLM-based judging to assign scores.
Output Structure
Results are organized in a directory structure that follows this pattern:
{benchmark_dir}/
├── config.cfg # Configuration used for the benchmark
├── {provider}-{model}/
│ ├── eval-results/
│ │ └── aggregate_metrics.csv # Aggregated metrics for this model
│ └── run-{run_id}/
│ ├── {suite}/
│ │ └── {evaluation}/
│ │ ├── eval-results.json # Individual evaluation results
│ │ ├── {eval_name}.jsonl # Session logs
│ │ └── work_dir.json # Info about evaluation working dir
│ └── run-results-summary.json # Summary of all evaluations in this run
├── leaderboard.csv # Final leaderboard comparing all models
└── all_metrics.csv # Union of all metrics across all models
Output Files Explained
Per-Model Files
eval-results/aggregate_metrics.csv: Contains aggregated metrics for each evaluation, averaged across all runs. Includes metrics likescore_mean,total_tokens_mean,prompt_execution_time_seconds_mean, etc.
Global Output Files
-
leaderboard.csv: Final leaderboard ranking all models by their average performance across evaluations. Contains columns like:provider,model_name: Model identificationavg_score_mean: Average score across all evaluationsavg_prompt_execution_time_seconds_mean: Average execution timeavg_total_tool_calls_mean: Average number of tool callsavg_total_tokens_mean: Average token usage
-
all_metrics.csv: Comprehensive dataset containing detailed metrics for every model-evaluation combination. This is a union of all individual model metrics, useful for detailed analysis and custom reporting.
Each model gets its own directory, containing run results and aggregated CSV files for analysis. The generate-leaderboard command processes all individual evaluation results and creates the comparative metrics files.
Error Handling and Troubleshooting
Important: The current version of goose-bench does not have robust error handling for common issues that can occur during evaluation runs, such as:
- Rate limiting from inference providers
- Network timeouts or connection errors
- Provider API errors that cause early session termination
- Resource exhaustion or memory issues
Checking for Failed Evaluations
After running benchmarks, you should inspect the generated metrics files to identify any evaluations that may have failed or terminated early:
-
Check the
aggregate_metrics.csvfiles in each model'seval-results/directory for:- Missing evaluations (fewer rows than expected)
- Unusually low scores or metrics
- Zero or near-zero execution times
- Missing or NaN values
-
Look for
server_error_meancolumn in the aggregate metrics - values greater than 0 indicate server errors occurred during evaluation -
Review session logs (
.jsonlfiles) in individual evaluation directories for error messages like:- "Server error"
- "Rate limit exceeded"
- "TEMPORARILY_UNAVAILABLE"
- Unexpected session terminations
Re-running Failed Evaluations
If you identify failed evaluations, you may need to:
- Adjust rate limiting: Add delays between requests or reduce parallel execution
- Update environment variables: Ensure API keys and tokens are valid
- Re-run specific model/evaluation combinations: Create a new config with only the failed combinations
- Check provider status: Verify the inference provider is operational
Example of creating a config to re-run failed evaluations:
{
"models": [
{
"provider": "databricks",
"name": "claude-3-5-sonnet",
"parallel_safe": false
}
],
"evals": [
{
"selector": "vibes:blog_summary",
"post_process_cmd": "/path/to/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
"parallel_safe": false
}
],
"repeat": 1,
"output_dir": "/path/to/retry-benchmark"
}
We recommend monitoring evaluation progress and checking for errors regularly, especially when running large benchmark suites across multiple models.
Available Commands
List Evaluations
goose bench selectors --config /path/to/config.json
Generate Initial Config
goose bench init-config --name my-benchmark-config.json
Run Benchmarks
goose bench run --config /path/to/config.json
Generate Leaderboard
goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output