Files
goose/crates/goose-bench

Goose Benchmarking Framework

The goose-bench crate provides a framework for benchmarking and evaluating LLM models with the Goose framework. This tool helps quantify model performance across various tasks and generate structured reports.

Features

  • Run benchmark suites across multiple LLM models
  • Execute evaluations in parallel when supported
  • Generate structured JSON and CSV reports
  • Process evaluation results with custom scripts
  • Calculate aggregate metrics across evaluations
  • Support for tool-shim evaluation
  • Generate leaderboards and comparative metrics

Prerequisites

  • Python Environment: The generate-leaderboard command executes Python scripts and requires a valid Python environment with necessary dependencies (pandas, etc.)
  • OpenAI API Key: For evaluations using LLM-as-judge (like blog_summary and restaurant_research), you must have an OPENAI_API_KEY environment variable set, as the judge uses the OpenAI GPT-4o model

Benchmark Workflow

Running benchmarks is a two-step process:

Step 1: Run Benchmarks

First, run the benchmark evaluations with your configuration:

goose bench run --config /path/to/your-config.json

This will execute all evaluations for all models specified in your configuration and create a benchmark directory with results.

Step 2: Generate Leaderboard

After the benchmarks complete, generate the leaderboard and aggregated metrics:

goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output-directory

The benchmark directory path will be shown in the output of the previous command, typically in the format benchmark-YYYY-MM-DD-HH:MM:SS.

Note: This command requires a valid Python environment as it executes Python scripts for data aggregation and leaderboard generation.

Configuration

Benchmark configuration is provided through a JSON file. Here's a sample configuration file (leaderboard-config.json) that you can use as a template:

{
  "models": [
    {
      "provider": "databricks",
      "name": "gpt-4-1-mini",
      "parallel_safe": true,
      "tool_shim": {
        "use_tool_shim": false,
        "tool_shim_model": null
      }
    },
    {
      "provider": "databricks",
      "name": "claude-3-5-sonnet",
      "parallel_safe": true,
      "tool_shim": null
    },
    {
      "provider": "databricks",
      "name": "gpt-4o",
      "parallel_safe": true,
      "tool_shim": null
    }
  ],
  "evals": [
    {
      "selector": "core:developer",
      "post_process_cmd": null,
      "parallel_safe": true
    },
    {
      "selector": "core:developer_search_replace",
      "post_process_cmd": null,
      "parallel_safe": true
    },
    {
      "selector": "vibes:blog_summary",
      "post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
      "parallel_safe": true
    },
    {
      "selector": "vibes:restaurant_research",
      "post_process_cmd": "/Users/ahau/Development/goose-1.0/goose/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
      "parallel_safe": true
    }
  ],
  "include_dirs": [],
  "repeat": 3,
  "run_id": null,
  "output_dir": "/path/to/output/directory",
  "eval_result_filename": "eval-results.json",
  "run_summary_filename": "run-results-summary.json",
  "env_file": "/path/to/.goosebench.env"
}

Configuration Options

Models

  • provider: The LLM provider (e.g., "databricks", "openai")
  • name: The model name
  • parallel_safe: Whether the model can be run in parallel
  • tool_shim: Configuration for tool-shim support
    • use_tool_shim: Whether to use tool-shim
    • tool_shim_model: Optional custom model for tool-shim

Evaluations

  • selector: The evaluation selector in format suite:evaluation
  • post_process_cmd: Optional path to a post-processing script
  • parallel_safe: Whether the evaluation can be run in parallel

Global Configuration

  • include_dirs: Additional directories to include in the benchmark environment
  • repeat: Number of times to repeat evaluations (for statistical significance)
  • run_id: Optional identifier for the run (defaults to timestamp)
  • output_dir: Directory to store benchmark results (must be absolute path)
  • eval_result_filename: Filename for individual evaluation results
  • run_summary_filename: Filename for run summary
  • env_file: Optional path to environment variables file

Environment Variables

You can provide environment variables through the env_file configuration option. This is useful for provider API keys and other sensitive information. Example .goosebench.env file:

OPENAI_API_KEY=your_openai_api_key_here
DATABRICKS_TOKEN=your_databricks_token_here
# Add other environment variables as needed

Important: For evaluations that use LLM-as-judge (like blog_summary and restaurant_research), you must set OPENAI_API_KEY as the judging system uses OpenAI's GPT-4o model.

Post-Processing

You can specify post-processing commands for evaluations, which will be executed after each evaluation completes. The command receives the path to the evaluation results file as its first argument.

For example, the run_vibes_judge.sh script processes outputs from the blog_summary and restaurant_research evaluations, using LLM-based judging to assign scores.

Output Structure

Results are organized in a directory structure that follows this pattern:

{benchmark_dir}/
├── config.cfg                           # Configuration used for the benchmark
├── {provider}-{model}/
│   ├── eval-results/
│   │   └── aggregate_metrics.csv        # Aggregated metrics for this model
│   └── run-{run_id}/
│       ├── {suite}/
│       │   └── {evaluation}/
│       │       ├── eval-results.json    # Individual evaluation results
│       │       ├── {eval_name}.jsonl    # Session logs
│       │       └── work_dir.json        # Info about evaluation working dir
│       └── run-results-summary.json     # Summary of all evaluations in this run
├── leaderboard.csv                      # Final leaderboard comparing all models
└── all_metrics.csv                      # Union of all metrics across all models

Output Files Explained

Per-Model Files

  • eval-results/aggregate_metrics.csv: Contains aggregated metrics for each evaluation, averaged across all runs. Includes metrics like score_mean, total_tokens_mean, prompt_execution_time_seconds_mean, etc.

Global Output Files

  • leaderboard.csv: Final leaderboard ranking all models by their average performance across evaluations. Contains columns like:

    • provider, model_name: Model identification
    • avg_score_mean: Average score across all evaluations
    • avg_prompt_execution_time_seconds_mean: Average execution time
    • avg_total_tool_calls_mean: Average number of tool calls
    • avg_total_tokens_mean: Average token usage
  • all_metrics.csv: Comprehensive dataset containing detailed metrics for every model-evaluation combination. This is a union of all individual model metrics, useful for detailed analysis and custom reporting.

Each model gets its own directory, containing run results and aggregated CSV files for analysis. The generate-leaderboard command processes all individual evaluation results and creates the comparative metrics files.

Error Handling and Troubleshooting

Important: The current version of goose-bench does not have robust error handling for common issues that can occur during evaluation runs, such as:

  • Rate limiting from inference providers
  • Network timeouts or connection errors
  • Provider API errors that cause early session termination
  • Resource exhaustion or memory issues

Checking for Failed Evaluations

After running benchmarks, you should inspect the generated metrics files to identify any evaluations that may have failed or terminated early:

  1. Check the aggregate_metrics.csv files in each model's eval-results/ directory for:

    • Missing evaluations (fewer rows than expected)
    • Unusually low scores or metrics
    • Zero or near-zero execution times
    • Missing or NaN values
  2. Look for server_error_mean column in the aggregate metrics - values greater than 0 indicate server errors occurred during evaluation

  3. Review session logs (.jsonl files) in individual evaluation directories for error messages like:

    • "Server error"
    • "Rate limit exceeded"
    • "TEMPORARILY_UNAVAILABLE"
    • Unexpected session terminations

Re-running Failed Evaluations

If you identify failed evaluations, you may need to:

  1. Adjust rate limiting: Add delays between requests or reduce parallel execution
  2. Update environment variables: Ensure API keys and tokens are valid
  3. Re-run specific model/evaluation combinations: Create a new config with only the failed combinations
  4. Check provider status: Verify the inference provider is operational

Example of creating a config to re-run failed evaluations:

{
  "models": [
    {
      "provider": "databricks",
      "name": "claude-3-5-sonnet",
      "parallel_safe": false
    }
  ],
  "evals": [
    {
      "selector": "vibes:blog_summary",
      "post_process_cmd": "/path/to/scripts/bench-postprocess-scripts/llm-judges/run_vibes_judge.sh",
      "parallel_safe": false
    }
  ],
  "repeat": 1,
  "output_dir": "/path/to/retry-benchmark"
}

We recommend monitoring evaluation progress and checking for errors regularly, especially when running large benchmark suites across multiple models.

Available Commands

List Evaluations

goose bench selectors --config /path/to/config.json

Generate Initial Config

goose bench init-config --name my-benchmark-config.json

Run Benchmarks

goose bench run --config /path/to/config.json

Generate Leaderboard

goose bench generate-leaderboard --benchmark-dir /path/to/benchmark-output