mirror of
https://github.com/aljazceru/goose.git
synced 2025-12-18 06:34:26 +01:00
83 lines
2.4 KiB
Markdown
83 lines
2.4 KiB
Markdown
# Goose Benchmark Scripts
|
|
|
|
This directory contains scripts for running and analyzing Goose benchmarks.
|
|
|
|
## run-benchmarks.sh
|
|
|
|
This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.
|
|
|
|
### Prerequisites
|
|
|
|
- Goose CLI must be built or installed
|
|
- `jq` command-line tool for JSON processing (optional, but recommended for result analysis)
|
|
|
|
### Usage
|
|
|
|
```bash
|
|
./scripts/run-benchmarks.sh [options]
|
|
```
|
|
|
|
#### Options
|
|
|
|
- `-p, --provider-models`: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-3-5-sonnet')
|
|
- `-s, --suites`: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')
|
|
- `-o, --output-dir`: Directory to store benchmark results (default: './benchmark-results')
|
|
- `-d, --debug`: Use debug build instead of release build
|
|
- `-h, --help`: Show help message
|
|
|
|
#### Examples
|
|
|
|
```bash
|
|
# Run with release build (default)
|
|
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-3-5-sonnet' --suites 'core,small_models'
|
|
|
|
# Run with debug build
|
|
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug
|
|
```
|
|
|
|
### How It Works
|
|
|
|
The script:
|
|
1. Parses the provider:model pairs and benchmark suites
|
|
2. Determines whether to use the debug or release binary
|
|
3. For each provider:model pair:
|
|
- Sets the `GOOSE_PROVIDER` and `GOOSE_MODEL` environment variables
|
|
- Runs the benchmark with the specified suites
|
|
- Analyzes the results for failures
|
|
4. Generates a summary of all benchmark runs
|
|
|
|
### Output
|
|
|
|
The script creates the following files in the output directory:
|
|
|
|
- `summary.md`: A summary of all benchmark results
|
|
- `{provider}-{model}.json`: Raw JSON output from each benchmark run
|
|
- `{provider}-{model}-analysis.txt`: Analysis of each benchmark run
|
|
|
|
### Exit Codes
|
|
|
|
- `0`: All benchmarks completed successfully
|
|
- `1`: One or more benchmarks failed
|
|
|
|
## parse-benchmark-results.sh
|
|
|
|
This script analyzes a single benchmark JSON result file and identifies any failures.
|
|
|
|
### Usage
|
|
|
|
```bash
|
|
./scripts/parse-benchmark-results.sh path/to/benchmark-results.json
|
|
```
|
|
|
|
### Output
|
|
|
|
The script outputs an analysis of the benchmark results to stdout, including:
|
|
|
|
- Basic information about the benchmark run
|
|
- Results for each evaluation in each suite
|
|
- Summary of passed and failed metrics
|
|
|
|
### Exit Codes
|
|
|
|
- `0`: All metrics passed successfully
|
|
- `1`: One or more metrics failed |