mirror of
https://github.com/aljazceru/goose.git
synced 2025-12-18 14:44:21 +01:00
move config details further into doc (#2092)
This commit is contained in:
@@ -1,83 +1,13 @@
|
|||||||
---
|
---
|
||||||
sidebar_position: 7
|
sidebar_position: 7
|
||||||
---
|
---
|
||||||
|
|
||||||
# Benchmarking with Goose
|
# Benchmarking with Goose
|
||||||
|
|
||||||
The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system configurations.<br></br>
|
The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system
|
||||||
|
configurations.<br></br>
|
||||||
This guide covers how to use the `goose bench` command to run benchmarks and analyze results.
|
This guide covers how to use the `goose bench` command to run benchmarks and analyze results.
|
||||||
|
|
||||||
## Configuration File
|
|
||||||
|
|
||||||
The benchmark configuration is specified in a JSON file with the following structure:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"models": [
|
|
||||||
{
|
|
||||||
"provider": "databricks",
|
|
||||||
"name": "goose",
|
|
||||||
"parallel_safe": true,
|
|
||||||
"tool_shim": {
|
|
||||||
"use_tool_shim": false,
|
|
||||||
"tool_shim_model": null
|
|
||||||
}
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"evals": [
|
|
||||||
{
|
|
||||||
"selector": "core",
|
|
||||||
"post_process_cmd": null,
|
|
||||||
"parallel_safe": true
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"include_dirs": [],
|
|
||||||
"repeat": 2,
|
|
||||||
"run_id": null,
|
|
||||||
"eval_result_filename": "eval-results.json",
|
|
||||||
"run_summary_filename": "run-results-summary.json",
|
|
||||||
"env_file": null
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Configuration Options
|
|
||||||
|
|
||||||
#### Models Section
|
|
||||||
Each model entry in the `models` array specifies:
|
|
||||||
|
|
||||||
- `provider`: The model provider (e.g., "databricks")
|
|
||||||
- `name`: Model identifier
|
|
||||||
- `parallel_safe`: Whether the model can be run in parallel
|
|
||||||
- `tool_shim`: Optional configuration for tool shimming
|
|
||||||
- `use_tool_shim`: Enable/disable tool shimming
|
|
||||||
- `tool_shim_model`: Optional model to use for tool shimming
|
|
||||||
|
|
||||||
#### Evals Section
|
|
||||||
Each evaluation entry in the `evals` array specifies:
|
|
||||||
|
|
||||||
- `selector`: The evaluation suite to run (e.g., "core")
|
|
||||||
- `post_process_cmd`: Optional path to a post-processing script
|
|
||||||
- `parallel_safe`: Whether the evaluation can run in parallel
|
|
||||||
|
|
||||||
#### General Options
|
|
||||||
|
|
||||||
- `include_dirs`: Additional directories to include in the evaluation
|
|
||||||
- `repeat`: Number of times to repeat each evaluation
|
|
||||||
- `run_id`: Optional identifier for the benchmark run
|
|
||||||
- `eval_result_filename`: Name of the evaluation results file
|
|
||||||
- `run_summary_filename`: Name of the summary results file
|
|
||||||
- `env_file`: Optional path to an environment file
|
|
||||||
|
|
||||||
##### Mechanics of include_dirs option
|
|
||||||
The `include_dirs` config parameter makes the items at all paths listed within the option, available to all evaluations.<br></br>
|
|
||||||
It accomplishes this by:
|
|
||||||
* copying each included asset into the top-level directory created for each model/provider pair
|
|
||||||
* at evaluation run-time
|
|
||||||
* whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
|
|
||||||
* only if the evaluation-code specifically pulls it in
|
|
||||||
* and only if the evaluation actually is covered by one of the configured selectors and therefore runs
|
|
||||||
|
|
||||||
## Running Benchmarks
|
|
||||||
|
|
||||||
### Quick Start
|
### Quick Start
|
||||||
|
|
||||||
1. The benchmarking system includes several evaluation suites.<br></br>
|
1. The benchmarking system includes several evaluation suites.<br></br>
|
||||||
@@ -117,11 +47,87 @@ cat bench-config.json
|
|||||||
goose bench run -c bench-config.json
|
goose bench run -c bench-config.json
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Configuration File
|
||||||
|
|
||||||
|
The benchmark configuration is specified in a JSON file with the following structure:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"models": [
|
||||||
|
{
|
||||||
|
"provider": "databricks",
|
||||||
|
"name": "goose",
|
||||||
|
"parallel_safe": true,
|
||||||
|
"tool_shim": {
|
||||||
|
"use_tool_shim": false,
|
||||||
|
"tool_shim_model": null
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"evals": [
|
||||||
|
{
|
||||||
|
"selector": "core",
|
||||||
|
"post_process_cmd": null,
|
||||||
|
"parallel_safe": true
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"include_dirs": [],
|
||||||
|
"repeat": 2,
|
||||||
|
"run_id": null,
|
||||||
|
"eval_result_filename": "eval-results.json",
|
||||||
|
"run_summary_filename": "run-results-summary.json",
|
||||||
|
"env_file": null
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configuration Options
|
||||||
|
|
||||||
|
#### Models Section
|
||||||
|
|
||||||
|
Each model entry in the `models` array specifies:
|
||||||
|
|
||||||
|
- `provider`: The model provider (e.g., "databricks")
|
||||||
|
- `name`: Model identifier
|
||||||
|
- `parallel_safe`: Whether the model can be run in parallel
|
||||||
|
- `tool_shim`: Optional configuration for tool shimming
|
||||||
|
- `use_tool_shim`: Enable/disable tool shimming
|
||||||
|
- `tool_shim_model`: Optional model to use for tool shimming
|
||||||
|
|
||||||
|
#### Evals Section
|
||||||
|
|
||||||
|
Each evaluation entry in the `evals` array specifies:
|
||||||
|
|
||||||
|
- `selector`: The evaluation suite to run (e.g., "core")
|
||||||
|
- `post_process_cmd`: Optional path to a post-processing script
|
||||||
|
- `parallel_safe`: Whether the evaluation can run in parallel
|
||||||
|
|
||||||
|
#### General Options
|
||||||
|
|
||||||
|
- `include_dirs`: Additional directories to include in the evaluation
|
||||||
|
- `repeat`: Number of times to repeat each evaluation
|
||||||
|
- `run_id`: Optional identifier for the benchmark run
|
||||||
|
- `eval_result_filename`: Name of the evaluation results file
|
||||||
|
- `run_summary_filename`: Name of the summary results file
|
||||||
|
- `env_file`: Optional path to an environment file
|
||||||
|
|
||||||
|
##### Mechanics of include_dirs option
|
||||||
|
|
||||||
|
The `include_dirs` config parameter makes the items at all paths listed within the option, available to all
|
||||||
|
evaluations.<br></br>
|
||||||
|
It accomplishes this by:
|
||||||
|
|
||||||
|
* copying each included asset into the top-level directory created for each model/provider pair
|
||||||
|
* at evaluation run-time
|
||||||
|
* whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
|
||||||
|
* only if the evaluation-code specifically pulls it in
|
||||||
|
* and only if the evaluation actually is covered by one of the configured selectors and therefore runs
|
||||||
|
|
||||||
### Customizing Evaluations
|
### Customizing Evaluations
|
||||||
|
|
||||||
You can customize runs in several ways:
|
You can customize runs in several ways:
|
||||||
|
|
||||||
1. Using Post-Processing Commands after evaluation:
|
1. Using Post-Processing Commands after evaluation:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"evals": [
|
"evals": [
|
||||||
@@ -135,6 +141,7 @@ You can customize runs in several ways:
|
|||||||
```
|
```
|
||||||
|
|
||||||
2. Including Additional Data:
|
2. Including Additional Data:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"include_dirs": [
|
"include_dirs": [
|
||||||
@@ -144,6 +151,7 @@ You can customize runs in several ways:
|
|||||||
```
|
```
|
||||||
|
|
||||||
3. Setting Environment Variables:
|
3. Setting Environment Variables:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"env_file": "/path/to/env-file"
|
"env_file": "/path/to/env-file"
|
||||||
@@ -154,6 +162,7 @@ You can customize runs in several ways:
|
|||||||
|
|
||||||
The benchmark generates two main output files within a file-hierarchy similar to the following.<br></br>
|
The benchmark generates two main output files within a file-hierarchy similar to the following.<br></br>
|
||||||
Results from running ach model/provider pair are stored within their own directory:
|
Results from running ach model/provider pair are stored within their own directory:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
benchmark-${datetime}/
|
benchmark-${datetime}/
|
||||||
${model}-${provider}[-tool-shim[-${shim-model}]]/
|
${model}-${provider}[-tool-shim[-${shim-model}]]/
|
||||||
@@ -185,5 +194,6 @@ RUST_LOG=debug goose bench bench-config.json
|
|||||||
|
|
||||||
### Tool Shimming
|
### Tool Shimming
|
||||||
|
|
||||||
Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the system.<br></br>
|
Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the
|
||||||
|
system.<br></br>
|
||||||
See this guide for important details on [tool shimming](experimental-features).
|
See this guide for important details on [tool shimming](experimental-features).
|
||||||
|
|||||||
Reference in New Issue
Block a user