From 4f9f21ac90177329d128bc30e5bc94e87b1cc731 Mon Sep 17 00:00:00 2001 From: marcelle <1852848+laanak08@users.noreply.github.com> Date: Tue, 8 Apr 2025 16:09:40 -0400 Subject: [PATCH] move config details further into doc (#2092) --- documentation/docs/guides/benchmarking.md | 116 ++++++++++++---------- 1 file changed, 63 insertions(+), 53 deletions(-) diff --git a/documentation/docs/guides/benchmarking.md b/documentation/docs/guides/benchmarking.md index 0211dbae..3fcc55db 100644 --- a/documentation/docs/guides/benchmarking.md +++ b/documentation/docs/guides/benchmarking.md @@ -1,11 +1,52 @@ --- sidebar_position: 7 --- + # Benchmarking with Goose -The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system configurations.

+The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system +configurations.

This guide covers how to use the `goose bench` command to run benchmarks and analyze results. +### Quick Start + +1. The benchmarking system includes several evaluation suites.

+ Run the following to see a listing of every valid selector: + +```bash +goose bench selectors +``` + +2. Create a basic configuration file: + +```bash +goose bench init-config -n bench-config.json +cat bench-config.json +{ + "models": [ + { + "provider": "databricks", + "name": "goose", + "parallel_safe": true + } + ], + "evals": [ + { + "selector": "core", + "parallel_safe": true + } + ], + "repeat": 1 +} +...etc. +``` + +2. Run the benchmark: + +```bash +goose bench run -c bench-config.json +``` + ## Configuration File The benchmark configuration is specified in a JSON file with the following structure: @@ -42,16 +83,18 @@ The benchmark configuration is specified in a JSON file with the following struc ### Configuration Options #### Models Section + Each model entry in the `models` array specifies: - `provider`: The model provider (e.g., "databricks") - `name`: Model identifier - `parallel_safe`: Whether the model can be run in parallel - `tool_shim`: Optional configuration for tool shimming - - `use_tool_shim`: Enable/disable tool shimming - - `tool_shim_model`: Optional model to use for tool shimming + - `use_tool_shim`: Enable/disable tool shimming + - `tool_shim_model`: Optional model to use for tool shimming #### Evals Section + Each evaluation entry in the `evals` array specifies: - `selector`: The evaluation suite to run (e.g., "core") @@ -68,60 +111,23 @@ Each evaluation entry in the `evals` array specifies: - `env_file`: Optional path to an environment file ##### Mechanics of include_dirs option -The `include_dirs` config parameter makes the items at all paths listed within the option, available to all evaluations.

+ +The `include_dirs` config parameter makes the items at all paths listed within the option, available to all +evaluations.

It accomplishes this by: + * copying each included asset into the top-level directory created for each model/provider pair * at evaluation run-time - * whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory - * only if the evaluation-code specifically pulls it in - * and only if the evaluation actually is covered by one of the configured selectors and therefore runs - -## Running Benchmarks - -### Quick Start - -1. The benchmarking system includes several evaluation suites.

-Run the following to see a listing of every valid selector: - -```bash -goose bench selectors -``` - -2. Create a basic configuration file: - -```bash -goose bench init-config -n bench-config.json -cat bench-config.json -{ - "models": [ - { - "provider": "databricks", - "name": "goose", - "parallel_safe": true - } - ], - "evals": [ - { - "selector": "core", - "parallel_safe": true - } - ], - "repeat": 1 -} -...etc. -``` - -2. Run the benchmark: - -```bash -goose bench run -c bench-config.json -``` + * whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory + * only if the evaluation-code specifically pulls it in + * and only if the evaluation actually is covered by one of the configured selectors and therefore runs ### Customizing Evaluations You can customize runs in several ways: 1. Using Post-Processing Commands after evaluation: + ```json { "evals": [ @@ -135,6 +141,7 @@ You can customize runs in several ways: ``` 2. Including Additional Data: + ```json { "include_dirs": [ @@ -144,6 +151,7 @@ You can customize runs in several ways: ``` 3. Setting Environment Variables: + ```json { "env_file": "/path/to/env-file" @@ -154,6 +162,7 @@ You can customize runs in several ways: The benchmark generates two main output files within a file-hierarchy similar to the following.

Results from running ach model/provider pair are stored within their own directory: + ```bash benchmark-${datetime}/ ${model}-${provider}[-tool-shim[-${shim-model}]]/ @@ -166,10 +175,10 @@ benchmark-${datetime}/ ``` 1. `eval-results.json`: Contains detailed results from each evaluation, including: - - Individual test case results - - Model responses - - Scoring metrics - - Error logs + - Individual test case results + - Model responses + - Scoring metrics + - Error logs 2. `run-results-summary.json`: A collection of all eval results across all suites. @@ -185,5 +194,6 @@ RUST_LOG=debug goose bench bench-config.json ### Tool Shimming -Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the system.

+Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the +system.

See this guide for important details on [tool shimming](experimental-features).