diff --git a/documentation/docs/guides/benchmarking.md b/documentation/docs/guides/benchmarking.md
index 0211dbae..3fcc55db 100644
--- a/documentation/docs/guides/benchmarking.md
+++ b/documentation/docs/guides/benchmarking.md
@@ -1,11 +1,52 @@
---
sidebar_position: 7
---
+
# Benchmarking with Goose
-The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system configurations.
+The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system
+configurations.
This guide covers how to use the `goose bench` command to run benchmarks and analyze results.
+### Quick Start
+
+1. The benchmarking system includes several evaluation suites.
+ Run the following to see a listing of every valid selector:
+
+```bash
+goose bench selectors
+```
+
+2. Create a basic configuration file:
+
+```bash
+goose bench init-config -n bench-config.json
+cat bench-config.json
+{
+ "models": [
+ {
+ "provider": "databricks",
+ "name": "goose",
+ "parallel_safe": true
+ }
+ ],
+ "evals": [
+ {
+ "selector": "core",
+ "parallel_safe": true
+ }
+ ],
+ "repeat": 1
+}
+...etc.
+```
+
+2. Run the benchmark:
+
+```bash
+goose bench run -c bench-config.json
+```
+
## Configuration File
The benchmark configuration is specified in a JSON file with the following structure:
@@ -42,16 +83,18 @@ The benchmark configuration is specified in a JSON file with the following struc
### Configuration Options
#### Models Section
+
Each model entry in the `models` array specifies:
- `provider`: The model provider (e.g., "databricks")
- `name`: Model identifier
- `parallel_safe`: Whether the model can be run in parallel
- `tool_shim`: Optional configuration for tool shimming
- - `use_tool_shim`: Enable/disable tool shimming
- - `tool_shim_model`: Optional model to use for tool shimming
+ - `use_tool_shim`: Enable/disable tool shimming
+ - `tool_shim_model`: Optional model to use for tool shimming
#### Evals Section
+
Each evaluation entry in the `evals` array specifies:
- `selector`: The evaluation suite to run (e.g., "core")
@@ -68,60 +111,23 @@ Each evaluation entry in the `evals` array specifies:
- `env_file`: Optional path to an environment file
##### Mechanics of include_dirs option
-The `include_dirs` config parameter makes the items at all paths listed within the option, available to all evaluations.
+
+The `include_dirs` config parameter makes the items at all paths listed within the option, available to all
+evaluations.
It accomplishes this by:
+
* copying each included asset into the top-level directory created for each model/provider pair
* at evaluation run-time
- * whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
- * only if the evaluation-code specifically pulls it in
- * and only if the evaluation actually is covered by one of the configured selectors and therefore runs
-
-## Running Benchmarks
-
-### Quick Start
-
-1. The benchmarking system includes several evaluation suites.
-Run the following to see a listing of every valid selector:
-
-```bash
-goose bench selectors
-```
-
-2. Create a basic configuration file:
-
-```bash
-goose bench init-config -n bench-config.json
-cat bench-config.json
-{
- "models": [
- {
- "provider": "databricks",
- "name": "goose",
- "parallel_safe": true
- }
- ],
- "evals": [
- {
- "selector": "core",
- "parallel_safe": true
- }
- ],
- "repeat": 1
-}
-...etc.
-```
-
-2. Run the benchmark:
-
-```bash
-goose bench run -c bench-config.json
-```
+ * whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
+ * only if the evaluation-code specifically pulls it in
+ * and only if the evaluation actually is covered by one of the configured selectors and therefore runs
### Customizing Evaluations
You can customize runs in several ways:
1. Using Post-Processing Commands after evaluation:
+
```json
{
"evals": [
@@ -135,6 +141,7 @@ You can customize runs in several ways:
```
2. Including Additional Data:
+
```json
{
"include_dirs": [
@@ -144,6 +151,7 @@ You can customize runs in several ways:
```
3. Setting Environment Variables:
+
```json
{
"env_file": "/path/to/env-file"
@@ -154,6 +162,7 @@ You can customize runs in several ways:
The benchmark generates two main output files within a file-hierarchy similar to the following.
Results from running ach model/provider pair are stored within their own directory:
+
```bash
benchmark-${datetime}/
${model}-${provider}[-tool-shim[-${shim-model}]]/
@@ -166,10 +175,10 @@ benchmark-${datetime}/
```
1. `eval-results.json`: Contains detailed results from each evaluation, including:
- - Individual test case results
- - Model responses
- - Scoring metrics
- - Error logs
+ - Individual test case results
+ - Model responses
+ - Scoring metrics
+ - Error logs
2. `run-results-summary.json`: A collection of all eval results across all suites.
@@ -185,5 +194,6 @@ RUST_LOG=debug goose bench bench-config.json
### Tool Shimming
-Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the system.
+Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the
+system.
See this guide for important details on [tool shimming](experimental-features).