move config details further into doc (#2092)

2025-12-17 22:24:21 +01:00 · 2025-04-08 16:09:40 -04:00
parent 8b5ff8a4a9
commit 4f9f21ac90
1 changed files with 63 additions and 53 deletions
--- a/documentation/docs/guides/benchmarking.md
+++ b/documentation/docs/guides/benchmarking.md
@@ -1,11 +1,52 @@
 ---
 sidebar_position: 7
 ---
+
 # Benchmarking with Goose

-The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system configurations.<br></br>
+The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system
+configurations.<br></br>
 This guide covers how to use the `goose bench` command to run benchmarks and analyze results.

+### Quick Start
+
+1. The benchmarking system includes several evaluation suites.<br></br>
+   Run the following to see a listing of every valid selector:
+
+```bash
+goose bench selectors
+```
+
+2. Create a basic configuration file:
+
+```bash
+goose bench init-config -n bench-config.json
+cat bench-config.json
+{
+  "models": [
+    {
+      "provider": "databricks",
+      "name": "goose",
+      "parallel_safe": true
+    }
+  ],
+  "evals": [
+    {
+      "selector": "core",
+      "parallel_safe": true
+    }
+  ],
+  "repeat": 1
+}
+...etc.
+```
+
+2. Run the benchmark:
+
+```bash
+goose bench run -c bench-config.json
+```
+
 ## Configuration File

 The benchmark configuration is specified in a JSON file with the following structure:
@@ -42,16 +83,18 @@ The benchmark configuration is specified in a JSON file with the following struc
 ### Configuration Options

 #### Models Section
+
 Each model entry in the `models` array specifies:

 - `provider`: The model provider (e.g., "databricks")
 - `name`: Model identifier
 - `parallel_safe`: Whether the model can be run in parallel
 - `tool_shim`: Optional configuration for tool shimming
-  - `use_tool_shim`: Enable/disable tool shimming
-  - `tool_shim_model`: Optional model to use for tool shimming
+    - `use_tool_shim`: Enable/disable tool shimming
+    - `tool_shim_model`: Optional model to use for tool shimming

 #### Evals Section
+
 Each evaluation entry in the `evals` array specifies:

 - `selector`: The evaluation suite to run (e.g., "core")
@@ -68,60 +111,23 @@ Each evaluation entry in the `evals` array specifies:
 - `env_file`: Optional path to an environment file

 ##### Mechanics of include_dirs option
-The `include_dirs` config parameter makes the items at all paths listed within the option, available to all evaluations.<br></br>
+
+The `include_dirs` config parameter makes the items at all paths listed within the option, available to all
+evaluations.<br></br>
 It accomplishes this by:
+
 * copying each included asset into the top-level directory created for each model/provider pair
 * at evaluation run-time
-  * whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
-  * only if the evaluation-code specifically pulls it in
-  * and only if the evaluation actually is covered by one of the configured selectors and therefore runs
-
-## Running Benchmarks
-
-### Quick Start
-
-1. The benchmarking system includes several evaluation suites.<br></br>
-Run the following to see a listing of every valid selector:
-
-```bash
-goose bench selectors
-```
-
-2. Create a basic configuration file:
-
-```bash
-goose bench init-config -n bench-config.json
-cat bench-config.json
-{
-  "models": [
-    {
-      "provider": "databricks",
-      "name": "goose",
-      "parallel_safe": true
-    }
-  ],
-  "evals": [
-    {
-      "selector": "core",
-      "parallel_safe": true
-    }
-  ],
-  "repeat": 1
-}
-...etc.
-```
-
-2. Run the benchmark:
-
-```bash
-goose bench run -c bench-config.json
-```
+    * whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
+    * only if the evaluation-code specifically pulls it in
+    * and only if the evaluation actually is covered by one of the configured selectors and therefore runs

 ### Customizing Evaluations

 You can customize runs in several ways:

 1. Using Post-Processing Commands after evaluation:
+
 ```json
 {
  "evals": [
@@ -135,6 +141,7 @@ You can customize runs in several ways:
 ```

 2. Including Additional Data:
+
 ```json
 {
  "include_dirs": [
@@ -144,6 +151,7 @@ You can customize runs in several ways:
 ```

 3. Setting Environment Variables:
+
 ```json
 {
  "env_file": "/path/to/env-file"
@@ -154,6 +162,7 @@ You can customize runs in several ways:

 The benchmark generates two main output files within a file-hierarchy similar to the following.<br></br>
 Results from running ach model/provider pair are stored within their own directory:
+
 ```bash
 benchmark-${datetime}/
  ${model}-${provider}[-tool-shim[-${shim-model}]]/
@@ -166,10 +175,10 @@ benchmark-${datetime}/
 ```

 1. `eval-results.json`: Contains detailed results from each evaluation, including:
-   - Individual test case results
-   - Model responses
-   - Scoring metrics
-   - Error logs
+    - Individual test case results
+    - Model responses
+    - Scoring metrics
+    - Error logs

 2. `run-results-summary.json`: A collection of all eval results across all suites.

@@ -185,5 +194,6 @@ RUST_LOG=debug goose bench bench-config.json

 ### Tool Shimming

-Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the system.<br></br>
+Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the
+system.<br></br>
 See this guide for important details on [tool shimming](experimental-features).