mirror of
https://github.com/aljazceru/Auto-GPT.git
synced 2026-01-18 13:34:28 +01:00
119 lines
3.7 KiB
Markdown
119 lines
3.7 KiB
Markdown
# Auto-GPT Benchmark
|
|
|
|
A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work
|
|
|
|
##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x
|
|
|
|
### To run the basic existing mock (June 21)
|
|
|
|
1. clone the repo `auto-gpt-benchmarks`
|
|
2. `pip install poetry`
|
|
3. `poetry shell`
|
|
4. `poetry install`
|
|
5. `agbenchmark start`
|
|
Keep config the same and watch the logs :)
|
|
|
|
- To add requirements `poetry add requirement`.
|
|
|
|
Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.
|
|
|
|
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `main` to last working commit
|
|
|
|
Let people know what beautiful code you write does, document everything well
|
|
|
|
Share your progress :)
|
|
|
|
## How this works
|
|
|
|
1. `pip install auto-gpt-benchmarks`
|
|
2. Add boilerplate code to start webserver to your agent (run loop and stop condition)
|
|
3. `agbenchmark start --category challenge_category` remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
|
|
4. We call the server to run the agent for each test
|
|
5. Show pass rate of tests, logs, and any other metrics
|
|
|
|
### To run the basic existing mock (June 21)
|
|
|
|
1. clone the repo `auto-gpt-benchmarks`
|
|
2. `pip install poetry`
|
|
3. `poetry shell`
|
|
4. `poetry install`
|
|
5. `agbenchmark start`
|
|
Keep config the same and watch the logs :)
|
|
|
|
#### Bonuses
|
|
|
|
- You can adds tests by git cloning auto-gpt-benchmarks to your repo
|
|
- Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
|
|
- Simple, easy to use
|
|
- Don't have to deal with cloud or parallelization yet
|
|
|
|
### Pytest
|
|
|
|
to create a test:
|
|
|
|
```
|
|
@pytest.mark.parametrize(
|
|
"server_response",
|
|
["VARIABLE"], # VARIABLE = the query/goal you provide to the model
|
|
indirect=True,
|
|
)
|
|
@pytest.mark.(VARIABLE) # VARIABLE = category of the test
|
|
def test_file_in_workspace(workspace): # VARIABLE = the actual test that asserts
|
|
assert os.path.exists(os.path.join(workspace, "file_to_check.txt"))
|
|
```
|
|
|
|
## Api
|
|
|
|
FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server
|
|
|
|
## Workspace
|
|
|
|
Defined by the user on config
|
|
|
|
#### Dataset
|
|
|
|
Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/
|
|
|
|
## Repo
|
|
|
|
```
|
|
|-- auto-gpt-benchmarks/ **main project directory**
|
|
| |-- metrics.py **combining scores, metrics, final evaluation**
|
|
| |-- start_benchmark.py **entry point from cli**
|
|
| |-- conftest.py **shared fixtures across all tests**
|
|
| |-- Challenge.py **easy challenge creation class?**
|
|
| |-- config.json **hostname, port, workspace folder**
|
|
| |-- challenges/ **challenges across different domains**
|
|
| | |-- adaptability/
|
|
| | |-- basic_abilities/
|
|
| | |-- code/
|
|
| | |-- memory/
|
|
| | |-- retrieval/
|
|
| | |-- web_navigation/
|
|
| | |-- writing/
|
|
| |-- tests/ **challenges across different metrics**
|
|
| | |-- basic_abilities/
|
|
| | |-- interface/
|
|
| |-- workspace/ **workspace related func**
|
|
| | |-- **init**.py
|
|
| | |-- workspace_manager.py **creation, deletion**
|
|
```
|
|
|
|
### Easy Challenge Creation
|
|
|
|
tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval
|
|
|
|
#### Written Challenges
|
|
|
|
For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore
|
|
|
|
#### Validators
|
|
|
|
Designed to handle specific types of output (e.g., text, code, structured data)
|
|
|
|
#### Logging
|
|
|
|
Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc
|
|
|
|
Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility
|