mirror of
https://github.com/aljazceru/Auto-GPT.git
synced 2026-01-10 01:34:22 +01:00
177 lines
5.0 KiB
Markdown
177 lines
5.0 KiB
Markdown
# Auto-GPT Benchmark
|
|
|
|
A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work
|
|
|
|
##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x
|
|
|
|
### To run the basic existing mock (June 21)
|
|
|
|
1. clone the repo `auto-gpt-benchmarks`
|
|
2. `pip install poetry`
|
|
3. `poetry shell`
|
|
4. `poetry install`
|
|
5. `agbenchmark start`
|
|
Keep config the same and watch the logs :)
|
|
|
|
- To add requirements `poetry add requirement`.
|
|
|
|
Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.
|
|
|
|
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `main` to last working commit
|
|
|
|
Let people know what beautiful code you write does, document everything well
|
|
|
|
Share your progress :)
|
|
|
|
## How this works
|
|
|
|
1. `pip install auto-gpt-benchmarks`
|
|
2. Add boilerplate code to start webserver to your agent (run loop and stop condition)
|
|
3. `agbenchmark start --category challenge_category` remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
|
|
4. We call the server to run the agent for each test
|
|
5. Show pass rate of tests, logs, and any other metrics
|
|
|
|
### To run the basic existing mock (June 21)
|
|
|
|
1. clone the repo `auto-gpt-benchmarks`
|
|
2. `pip install poetry`
|
|
3. `poetry shell`
|
|
4. `poetry install`
|
|
5. `agbenchmark start`
|
|
Keep config the same and watch the logs :)
|
|
|
|
#### Bonuses
|
|
|
|
- You can adds tests by git cloning auto-gpt-benchmarks to your repo
|
|
- Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
|
|
- Simple, easy to use
|
|
- Don't have to deal with cloud or parallelization yet
|
|
|
|
### Pytest
|
|
|
|
to create a test:
|
|
|
|
```python
|
|
import pytest
|
|
from agbenchmark.challenges.define_task_types import ChallengeData
|
|
from ..CategoryChallenge import CategoryChallenge
|
|
import os
|
|
|
|
data = ChallengeData.deserialize(
|
|
os.path.join(os.path.dirname(__file__), "r_file_data.json")
|
|
)
|
|
|
|
class TestSomething(CategoryChallenge):
|
|
"""Testing if LLM can read a file"""
|
|
|
|
@pytest.mark.parametrize(
|
|
"server_response",
|
|
[(data.task, data.mock_func)],
|
|
indirect=True,
|
|
)
|
|
def test_retrieval(
|
|
self, workspace
|
|
):
|
|
# scoring logic goes here
|
|
```
|
|
|
|
All challenges will inherit from parent class which has the mark
|
|
|
|
```python
|
|
@pytest.mark.basic
|
|
class BasicChallenge(Challenge):
|
|
pass
|
|
```
|
|
|
|
If you want to add a custom mark to a Challenge, you must specify it before the test definition
|
|
|
|
```python
|
|
@pytest.mark.other_mark
|
|
def test_retrieval(self, workspace):
|
|
```
|
|
|
|
To add a dependency to a challenge use the following
|
|
|
|
```python
|
|
# to defining what a test depends on
|
|
from pytest_dependency import depends
|
|
|
|
def test1(self, request, workspace):
|
|
depends(request, data.dependencies)
|
|
# for defining a test as a dependency
|
|
@pytest.mark.dependency()
|
|
def test2
|
|
```
|
|
|
|
Ordering of challenges needs to be used in combination with the above to make sure it executes afterwards
|
|
|
|
```python
|
|
@pytest.mark.run(order=1)
|
|
```
|
|
|
|
To create a file to test a challenge, add this to the challenge file which will create a file before running the server
|
|
|
|
```python
|
|
@pytest.fixture(scope="module", autouse=True)
|
|
def setup_module(workspace):
|
|
if data.ground.should_contain:
|
|
Challenge.write_to_file(
|
|
workspace, data.ground.files[0], "this is how we're doing"
|
|
)
|
|
```
|
|
|
|
## Api
|
|
|
|
FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server
|
|
|
|
## Workspace
|
|
|
|
Defined by the user on config
|
|
|
|
#### Dataset
|
|
|
|
Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/
|
|
|
|
## Repo
|
|
|
|
```
|
|
|-- auto-gpt-benchmarks/ **main project directory**
|
|
| |-- metrics.py **combining scores, metrics, final evaluation**
|
|
| |-- start_benchmark.py **entry point from cli**
|
|
| |-- conftest.py **shared fixtures across all tests**
|
|
| |-- Challenge.py **easy challenge creation class?**
|
|
| |-- config.json **hostname, port, workspace folder**
|
|
| |-- challenges/ **challenges across different domains**
|
|
| | |-- adaptability/
|
|
| | |-- basic_abilities/
|
|
| | |-- code/
|
|
| | |-- memory/
|
|
| | |-- retrieval/
|
|
| | |-- web_navigation/
|
|
| | |-- writing/
|
|
| |-- tests/ **challenges across different metrics**
|
|
| | |-- basic_abilities/
|
|
| | |-- interface/
|
|
| |-- workspace/ **workspace related func**
|
|
| | |-- **init**.py
|
|
| | |-- workspace_manager.py **creation, deletion**
|
|
```
|
|
|
|
### Easy Challenge Creation
|
|
|
|
tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval
|
|
|
|
#### Written Challenges
|
|
|
|
For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore
|
|
|
|
#### Validators
|
|
|
|
Designed to handle specific types of output (e.g., text, code, structured data)
|
|
|
|
#### Logging
|
|
|
|
Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc
|
|
|
|
Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility
|