agbenchmark
A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work
Simple boilerplate code that spins up a webserver to plug their agent into. We call multiple tasks by invoking different pytest commands on folders and once the agent stops or reaches 50 loops (which they will have to define). We handle the deletion of files after a run loop ends. Then we call call the POST request for the next task. Then we will spit out a combined benchmark once all tests run
- Agent adds tests by adding to our repo
- Agent abstracted from benchmark
- Scalable (parallel servers running tests)
- Better standardization
Diagrams (out of date, cloud oriented): https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x
Contributing
-
Make sure you have
poetryinstalled -pip install poetry. -
Then
poetry installfor dependencies -
To add requirements
poetry add requirement. -
To run in venv
poetry run python script.py
Feel free to merge with main at will (but also to ask for review) - if you can't send msg in R&D chat for access.
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert main to last working commit
Let people know what beautiful code you write does, document everything well
Share your progress :)
Api
FastAPI with REST, import requests
POST hostname:8080/challenges
{
"test_name": ""
"challenge": "memory" - optional
}
Auth:
get preSignedUrl from API
POST preSignedUrl
{
"artifacts": [{}]
}
Workspace
Kubernetes with AWS3 or GCP
Challenges
Dataset
Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/
Simple challenge creation through a DSL (domain specific language)
Challenge TicTacToeCoding
Description "The agent should implement a basic tic-tac-toe game in Python."
Artifacts {
Code "tictactoe.py"
}
Tasks {
Code "Write a function to initialize the game board."
Code "Write a function to handle a player's turn."
Code "Write a function to check for a winning move."
Test "Write tests for the blog post model, serializer, and view."
Command "Run Django's test suite to ensure everything is working as expected."
}
SuccessCriteria {
Correctness "The game should correctly alternate between two players."
Correctness "The game should correctly identify a winning move."
Efficiency "The game should not use unnecessary computational resources."
Design "The solution should follow good practices for Django and Django Rest Framework."
}
EndChallenge
Validators
Designed to handle specific types of output (e.g., text, code, structured data)
Logging
Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc
Written Challenges
For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore
Repo
|-- agbenchmark/ **main project directory**
| |-- **init**.py
| |-- server/
| | |-- **init**.py
| | |-- api.py **opens server on host and exposes urls**
| | |-- utils.py
| |-- benchmark/
| | |-- **init**.py
| | |-- benchmark.py **combining scores, metrics, final evaluation**
| | |-- run.py **entry point. sets everything up**
| | |-- challenges/ **challenges across different metrics**
| | | |-- **init**.py
| | | |-- Challenge.py **easy challenge creation through Challenge class. potentially how DSL is defined. may need to inherit challenge class like Adaptability(Challenge)**
| | | |-- utils.py
| | | |-- adaptability.py
| | | |-- basic_abilities.py
| | | |-- code.py
| | | |-- memory.py
| | | |-- retrieval.py
| | | |-- web_navigation.py
| | | |-- writing.py
| |-- workspace/ **workspace related func**
| | |-- **init**.py
| | |-- workspace_manager.py **creation, deletion, preSignedUrl generation**
| | |-- cloud_services/
| | | |-- **init**.py
| | | |-- aws.py **not finalized, but write, read, and del files**
|-- tests/ **test func of agbenchmark**
| |-- **init**.py
| |-- test_api.py
| |-- test_benchmark.py
| |-- test_workspace_manager.py
Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility