pip install auto-gpt-benchmarks
Add boilerplate code to start webserver to your agent (run loop and stop condition)
agbenchmark start --category challenge_category remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
We call the server to run the agent for each test
Show pass rate of tests, logs, and any other metrics

To run the basic existing mock (June 21)

clone the repo auto-gpt-benchmarks
pip install poetry
poetry shell
poetry install
agbenchmark start Keep config the same and watch the logs :)

Bonuses

You can adds tests by git cloning auto-gpt-benchmarks to your repo
Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
Simple, easy to use
Don't have to deal with cloud or parallelization yet

Pytest

to create a test:

import pytest
from agbenchmark.challenges.define_task_types import ChallengeData
from ..CategoryChallenge import CategoryChallenge
import os

data = ChallengeData.deserialize(
    os.path.join(os.path.dirname(__file__), "r_file_data.json")
)

class TestSomething(CategoryChallenge):
    """Testing if LLM can read a file"""

    @pytest.mark.parametrize(
        "server_response",
        [(data.task, data.mock_func)],
        indirect=True,
    )
    def test_retrieval(
        self, workspace
    ):
        # scoring logic goes here

All challenges will inherit from parent class which has the mark

@pytest.mark.basic
class BasicChallenge(Challenge):
    pass

If you want to add a custom mark to a Challenge, you must specify it before the test definition

@pytest.mark.other_mark
def test_retrieval(self, workspace):

To add a dependency to a challenge use the following

# to defining what a test depends on
from pytest_dependency import depends

def test1(self, request, workspace):
   depends(request, data.dependencies)
# for defining a test as a dependency
@pytest.mark.dependency()
def test2

Ordering of challenges needs to be used in combination with the above to make sure it executes afterwards

@pytest.mark.run(order=1)

To create a file to test a challenge, add this to the challenge file which will create a file before running the server

@pytest.fixture(scope="module", autouse=True)
def setup_module(workspace):
    if data.ground.should_contain:
        Challenge.write_to_file(
            workspace, data.ground.files[0], "this is how we're doing"
        )

Api

FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server

Workspace

Defined by the user on config

Dataset

Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/

Repo

|-- auto-gpt-benchmarks/ **main project directory**
| |-- metrics.py **combining scores, metrics, final evaluation**
| |-- start_benchmark.py **entry point from cli**
| |-- conftest.py **shared fixtures across all tests**
| |-- Challenge.py **easy challenge creation class?**
| |-- config.json **hostname, port, workspace folder**
| |-- challenges/ **challenges across different domains**
| | |-- adaptability/
| | |-- basic_abilities/
| | |-- code/
| | |-- memory/
| | |-- retrieval/
| | |-- web_navigation/
| | |-- writing/
| |-- tests/ **challenges across different metrics**
| | |-- basic_abilities/
| | |-- interface/
| |-- workspace/ **workspace related func**
| | |-- **init**.py
| | |-- workspace_manager.py **creation, deletion**

Easy Challenge Creation

tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval

Written Challenges

For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore

Validators

Designed to handle specific types of output (e.g., text, code, structured data)

Logging

Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc

Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility

Languages

JavaScript 68.5%

Python 18.3%

Jupyter Notebook 8.3%

Dart 3.4%

C++ 0.4%

Other 0.8%