AGBenchmark codebase clean-up (#6650)

* refactor(benchmark): Deduplicate configuration loading logic - Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module. - Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function. * fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py - Fixed type errors and linting errors in `__main__.py` - Improved the readability of CLI argument validation by introducing a separate function for it * refactor(benchmark): Lint and typefix app.py - Rearranged and cleaned up import statements - Fixed type errors caused by improper use of `psutil` objects - Simplified a number of `os.path` usages by converting to `pathlib` - Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema` * refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py - Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`). - Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`. - Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`). - Remove all unused types from schema.py (= most of them). * refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py * refactor(benchmark): Improve typing, response validation, and readability in app.py - Simplified response generation by leveraging type checking and conversion by FastAPI. - Introduced use of `HTTPException` for error responses. - Improved naming, formatting, and typing in `app.py::create_evaluation`. - Updated the docstring on `app.py::create_agent_task`. - Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py. - Added default values to optional attributes on models in report_types_v2.py. - Removed unused imports in `generate_test.py` * refactor(benchmark): Clean up logging and print statements - Introduced use of the `logging` library for unified logging and better readability. - Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`. - Improved descriptiveness of log statements. - Removed unnecessary print statements. - Added log statements to unspecific and non-verbose `except` blocks. - Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format. - Added `.utils.logging` module with `configure_logging` function to easily configure the logging library. - Converted raw escape sequences in `.utils.challenge` to use `colorama`. - Renamed `generate_test.py::generate_tests` to `load_challenges`. * refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent - Remove unused server.py file - Remove unused run_agent function from agent_interface.py * refactor(benchmark): Clean up conftest.py - Fix and add type annotations - Rewrite docstrings - Disable or remove unused code - Fix definition of arguments and their types in `pytest_addoption` * refactor(benchmark): Clean up generate_test.py file - Refactored the `create_single_test` function for clarity and readability - Removed unused variables - Made creation of `Challenge` subclasses more straightforward - Made bare `except` more specific - Renamed `Challenge.setup_challenge` method to `run_challenge` - Updated type hints and annotations - Made minor code/readability improvements in `load_challenges` - Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module * fix(benchmark): Fix and add type annotations in execute_sub_process.py * refactor(benchmark): Simplify const determination in agent_interface.py - Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS` * fix(benchmark): Register category markers to prevent warnings - Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime. * refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json * refactor(benchmark): Update agent_api_interface.py - Add type annotations to `copy_agent_artifacts_into_temp_folder` function - Add note about broken endpoint in the `agent_protocol_client` library - Remove unused variable in `run_api_agent` function - Improve readability and resolve linting error * feat(benchmark): Improve and centralize pathfinding - Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder. - Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const. - Replace path constants defined in __main__.py with usages of `PATH_MANAGER`. * feat(benchmark/cli): Clean up and improve CLI - Updated commands, options, and their descriptions to be more intuitive and consistent - Moved slow imports into the entrypoints that use them to speed up application startup - Fixed type hints to match output types of Click options - Hid deprecated `agbenchmark start` command - Refactored code to improve readability and maintainability - Moved main entrypoint into `run` subcommand - Fixed `version` and `serve` subcommands - Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility) - Renamed `--no_dep` to `--no-dep` for consistency - Fixed string formatting issues in log statements * refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py - Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`. - Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`. - Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`. - Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties. - Update all code references according to the changes mentioned above. * refactor(benchmark): Fix ReportManager init parameter types and use pathlib - Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`. - Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`. - Rename `self.filename` with `self.report_file` in `ReportManager`. - Change the way the report file is created, opened and saved to use the `Path` object. * refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation - Use `ChallengeData` objects instead of untyped `dict` in app.py, generate_test.py, reports.py. - Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class. - Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class. - Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py. - Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py. - Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py). * refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py - Cleaned up generate_test.py and conftest.py - Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method. - Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py. - Converted methods in the `Challenge` class to class methods where appropriate. - Improved argument handling in the `run_benchmark` function in `__main__.py`. * refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state - Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management. - Remove the `.path_manager` module containing `AGBenchmarkPathManager`. - Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity. * feat(benchmark/serve): Configurable port for `serve` subcommand - Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on. - If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set. * feat(benchmark/cli): Add `config` subcommand - Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config. * fix(benchmark): Gracefully handle incompatible challenge spec files in app.py - Added a check to skip deprecated challenges - Added logging to allow debugging of the loading process - Added handling of validation errors when parsing challenge spec files - Added missing `spec_file` attribute to `ChallengeData` * refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint - Move `run_benchmark` and `validate_args` from __main__.py to main.py - Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark` - Move `get_unique_categories` from __main__.py to challenges/__init__.py - Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py - Reduce operations on updates.json (including `initialize_updates_file`) outside of API * refactor(benchmark): Remove unused `/updates` endpoint and all related code - Remove `updates_json_file` attribute from `AgentBenchmarkConfig` - Remove `get_updates` and `_initialize_updates_file` in app.py - Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py - Remove call to `append_updates_file` in challenge.py * refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig` - Add and update docstrings - Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility - Make naming of path attributes on `AgentBenchmarkConfig` more consistent - Remove unused `agent_home_directory` attribute - Remove unused `workspace` attribute * fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config * fix(benchmark): Update agent-protocol-client to v1.1.0 - Fixes issue with fetching task artifact listings
2026-01-31 11:54:30 +01:00 · 2024-01-02 22:23:09 +01:00
parent b8238c2228
commit 25cc6ad6ae
47 changed files with 2122 additions and 7752 deletions
--- a/.github/workflows/hackathon.yml
+++ b/.github/workflows/hackathon.yml
@@ -121,7 +121,7 @@ jobs:
          ./run agent start $AGENT_NAME
          cd ../benchmark
          poetry install
-          poetry run agbenchmark --no_dep
+          poetry run agbenchmark --no-dep
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          SERP_API_KEY: ${{ secrets.SERP_API_KEY }}
--- a/benchmark/agbenchmark/main.py
+++ b/benchmark/agbenchmark/main.py
@@ -1,5 +1,4 @@
-import glob
-import json
+import logging
 import os
 import sys
 from datetime import datetime, timezone
@@ -7,205 +6,97 @@ from pathlib import Path
 from typing import Any, Optional

 import click
-import pytest
-import toml
+from click_default_group import DefaultGroup
 from dotenv import load_dotenv
-from helicone.lock import HeliconeLockManager

-from agbenchmark.app import app
-from agbenchmark.reports.ReportManager import SingletonReportManager
-from agbenchmark.utils.data_types import AgentBenchmarkConfig
+from agbenchmark.config import AgentBenchmarkConfig
+from agbenchmark.utils.logging import configure_logging

 load_dotenv()

+try:
+    if os.getenv("HELICONE_API_KEY"):
+        import helicone  # noqa
+
+        helicone_enabled = True
+    else:
+        helicone_enabled = False
+except ImportError:
+    helicone_enabled = False
+
+
+class InvalidInvocationError(ValueError):
+    pass
+
+
+logger = logging.getLogger(__name__)
+
 BENCHMARK_START_TIME_DT = datetime.now(timezone.utc)
 BENCHMARK_START_TIME = BENCHMARK_START_TIME_DT.strftime("%Y-%m-%dT%H:%M:%S+00:00")
-TEMP_FOLDER_ABS_PATH = Path.cwd() / "agbenchmark_config" / "temp_folder"
-CHALLENGES_ALREADY_BEATEN = (
-    Path.cwd() / "agbenchmark_config" / "challenges_already_beaten.json"
-)
-UPDATES_JSON_PATH = Path.cwd() / "agbenchmark_config" / "updates.json"


-if os.environ.get("HELICONE_API_KEY"):
+if helicone_enabled:
+    from helicone.lock import HeliconeLockManager
+
    HeliconeLockManager.write_custom_property(
        "benchmark_start_time", BENCHMARK_START_TIME
    )

-with open(
-    Path(__file__).resolve().parent / "challenges" / "optional_categories.json"
-) as f:
-    OPTIONAL_CATEGORIES = json.load(f)["optional_categories"]
+
+@click.group(cls=DefaultGroup, default_if_no_args=True)
+@click.option("--debug", is_flag=True, help="Enable debug output")
+def cli(
+    debug: bool,
+) -> Any:
+    configure_logging(logging.DEBUG if debug else logging.INFO)


-def get_unique_categories() -> set[str]:
-    """Find all data.json files in the directory relative to this file and its subdirectories,
-    read the "category" field from each file, and return a set of unique categories."""
-    categories = set()
-
-    # Get the directory of this file
-    this_dir = os.path.dirname(os.path.abspath(__file__))
-
-    glob_path = os.path.join(this_dir, "./challenges/**/data.json")
-    # Use it as the base for the glob pattern
-    for data_file in glob.glob(glob_path, recursive=True):
-        with open(data_file, "r") as f:
-            try:
-                data = json.load(f)
-                categories.update(data.get("category", []))
-            except json.JSONDecodeError:
-                print(f"Error: {data_file} is not a valid JSON file.")
-                continue
-            except IOError:
-                print(f"IOError: file could not be read: {data_file}")
-                continue
-
-    return categories
+@cli.command(hidden=True)
+def start():
+    raise DeprecationWarning(
+        "`agbenchmark start` is deprecated. Use `agbenchmark run` instead."
+    )


-def run_benchmark(
-    maintain: bool = False,
-    improve: bool = False,
-    explore: bool = False,
-    mock: bool = False,
-    no_dep: bool = False,
-    nc: bool = False,
-    keep_answers: bool = False,
-    category: Optional[tuple[str]] = None,
-    skip_category: Optional[tuple[str]] = None,
-    test: Optional[str] = None,
-    cutoff: Optional[int] = None,
-    server: bool = False,
-) -> int:
-    """Start the benchmark tests. If a category flag is provided, run the categories with that mark."""
-    # Check if configuration file exists and is not empty
-
-    initialize_updates_file()
-    SingletonReportManager()
-    agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
-    try:
-        with open(agent_benchmark_config_path, "r") as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-            agent_benchmark_config.agent_benchmark_config_path = (
-                agent_benchmark_config_path
-            )
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        return 1
-
-    if maintain and improve and explore:
-        print(
-            "Error: You can't use --maintain, --improve or --explore at the same time. Please choose one."
-        )
-        return 1
-
-    if test and (category or skip_category or maintain or improve or explore):
-        print(
-            "Error: If you're running a specific test make sure no other options are selected. Please just pass the --test."
-        )
-        return 1
-
-    assert agent_benchmark_config.host, "Error: host needs to be added to the config."
-
-    print("Current configuration:")
-    for key, value in vars(agent_benchmark_config).items():
-        print(f"{key}: {value}")
-
-    pytest_args = ["-vs"]
-    if keep_answers:
-        pytest_args.append("--keep-answers")
-
-    if test:
-        print("Running specific test:", test)
-    else:
-        # Categories that are used in the challenges
-        categories = get_unique_categories()
-        if category:
-            invalid_categories = set(category) - categories
-            assert (
-                not invalid_categories
-            ), f"Invalid categories: {invalid_categories}. Valid categories are: {categories}"
-
-        if category:
-            categories_to_run = set(category)
-            if skip_category:
-                categories_to_run = categories_to_run.difference(set(skip_category))
-                assert categories_to_run, "Error: You can't skip all categories"
-            pytest_args.extend(["-m", " or ".join(categories_to_run), "--category"])
-            print("Running tests of category:", categories_to_run)
-        elif skip_category:
-            categories_to_run = categories - set(skip_category)
-            assert categories_to_run, "Error: You can't skip all categories"
-            pytest_args.extend(["-m", " or ".join(categories_to_run), "--category"])
-            print("Running tests of category:", categories_to_run)
-        else:
-            print("Running all categories")
-
-        if maintain:
-            print("Running only regression tests")
-            pytest_args.append("--maintain")
-        elif improve:
-            print("Running only non-regression tests")
-            pytest_args.append("--improve")
-        elif explore:
-            print("Only attempt challenges that have never been beaten")
-            pytest_args.append("--explore")
-
-    if mock:
-        pytest_args.append("--mock")
-        os.environ[
-            "IS_MOCK"
-        ] = "True"  # ugly hack to make the mock work when calling from API
-
-    if no_dep:
-        pytest_args.append("--no_dep")
-
-    if nc and cutoff:
-        print(
-            "Error: You can't use both --nc and --cutoff at the same time. Please choose one."
-        )
-        return 1
-
-    if nc:
-        pytest_args.append("--nc")
-    if cutoff:
-        pytest_args.append("--cutoff")
-        print(f"Setting cuttoff override to {cutoff} seconds.")
-    current_dir = Path(__file__).resolve().parent
-    print(f"Current directory: {current_dir}")
-    pytest_args.extend((str(current_dir), "--cache-clear"))
-    exit_code = pytest.main(pytest_args)
-    SingletonReportManager().clear_instance()
-
-
-@click.group(invoke_without_command=True)
-@click.option("--backend", is_flag=True, help="If it's being run from the cli")
-@click.option("-c", "--category", multiple=True, help="Specific category to run")
+@cli.command(default=True)
+@click.option(
+    "-c",
+    "--category",
+    multiple=True,
+    help="(+) Select a category to run.",
+)
@click.option(
    "-s",
    "--skip-category",
    multiple=True,
-    help="Skips preventing the tests from this category from running",
+    help="(+) Exclude a category from running.",
 )
-@click.option("--test", multiple=True, help="Specific test to run")
-@click.option("--maintain", is_flag=True, help="Runs only regression tests")
-@click.option("--improve", is_flag=True, help="Run only non-regression tests")
+@click.option("--test", multiple=True, help="(+) Select a test to run.")
+@click.option("--maintain", is_flag=True, help="Run only regression tests.")
+@click.option("--improve", is_flag=True, help="Run only non-regression tests.")
@click.option(
    "--explore",
    is_flag=True,
-    help="Only attempt challenges that have never been beaten",
+    help="Run only challenges that have never been beaten.",
 )
-@click.option("--mock", is_flag=True, help="Run with mock")
@click.option(
-    "--no_dep",
+    "--no-dep",
    is_flag=True,
-    help="Run without dependencies",
+    help="Run all (selected) challenges, regardless of dependency success/failure.",
 )
-@click.option("--nc", is_flag=True, help="Run without cutoff")
+@click.option("--cutoff", type=int, help="Override the challenge time limit (seconds).")
+@click.option("--nc", is_flag=True, help="Disable the challenge time limit.")
+@click.option("--mock", is_flag=True, help="Run with mock")
@click.option("--keep-answers", is_flag=True, help="Keep answers")
-@click.option("--cutoff", help="Set or override tests cutoff (seconds)")
-@click.argument("value", type=str, required=False)
-def cli(
+@click.option(
+    "--backend",
+    is_flag=True,
+    help="Write log output to a file instead of the terminal.",
+)
+# @click.argument(
+#     "agent_path", type=click.Path(exists=True, file_okay=False), required=False
+# )
+def run(
    maintain: bool,
    improve: bool,
    explore: bool,
@@ -213,18 +104,37 @@ def cli(
    no_dep: bool,
    nc: bool,
    keep_answers: bool,
-    category: Optional[list[str]] = None,
-    skip_category: Optional[list[str]] = None,
-    test: Optional[str] = None,
+    test: tuple[str],
+    category: tuple[str],
+    skip_category: tuple[str],
    cutoff: Optional[int] = None,
    backend: Optional[bool] = False,
-    value: Optional[str] = None,
-) -> Any:
-    # Redirect stdout if backend is True
-    if value == "start":
-        raise ("`agbenchmark start` is removed. Run `agbenchmark` instead.")
-    if value == "serve":
-        return serve()
+    # agent_path: Optional[Path] = None,
+) -> None:
+    """
+    Run the benchmark on the agent in the current directory.
+
+    Options marked with (+) can be specified multiple times, to select multiple items.
+    """
+    from agbenchmark.main import run_benchmark, validate_args
+
+    agbenchmark_config = AgentBenchmarkConfig.load()
+    logger.debug(f"agbenchmark_config: {agbenchmark_config.agbenchmark_config_dir}")
+    try:
+        validate_args(
+            maintain=maintain,
+            improve=improve,
+            explore=explore,
+            tests=test,
+            categories=category,
+            skip_categories=skip_category,
+            no_cutoff=nc,
+            cutoff=cutoff,
+        )
+    except InvalidInvocationError as e:
+        logger.error("Error: " + "\n".join(e.args))
+        sys.exit(1)
+
    original_stdout = sys.stdout  # Save the original standard output
    exit_code = None

@@ -232,16 +142,17 @@ def cli(
        with open("backend/backend_stdout.txt", "w") as f:
            sys.stdout = f
            exit_code = run_benchmark(
+                config=agbenchmark_config,
                maintain=maintain,
                improve=improve,
                explore=explore,
                mock=mock,
                no_dep=no_dep,
-                nc=nc,
+                no_cutoff=nc,
                keep_answers=keep_answers,
-                category=category,
-                skip_category=skip_category,
-                test=test,
+                tests=test,
+                categories=category,
+                skip_categories=skip_category,
                cutoff=cutoff,
            )

@@ -249,16 +160,17 @@ def cli(

    else:
        exit_code = run_benchmark(
+            config=agbenchmark_config,
            maintain=maintain,
            improve=improve,
            explore=explore,
            mock=mock,
            no_dep=no_dep,
-            nc=nc,
+            no_cutoff=nc,
            keep_answers=keep_answers,
-            category=category,
-            skip_category=skip_category,
-            test=test,
+            tests=test,
+            categories=category,
+            skip_categories=skip_category,
            cutoff=cutoff,
        )

@@ -266,33 +178,44 @@ def cli(


@cli.command()
-def version():
-    """Print the version of the benchmark tool."""
-    current_directory = Path(__file__).resolve().parent
-    version = toml.load(current_directory / ".." / "pyproject.toml")["tool"]["poetry"][
-        "version"
-    ]
-    print(f"Benchmark Tool Version {version}")
-
-
-def serve():
+@click.option("--port", type=int, help="Port to run the API on.")
+def serve(port: Optional[int] = None):
+    """Serve the benchmark frontend and API on port 8080."""
    import uvicorn

+    from agbenchmark.app import setup_fastapi_app
+
+    config = AgentBenchmarkConfig.load()
+    app = setup_fastapi_app(config)
+
    # Run the FastAPI application using uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=8080)
+    port = port or int(os.getenv("PORT", 8080))
+    uvicorn.run(app, host="0.0.0.0", port=port)


-def initialize_updates_file():
-    if os.path.exists(UPDATES_JSON_PATH):
-        # If the file already exists, overwrite it with an empty list
-        with open(UPDATES_JSON_PATH, "w") as file:
-            json.dump([], file, indent=2)
-        print("Initialized updates.json by overwriting with an empty array")
-    else:
-        # If the file doesn't exist, create it and write an empty list
-        with open(UPDATES_JSON_PATH, "w") as file:
-            json.dump([], file, indent=2)
-        print("Created updates.json and initialized it with an empty array")
+@cli.command()
+def config():
+    """Displays info regarding the present AGBenchmark config."""
+    try:
+        config = AgentBenchmarkConfig.load()
+    except FileNotFoundError as e:
+        click.echo(e, err=True)
+        return 1
+
+    k_col_width = max(len(k) for k in config.dict().keys())
+    for k, v in config.dict().items():
+        click.echo(f"{k: <{k_col_width}} = {v}")
+
+
+@cli.command()
+def version():
+    """Print version info for the AGBenchmark application."""
+    import toml
+
+    package_root = Path(__file__).resolve().parent.parent
+    pyproject = toml.load(package_root / "pyproject.toml")
+    version = pyproject["tool"]["poetry"]["version"]
+    click.echo(f"AGBenchmark version {version}")


 if __name__ == "__main__":
--- a/benchmark/agbenchmark/agent_api_interface.py
+++ b/benchmark/agbenchmark/agent_api_interface.py
@@ -1,30 +1,25 @@
-import json
 import logging
 import os
-import pathlib
 import time
-from typing import Any, Dict, Optional
+from pathlib import Path
+from typing import Optional
+
+from agent_protocol_client import AgentApi, ApiClient, Configuration, TaskRequestBody

-from agbenchmark.__main__ import TEMP_FOLDER_ABS_PATH, UPDATES_JSON_PATH
 from agbenchmark.agent_interface import get_list_of_file_paths
-from agbenchmark.agent_protocol_client import (
-    AgentApi,
-    ApiClient,
-    Configuration,
-    TaskRequestBody,
-)
-from agbenchmark.agent_protocol_client.models.step import Step
+from agbenchmark.config import AgentBenchmarkConfig
 from agbenchmark.utils.data_types import ChallengeData

 LOG = logging.getLogger(__name__)


 async def run_api_agent(
-    task: ChallengeData, config: Dict[str, Any], artifacts_location: str, timeout: int
+    task: ChallengeData,
+    config: AgentBenchmarkConfig,
+    artifacts_location: str,
+    timeout: int,
 ) -> None:
-    host_value = None
-
-    configuration = Configuration(host=config["AgentBenchmarkConfig"].host + "/ap/v1")
+    configuration = Configuration(host=config.host)
    async with ApiClient(configuration) as api_client:
        api_instance = AgentApi(api_client)
        task_request_body = TaskRequestBody(input=task.task)
@@ -45,7 +40,6 @@ async def run_api_agent(
            # Read the existing JSON data from the file

            step = await api_instance.execute_agent_task_step(task_id=task_id)
-            await append_updates_file(step)

            print(f"[{task.name}] - step {step.name} ({i}. request)")
            i += 1
@@ -54,34 +48,38 @@ async def run_api_agent(
                raise TimeoutError("Time limit exceeded")
            if not step or step.is_last:
                steps_remaining = False
-        # if we're calling a mock agent, we "cheat" and give the correct artifacts to pass the tests
+
+        # In "mock" mode, we cheat by giving the correct artifacts to pass the challenge
        if os.getenv("IS_MOCK"):
            await upload_artifacts(
                api_instance, artifacts_location, task_id, "artifacts_out"
            )

-        await copy_agent_artifacts_into_temp_folder(api_instance, task_id)
+        await copy_agent_artifacts_into_folder(
+            api_instance, task_id, config.temp_folder
+        )


-async def copy_agent_artifacts_into_temp_folder(api_instance, task_id):
+async def copy_agent_artifacts_into_folder(
+    api_instance: AgentApi, task_id: str, folder: Path
+):
    artifacts = await api_instance.list_agent_task_artifacts(task_id=task_id)
+
    for artifact in artifacts.artifacts:
        # current absolute path of the directory of the file
-        directory_location = pathlib.Path(TEMP_FOLDER_ABS_PATH)
        if artifact.relative_path:
-            path = (
+            path: str = (
                artifact.relative_path
                if not artifact.relative_path.startswith("/")
                else artifact.relative_path[1:]
            )
-            directory_location = pathlib.Path(
-                os.path.dirname(directory_location / path)
-            )
-            LOG.info(f"Creating directory {directory_location}")
+            folder = (folder / path).parent

-        directory_location.mkdir(parents=True, exist_ok=True)
+        if not folder.exists():
+            LOG.info(f"Creating directory {folder}")
+            folder.mkdir(parents=True)

-        file_path = directory_location / artifact.file_name
+        file_path = folder / artifact.file_name
        LOG.info(f"Writing file {file_path}")
        with open(file_path, "wb") as f:
            content = await api_instance.download_agent_task_artifact(
@@ -91,35 +89,16 @@ async def copy_agent_artifacts_into_temp_folder(api_instance, task_id):
            f.write(content)


-async def append_updates_file(step: Step):
-    with open(UPDATES_JSON_PATH, "r") as file:
-        existing_data = json.load(file)
-    # Append the new update to the existing array
-    new_update = create_update_json(step)
-
-    existing_data.append(new_update)
-    # Write the updated array back to the file
-    with open(UPDATES_JSON_PATH, "w") as file:
-        file.write(json.dumps(existing_data, indent=2))
-
-
 async def upload_artifacts(
-    api_instance: ApiClient, artifacts_location: str, task_id: str, type: str
+    api_instance: AgentApi, artifacts_location: str, task_id: str, type: str
 ) -> None:
    for file_path in get_list_of_file_paths(artifacts_location, type):
        relative_path: Optional[str] = "/".join(
-            file_path.split(f"{type}/", 1)[-1].split("/")[:-1]
+            str(file_path).split(f"{type}/", 1)[-1].split("/")[:-1]
        )
        if not relative_path:
            relative_path = None

        await api_instance.upload_agent_task_artifacts(
-            task_id=task_id, file=file_path, relative_path=relative_path
+            task_id=task_id, file=str(file_path), relative_path=relative_path
        )
-
-
-def create_update_json(step: Step):
-    now = int(time.time())
-    content = {"content": step.to_dict(), "timestamp": now}
-
-    return content
--- a/benchmark/agbenchmark/agent_interface.py
+++ b/benchmark/agbenchmark/agent_interface.py
@@ -1,45 +1,27 @@
 import os
 import shutil
-import sys
-from typing import List
+from pathlib import Path

 from dotenv import load_dotenv

-from agbenchmark.execute_sub_process import execute_subprocess
-
 load_dotenv()

-helicone_graphql_logs = os.getenv("HELICONE_GRAPHQL_LOGS")
-HELICONE_GRAPHQL_LOGS = (
-    helicone_graphql_logs.lower() == "true" if helicone_graphql_logs else False
-)
-
-
-def run_agent(task: str, timeout: int) -> None:
-    print(f"Running agbenchmark/benchmarks.py with timeout {timeout}")
-
-    command = [sys.executable, "-m", "agbenchmark_config.benchmarks", str(task)]
-
-    execute_subprocess(command, timeout)
+HELICONE_GRAPHQL_LOGS = os.getenv("HELICONE_GRAPHQL_LOGS", "").lower() == "true"


 def get_list_of_file_paths(
-    challenge_dir_path: str, artifact_folder_name: str
-) -> List[str]:
-    # this file is at agbenchmark\agent_interface.py
-    source_dir = os.path.join(
-        challenge_dir_path,
-        artifact_folder_name,
-    )
-    if not os.path.exists(source_dir):
+    challenge_dir_path: str | Path, artifact_folder_name: str
+) -> list[Path]:
+    source_dir = Path(challenge_dir_path) / artifact_folder_name
+    if not source_dir.exists():
        return []
-    return [os.path.join(source_dir, file_name) for file_name in os.listdir(source_dir)]
+    return list(source_dir.iterdir())


 def copy_artifacts_into_temp_folder(
-    workspace: str | dict[str, str], artifact_folder_name: str, challenge_dir_path: str
+    workspace: str | Path, artifact_folder_name: str, challenge_dir_path: str | Path
 ) -> None:
    file_paths = get_list_of_file_paths(challenge_dir_path, artifact_folder_name)
    for file_path in file_paths:
-        if os.path.isfile(file_path):
+        if file_path.is_file():
            shutil.copy(file_path, workspace)
--- a/benchmark/agbenchmark/agent_protocol_client/init.py
+++ b/benchmark/agbenchmark/agent_protocol_client/init.py
@@ -1,42 +0,0 @@
-# coding: utf-8
-
-# flake8: noqa
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-__version__ = "1.0.0"
-
-# import apis into sdk package
-from agbenchmark.agent_protocol_client.api.agent_api import AgentApi
-from agbenchmark.agent_protocol_client.api_client import ApiClient
-
-# import ApiClient
-from agbenchmark.agent_protocol_client.api_response import ApiResponse
-from agbenchmark.agent_protocol_client.configuration import Configuration
-from agbenchmark.agent_protocol_client.exceptions import (
-    ApiAttributeError,
-    ApiException,
-    ApiKeyError,
-    ApiTypeError,
-    ApiValueError,
-    OpenApiException,
-)
-
-# import models into sdk package
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-from agbenchmark.agent_protocol_client.models.step import Step
-from agbenchmark.agent_protocol_client.models.step_all_of import StepAllOf
-from agbenchmark.agent_protocol_client.models.step_request_body import StepRequestBody
-from agbenchmark.agent_protocol_client.models.task import Task
-from agbenchmark.agent_protocol_client.models.task_all_of import TaskAllOf
-from agbenchmark.agent_protocol_client.models.task_request_body import TaskRequestBody
--- a/benchmark/agbenchmark/agent_protocol_client/api/init.py
+++ b/benchmark/agbenchmark/agent_protocol_client/api/init.py
@@ -1,4 +0,0 @@
-# flake8: noqa
-
-# import apis into api package
-from agbenchmark.agent_protocol_client.api.agent_api import AgentApi
--- a/benchmark/agbenchmark/agent_protocol_client/api/agent_api.py
+++ b/benchmark/agbenchmark/agent_protocol_client/api/agent_api.py
--- a/benchmark/agbenchmark/agent_protocol_client/api_client.py
+++ b/benchmark/agbenchmark/agent_protocol_client/api_client.py
@@ -1,838 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-import atexit
-import datetime
-import json
-import mimetypes
-import os
-import re
-import tempfile
-from multiprocessing.pool import ThreadPool
-from urllib.parse import quote
-
-from dateutil.parser import parse
-
-import agbenchmark.agent_protocol_client.models
-from agbenchmark.agent_protocol_client import rest
-from agbenchmark.agent_protocol_client.api_response import ApiResponse
-from agbenchmark.agent_protocol_client.configuration import Configuration
-from agbenchmark.agent_protocol_client.exceptions import ApiException, ApiValueError
-
-
-class ApiClient(object):
-    """Generic API client for OpenAPI client library builds.
-
-    OpenAPI generic API client. This client handles the client-
-    server communication, and is invariant across implementations. Specifics of
-    the methods and models for each application are generated from the OpenAPI
-    templates.
-
-    :param configuration: .Configuration object for this client
-    :param header_name: a header to pass when making calls to the API.
-    :param header_value: a header value to pass when making calls to
-        the API.
-    :param cookie: a cookie to include in the header when making calls
-        to the API
-    :param pool_threads: The number of threads to use for async requests
-        to the API. More threads means more concurrent API requests.
-    """
-
-    PRIMITIVE_TYPES = (float, bool, bytes, str, int)
-    NATIVE_TYPES_MAPPING = {
-        "int": int,
-        "long": int,  # TODO remove as only py3 is supported?
-        "float": float,
-        "str": str,
-        "bool": bool,
-        "date": datetime.date,
-        "datetime": datetime.datetime,
-        "object": object,
-    }
-    _pool = None
-
-    def __init__(
-        self,
-        configuration=None,
-        header_name=None,
-        header_value=None,
-        cookie=None,
-        pool_threads=1,
-    ):
-        # use default configuration if none is provided
-        if configuration is None:
-            configuration = Configuration.get_default()
-        self.configuration = configuration
-        self.pool_threads = pool_threads
-
-        self.rest_client = rest.RESTClientObject(configuration)
-        self.default_headers = {}
-        if header_name is not None:
-            self.default_headers[header_name] = header_value
-        self.cookie = cookie
-        # Set default User-Agent.
-        self.user_agent = "OpenAPI-Generator/1.0.0/python"
-        self.client_side_validation = configuration.client_side_validation
-
-    async def __aenter__(self):
-        return self
-
-    async def __aexit__(self, exc_type, exc_value, traceback):
-        await self.close()
-
-    async def close(self):
-        await self.rest_client.close()
-        if self._pool:
-            self._pool.close()
-            self._pool.join()
-            self._pool = None
-            if hasattr(atexit, "unregister"):
-                atexit.unregister(self.close)
-
-    @property
-    def pool(self):
-        """Create thread pool on first request
-        avoids instantiating unused threadpool for blocking clients.
-        """
-        if self._pool is None:
-            atexit.register(self.close)
-            self._pool = ThreadPool(self.pool_threads)
-        return self._pool
-
-    @property
-    def user_agent(self):
-        """User agent for this API client"""
-        return self.default_headers["User-Agent"]
-
-    @user_agent.setter
-    def user_agent(self, value):
-        self.default_headers["User-Agent"] = value
-
-    def set_default_header(self, header_name, header_value):
-        self.default_headers[header_name] = header_value
-
-    _default = None
-
-    @classmethod
-    def get_default(cls):
-        """Return new instance of ApiClient.
-
-        This method returns newly created, based on default constructor,
-        object of ApiClient class or returns a copy of default
-        ApiClient.
-
-        :return: The ApiClient object.
-        """
-        if cls._default is None:
-            cls._default = ApiClient()
-        return cls._default
-
-    @classmethod
-    def set_default(cls, default):
-        """Set default instance of ApiClient.
-
-        It stores default ApiClient.
-
-        :param default: object of ApiClient.
-        """
-        cls._default = default
-
-    async def __call_api(
-        self,
-        resource_path,
-        method,
-        path_params=None,
-        query_params=None,
-        header_params=None,
-        body=None,
-        post_params=None,
-        files=None,
-        response_types_map=None,
-        auth_settings=None,
-        _return_http_data_only=None,
-        collection_formats=None,
-        _preload_content=True,
-        _request_timeout=None,
-        _host=None,
-        _request_auth=None,
-    ):
-        config = self.configuration
-
-        # header parameters
-        header_params = header_params or {}
-        header_params.update(self.default_headers)
-        if self.cookie:
-            header_params["Cookie"] = self.cookie
-        if header_params:
-            header_params = self.sanitize_for_serialization(header_params)
-            header_params = dict(
-                self.parameters_to_tuples(header_params, collection_formats)
-            )
-
-        # path parameters
-        if path_params:
-            path_params = self.sanitize_for_serialization(path_params)
-            path_params = self.parameters_to_tuples(path_params, collection_formats)
-            for k, v in path_params:
-                # specified safe chars, encode everything
-                resource_path = resource_path.replace(
-                    "{%s}" % k, quote(str(v), safe=config.safe_chars_for_path_param)
-                )
-
-        # post parameters
-        if post_params or files:
-            post_params = post_params if post_params else []
-            post_params = self.sanitize_for_serialization(post_params)
-            post_params = self.parameters_to_tuples(post_params, collection_formats)
-            post_params.extend(self.files_parameters(files))
-
-        # auth setting
-        self.update_params_for_auth(
-            header_params,
-            query_params,
-            auth_settings,
-            resource_path,
-            method,
-            body,
-            request_auth=_request_auth,
-        )
-
-        # body
-        if body:
-            body = self.sanitize_for_serialization(body)
-
-        # request url
-        if _host is None:
-            url = self.configuration.host + resource_path
-        else:
-            # use server/host defined in path or operation instead
-            url = _host + resource_path
-
-        # query parameters
-        if query_params:
-            query_params = self.sanitize_for_serialization(query_params)
-            url_query = self.parameters_to_url_query(query_params, collection_formats)
-            url += "?" + url_query
-
-        try:
-            # perform request and return response
-            response_data = await self.request(
-                method,
-                url,
-                query_params=query_params,
-                headers=header_params,
-                post_params=post_params,
-                body=body,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-            )
-        except ApiException as e:
-            if e.body:
-                e.body = e.body.decode("utf-8")
-            raise e
-
-        self.last_response = response_data
-
-        return_data = None  # assuming derialization is not needed
-        # data needs deserialization or returns HTTP data (deserialized) only
-        if _preload_content or _return_http_data_only:
-            response_type = response_types_map.get(str(response_data.status), None)
-
-            if response_type == "bytearray":
-                response_data.data = response_data.data
-            else:
-                match = None
-                content_type = response_data.getheader("content-type")
-                if content_type is not None:
-                    match = re.search(r"charset=([a-zA-Z\-\d]+)[\s;]?", content_type)
-                encoding = match.group(1) if match else "utf-8"
-                response_data.data = response_data.data.decode(encoding)
-
-            # deserialize response data
-            if response_type == "bytearray":
-                return_data = response_data.data
-            elif response_type:
-                return_data = self.deserialize(response_data, response_type)
-            else:
-                return_data = None
-
-        if _return_http_data_only:
-            return return_data
-        else:
-            return ApiResponse(
-                status_code=response_data.status,
-                data=return_data,
-                headers=response_data.getheaders(),
-                raw_data=response_data.data,
-            )
-
-    def sanitize_for_serialization(self, obj):
-        """Builds a JSON POST object.
-
-        If obj is None, return None.
-        If obj is str, int, long, float, bool, return directly.
-        If obj is datetime.datetime, datetime.date
-            convert to string in iso8601 format.
-        If obj is list, sanitize each element in the list.
-        If obj is dict, return the dict.
-        If obj is OpenAPI model, return the properties dict.
-
-        :param obj: The data to serialize.
-        :return: The serialized form of data.
-        """
-        if obj is None:
-            return None
-        elif isinstance(obj, self.PRIMITIVE_TYPES):
-            return obj
-        elif isinstance(obj, list):
-            return [self.sanitize_for_serialization(sub_obj) for sub_obj in obj]
-        elif isinstance(obj, tuple):
-            return tuple(self.sanitize_for_serialization(sub_obj) for sub_obj in obj)
-        elif isinstance(obj, (datetime.datetime, datetime.date)):
-            return obj.isoformat()
-
-        if isinstance(obj, dict):
-            obj_dict = obj
-        else:
-            # Convert model obj to dict except
-            # attributes `openapi_types`, `attribute_map`
-            # and attributes which value is not None.
-            # Convert attribute name to json key in
-            # model definition for request.
-            obj_dict = obj.to_dict()
-
-        return {
-            key: self.sanitize_for_serialization(val) for key, val in obj_dict.items()
-        }
-
-    def deserialize(self, response, response_type):
-        """Deserializes response into an object.
-
-        :param response: RESTResponse object to be deserialized.
-        :param response_type: class literal for
-            deserialized object, or string of class name.
-
-        :return: deserialized object.
-        """
-        # handle file downloading
-        # save response body into a tmp file and return the instance
-        if response_type == "file":
-            return self.__deserialize_file(response)
-
-        # fetch data from response object
-        try:
-            data = json.loads(response.data)
-        except ValueError:
-            data = response.data
-
-        return self.__deserialize(data, response_type)
-
-    def __deserialize(self, data, klass):
-        """Deserializes dict, list, str into an object.
-
-        :param data: dict, list or str.
-        :param klass: class literal, or string of class name.
-
-        :return: object.
-        """
-        if data is None:
-            return None
-
-        if type(klass) == str:
-            if klass.startswith("List["):
-                sub_kls = re.match(r"List\[(.*)]", klass).group(1)
-                return [self.__deserialize(sub_data, sub_kls) for sub_data in data]
-
-            if klass.startswith("Dict["):
-                sub_kls = re.match(r"Dict\[([^,]*), (.*)]", klass).group(2)
-                return {k: self.__deserialize(v, sub_kls) for k, v in data.items()}
-
-            # convert str to class
-            if klass in self.NATIVE_TYPES_MAPPING:
-                klass = self.NATIVE_TYPES_MAPPING[klass]
-            else:
-                klass = getattr(agbenchmark.agent_protocol_client.models, klass)
-
-        if klass in self.PRIMITIVE_TYPES:
-            return self.__deserialize_primitive(data, klass)
-        elif klass == object:
-            return self.__deserialize_object(data)
-        elif klass == datetime.date:
-            return self.__deserialize_date(data)
-        elif klass == datetime.datetime:
-            return self.__deserialize_datetime(data)
-        else:
-            return self.__deserialize_model(data, klass)
-
-    def call_api(
-        self,
-        resource_path,
-        method,
-        path_params=None,
-        query_params=None,
-        header_params=None,
-        body=None,
-        post_params=None,
-        files=None,
-        response_types_map=None,
-        auth_settings=None,
-        async_req=None,
-        _return_http_data_only=None,
-        collection_formats=None,
-        _preload_content=True,
-        _request_timeout=None,
-        _host=None,
-        _request_auth=None,
-    ):
-        """Makes the HTTP request (synchronous) and returns deserialized data.
-
-        To make an async_req request, set the async_req parameter.
-
-        :param resource_path: Path to method endpoint.
-        :param method: Method to call.
-        :param path_params: Path parameters in the url.
-        :param query_params: Query parameters in the url.
-        :param header_params: Header parameters to be
-            placed in the request header.
-        :param body: Request body.
-        :param post_params dict: Request post form parameters,
-            for `application/x-www-form-urlencoded`, `multipart/form-data`.
-        :param auth_settings list: Auth Settings names for the request.
-        :param response: Response data type.
-        :param files dict: key -> filename, value -> filepath,
-            for `multipart/form-data`.
-        :param async_req bool: execute request asynchronously
-        :param _return_http_data_only: response data instead of ApiResponse
-                                       object with status code, headers, etc
-        :param _preload_content: if False, the ApiResponse.data will
-                                 be set to none and raw_data will store the
-                                 HTTP response body without reading/decoding.
-                                 Default is True.
-        :param collection_formats: dict of collection formats for path, query,
-            header, and post parameters.
-        :param _request_timeout: timeout setting for this request. If one
-                                 number provided, it will be total request
-                                 timeout. It can also be a pair (tuple) of
-                                 (connection, read) timeouts.
-        :param _request_auth: set to override the auth_settings for an a single
-                              request; this effectively ignores the authentication
-                              in the spec for a single request.
-        :type _request_token: dict, optional
-        :return:
-            If async_req parameter is True,
-            the request will be called asynchronously.
-            The method will return the request thread.
-            If parameter async_req is False or missing,
-            then the method will return the response directly.
-        """
-        if not async_req:
-            return self.__call_api(
-                resource_path,
-                method,
-                path_params,
-                query_params,
-                header_params,
-                body,
-                post_params,
-                files,
-                response_types_map,
-                auth_settings,
-                _return_http_data_only,
-                collection_formats,
-                _preload_content,
-                _request_timeout,
-                _host,
-                _request_auth,
-            )
-
-        return self.pool.apply_async(
-            self.__call_api,
-            (
-                resource_path,
-                method,
-                path_params,
-                query_params,
-                header_params,
-                body,
-                post_params,
-                files,
-                response_types_map,
-                auth_settings,
-                _return_http_data_only,
-                collection_formats,
-                _preload_content,
-                _request_timeout,
-                _host,
-                _request_auth,
-            ),
-        )
-
-    def request(
-        self,
-        method,
-        url,
-        query_params=None,
-        headers=None,
-        post_params=None,
-        body=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        """Makes the HTTP request using RESTClient."""
-        if method == "GET":
-            return self.rest_client.get_request(
-                url,
-                query_params=query_params,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-                headers=headers,
-            )
-        elif method == "HEAD":
-            return self.rest_client.head_request(
-                url,
-                query_params=query_params,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-                headers=headers,
-            )
-        elif method == "OPTIONS":
-            return self.rest_client.options_request(
-                url,
-                query_params=query_params,
-                headers=headers,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-            )
-        elif method == "POST":
-            return self.rest_client.post_request(
-                url,
-                query_params=query_params,
-                headers=headers,
-                post_params=post_params,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-                body=body,
-            )
-        elif method == "PUT":
-            return self.rest_client.put_request(
-                url,
-                query_params=query_params,
-                headers=headers,
-                post_params=post_params,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-                body=body,
-            )
-        elif method == "PATCH":
-            return self.rest_client.patch_request(
-                url,
-                query_params=query_params,
-                headers=headers,
-                post_params=post_params,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-                body=body,
-            )
-        elif method == "DELETE":
-            return self.rest_client.delete_request(
-                url,
-                query_params=query_params,
-                headers=headers,
-                _preload_content=_preload_content,
-                _request_timeout=_request_timeout,
-                body=body,
-            )
-        else:
-            raise ApiValueError(
-                "http method must be `GET`, `HEAD`, `OPTIONS`,"
-                " `POST`, `PATCH`, `PUT` or `DELETE`."
-            )
-
-    def parameters_to_tuples(self, params, collection_formats):
-        """Get parameters as list of tuples, formatting collections.
-
-        :param params: Parameters as dict or list of two-tuples
-        :param dict collection_formats: Parameter collection formats
-        :return: Parameters as list of tuples, collections formatted
-        """
-        new_params = []
-        if collection_formats is None:
-            collection_formats = {}
-        for k, v in (
-            params.items() if isinstance(params, dict) else params
-        ):  # noqa: E501
-            if k in collection_formats:
-                collection_format = collection_formats[k]
-                if collection_format == "multi":
-                    new_params.extend((k, value) for value in v)
-                else:
-                    if collection_format == "ssv":
-                        delimiter = " "
-                    elif collection_format == "tsv":
-                        delimiter = "\t"
-                    elif collection_format == "pipes":
-                        delimiter = "|"
-                    else:  # csv is the default
-                        delimiter = ","
-                    new_params.append((k, delimiter.join(str(value) for value in v)))
-            else:
-                new_params.append((k, v))
-        return new_params
-
-    def parameters_to_url_query(self, params, collection_formats):
-        """Get parameters as list of tuples, formatting collections.
-
-        :param params: Parameters as dict or list of two-tuples
-        :param dict collection_formats: Parameter collection formats
-        :return: URL query string (e.g. a=Hello%20World&b=123)
-        """
-        new_params = []
-        if collection_formats is None:
-            collection_formats = {}
-        for k, v in (
-            params.items() if isinstance(params, dict) else params
-        ):  # noqa: E501
-            if isinstance(v, (int, float)):
-                v = str(v)
-            if isinstance(v, bool):
-                v = str(v).lower()
-            if isinstance(v, dict):
-                v = json.dumps(v)
-
-            if k in collection_formats:
-                collection_format = collection_formats[k]
-                if collection_format == "multi":
-                    new_params.extend((k, value) for value in v)
-                else:
-                    if collection_format == "ssv":
-                        delimiter = " "
-                    elif collection_format == "tsv":
-                        delimiter = "\t"
-                    elif collection_format == "pipes":
-                        delimiter = "|"
-                    else:  # csv is the default
-                        delimiter = ","
-                    new_params.append(
-                        (k, delimiter.join(quote(str(value)) for value in v))
-                    )
-            else:
-                new_params.append((k, quote(str(v))))
-
-        return "&".join(["=".join(item) for item in new_params])
-
-    def files_parameters(self, files=None):
-        """Builds form parameters.
-
-        :param files: File parameters.
-        :return: Form parameters with files.
-        """
-        params = []
-
-        if files:
-            for k, v in files.items():
-                if not v:
-                    continue
-                file_names = v if type(v) is list else [v]
-                for n in file_names:
-                    with open(n, "rb") as f:
-                        filename = os.path.basename(f.name)
-                        filedata = f.read()
-                        mimetype = (
-                            mimetypes.guess_type(filename)[0]
-                            or "application/octet-stream"
-                        )
-                        params.append(tuple([k, tuple([filename, filedata, mimetype])]))
-
-        return params
-
-    def select_header_accept(self, accepts):
-        """Returns `Accept` based on an array of accepts provided.
-
-        :param accepts: List of headers.
-        :return: Accept (e.g. application/json).
-        """
-        if not accepts:
-            return
-
-        for accept in accepts:
-            if re.search("json", accept, re.IGNORECASE):
-                return accept
-
-        return accepts[0]
-
-    def select_header_content_type(self, content_types):
-        """Returns `Content-Type` based on an array of content_types provided.
-
-        :param content_types: List of content-types.
-        :return: Content-Type (e.g. application/json).
-        """
-        if not content_types:
-            return None
-
-        for content_type in content_types:
-            if re.search("json", content_type, re.IGNORECASE):
-                return content_type
-
-        return content_types[0]
-
-    def update_params_for_auth(
-        self,
-        headers,
-        queries,
-        auth_settings,
-        resource_path,
-        method,
-        body,
-        request_auth=None,
-    ):
-        """Updates header and query params based on authentication setting.
-
-        :param headers: Header parameters dict to be updated.
-        :param queries: Query parameters tuple list to be updated.
-        :param auth_settings: Authentication setting identifiers list.
-        :resource_path: A string representation of the HTTP request resource path.
-        :method: A string representation of the HTTP request method.
-        :body: A object representing the body of the HTTP request.
-        The object type is the return value of sanitize_for_serialization().
-        :param request_auth: if set, the provided settings will
-                             override the token in the configuration.
-        """
-        if not auth_settings:
-            return
-
-        if request_auth:
-            self._apply_auth_params(
-                headers, queries, resource_path, method, body, request_auth
-            )
-            return
-
-        for auth in auth_settings:
-            auth_setting = self.configuration.auth_settings().get(auth)
-            if auth_setting:
-                self._apply_auth_params(
-                    headers, queries, resource_path, method, body, auth_setting
-                )
-
-    def _apply_auth_params(
-        self, headers, queries, resource_path, method, body, auth_setting
-    ):
-        """Updates the request parameters based on a single auth_setting
-
-        :param headers: Header parameters dict to be updated.
-        :param queries: Query parameters tuple list to be updated.
-        :resource_path: A string representation of the HTTP request resource path.
-        :method: A string representation of the HTTP request method.
-        :body: A object representing the body of the HTTP request.
-        The object type is the return value of sanitize_for_serialization().
-        :param auth_setting: auth settings for the endpoint
-        """
-        if auth_setting["in"] == "cookie":
-            headers["Cookie"] = auth_setting["value"]
-        elif auth_setting["in"] == "header":
-            if auth_setting["type"] != "http-signature":
-                headers[auth_setting["key"]] = auth_setting["value"]
-        elif auth_setting["in"] == "query":
-            queries.append((auth_setting["key"], auth_setting["value"]))
-        else:
-            raise ApiValueError("Authentication token must be in `query` or `header`")
-
-    def __deserialize_file(self, response):
-        """Deserializes body to file
-
-        Saves response body into a file in a temporary folder,
-        using the filename from the `Content-Disposition` header if provided.
-
-        :param response:  RESTResponse.
-        :return: file path.
-        """
-        fd, path = tempfile.mkstemp(dir=self.configuration.temp_folder_path)
-        os.close(fd)
-        os.remove(path)
-
-        content_disposition = response.getheader("Content-Disposition")
-        if content_disposition:
-            filename = re.search(
-                r'filename=[\'"]?([^\'"\s]+)[\'"]?', content_disposition
-            ).group(1)
-            path = os.path.join(os.path.dirname(path), filename)
-
-        with open(path, "wb") as f:
-            f.write(response.data)
-
-        return path
-
-    def __deserialize_primitive(self, data, klass):
-        """Deserializes string to primitive type.
-
-        :param data: str.
-        :param klass: class literal.
-
-        :return: int, long, float, str, bool.
-        """
-        try:
-            return klass(data)
-        except UnicodeEncodeError:
-            return str(data)
-        except TypeError:
-            return data
-
-    def __deserialize_object(self, value):
-        """Return an original value.
-
-        :return: object.
-        """
-        return value
-
-    def __deserialize_date(self, string):
-        """Deserializes string to date.
-
-        :param string: str.
-        :return: date.
-        """
-        try:
-            return parse(string).date()
-        except ImportError:
-            return string
-        except ValueError:
-            raise rest.ApiException(
-                status=0, reason="Failed to parse `{0}` as date object".format(string)
-            )
-
-    def __deserialize_datetime(self, string):
-        """Deserializes string to datetime.
-
-        The string should be in iso8601 datetime format.
-
-        :param string: str.
-        :return: datetime.
-        """
-        try:
-            return parse(string)
-        except ImportError:
-            return string
-        except ValueError:
-            raise rest.ApiException(
-                status=0,
-                reason=("Failed to parse `{0}` as datetime object".format(string)),
-            )
-
-    def __deserialize_model(self, data, klass):
-        """Deserializes list or dict to model.
-
-        :param data: dict, list.
-        :param klass: class literal.
-        :return: model object.
-        """
-
-        return klass.from_dict(data)
--- a/benchmark/agbenchmark/agent_protocol_client/api_response.py
+++ b/benchmark/agbenchmark/agent_protocol_client/api_response.py
@@ -1,28 +0,0 @@
-"""API response object."""
-
-from __future__ import annotations
-
-from typing import Any, Dict, Optional
-
-from pydantic import Field, StrictInt, StrictStr
-
-
-class ApiResponse:
-    """
-    API response object
-    """
-
-    status_code: Optional[StrictInt] = Field(None, description="HTTP status code")
-    headers: Optional[Dict[StrictStr, StrictStr]] = Field(
-        None, description="HTTP headers"
-    )
-    data: Optional[Any] = Field(
-        None, description="Deserialized data given the data type"
-    )
-    raw_data: Optional[Any] = Field(None, description="Raw data (HTTP response body)")
-
-    def __init__(self, status_code=None, headers=None, data=None, raw_data=None):
-        self.status_code = status_code
-        self.headers = headers
-        self.data = data
-        self.raw_data = raw_data
--- a/benchmark/agbenchmark/agent_protocol_client/configuration.py
+++ b/benchmark/agbenchmark/agent_protocol_client/configuration.py
@@ -1,447 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-import copy
-import http.client as httplib
-import logging
-import sys
-
-import urllib3
-
-JSON_SCHEMA_VALIDATION_KEYWORDS = {
-    "multipleOf",
-    "maximum",
-    "exclusiveMaximum",
-    "minimum",
-    "exclusiveMinimum",
-    "maxLength",
-    "minLength",
-    "pattern",
-    "maxItems",
-    "minItems",
-}
-
-
-class Configuration(object):
-    """This class contains various settings of the API client.
-
-    :param host: Base url.
-    :param api_key: Dict to store API key(s).
-      Each entry in the dict specifies an API key.
-      The dict key is the name of the security scheme in the OAS specification.
-      The dict value is the API key secret.
-    :param api_key_prefix: Dict to store API prefix (e.g. Bearer).
-      The dict key is the name of the security scheme in the OAS specification.
-      The dict value is an API key prefix when generating the auth data.
-    :param username: Username for HTTP basic authentication.
-    :param password: Password for HTTP basic authentication.
-    :param access_token: Access token.
-    :param server_index: Index to servers configuration.
-    :param server_variables: Mapping with string values to replace variables in
-      templated server configuration. The validation of enums is performed for
-      variables with defined enum values before.
-    :param server_operation_index: Mapping from operation ID to an index to server
-      configuration.
-    :param server_operation_variables: Mapping from operation ID to a mapping with
-      string values to replace variables in templated server configuration.
-      The validation of enums is performed for variables with defined enum values before.
-    :param ssl_ca_cert: str - the path to a file of concatenated CA certificates
-      in PEM format.
-
-    """
-
-    _default = None
-
-    def __init__(
-        self,
-        host=None,
-        api_key=None,
-        api_key_prefix=None,
-        username=None,
-        password=None,
-        access_token=None,
-        server_index=None,
-        server_variables=None,
-        server_operation_index=None,
-        server_operation_variables=None,
-        ssl_ca_cert=None,
-    ):
-        """Constructor"""
-        self._base_path = "http://localhost" if host is None else host
-        """Default Base url
-        """
-        self.server_index = 0 if server_index is None and host is None else server_index
-        self.server_operation_index = server_operation_index or {}
-        """Default server index
-        """
-        self.server_variables = server_variables or {}
-        self.server_operation_variables = server_operation_variables or {}
-        """Default server variables
-        """
-        self.temp_folder_path = None
-        """Temp file folder for downloading files
-        """
-        # Authentication Settings
-        self.api_key = {}
-        if api_key:
-            self.api_key = api_key
-        """dict to store API key(s)
-        """
-        self.api_key_prefix = {}
-        if api_key_prefix:
-            self.api_key_prefix = api_key_prefix
-        """dict to store API prefix (e.g. Bearer)
-        """
-        self.refresh_api_key_hook = None
-        """function hook to refresh API key if expired
-        """
-        self.username = username
-        """Username for HTTP basic authentication
-        """
-        self.password = password
-        """Password for HTTP basic authentication
-        """
-        self.access_token = access_token
-        """Access token
-        """
-        self.logger = {}
-        """Logging Settings
-        """
-        self.logger["package_logger"] = logging.getLogger("agent_protocol_client")
-        self.logger["urllib3_logger"] = logging.getLogger("urllib3")
-        self.logger_format = "%(asctime)s %(levelname)s %(message)s"
-        """Log format
-        """
-        self.logger_stream_handler = None
-        """Log stream handler
-        """
-        self.logger_file_handler = None
-        """Log file handler
-        """
-        self.logger_file = None
-        """Debug file location
-        """
-        self.debug = False
-        """Debug switch
-        """
-
-        self.verify_ssl = True
-        """SSL/TLS verification
-           Set this to false to skip verifying SSL certificate when calling API
-           from https server.
-        """
-        self.ssl_ca_cert = ssl_ca_cert
-        """Set this to customize the certificate file to verify the peer.
-        """
-        self.cert_file = None
-        """client certificate file
-        """
-        self.key_file = None
-        """client key file
-        """
-        self.assert_hostname = None
-        """Set this to True/False to enable/disable SSL hostname verification.
-        """
-        self.tls_server_name = None
-        """SSL/TLS Server Name Indication (SNI)
-           Set this to the SNI value expected by the server.
-        """
-
-        self.connection_pool_maxsize = 100
-        """This value is passed to the aiohttp to limit simultaneous connections.
-           Default values is 100, None means no-limit.
-        """
-
-        self.proxy = None
-        """Proxy URL
-        """
-        self.proxy_headers = None
-        """Proxy headers
-        """
-        self.safe_chars_for_path_param = ""
-        """Safe chars for path_param
-        """
-        self.retries = None
-        """Adding retries to override urllib3 default value 3
-        """
-        # Enable client side validation
-        self.client_side_validation = True
-
-        self.socket_options = None
-        """Options to pass down to the underlying urllib3 socket
-        """
-
-        self.datetime_format = "%Y-%m-%dT%H:%M:%S.%f%z"
-        """datetime format
-        """
-
-        self.date_format = "%Y-%m-%d"
-        """date format
-        """
-
-    def __deepcopy__(self, memo):
-        cls = self.__class__
-        result = cls.__new__(cls)
-        memo[id(self)] = result
-        for k, v in self.__dict__.items():
-            if k not in ("logger", "logger_file_handler"):
-                setattr(result, k, copy.deepcopy(v, memo))
-        # shallow copy of loggers
-        result.logger = copy.copy(self.logger)
-        # use setters to configure loggers
-        result.logger_file = self.logger_file
-        result.debug = self.debug
-        return result
-
-    def __setattr__(self, name, value):
-        object.__setattr__(self, name, value)
-
-    @classmethod
-    def set_default(cls, default):
-        """Set default instance of configuration.
-
-        It stores default configuration, which can be
-        returned by get_default_copy method.
-
-        :param default: object of Configuration
-        """
-        cls._default = default
-
-    @classmethod
-    def get_default_copy(cls):
-        """Deprecated. Please use `get_default` instead.
-
-        Deprecated. Please use `get_default` instead.
-
-        :return: The configuration object.
-        """
-        return cls.get_default()
-
-    @classmethod
-    def get_default(cls):
-        """Return the default configuration.
-
-        This method returns newly created, based on default constructor,
-        object of Configuration class or returns a copy of default
-        configuration.
-
-        :return: The configuration object.
-        """
-        if cls._default is None:
-            cls._default = Configuration()
-        return cls._default
-
-    @property
-    def logger_file(self):
-        """The logger file.
-
-        If the logger_file is None, then add stream handler and remove file
-        handler. Otherwise, add file handler and remove stream handler.
-
-        :param value: The logger_file path.
-        :type: str
-        """
-        return self.__logger_file
-
-    @logger_file.setter
-    def logger_file(self, value):
-        """The logger file.
-
-        If the logger_file is None, then add stream handler and remove file
-        handler. Otherwise, add file handler and remove stream handler.
-
-        :param value: The logger_file path.
-        :type: str
-        """
-        self.__logger_file = value
-        if self.__logger_file:
-            # If set logging file,
-            # then add file handler and remove stream handler.
-            self.logger_file_handler = logging.FileHandler(self.__logger_file)
-            self.logger_file_handler.setFormatter(self.logger_formatter)
-            for _, logger in self.logger.items():
-                logger.addHandler(self.logger_file_handler)
-
-    @property
-    def debug(self):
-        """Debug status
-
-        :param value: The debug status, True or False.
-        :type: bool
-        """
-        return self.__debug
-
-    @debug.setter
-    def debug(self, value):
-        """Debug status
-
-        :param value: The debug status, True or False.
-        :type: bool
-        """
-        self.__debug = value
-        if self.__debug:
-            # if debug status is True, turn on debug logging
-            for _, logger in self.logger.items():
-                logger.setLevel(logging.DEBUG)
-            # turn on httplib debug
-            httplib.HTTPConnection.debuglevel = 1
-        else:
-            # if debug status is False, turn off debug logging,
-            # setting log level to default `logging.WARNING`
-            for _, logger in self.logger.items():
-                logger.setLevel(logging.WARNING)
-            # turn off httplib debug
-            httplib.HTTPConnection.debuglevel = 0
-
-    @property
-    def logger_format(self):
-        """The logger format.
-
-        The logger_formatter will be updated when sets logger_format.
-
-        :param value: The format string.
-        :type: str
-        """
-        return self.__logger_format
-
-    @logger_format.setter
-    def logger_format(self, value):
-        """The logger format.
-
-        The logger_formatter will be updated when sets logger_format.
-
-        :param value: The format string.
-        :type: str
-        """
-        self.__logger_format = value
-        self.logger_formatter = logging.Formatter(self.__logger_format)
-
-    def get_api_key_with_prefix(self, identifier, alias=None):
-        """Gets API key (with prefix if set).
-
-        :param identifier: The identifier of apiKey.
-        :param alias: The alternative identifier of apiKey.
-        :return: The token for api key authentication.
-        """
-        if self.refresh_api_key_hook is not None:
-            self.refresh_api_key_hook(self)
-        key = self.api_key.get(
-            identifier, self.api_key.get(alias) if alias is not None else None
-        )
-        if key:
-            prefix = self.api_key_prefix.get(identifier)
-            if prefix:
-                return "%s %s" % (prefix, key)
-            else:
-                return key
-
-    def get_basic_auth_token(self):
-        """Gets HTTP basic authentication header (string).
-
-        :return: The token for basic HTTP authentication.
-        """
-        username = ""
-        if self.username is not None:
-            username = self.username
-        password = ""
-        if self.password is not None:
-            password = self.password
-        return urllib3.util.make_headers(basic_auth=username + ":" + password).get(
-            "authorization"
-        )
-
-    def auth_settings(self):
-        """Gets Auth Settings dict for api client.
-
-        :return: The Auth Settings information dict.
-        """
-        auth = {}
-        return auth
-
-    def to_debug_report(self):
-        """Gets the essential information for debugging.
-
-        :return: The report for debugging.
-        """
-        return (
-            "Python SDK Debug Report:\n"
-            "OS: {env}\n"
-            "Python Version: {pyversion}\n"
-            "Version of the API: v0.2\n"
-            "SDK Package Version: 1.0.0".format(env=sys.platform, pyversion=sys.version)
-        )
-
-    def get_host_settings(self):
-        """Gets an array of host settings
-
-        :return: An array of host settings
-        """
-        return [
-            {
-                "url": "",
-                "description": "No description provided",
-            }
-        ]
-
-    def get_host_from_settings(self, index, variables=None, servers=None):
-        """Gets host URL based on the index and variables
-        :param index: array index of the host settings
-        :param variables: hash of variable and the corresponding value
-        :param servers: an array of host settings or None
-        :return: URL based on host settings
-        """
-        if index is None:
-            return self._base_path
-
-        variables = {} if variables is None else variables
-        servers = self.get_host_settings() if servers is None else servers
-
-        try:
-            server = servers[index]
-        except IndexError:
-            raise ValueError(
-                "Invalid index {0} when selecting the host settings. "
-                "Must be less than {1}".format(index, len(servers))
-            )
-
-        url = server["url"]
-
-        # go through variables and replace placeholders
-        for variable_name, variable in server.get("variables", {}).items():
-            used_value = variables.get(variable_name, variable["default_value"])
-
-            if "enum_values" in variable and used_value not in variable["enum_values"]:
-                raise ValueError(
-                    "The variable `{0}` in the host URL has invalid value "
-                    "{1}. Must be {2}.".format(
-                        variable_name, variables[variable_name], variable["enum_values"]
-                    )
-                )
-
-            url = url.replace("{" + variable_name + "}", used_value)
-
-        return url
-
-    @property
-    def host(self):
-        """Return generated host."""
-        return self.get_host_from_settings(
-            self.server_index, variables=self.server_variables
-        )
-
-    @host.setter
-    def host(self, value):
-        """Fix base path."""
-        self._base_path = value
-        self.server_index = None
--- a/benchmark/agbenchmark/agent_protocol_client/docs/AgentApi.md
+++ b/benchmark/agbenchmark/agent_protocol_client/docs/AgentApi.md
@@ -1,615 +0,0 @@
-# agbenchmark.agent_protocol_client.AgentApi
-
-All URIs are relative to _http://localhost_
-
-| Method                                                                       | HTTP request                                           | Description                                                   |
-| ---------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------- |
-| [**create_agent_task**](AgentApi.md#create_agent_task)                       | **POST** /agent/tasks                                  | Creates a task for the agent.                                 |
-| [**download_agent_task_artifact**](AgentApi.md#download_agent_task_artifact) | **GET** /agent/tasks/{task_id}/artifacts/{artifact_id} | Download a specified artifact.                                |
-| [**execute_agent_task_step**](AgentApi.md#execute_agent_task_step)           | **POST** /agent/tasks/{task_id}/steps                  | Execute a step in the specified agent task.                   |
-| [**get_agent_task**](AgentApi.md#get_agent_task)                             | **GET** /agent/tasks/{task_id}                         | Get details about a specified agent task.                     |
-| [**get_agent_task_step**](AgentApi.md#get_agent_task_step)                   | **GET** /agent/tasks/{task_id}/steps/{step_id}         | Get details about a specified task step.                      |
-| [**list_agent_task_artifacts**](AgentApi.md#list_agent_task_artifacts)       | **GET** /agent/tasks/{task_id}/artifacts               | List all artifacts that have been created for the given task. |
-| [**list_agent_task_steps**](AgentApi.md#list_agent_task_steps)               | **GET** /agent/tasks/{task_id}/steps                   | List all steps for the specified task.                        |
-| [**list_agent_tasks_ids**](AgentApi.md#list_agent_tasks_ids)                 | **GET** /agent/tasks                                   | List all tasks that have been created for the agent.          |
-| [**upload_agent_task_artifacts**](AgentApi.md#upload_agent_task_artifacts)   | **POST** /agent/tasks/{task_id}/artifacts              | Upload an artifact for the specified task.                    |
-
-# **create_agent_task**
-
-> Task create_agent_task(task_request_body=task_request_body)
-
-Creates a task for the agent.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.models.task import Task
-from agbenchmark.agent_protocol_client.models.task_request_body import TaskRequestBody
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_request_body = agbenchmark.agent_protocol_client.TaskRequestBody() # TaskRequestBody |  (optional)
-
-    try:
-        # Creates a task for the agent.
-        api_response = await api_instance.create_agent_task(task_request_body=task_request_body)
-        print("The response of AgentApi->create_agent_task:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->create_agent_task: %s\n" % e)
-```
-
-### Parameters
-
-| Name                  | Type                                      | Description | Notes      |
-| --------------------- | ----------------------------------------- | ----------- | ---------- |
-| **task_request_body** | [**TaskRequestBody**](TaskRequestBody.md) |             | [optional] |
-
-### Return type
-
-[**Task**](Task.md)
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: application/json
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                                | Response headers |
-| ----------- | ------------------------------------------ | ---------------- |
-| **200**     | A new agent task was successfully created. | -                |
-| **0**       | Internal Server Error                      | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **download_agent_task_artifact**
-
-> bytearray download_agent_task_artifact(task_id, artifact_id)
-
-Download a specified artifact.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-    artifact_id = 'artifact_id_example' # str | ID of the artifact
-
-    try:
-        # Download a specified artifact.
-        api_response = await api_instance.download_agent_task_artifact(task_id, artifact_id)
-        print("The response of AgentApi->download_agent_task_artifact:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->download_agent_task_artifact: %s\n" % e)
-```
-
-### Parameters
-
-| Name            | Type    | Description        | Notes |
-| --------------- | ------- | ------------------ | ----- |
-| **task_id**     | **str** | ID of the task     |
-| **artifact_id** | **str** | ID of the artifact |
-
-### Return type
-
-**bytearray**
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: Not defined
- **Accept**: application/octet-stream
-
-### HTTP response details
-
-| Status code | Description                           | Response headers |
-| ----------- | ------------------------------------- | ---------------- |
-| **200**     | Returned the content of the artifact. | -                |
-| **0**       | Internal Server Error                 | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **execute_agent_task_step**
-
-> Step execute_agent_task_step(task_id, step_request_body=step_request_body)
-
-Execute a step in the specified agent task.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.models.step import Step
-from agbenchmark.agent_protocol_client.models.step_request_body import StepRequestBody
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-    step_request_body = agbenchmark.agent_protocol_client.StepRequestBody() # StepRequestBody |  (optional)
-
-    try:
-        # Execute a step in the specified agent task.
-        api_response = await api_instance.execute_agent_task_step(task_id, step_request_body=step_request_body)
-        print("The response of AgentApi->execute_agent_task_step:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->execute_agent_task_step: %s\n" % e)
-```
-
-### Parameters
-
-| Name                  | Type                                      | Description    | Notes      |
-| --------------------- | ----------------------------------------- | -------------- | ---------- |
-| **task_id**           | **str**                                   | ID of the task |
-| **step_request_body** | [**StepRequestBody**](StepRequestBody.md) |                | [optional] |
-
-### Return type
-
-[**Step**](Step.md)
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: application/json
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                       | Response headers |
-| ----------- | --------------------------------- | ---------------- |
-| **200**     | Executed step for the agent task. | -                |
-| **0**       | Internal Server Error             | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **get_agent_task**
-
-> Task get_agent_task(task_id)
-
-Get details about a specified agent task.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.models.task import Task
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-
-    try:
-        # Get details about a specified agent task.
-        api_response = await api_instance.get_agent_task(task_id)
-        print("The response of AgentApi->get_agent_task:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->get_agent_task: %s\n" % e)
-```
-
-### Parameters
-
-| Name        | Type    | Description    | Notes |
-| ----------- | ------- | -------------- | ----- |
-| **task_id** | **str** | ID of the task |
-
-### Return type
-
-[**Task**](Task.md)
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: Not defined
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                           | Response headers |
-| ----------- | ------------------------------------- | ---------------- |
-| **200**     | Returned details about an agent task. | -                |
-| **0**       | Internal Server Error                 | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **get_agent_task_step**
-
-> Step get_agent_task_step(task_id, step_id)
-
-Get details about a specified task step.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.models.step import Step
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-    step_id = 'step_id_example' # str | ID of the step
-
-    try:
-        # Get details about a specified task step.
-        api_response = await api_instance.get_agent_task_step(task_id, step_id)
-        print("The response of AgentApi->get_agent_task_step:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->get_agent_task_step: %s\n" % e)
-```
-
-### Parameters
-
-| Name        | Type    | Description    | Notes |
-| ----------- | ------- | -------------- | ----- |
-| **task_id** | **str** | ID of the task |
-| **step_id** | **str** | ID of the step |
-
-### Return type
-
-[**Step**](Step.md)
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: Not defined
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                                | Response headers |
-| ----------- | ------------------------------------------ | ---------------- |
-| **200**     | Returned details about an agent task step. | -                |
-| **0**       | Internal Server Error                      | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **list_agent_task_artifacts**
-
-> List[Artifact] list_agent_task_artifacts(task_id)
-
-List all artifacts that have been created for the given task.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-
-    try:
-        # List all artifacts that have been created for the given task.
-        api_response = await api_instance.list_agent_task_artifacts(task_id)
-        print("The response of AgentApi->list_agent_task_artifacts:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->list_agent_task_artifacts: %s\n" % e)
-```
-
-### Parameters
-
-| Name        | Type    | Description    | Notes |
-| ----------- | ------- | -------------- | ----- |
-| **task_id** | **str** | ID of the task |
-
-### Return type
-
-[**List[Artifact]**](Artifact.md)
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: Not defined
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                           | Response headers |
-| ----------- | ------------------------------------- | ---------------- |
-| **200**     | Returned the content of the artifact. | -                |
-| **0**       | Internal Server Error                 | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **list_agent_task_steps**
-
-> List[str] list_agent_task_steps(task_id)
-
-List all steps for the specified task.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-
-    try:
-        # List all steps for the specified task.
-        api_response = await api_instance.list_agent_task_steps(task_id)
-        print("The response of AgentApi->list_agent_task_steps:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->list_agent_task_steps: %s\n" % e)
-```
-
-### Parameters
-
-| Name        | Type    | Description    | Notes |
-| ----------- | ------- | -------------- | ----- |
-| **task_id** | **str** | ID of the task |
-
-### Return type
-
-**List[str]**
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: Not defined
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                                                   | Response headers |
-| ----------- | ------------------------------------------------------------- | ---------------- |
-| **200**     | Returned list of agent&#39;s step IDs for the specified task. | -                |
-| **0**       | Internal Server Error                                         | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **list_agent_tasks_ids**
-
-> List[str] list_agent_tasks_ids()
-
-List all tasks that have been created for the agent.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-
-    try:
-        # List all tasks that have been created for the agent.
-        api_response = await api_instance.list_agent_tasks_ids()
-        print("The response of AgentApi->list_agent_tasks_ids:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->list_agent_tasks_ids: %s\n" % e)
-```
-
-### Parameters
-
-This endpoint does not need any parameter.
-
-### Return type
-
-**List[str]**
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: Not defined
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                            | Response headers |
-| ----------- | -------------------------------------- | ---------------- |
-| **200**     | Returned list of agent&#39;s task IDs. | -                |
-| **0**       | Internal Server Error                  | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
-
-# **upload_agent_task_artifacts**
-
-> Artifact upload_agent_task_artifacts(task_id, file, relative_path=relative_path)
-
-Upload an artifact for the specified task.
-
-### Example
-
-```python
-import time
-import os
-import agent_protocol_client
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-from agbenchmark.agent_protocol_client.rest import ApiException
-from pprint import pprint
-
-# Defining the host is optional and defaults to http://localhost
-# See configuration.py for a list of all supported configuration parameters.
-configuration = agbenchmark.agent_protocol_client.Configuration(
-    host = "http://localhost"
-)
-
-
-# Enter a context with an instance of the API client
-async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
-    # Create an instance of the API class
-    api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
-    task_id = 'task_id_example' # str | ID of the task
-    file = None # bytearray | File to upload.
-    relative_path = 'relative_path_example' # str | Relative path of the artifact in the agent's workspace. (optional)
-
-    try:
-        # Upload an artifact for the specified task.
-        api_response = await api_instance.upload_agent_task_artifacts(task_id, file, relative_path=relative_path)
-        print("The response of AgentApi->upload_agent_task_artifacts:\n")
-        pprint(api_response)
-    except Exception as e:
-        print("Exception when calling AgentApi->upload_agent_task_artifacts: %s\n" % e)
-```
-
-### Parameters
-
-| Name              | Type          | Description                                                 | Notes      |
-| ----------------- | ------------- | ----------------------------------------------------------- | ---------- |
-| **task_id**       | **str**       | ID of the task                                              |
-| **file**          | **bytearray** | File to upload.                                             |
-| **relative_path** | **str**       | Relative path of the artifact in the agent&#39;s workspace. | [optional] |
-
-### Return type
-
-[**Artifact**](Artifact.md)
-
-### Authorization
-
-No authorization required
-
-### HTTP request headers
-
- **Content-Type**: multipart/form-data
- **Accept**: application/json
-
-### HTTP response details
-
-| Status code | Description                           | Response headers |
-| ----------- | ------------------------------------- | ---------------- |
-| **200**     | Returned the content of the artifact. | -                |
-| **0**       | Internal Server Error                 | -                |
-
-[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
--- a/benchmark/agbenchmark/agent_protocol_client/exceptions.py
+++ b/benchmark/agbenchmark/agent_protocol_client/exceptions.py
@@ -1,154 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-class OpenApiException(Exception):
-    """The base exception class for all OpenAPIExceptions"""
-
-
-class ApiTypeError(OpenApiException, TypeError):
-    def __init__(self, msg, path_to_item=None, valid_classes=None, key_type=None):
-        """Raises an exception for TypeErrors
-
-        Args:
-            msg (str): the exception message
-
-        Keyword Args:
-            path_to_item (list): a list of keys an indices to get to the
-                                 current_item
-                                 None if unset
-            valid_classes (tuple): the primitive classes that current item
-                                   should be an instance of
-                                   None if unset
-            key_type (bool): False if our value is a value in a dict
-                             True if it is a key in a dict
-                             False if our item is an item in a list
-                             None if unset
-        """
-        self.path_to_item = path_to_item
-        self.valid_classes = valid_classes
-        self.key_type = key_type
-        full_msg = msg
-        if path_to_item:
-            full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
-        super(ApiTypeError, self).__init__(full_msg)
-
-
-class ApiValueError(OpenApiException, ValueError):
-    def __init__(self, msg, path_to_item=None):
-        """
-        Args:
-            msg (str): the exception message
-
-        Keyword Args:
-            path_to_item (list) the path to the exception in the
-                received_data dict. None if unset
-        """
-
-        self.path_to_item = path_to_item
-        full_msg = msg
-        if path_to_item:
-            full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
-        super(ApiValueError, self).__init__(full_msg)
-
-
-class ApiAttributeError(OpenApiException, AttributeError):
-    def __init__(self, msg, path_to_item=None):
-        """
-        Raised when an attribute reference or assignment fails.
-
-        Args:
-            msg (str): the exception message
-
-        Keyword Args:
-            path_to_item (None/list) the path to the exception in the
-                received_data dict
-        """
-        self.path_to_item = path_to_item
-        full_msg = msg
-        if path_to_item:
-            full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
-        super(ApiAttributeError, self).__init__(full_msg)
-
-
-class ApiKeyError(OpenApiException, KeyError):
-    def __init__(self, msg, path_to_item=None):
-        """
-        Args:
-            msg (str): the exception message
-
-        Keyword Args:
-            path_to_item (None/list) the path to the exception in the
-                received_data dict
-        """
-        self.path_to_item = path_to_item
-        full_msg = msg
-        if path_to_item:
-            full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
-        super(ApiKeyError, self).__init__(full_msg)
-
-
-class ApiException(OpenApiException):
-    def __init__(self, status=None, reason=None, http_resp=None):
-        if http_resp:
-            self.status = http_resp.status
-            self.reason = http_resp.reason
-            self.body = http_resp.data
-            self.headers = http_resp.getheaders()
-        else:
-            self.status = status
-            self.reason = reason
-            self.body = None
-            self.headers = None
-
-    def __str__(self):
-        """Custom error messages for exception"""
-        error_message = "({0})\n" "Reason: {1}\n".format(self.status, self.reason)
-        if self.headers:
-            error_message += "HTTP response headers: {0}\n".format(self.headers)
-
-        if self.body:
-            error_message += "HTTP response body: {0}\n".format(self.body)
-
-        return error_message
-
-
-class NotFoundException(ApiException):
-    def __init__(self, status=None, reason=None, http_resp=None):
-        super(NotFoundException, self).__init__(status, reason, http_resp)
-
-
-class UnauthorizedException(ApiException):
-    def __init__(self, status=None, reason=None, http_resp=None):
-        super(UnauthorizedException, self).__init__(status, reason, http_resp)
-
-
-class ForbiddenException(ApiException):
-    def __init__(self, status=None, reason=None, http_resp=None):
-        super(ForbiddenException, self).__init__(status, reason, http_resp)
-
-
-class ServiceException(ApiException):
-    def __init__(self, status=None, reason=None, http_resp=None):
-        super(ServiceException, self).__init__(status, reason, http_resp)
-
-
-def render_path(path_to_item):
-    """Returns a string representation of a path"""
-    result = ""
-    for pth in path_to_item:
-        if isinstance(pth, int):
-            result += "[{0}]".format(pth)
-        else:
-            result += "['{0}']".format(pth)
-    return result
--- a/benchmark/agbenchmark/agent_protocol_client/models/init.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/init.py
@@ -1,25 +0,0 @@
-# coding: utf-8
-
-# flake8: noqa
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-# import models into model package
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-from agbenchmark.agent_protocol_client.models.artifacts import Artifacts
-from agbenchmark.agent_protocol_client.models.pagination import Pagination
-from agbenchmark.agent_protocol_client.models.step import Step
-from agbenchmark.agent_protocol_client.models.step_all_of import StepAllOf
-from agbenchmark.agent_protocol_client.models.step_request_body import StepRequestBody
-from agbenchmark.agent_protocol_client.models.task import Task
-from agbenchmark.agent_protocol_client.models.task_all_of import TaskAllOf
-from agbenchmark.agent_protocol_client.models.task_request_body import TaskRequestBody
--- a/benchmark/agbenchmark/agent_protocol_client/models/artifact.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/artifact.py
@@ -1,72 +0,0 @@
-# coding: utf-8
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Optional
-
-from pydantic import BaseModel, Field, StrictStr
-
-
-class Artifact(BaseModel):
-    """
-    Artifact that the task has produced.
-    """
-
-    artifact_id: StrictStr = Field(..., description="ID of the artifact.")
-    file_name: StrictStr = Field(..., description="Filename of the artifact.")
-    relative_path: Optional[StrictStr] = Field(
-        None, description="Relative path of the artifact in the agent's workspace."
-    )
-    __properties = ["artifact_id", "file_name", "relative_path"]
-    created_at: StrictStr = Field(..., description="Creation date of the artifact.")
-    # modified_at: StrictStr = Field(..., description="Modification date of the artifact.")
-    agent_created: bool = Field(..., description="True if created by the agent")
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> Artifact:
-        """Create an instance of Artifact from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> Artifact:
-        """Create an instance of Artifact from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return Artifact.parse_obj(obj)
-
-        _obj = Artifact.parse_obj(
-            {
-                "artifact_id": obj.get("artifact_id"),
-                "file_name": obj.get("file_name"),
-                "relative_path": obj.get("relative_path"),
-                "created_at": obj.get("created_at"),
-                "modified_at": obj.get("modified_at"),
-                "agent_created": obj.get("agent_created"),
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/artifacts.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/artifacts.py
@@ -1,77 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-
-from pydantic import BaseModel
-
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-from agbenchmark.agent_protocol_client.models.pagination import Pagination
-
-
-class Artifacts(BaseModel):
-    """
-    Artifacts that the task has produced.
-    """
-
-    artifacts: list[Artifact]
-    pagination: Pagination
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> Artifacts:
-        """Create an instance of Artifacts from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> Artifacts:
-        """Create an instance of Artifacts from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return Artifacts.parse_obj(obj)
-
-        _obj = Artifacts.parse_obj(
-            {
-                "artifacts": obj.get("artifacts"),
-                "pagination": obj.get("pagination"),
-            }
-        )
-        return _obj
-
-
-Artifacts.update_forward_refs()
--- a/benchmark/agbenchmark/agent_protocol_client/models/pagination.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/pagination.py
@@ -1,75 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-
-from pydantic import BaseModel
-
-
-class Pagination(BaseModel):
-    """
-    Pagination that the task has produced.
-    """
-
-    total_items: int
-    total_pages: int
-    current_page: int
-    page_size: int
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> Pagination:
-        """Create an instance of Pagination from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> Pagination:
-        """Create an instance of Pagination from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return Pagination.parse_obj(obj)
-
-        _obj = Pagination.parse_obj(
-            {
-                "total_items": obj.get("total_items"),
-                "total_pages": obj.get("total_pages"),
-                "current_page": obj.get("current_page"),
-                "page_size": obj.get("page_size"),
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/step.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/step.py
@@ -1,146 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Any, Optional
-
-from pydantic import BaseModel, Field, StrictBool, StrictStr, conlist, validator
-
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-
-
-class Step(BaseModel):
-    """
-    Step
-    """
-
-    input: Optional[StrictStr] = Field(None, description="Input prompt for the step.")
-    additional_input: Optional[Any] = Field(
-        None, description="Input parameters for the task step. Any value is allowed."
-    )
-    task_id: StrictStr = Field(
-        ..., description="The ID of the task this step belongs to."
-    )
-    step_id: StrictStr = Field(..., description="The ID of the task step.")
-    name: Optional[StrictStr] = Field(None, description="The name of the task step.")
-    status: StrictStr = Field(..., description="The status of the task step.")
-    output: Optional[StrictStr] = Field(None, description="Output of the task step.")
-    additional_output: Optional[Any] = Field(
-        None,
-        description="Output that the task step has produced. Any value is allowed.",
-    )
-    artifacts: conlist(Artifact) = Field(
-        ..., description="A list of artifacts that the step has produced."
-    )
-    is_last: Optional[StrictBool] = Field(
-        False, description="Whether this is the last step in the task."
-    )
-    __properties = [
-        "input",
-        "additional_input",
-        "task_id",
-        "step_id",
-        "name",
-        "status",
-        "output",
-        "additional_output",
-        "artifacts",
-        "is_last",
-    ]
-
-    @validator("status")
-    def status_validate_enum(cls, value):
-        """Validates the enum"""
-        if value not in ("created", "completed"):
-            raise ValueError("must be one of enum values ('created', 'completed')")
-        return value
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> Step:
-        """Create an instance of Step from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
-        _items = []
-        if self.artifacts:
-            for _item in self.artifacts:
-                if _item:
-                    _items.append(_item.to_dict())
-            _dict["artifacts"] = _items
-        # set to None if additional_input (nullable) is None
-        # and __fields_set__ contains the field
-        if self.additional_input is None and "additional_input" in self.__fields_set__:
-            _dict["additional_input"] = None
-
-        # set to None if additional_output (nullable) is None
-        # and __fields_set__ contains the field
-        if (
-            self.additional_output is None
-            and "additional_output" in self.__fields_set__
-        ):
-            _dict["additional_output"] = None
-
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> Step:
-        """Create an instance of Step from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return Step.parse_obj(obj)
-
-        _obj = Step.parse_obj(
-            {
-                "input": obj.get("input"),
-                "additional_input": obj.get("additional_input"),
-                "task_id": obj.get("task_id"),
-                "step_id": obj.get("step_id"),
-                "name": obj.get("name"),
-                "status": obj.get("status"),
-                "output": obj.get("output"),
-                "additional_output": obj.get("additional_output"),
-                "artifacts": [
-                    Artifact.from_dict(_item) for _item in obj.get("artifacts")
-                ]
-                if obj.get("artifacts") is not None
-                else None,
-                "is_last": obj.get("is_last")
-                if obj.get("is_last") is not None
-                else False,
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/step_all_of.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/step_all_of.py
@@ -1,133 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Any, Optional
-
-from pydantic import BaseModel, Field, StrictBool, StrictStr, conlist, validator
-
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-
-
-class StepAllOf(BaseModel):
-    """
-    StepAllOf
-    """
-
-    task_id: StrictStr = Field(
-        ..., description="The ID of the task this step belongs to."
-    )
-    step_id: StrictStr = Field(..., description="The ID of the task step.")
-    name: Optional[StrictStr] = Field(None, description="The name of the task step.")
-    status: StrictStr = Field(..., description="The status of the task step.")
-    output: Optional[StrictStr] = Field(None, description="Output of the task step.")
-    additional_output: Optional[Any] = Field(
-        None,
-        description="Output that the task step has produced. Any value is allowed.",
-    )
-    artifacts: conlist(Artifact) = Field(
-        ..., description="A list of artifacts that the step has produced."
-    )
-    is_last: Optional[StrictBool] = Field(
-        False, description="Whether this is the last step in the task."
-    )
-    __properties = [
-        "task_id",
-        "step_id",
-        "name",
-        "status",
-        "output",
-        "additional_output",
-        "artifacts",
-        "is_last",
-    ]
-
-    @validator("status")
-    def status_validate_enum(cls, value):
-        """Validates the enum"""
-        if value not in ("created", "completed"):
-            raise ValueError("must be one of enum values ('created', 'completed')")
-        return value
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> StepAllOf:
-        """Create an instance of StepAllOf from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
-        _items = []
-        if self.artifacts:
-            for _item in self.artifacts:
-                if _item:
-                    _items.append(_item.to_dict())
-            _dict["artifacts"] = _items
-        # set to None if additional_output (nullable) is None
-        # and __fields_set__ contains the field
-        if (
-            self.additional_output is None
-            and "additional_output" in self.__fields_set__
-        ):
-            _dict["additional_output"] = None
-
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> StepAllOf:
-        """Create an instance of StepAllOf from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return StepAllOf.parse_obj(obj)
-
-        _obj = StepAllOf.parse_obj(
-            {
-                "task_id": obj.get("task_id"),
-                "step_id": obj.get("step_id"),
-                "name": obj.get("name"),
-                "status": obj.get("status"),
-                "output": obj.get("output"),
-                "additional_output": obj.get("additional_output"),
-                "artifacts": [
-                    Artifact.from_dict(_item) for _item in obj.get("artifacts")
-                ]
-                if obj.get("artifacts") is not None
-                else None,
-                "is_last": obj.get("is_last")
-                if obj.get("is_last") is not None
-                else False,
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/step_request_body.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/step_request_body.py
@@ -1,77 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Any, Optional
-
-from pydantic import BaseModel, Field, StrictStr
-
-
-class StepRequestBody(BaseModel):
-    """
-    Body of the task request.
-    """
-
-    input: Optional[StrictStr] = Field(None, description="Input prompt for the step.")
-    additional_input: Optional[Any] = Field(
-        None, description="Input parameters for the task step. Any value is allowed."
-    )
-    __properties = ["input", "additional_input"]
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> StepRequestBody:
-        """Create an instance of StepRequestBody from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # set to None if additional_input (nullable) is None
-        # and __fields_set__ contains the field
-        if self.additional_input is None and "additional_input" in self.__fields_set__:
-            _dict["additional_input"] = None
-
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> StepRequestBody:
-        """Create an instance of StepRequestBody from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return StepRequestBody.parse_obj(obj)
-
-        _obj = StepRequestBody.parse_obj(
-            {"input": obj.get("input"), "additional_input": obj.get("additional_input")}
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/step_result.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/step_result.py
@@ -1,89 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v1
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Any, Optional
-
-from pydantic import BaseModel, Field, StrictBool, conlist
-
-
-class StepResult(BaseModel):
-    """
-    Result of the task step.
-    """
-
-    output: Optional[Any] = Field(
-        None,
-        description="Output that the task step has produced. Any value is allowed.",
-    )
-    artifacts: conlist(Any) = Field(
-        ..., description="A list of artifacts that the step has produced."
-    )
-    is_last: Optional[StrictBool] = Field(
-        False, description="Whether this is the last step in the task."
-    )
-    __properties = ["output", "artifacts", "is_last"]
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> StepResult:
-        """Create an instance of StepResult from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # set to None if output (nullable) is None
-        # and __fields_set__ contains the field
-        if self.output is None and "output" in self.__fields_set__:
-            _dict["output"] = None
-
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> StepResult:
-        """Create an instance of StepResult from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return StepResult.parse_obj(obj)
-
-        _obj = StepResult.parse_obj(
-            {
-                "output": obj.get("output"),
-                "artifacts": obj.get("artifacts"),
-                "is_last": obj.get("is_last")
-                if obj.get("is_last") is not None
-                else False,
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/task.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/task.py
@@ -1,99 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Any, Optional
-
-from pydantic import BaseModel, Field, StrictStr, conlist
-
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-
-
-class Task(BaseModel):
-    """
-    Task
-    """
-
-    input: Optional[StrictStr] = Field(None, description="Input prompt for the task.")
-    additional_input: Optional[Any] = Field(
-        None, description="Input parameters for the task. Any value is allowed."
-    )
-    task_id: StrictStr = Field(..., description="The ID of the task.")
-    artifacts: conlist(Artifact) = Field(
-        ..., description="A list of artifacts that the task has produced."
-    )
-    __properties = ["input", "additional_input", "task_id", "artifacts"]
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> Task:
-        """Create an instance of Task from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
-        _items = []
-        if self.artifacts:
-            for _item in self.artifacts:
-                if _item:
-                    _items.append(_item.to_dict())
-            _dict["artifacts"] = _items
-        # set to None if additional_input (nullable) is None
-        # and __fields_set__ contains the field
-        if self.additional_input is None and "additional_input" in self.__fields_set__:
-            _dict["additional_input"] = None
-
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> Task:
-        """Create an instance of Task from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return Task.parse_obj(obj)
-
-        _obj = Task.parse_obj(
-            {
-                "input": obj.get("input"),
-                "additional_input": obj.get("additional_input"),
-                "task_id": obj.get("task_id"),
-                "artifacts": [
-                    Artifact.from_dict(_item) for _item in obj.get("artifacts")
-                ]
-                if obj.get("artifacts") is not None
-                else None,
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/task_all_of.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/task_all_of.py
@@ -1,87 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-
-from pydantic import BaseModel, Field, StrictStr, conlist
-
-from agbenchmark.agent_protocol_client.models.artifact import Artifact
-
-
-class TaskAllOf(BaseModel):
-    """
-    Definition of an agent task.
-    """
-
-    task_id: StrictStr = Field(..., description="The ID of the task.")
-    artifacts: conlist(Artifact) = Field(
-        ..., description="A list of artifacts that the task has produced."
-    )
-    __properties = ["task_id", "artifacts"]
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> TaskAllOf:
-        """Create an instance of TaskAllOf from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
-        _items = []
-        if self.artifacts:
-            for _item in self.artifacts:
-                if _item:
-                    _items.append(_item.to_dict())
-            _dict["artifacts"] = _items
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> TaskAllOf:
-        """Create an instance of TaskAllOf from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return TaskAllOf.parse_obj(obj)
-
-        _obj = TaskAllOf.parse_obj(
-            {
-                "task_id": obj.get("task_id"),
-                "artifacts": [
-                    Artifact.from_dict(_item) for _item in obj.get("artifacts")
-                ]
-                if obj.get("artifacts") is not None
-                else None,
-            }
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/models/task_request_body.py
+++ b/benchmark/agbenchmark/agent_protocol_client/models/task_request_body.py
@@ -1,77 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-from __future__ import annotations
-
-import json
-import pprint
-import re  # noqa: F401
-from typing import Any, Optional
-
-from pydantic import BaseModel, Field, StrictStr
-
-
-class TaskRequestBody(BaseModel):
-    """
-    Body of the task request.
-    """
-
-    input: Optional[StrictStr] = Field(None, description="Input prompt for the task.")
-    additional_input: Optional[Any] = Field(
-        None, description="Input parameters for the task. Any value is allowed."
-    )
-    __properties = ["input", "additional_input"]
-
-    class Config:
-        """Pydantic configuration"""
-
-        allow_population_by_field_name = True
-        validate_assignment = True
-
-    def to_str(self) -> str:
-        """Returns the string representation of the model using alias"""
-        return pprint.pformat(self.dict(by_alias=True))
-
-    def to_json(self) -> str:
-        """Returns the JSON representation of the model using alias"""
-        return json.dumps(self.to_dict())
-
-    @classmethod
-    def from_json(cls, json_str: str) -> TaskRequestBody:
-        """Create an instance of TaskRequestBody from a JSON string"""
-        return cls.from_dict(json.loads(json_str))
-
-    def to_dict(self):
-        """Returns the dictionary representation of the model using alias"""
-        _dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
-        # set to None if additional_input (nullable) is None
-        # and __fields_set__ contains the field
-        if self.additional_input is None and "additional_input" in self.__fields_set__:
-            _dict["additional_input"] = None
-
-        return _dict
-
-    @classmethod
-    def from_dict(cls, obj: dict) -> TaskRequestBody:
-        """Create an instance of TaskRequestBody from a dict"""
-        if obj is None:
-            return None
-
-        if not isinstance(obj, dict):
-            return TaskRequestBody.parse_obj(obj)
-
-        _obj = TaskRequestBody.parse_obj(
-            {"input": obj.get("input"), "additional_input": obj.get("additional_input")}
-        )
-        return _obj
--- a/benchmark/agbenchmark/agent_protocol_client/rest.py
+++ b/benchmark/agbenchmark/agent_protocol_client/rest.py
@@ -1,311 +0,0 @@
-# coding: utf-8
-
-"""
-    Agent Communication Protocol
-
-    Specification of the API protocol for communication with an agent.  # noqa: E501
-
-    The version of the OpenAPI document: v0.2
-    Generated by OpenAPI Generator (https://openapi-generator.tech)
-
-    Do not edit the class manually.
-"""
-
-
-import io
-import json
-import logging
-import re
-import ssl
-from urllib.parse import urlencode
-
-import aiohttp
-
-from agbenchmark.agent_protocol_client.exceptions import ApiException, ApiValueError
-
-logger = logging.getLogger(__name__)
-
-
-class RESTResponse(io.IOBase):
-    def __init__(self, resp, data):
-        self.aiohttp_response = resp
-        self.status = resp.status
-        self.reason = resp.reason
-        self.data = data
-
-    def getheaders(self):
-        """Returns a CIMultiDictProxy of the response headers."""
-        return self.aiohttp_response.headers
-
-    def getheader(self, name, default=None):
-        """Returns a given response header."""
-        return self.aiohttp_response.headers.get(name, default)
-
-
-class RESTClientObject(object):
-    def __init__(self, configuration, pools_size=4, maxsize=None):
-        # maxsize is number of requests to host that are allowed in parallel
-        if maxsize is None:
-            maxsize = configuration.connection_pool_maxsize
-
-        ssl_context = ssl.create_default_context(cafile=configuration.ssl_ca_cert)
-        if configuration.cert_file:
-            ssl_context.load_cert_chain(
-                configuration.cert_file, keyfile=configuration.key_file
-            )
-
-        if not configuration.verify_ssl:
-            ssl_context.check_hostname = False
-            ssl_context.verify_mode = ssl.CERT_NONE
-
-        connector = aiohttp.TCPConnector(limit=maxsize, ssl=ssl_context)
-
-        self.proxy = configuration.proxy
-        self.proxy_headers = configuration.proxy_headers
-
-        # https pool manager
-        self.pool_manager = aiohttp.ClientSession(connector=connector, trust_env=True)
-
-    async def close(self):
-        await self.pool_manager.close()
-
-    async def request(
-        self,
-        method,
-        url,
-        query_params=None,
-        headers=None,
-        body=None,
-        post_params=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        """Execute request
-
-        :param method: http request method
-        :param url: http request url
-        :param query_params: query parameters in the url
-        :param headers: http request headers
-        :param body: request json body, for `application/json`
-        :param post_params: request post parameters,
-                            `application/x-www-form-urlencoded`
-                            and `multipart/form-data`
-        :param _preload_content: this is a non-applicable field for
-                                 the AiohttpClient.
-        :param _request_timeout: timeout setting for this request. If one
-                                 number provided, it will be total request
-                                 timeout. It can also be a pair (tuple) of
-                                 (connection, read) timeouts.
-        """
-        method = method.upper()
-        assert method in ["GET", "HEAD", "DELETE", "POST", "PUT", "PATCH", "OPTIONS"]
-
-        if post_params and body:
-            raise ApiValueError(
-                "body parameter cannot be used with post_params parameter."
-            )
-
-        post_params = post_params or {}
-        headers = headers or {}
-        # url already contains the URL query string
-        # so reset query_params to empty dict
-        query_params = {}
-        timeout = _request_timeout or 5 * 60
-
-        if "Content-Type" not in headers:
-            headers["Content-Type"] = "application/json"
-
-        args = {"method": method, "url": url, "timeout": timeout, "headers": headers}
-
-        if self.proxy:
-            args["proxy"] = self.proxy
-        if self.proxy_headers:
-            args["proxy_headers"] = self.proxy_headers
-
-        if query_params:
-            args["url"] += "?" + urlencode(query_params)
-
-        # For `POST`, `PUT`, `PATCH`, `OPTIONS`, `DELETE`
-        if method in ["POST", "PUT", "PATCH", "OPTIONS", "DELETE"]:
-            if re.search("json", headers["Content-Type"], re.IGNORECASE):
-                if body is not None:
-                    body = json.dumps(body)
-                args["data"] = body
-            elif (
-                headers["Content-Type"] == "application/x-www-form-urlencoded"
-            ):  # noqa: E501
-                args["data"] = aiohttp.FormData(post_params)
-            elif headers["Content-Type"] == "multipart/form-data":
-                # must del headers['Content-Type'], or the correct
-                # Content-Type which generated by aiohttp
-                del headers["Content-Type"]
-                data = aiohttp.FormData()
-                for param in post_params:
-                    k, v = param
-                    if isinstance(v, tuple) and len(v) == 3:
-                        data.add_field(k, value=v[1], filename=v[0], content_type=v[2])
-                    else:
-                        data.add_field(k, v)
-                args["data"] = data
-
-            # Pass a `bytes` parameter directly in the body to support
-            # other content types than Json when `body` argument is provided
-            # in serialized form
-            elif isinstance(body, bytes):
-                args["data"] = body
-            else:
-                # Cannot generate the request from given parameters
-                msg = """Cannot prepare a request message for provided
-                         arguments. Please check that your arguments match
-                         declared content type."""
-                raise ApiException(status=0, reason=msg)
-
-        r = await self.pool_manager.request(**args)
-        if _preload_content:
-            data = await r.read()
-            r = RESTResponse(r, data)
-
-            # log response body
-            logger.debug("response body: %s", r.data)
-
-            if not 200 <= r.status <= 299:
-                raise ApiException(http_resp=r)
-
-        return r
-
-    async def get_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "GET",
-            url,
-            headers=headers,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            query_params=query_params,
-        )
-
-    async def head_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "HEAD",
-            url,
-            headers=headers,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            query_params=query_params,
-        )
-
-    async def options_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        post_params=None,
-        body=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "OPTIONS",
-            url,
-            headers=headers,
-            query_params=query_params,
-            post_params=post_params,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            body=body,
-        )
-
-    async def delete_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        body=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "DELETE",
-            url,
-            headers=headers,
-            query_params=query_params,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            body=body,
-        )
-
-    async def post_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        post_params=None,
-        body=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "POST",
-            url,
-            headers=headers,
-            query_params=query_params,
-            post_params=post_params,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            body=body,
-        )
-
-    async def put_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        post_params=None,
-        body=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "PUT",
-            url,
-            headers=headers,
-            query_params=query_params,
-            post_params=post_params,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            body=body,
-        )
-
-    async def patch_request(
-        self,
-        url,
-        headers=None,
-        query_params=None,
-        post_params=None,
-        body=None,
-        _preload_content=True,
-        _request_timeout=None,
-    ):
-        return await self.request(
-            "PATCH",
-            url,
-            headers=headers,
-            query_params=query_params,
-            post_params=post_params,
-            _preload_content=_preload_content,
-            _request_timeout=_request_timeout,
-            body=body,
-        )
--- a/benchmark/agbenchmark/app.py
+++ b/benchmark/agbenchmark/app.py
@@ -1,78 +1,74 @@
 import datetime
+import glob
+import json
+import logging
+import sys
+import time
 import uuid
 from collections import defaultdict, deque
+from multiprocessing import Process
 from pathlib import Path
-
-import httpx
-
-from agbenchmark.agent_protocol_client import (
-    AgentApi,
-    ApiClient,
-    ApiException,
-    Configuration,
-)
-from agbenchmark.reports.processing.report_types_v2 import BenchmarkRun
-from agbenchmark.schema import TaskEvalRequestBody
-from agbenchmark.utils.utils import write_pretty_json
-
-configuration = Configuration(host="http://localhost:8000" + "/ap/v1")
-
-import json
-import os
-import sys
 from typing import Any, Optional

+import httpx
 import psutil
-from fastapi import APIRouter, FastAPI
-from fastapi import (
-    HTTPException as FastAPIHTTPException,  # Import HTTPException from FastAPI
-)
-from fastapi import Request, Response
+from agent_protocol_client import AgentApi, ApiClient, ApiException, Configuration
+from agent_protocol_client.models import Task, TaskRequestBody
+from fastapi import APIRouter, FastAPI, HTTPException, Request, Response
 from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Extra, ValidationError

-from agbenchmark.execute_sub_process import execute_subprocess
-from agbenchmark.schema import Task, TaskRequestBody
+from agbenchmark.config import AgentBenchmarkConfig
+from agbenchmark.reports.processing.report_types_v2 import (
+    BenchmarkRun,
+    Metrics,
+    RepositoryInfo,
+    RunDetails,
+    TaskInfo,
+)
+from agbenchmark.schema import TaskEvalRequestBody
+from agbenchmark.utils.data_types import ChallengeData
+from agbenchmark.utils.utils import write_pretty_json

-sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-from fastapi import FastAPI
-from pydantic import BaseModel, Extra
+sys.path.append(str(Path(__file__).parent.parent))

-router = APIRouter()
-import glob
+logger = logging.getLogger(__name__)

-# Change the current working directory to the benchmark path
-# home_path = find_absolute_benchmark_path()
-# os.chdir(home_path)
-
-general_command = ["poetry", "run", "agbenchmark", "start", "--backend"]
-
-import psutil
-
-challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
-
-json_files = deque(
+CHALLENGES: dict[str, ChallengeData] = {}
+challenges_path = Path(__file__).parent / "challenges"
+challenge_spec_files = deque(
    glob.glob(
        f"{challenges_path}/**/data.json",
        recursive=True,
    )
 )

-CHALLENGES = {}
-task_informations = defaultdict(dict)
+logger.debug("Loading challenges...")
+while challenge_spec_files:
+    challenge_spec_file = Path(challenge_spec_files.popleft())
+    challenge_relpath = challenge_spec_file.relative_to(challenges_path.parent)
+    if challenge_relpath.is_relative_to("challenges/deprecated"):
+        continue

-while json_files:
-    json_file = json_files.popleft()
+    logger.debug(f"Loading {challenge_relpath}...")
+    try:
+        challenge_info = ChallengeData.parse_file(challenge_spec_file)
+    except ValidationError as e:
+        if logging.getLogger().level == logging.DEBUG:
+            logger.warning(f"Spec file {challenge_relpath} failed to load:\n{e}")
+        logger.debug(f"Invalid challenge spec: {challenge_spec_file.read_text()}")
+        continue
+    challenge_info.spec_file = challenge_spec_file

-    with open(json_file, "r") as file:
-        data = json.load(file)
+    if not challenge_info.eval_id:
+        challenge_info.eval_id = str(uuid.uuid4())
+        # this will sort all the keys of the JSON systematically
+        # so that the order is always the same
+        write_pretty_json(challenge_info.dict(), challenge_spec_file)

-        if "eval_id" not in data:
-            data["eval_id"] = str(uuid.uuid4())
-        # this will sort all the keys of the JSON systematically so that the order is always the same
-        write_pretty_json(data, json_file)
-        # ok
-        CHALLENGES[data["eval_id"]] = data
-        CHALLENGES[data["eval_id"]]["path"] = json_file
+    CHALLENGES[challenge_info.eval_id] = challenge_info
+
+task_informations = defaultdict(dict[str, Any])


 def find_agbenchmark_without_uvicorn():
@@ -93,10 +89,10 @@ def find_agbenchmark_without_uvicorn():
    ):
        try:
            # Convert the process.info dictionary values to strings and concatenate them
-            full_info = " ".join([str(v) for k, v in process.info.items()])
+            full_info = " ".join([str(v) for k, v in process.as_dict().items()])

            if "agbenchmark" in full_info and "uvicorn" not in full_info:
-                pids.append(process.info["pid"])
+                pids.append(process.pid)
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass
    return pids
@@ -114,24 +110,12 @@ class CreateReportRequest(BaseModel):

 updates_list = []

-updates_list = []
-
-import json
-
 origins = [
    "http://localhost:8000",
    "http://localhost:8080",
    "http://127.0.0.1:5000",
    "http://localhost:5000",
 ]
-app = FastAPI()
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=origins,
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)


 def stream_output(pipe):
@@ -139,275 +123,210 @@ def stream_output(pipe):
        print(line, end="")


-@router.post("/reports")
-def run_single_test(body: CreateReportRequest) -> Any:
-    pids = find_agbenchmark_without_uvicorn()
-    print(f"pids already running with agbenchmark: {pids}")
-    print(body.dict())
-    # it's a hack because other parts of the code are using sys.argv
-    print(os.getcwd())
-    command_options = ["agbenchmark"]
-    # if body.category:
-    #     sys.argv.append(f"--category={body.category}")
-    command_options.append(f"--test={body.test}")
-    if body.mock:
-        command_options.append("--mock")
-
-    execute_subprocess(command_options, 200)
-    import json
-    from pathlib import Path
-
-    print("finished running")
-    # List all folders in the current working directory
-    path_reports = Path.cwd() / "agbenchmark_config" / "reports"
-    folders = [folder for folder in path_reports.iterdir() if folder.is_dir()]
-
-    # Sort the folders based on their names
-    sorted_folders = sorted(folders, key=lambda x: x.name)
-
-    # Get the last folder
-    last_folder = sorted_folders[-1] if sorted_folders else None
-
-    # Read report.json from this folder
-    if last_folder:
-        report_path = last_folder / "report.json"
-        print(report_path)
-        if report_path.exists():
-            with report_path.open() as file:
-                data = json.load(file)
-            print(data)
-        else:
-            print(f"'report.json' does not exist in '{last_folder}'")
-    else:
-        print("No folders found.")
-
-    return Response(
-        content=json.dumps(data),
-        status_code=200,
-        media_type="application/json",
+def setup_fastapi_app(agbenchmark_config: AgentBenchmarkConfig) -> FastAPI:
+    from agbenchmark.agent_api_interface import (
+        copy_agent_artifacts_into_folder,
+        upload_artifacts,
    )
-
-
-import json
-from typing import Any
-
-from fastapi import FastAPI, Request, Response
-
-
-@router.get("/updates")
-def get_updates(request: Request) -> Any:
-    from agbenchmark.__main__ import UPDATES_JSON_PATH
-
-    try:
-        # Read data from the "update.json" file (provide the correct file path)
-        with open(UPDATES_JSON_PATH, "r") as file:
-            data = json.load(file)
-
-        # Get the last_update_time from the query parameter
-        query_param = request.query_params.get("last_update_time")
-
-        if query_param is None:
-            # Handle the case when last_update_time is not provided
-            print("ERROR: last_update_time parameter is missing")
-            return Response(
-                content=json.dumps({"error": "last_update_time parameter is missing"}),
-                status_code=400,
-                media_type="application/json",
-                headers={"Content-Type": "application/json"},
-            )
-
-        # Convert query_param to a Unix timestamp (assuming it's in seconds as a string)
-        query_timestamp = int(query_param)
-
-        # Filter the data based on the timestamp (keep timestamps before query_timestamp)
-        filtered_data = [item for item in data if item["timestamp"] > query_timestamp]
-
-        # Extract only the "content" field from each item
-        filtered_data = [item["content"] for item in filtered_data]
-
-        # Convert the filtered data to JSON
-        filtered_json = json.dumps(filtered_data, indent=2)
-
-        print("INFO: Returning filtered data to the client")
-        return Response(
-            content=filtered_json,
-            status_code=200,
-            media_type="application/json",
-            headers={"Content-Type": "application/json"},
-        )
-    except FileNotFoundError:
-        print("ERROR: File not found: updates.json")
-        return Response(
-            content=json.dumps({"error": "File not found"}),
-            status_code=404,
-            media_type="application/json",
-            headers={"Content-Type": "application/json"},
-        )
-
-
-@router.post("/agent/tasks", tags=["agent"], response_model=Task)
-async def create_agent_task(task_eval_request: TaskEvalRequestBody) -> Task:
-    """
-    Creates a new task using the provided TaskRequestBody and returns a Task.
-
-    Args:
-        request (Request): FastAPI request object.
-        task (TaskRequestBody): The task request containing input and additional input data.
-
-    Returns:
-        Task: A new task with task_id, input, additional_input, and empty lists for artifacts and steps.
-
-    Example:
-        Request (TaskRequestBody defined in schema.py):
-            {
-                "input": "Write the words you receive to the file 'output.txt'.",
-                "additional_input": "python/code"
-            }
-
-        Response (Task defined in schema.py):
-            {
-                "task_id": "50da533e-3904-4401-8a07-c49adf88b5eb",
-                "input": "Write the word 'Washington' to a .txt file",
-                "additional_input": "python/code",
-                "artifacts": [],
-            }
-    """
-    from agbenchmark.agent_api_interface import upload_artifacts
-
-    try:
-        async with ApiClient(configuration) as api_client:
-            api_instance = AgentApi(api_client)
-            task_input = CHALLENGES[task_eval_request.eval_id]["task"]
-
-            task_request_body = TaskRequestBody(input=task_input)
-            task_response = await api_instance.create_agent_task(
-                task_request_body=task_request_body
-            )
-            task_informations[task_response.task_id][
-                "benchmark_start_time"
-            ] = datetime.datetime.now(datetime.timezone.utc).strftime(
-                "%Y-%m-%dT%H:%M:%S+00:00"
-            )
-            task_informations[task_response.task_id][
-                "eval_id"
-            ] = task_eval_request.eval_id
-            await upload_artifacts(
-                api_instance,
-                str(Path(CHALLENGES[task_eval_request.eval_id]["path"]).parent),
-                task_response.task_id,
-                "artifacts_in",
-            )
-            return Response(
-                content=task_response.json(),
-                status_code=200,
-                media_type="application/json",
-            )
-    except ApiException as e:
-        print(f"Error whilst trying to create a task: {task_eval_request}")
-        return Response(
-            content=json.dumps({"error": "Internal server error"}),
-            status_code=500,
-            media_type="application/json",
-        )
-
-
-@router.post("/agent/tasks/{task_id}/steps")
-async def proxy(request: Request, task_id: str):
-    timeout = httpx.Timeout(300.0, read=300.0)  # 5 minutes
-    async with httpx.AsyncClient(timeout=timeout) as client:
-        # Construct the new URL
-        new_url = f"http://localhost:8000/ap/v1/agent/tasks/{task_id}/steps"
-
-        # Forward the request
-        response = await client.post(
-            new_url,
-            data=await request.body(),
-            headers=dict(request.headers),
-        )
-
-        # Return the response from the forwarded request
-        return Response(content=response.content, status_code=response.status_code)
-
-
-@router.post("/agent/tasks/{task_id}/evaluations")
-async def create_evaluation(task_id: str) -> deque:
-    from agbenchmark.__main__ import TEMP_FOLDER_ABS_PATH
-    from agbenchmark.agent_api_interface import copy_agent_artifacts_into_temp_folder
    from agbenchmark.agent_interface import copy_artifacts_into_temp_folder
-    from agbenchmark.generate_test import create_challenge
+    from agbenchmark.generate_test import create_challenge_from_spec_file
+    from agbenchmark.main import run_benchmark

-    try:
-        async with ApiClient(configuration) as api_client:
-            api_instance = AgentApi(api_client)
-            await copy_agent_artifacts_into_temp_folder(api_instance, task_id)
-        # add custom python
-        data = CHALLENGES[task_informations[task_id]["eval_id"]]
+    configuration = Configuration(
+        host=agbenchmark_config.host or "http://localhost:8000"
+    )
+    app = FastAPI()
+    app.add_middleware(
+        CORSMiddleware,
+        allow_origins=origins,
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+    router = APIRouter()

-        artifact_path = str(Path(data["path"]).parent)
-        copy_artifacts_into_temp_folder(
-            TEMP_FOLDER_ABS_PATH, "custom_python", artifact_path
+    @router.post("/reports")
+    def run_single_test(body: CreateReportRequest) -> dict:
+        pids = find_agbenchmark_without_uvicorn()
+        logger.info(f"pids already running with agbenchmark: {pids}")
+
+        logger.debug(f"Request to /reports: {body.dict()}")
+
+        # Start the benchmark in a separate thread
+        benchmark_process = Process(
+            target=lambda: run_benchmark(
+                config=agbenchmark_config,
+                tests=(body.test,),
+                mock=body.mock or False,
+            )
        )
-        json_file = CHALLENGES[task_informations[task_id]["eval_id"]]["path"]
-        json_files = deque()
+        benchmark_process.start()

-        _, challenge_class = create_challenge(data, json_file, json_files)
-        challenge_instance = challenge_class()
-        scores = challenge_instance.get_scores(config={})
-        test_name = "Test" + data["name"]
-        is_score_100 = 1 in scores["values"]
+        # Wait for the benchmark to finish, with a timeout of 200 seconds
+        timeout = 200
+        start_time = time.time()
+        while benchmark_process.is_alive():
+            if time.time() - start_time > timeout:
+                logger.warning(f"Benchmark run timed out after {timeout} seconds")
+                benchmark_process.terminate()
+                break
+            time.sleep(1)
+        else:
+            logger.debug(f"Benchmark finished running in {time.time() - start_time} s")

-        info_details = {
-            "repository_info": {
-                "repo_url": None,
-                "team_name": None,
-                "benchmark_git_commit_sha": None,
-                "agent_git_commit_sha": None,
-            },
-            "run_details": {
-                "run_id": None,
-                "command": "agbenchmark" + " --test=" + test_name,
-                "completion_time": None,
-                "benchmark_start_time": task_informations[task_id][
+        # List all folders in the current working directory
+        path_reports = agbenchmark_config.reports_folder
+        folders = [folder for folder in path_reports.iterdir() if folder.is_dir()]
+
+        # Sort the folders based on their names
+        sorted_folders = sorted(folders, key=lambda x: x.name)
+
+        # Get the last folder
+        latest_folder = sorted_folders[-1] if sorted_folders else None
+
+        # Read report.json from this folder
+        if latest_folder:
+            report_path = latest_folder / "report.json"
+            logger.debug(f"Getting latest report from {report_path}")
+            if report_path.exists():
+                with report_path.open() as file:
+                    data = json.load(file)
+                logger.debug(f"Report data: {data}")
+            else:
+                logger.error(
+                    "Could not get result after running benchmark: "
+                    f"'report.json' does not exist in '{latest_folder}'"
+                )
+        else:
+            logger.error(
+                "Could not get result after running benchmark: no reports found"
+            )
+
+        return data
+
+    @router.post("/agent/tasks", tags=["agent"])
+    async def create_agent_task(task_eval_request: TaskEvalRequestBody) -> Task:
+        """
+        Creates a new task using the provided TaskEvalRequestBody and returns a Task.
+
+        Args:
+            task_eval_request: `TaskRequestBody` including an eval_id.
+
+        Returns:
+            Task: A new task with task_id, input, additional_input,
+                and empty lists for artifacts and steps.
+
+        Example:
+            Request (TaskEvalRequestBody defined in schema.py):
+                {
+                    ...,
+                    "eval_id": "50da533e-3904-4401-8a07-c49adf88b5eb"
+                }
+
+            Response (Task defined in `agent_protocol_client.models`):
+                {
+                    "task_id": "50da533e-3904-4401-8a07-c49adf88b5eb",
+                    "input": "Write the word 'Washington' to a .txt file",
+                    "artifacts": []
+                }
+        """
+        try:
+            async with ApiClient(configuration) as api_client:
+                api_instance = AgentApi(api_client)
+                task_input = CHALLENGES[task_eval_request.eval_id].task
+
+                task_request_body = TaskRequestBody(input=task_input)
+                task_response = await api_instance.create_agent_task(
+                    task_request_body=task_request_body
+                )
+                task_informations[task_response.task_id][
                    "benchmark_start_time"
-                ],
-                "test_name": data["name"],
-            },
-            "task_info": {
-                "data_path": data["path"].split("benchmark/", 1)[-1],
-                "is_regression": None,
-                "category": data["category"],
-                "task": data["task"],
-                "answer": data["ground"]["answer"],
-                "description": data["info"]["description"],
-            },
-            "metrics": {
-                "difficulty": None,
-                "success": is_score_100,
-                "attempted": True,
-                "success_percentage": None,
-                "cost": None,
-                "run_time": None,
-            },
-            "reached_cutoff": None,
-            "config": {},
-        }
+                ] = datetime.datetime.now(datetime.timezone.utc).strftime(
+                    "%Y-%m-%dT%H:%M:%S+00:00"
+                )
+                task_informations[task_response.task_id][
+                    "eval_id"
+                ] = task_eval_request.eval_id
+                await upload_artifacts(
+                    api_instance,
+                    str(CHALLENGES[task_eval_request.eval_id].spec_file.parent),
+                    task_response.task_id,
+                    "artifacts_in",
+                )
+                return task_response
+        except ApiException as e:
+            logger.error(f"Error whilst trying to create a task:\n{e}")
+            logger.error(
+                "The above error was caused while processing request: "
+                f"{task_eval_request}"
+            )
+            raise HTTPException(500)

-        BenchmarkRun.parse_obj(info_details)
+    @router.post("/agent/tasks/{task_id}/steps")
+    async def proxy(request: Request, task_id: str):
+        timeout = httpx.Timeout(300.0, read=300.0)  # 5 minutes
+        async with httpx.AsyncClient(timeout=timeout) as client:
+            # Construct the new URL
+            new_url = f"{configuration.host}/ap/v1/agent/tasks/{task_id}/steps"

-        print(json.dumps(info_details, indent=4))
-        return Response(
-            content=json.dumps(info_details),
-            status_code=200,
-            media_type="application/json",
-        )
-    except ApiException as e:
-        print(f"Error whilst trying to evaluate the task: {task_id}")
-        return Response(
-            content=json.dumps({"error": "Internal server error"}),
-            status_code=500,
-            media_type="application/json",
-        )
-    # path = Path(json_file).resolve()
+            # Forward the request
+            response = await client.post(
+                new_url,
+                data=await request.body(),
+                headers=dict(request.headers),
+            )

+            # Return the response from the forwarded request
+            return Response(content=response.content, status_code=response.status_code)

-app.include_router(router, prefix="/ap/v1")
+    @router.post("/agent/tasks/{task_id}/evaluations")
+    async def create_evaluation(task_id: str) -> BenchmarkRun:
+        challenge_info = CHALLENGES[task_informations[task_id]["eval_id"]]
+        workspace = agbenchmark_config.temp_folder
+        try:
+            async with ApiClient(configuration) as api_client:
+                api_instance = AgentApi(api_client)
+                await copy_agent_artifacts_into_folder(api_instance, task_id, workspace)
+
+            artifact_path = challenge_info.spec_file.parent
+            copy_artifacts_into_temp_folder(workspace, "custom_python", artifact_path)
+
+            challenge = create_challenge_from_spec_file(challenge_info.spec_file)
+            scores = challenge.get_scores(workspace)
+            is_score_100 = 1 in scores["values"]
+
+            eval_info = BenchmarkRun(
+                repository_info=RepositoryInfo(),
+                run_details=RunDetails(
+                    command=f"agbenchmark --test={challenge_info.name}",
+                    benchmark_start_time=(
+                        task_informations[task_id]["benchmark_start_time"]
+                    ),
+                    test_name=challenge_info.name,
+                ),
+                task_info=TaskInfo(
+                    data_path=str(
+                        challenge_info.spec_file.relative_to(challenges_path.parent)
+                    ),
+                    is_regression=None,
+                    category=[c.value for c in challenge_info.category],
+                    task=challenge_info.task,
+                    answer=challenge_info.ground.answer,
+                    description=challenge_info.info.description,
+                ),
+                metrics=Metrics(
+                    success=is_score_100,
+                    attempted=True,
+                ),
+                config={},
+            )
+
+            logger.debug(f"Returning evaluation data:\n{eval_info.json(indent=4)}")
+            return eval_info
+        except ApiException as e:
+            logger.error(f"Error {e} whilst trying to evaluate task: {task_id}")
+            raise HTTPException(500)
+
+    app.include_router(router, prefix="/ap/v1")
+
+    return app
--- a/benchmark/agbenchmark/challenges/init.py
+++ b/benchmark/agbenchmark/challenges/init.py
@@ -0,0 +1,32 @@
+import glob
+import json
+import logging
+from pathlib import Path
+
+logger = logging.getLogger(__name__)
+
+
+def get_unique_categories() -> set[str]:
+    """
+    Find all data.json files in the directory relative to this file and its
+    subdirectories, read the "category" field from each file, and return a set of unique
+    categories.
+    """
+    categories = set()
+
+    challenges_dir = Path(__file__).parent
+    glob_path = f"{challenges_dir}/**/data.json"
+
+    for data_file in glob.glob(glob_path, recursive=True):
+        with open(data_file, "r") as f:
+            try:
+                challenge_data = json.load(f)
+                categories.update(challenge_data.get("category", []))
+            except json.JSONDecodeError:
+                logger.error(f"Error: {data_file} is not a valid JSON file.")
+                continue
+            except IOError:
+                logger.error(f"IOError: file could not be read: {data_file}")
+                continue
+
+    return categories
--- a/benchmark/agbenchmark/challenges/verticals/scrape/4_revenue_retrieval_2/data.json
+++ b/benchmark/agbenchmark/challenges/verticals/scrape/4_revenue_retrieval_2/data.json
@@ -16,21 +16,21 @@
            ".txt"
        ],
        "should_contain": [
-          "15",
-          "112",
-          "117",
-          "204",
-          "413",
-          "2,0",
-          "3,198",
-          "4,046",
-          "7,000",
-          "11,759",
-          "21,461",
-          "24,578",
-          "31,536",
-          "53,823",
-          "81,462"
+            "15",
+            "112",
+            "117",
+            "204",
+            "413",
+            "2,0",
+            "3,198",
+            "4,046",
+            "7,000",
+            "11,759",
+            "21,461",
+            "24,578",
+            "31,536",
+            "53,823",
+            "81,462"
        ],
        "should_not_contain": []
    },
--- a/benchmark/agbenchmark/config.py
+++ b/benchmark/agbenchmark/config.py
@@ -0,0 +1,119 @@
+import json
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Optional
+
+from pydantic import BaseSettings
+
+
+def _calculate_info_test_path(base_path: Path, benchmark_start_time: datetime) -> Path:
+    """
+    Calculates the path to the directory where the test report will be saved.
+    """
+    # Ensure the reports path exists
+    base_path.mkdir(parents=True, exist_ok=True)
+
+    # Get current UTC date-time stamp
+    date_stamp = benchmark_start_time.strftime("%Y%m%dT%H%M%S")
+
+    # Default run name
+    run_name = "full_run"
+
+    # Map command-line arguments to their respective labels
+    arg_labels = {
+        "--test": None,
+        "--category": None,
+        "--maintain": "maintain",
+        "--improve": "improve",
+        "--explore": "explore",
+    }
+
+    # Identify the relevant command-line argument
+    for arg, label in arg_labels.items():
+        if arg in sys.argv:
+            test_arg = sys.argv[sys.argv.index(arg) + 1] if label is None else None
+            run_name = arg.strip("--")
+            if test_arg:
+                run_name = f"{run_name}_{test_arg}"
+            break
+
+    # Create the full new directory path with ISO standard UTC date-time stamp
+    report_path = base_path / f"{date_stamp}_{run_name}"
+
+    # Ensure the new directory is created
+    # FIXME: this is not a desirable side-effect of loading the config
+    report_path.mkdir(exist_ok=True)
+
+    return report_path
+
+
+class AgentBenchmarkConfig(BaseSettings, extra="allow"):
+    """
+    Configuration model and loader for the AGBenchmark.
+
+    Projects that want to use AGBenchmark should contain an agbenchmark_config folder
+    with a config.json file that - at minimum - specifies the `host` at which the
+    subject application exposes an Agent Protocol compliant API.
+    """
+
+    agbenchmark_config_dir: Path
+    """Path to the agbenchmark_config folder of the subject agent application."""
+
+    categories: list[str] | None = None
+    """Categories to benchmark the agent for. If omitted, all categories are assumed."""
+
+    host: str
+    """Host (scheme://address:port) of the subject agent application."""
+
+    @classmethod
+    def load(cls, config_dir: Optional[Path] = None) -> "AgentBenchmarkConfig":
+        config_dir = config_dir or cls.find_config_folder()
+        with (config_dir / "config.json").open("r") as f:
+            return cls(
+                agbenchmark_config_dir=config_dir,
+                **json.load(f),
+            )
+
+    @staticmethod
+    def find_config_folder(for_dir: Path = Path.cwd()) -> Path:
+        """
+        Find the closest ancestor folder containing an agbenchmark_config folder,
+        and returns the path of that agbenchmark_config folder.
+        """
+        current_directory = for_dir
+        while current_directory != Path("/"):
+            if (path := current_directory / "agbenchmark_config").exists():
+                if (path / "config.json").is_file():
+                    return path
+            current_directory = current_directory.parent
+        raise FileNotFoundError(
+            "No 'agbenchmark_config' directory found in the path hierarchy."
+        )
+
+    @property
+    def config_file(self) -> Path:
+        return self.agbenchmark_config_dir / "config.json"
+
+    @property
+    def reports_folder(self) -> Path:
+        return self.agbenchmark_config_dir / "reports"
+
+    def get_report_dir(self, benchmark_start_time: datetime) -> Path:
+        return _calculate_info_test_path(self.reports_folder, benchmark_start_time)
+
+    @property
+    def regression_tests_file(self) -> Path:
+        return self.reports_folder / "regression_tests.json"
+
+    @property
+    def success_rate_file(self) -> Path:
+        return self.reports_folder / "success_rate.json"
+
+    @property
+    def challenges_already_beaten_file(self) -> Path:
+        return self.agbenchmark_config_dir / "challenges_already_beaten.json"
+
+    @property
+    def temp_folder(self) -> Path:
+        return self.agbenchmark_config_dir / "temp_folder"
--- a/benchmark/agbenchmark/conftest.py
+++ b/benchmark/agbenchmark/conftest.py
@@ -1,167 +1,127 @@
 import contextlib
 import json
+import logging
 import os
 import shutil
-import sys
 import threading
 import time
-from pathlib import Path  # noqa
+from pathlib import Path
 from typing import Any, Generator

 import pytest

-from agbenchmark.__main__ import TEMP_FOLDER_ABS_PATH
+from agbenchmark.config import AgentBenchmarkConfig
 from agbenchmark.reports.reports import (
    finalize_reports,
    generate_single_call_report,
    session_finish,
 )
-from agbenchmark.utils.data_types import AgentBenchmarkConfig
+from agbenchmark.utils.challenge import Challenge
+from agbenchmark.utils.data_types import Category

 GLOBAL_TIMEOUT = (
    1500  # The tests will stop after 25 minutes so we can send the reports.
 )

+agbenchmark_config = AgentBenchmarkConfig.load()
+logger = logging.getLogger(__name__)
+
 pytest_plugins = ["agbenchmark.utils.dependencies"]
 collect_ignore = ["challenges"]
 suite_reports: dict[str, list] = {}


-def load_config_from_request(request: Any) -> AgentBenchmarkConfig:
-    """
-    This function loads the configuration for the agent benchmark from a given request.
-
-    Args:
-        request (Any): The request object from which the agent benchmark configuration is to be loaded.
-
-    Returns:
-        AgentBenchmarkConfig: The loaded agent benchmark configuration.
-
-    Raises:
-        json.JSONDecodeError: If the benchmark configuration file is not a valid JSON file.
-    """
-    agent_benchmark_config_path = Path.cwd() / "agbenchmark_config" / "config.json"
-    try:
-        with open(agent_benchmark_config_path, "r") as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-            agent_benchmark_config.agent_benchmark_config_path = (
-                agent_benchmark_config_path
-            )
-            return agent_benchmark_config
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        raise
-
-
@pytest.fixture(scope="module")
-def config(request: Any) -> Any:
-    """
-    This pytest fixture is responsible for loading the agent benchmark configuration from a given request.
-    This fixture is scoped to the module level, meaning it's invoked once per test module.
-
-    Args:
-        request (Any): The request object from which the agent benchmark configuration is to be loaded.
-
-    Returns:
-        Any: The loaded configuration dictionary.
-
-    Raises:
-        json.JSONDecodeError: If the benchmark configuration file is not a valid JSON file.
-    """
-    config = {}
-    agent_benchmark_config_path = Path.cwd() / "agbenchmark_config" / "config.json"
-    try:
-        with open(agent_benchmark_config_path, "r") as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-            agent_benchmark_config.agent_benchmark_config_path = (
-                agent_benchmark_config_path
-            )
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        raise
-
-    config["AgentBenchmarkConfig"] = agent_benchmark_config
-
-    return config
+def config() -> AgentBenchmarkConfig:
+    return agbenchmark_config


@pytest.fixture(autouse=True)
-def temp_folder() -> Generator[str, None, None]:
+def temp_folder() -> Generator[Path, None, None]:
    """
-    This pytest fixture is responsible for setting up and tearing down the temporary folder for each test.
+    Pytest fixture that sets up and tears down the temporary folder for each test.
    It is automatically used in every test due to the 'autouse=True' parameter.
-    It is used in order to let agbenchmark store files so they can then be evaluated.
    """

    # create output directory if it doesn't exist
-    if not os.path.exists(TEMP_FOLDER_ABS_PATH):
-        os.makedirs(TEMP_FOLDER_ABS_PATH, exist_ok=True)
+    if not os.path.exists(agbenchmark_config.temp_folder):
+        os.makedirs(agbenchmark_config.temp_folder, exist_ok=True)

-    yield
+    yield agbenchmark_config.temp_folder
    # teardown after test function completes
    if not os.getenv("KEEP_TEMP_FOLDER_FILES"):
-        for filename in os.listdir(TEMP_FOLDER_ABS_PATH):
-            file_path = os.path.join(TEMP_FOLDER_ABS_PATH, filename)
+        for filename in os.listdir(agbenchmark_config.temp_folder):
+            file_path = os.path.join(agbenchmark_config.temp_folder, filename)
            try:
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)
                elif os.path.isdir(file_path):
                    shutil.rmtree(file_path)
            except Exception as e:
-                print(f"Failed to delete {file_path}. Reason: {e}")
+                logger.warning(f"Failed to delete {file_path}. Reason: {e}")


-def pytest_addoption(parser: Any) -> None:
+def pytest_addoption(parser: pytest.Parser) -> None:
    """
-    This function is a pytest hook that is called to add command-line options.
-    It is used to add custom command-line options that are specific to the agent benchmark tests.
-    These options can be used to control the behavior of the tests.
-    The "--mock" option is used to run the tests in mock mode.
-    The "--host" option is used to specify the host for the tests.
-    The "--category" option is used to run only tests of a specific category.
-    The "--nc" option is used to run the tests without caching.
-    The "--cutoff" option is used to specify a cutoff time for the tests.
-    The "--improve" option is used to run only the tests that are marked for improvement.
-    The "--maintain" option is used to run only the tests that are marked for maintenance.
-    The "--explore" option is used to run the tests in exploration mode.
-    The "--test" option is used to run a specific test.
-    The "--no_dep" option is used to run the tests without dependencies.
-    The "--keep_answers" option is used to keep the answers of the tests.
+    Pytest hook that adds command-line options to the `pytest` command.
+    The added options are specific to agbenchmark and control its behavior:
+    * `--mock` is used to run the tests in mock mode.
+    * `--host` is used to specify the host for the tests.
+    * `--category` is used to run only tests of a specific category.
+    * `--nc` is used to run the tests without caching.
+    * `--cutoff` is used to specify a cutoff time for the tests.
+    * `--improve` is used to run only the tests that are marked for improvement.
+    * `--maintain` is used to run only the tests that are marked for maintenance.
+    * `--explore` is used to run the tests in exploration mode.
+    * `--test` is used to run a specific test.
+    * `--no-dep` is used to run the tests without dependencies.
+    * `--keep-answers` is used to keep the answers of the tests.

    Args:
-        parser (Any): The parser object to which the command-line options are added.
+        parser: The Pytest CLI parser to which the command-line options are added.
    """
-    parser.addoption("--no_dep", action="store_true", default=False)
-    parser.addoption("--mock", action="store_true", default=False)
-    parser.addoption("--host", action="store_true", default=None)
-    parser.addoption("--nc", action="store_true", default=False)
-    parser.addoption("--cutoff", action="store_true", default=False)
-    parser.addoption("--category", action="store_true", default=False)
-    parser.addoption("--test", action="store_true", default=None)
-    parser.addoption("--improve", action="store_true", default=False)
-    parser.addoption("--maintain", action="store_true", default=False)
-    parser.addoption("--explore", action="store_true", default=False)
-    parser.addoption("--keep-answers", action="store_true", default=False)
+    parser.addoption("--no-dep", action="store_true")
+    parser.addoption("--mock", action="store_true")
+    parser.addoption("--host", default=None)
+    parser.addoption("--nc", action="store_true")
+    parser.addoption("--cutoff", action="store")
+    parser.addoption("--category", action="append")
+    parser.addoption("--test", action="append")
+    parser.addoption("--improve", action="store_true")
+    parser.addoption("--maintain", action="store_true")
+    parser.addoption("--explore", action="store_true")
+    parser.addoption("--keep-answers", action="store_true")
+
+
+def pytest_configure(config: pytest.Config) -> None:
+    # Register category markers to prevent "unknown marker" warnings
+    for category in Category:
+        config.addinivalue_line("markers", f"{category.value}: {category}")


@pytest.fixture(autouse=True)
-def check_regression(request: Any) -> None:
+def check_regression(request: pytest.FixtureRequest) -> None:
    """
-    This pytest fixture is responsible for checking if a test is a regression test.
-    It is automatically used in every test due to the 'autouse=True' parameter.
-    The test name and the agent benchmark configuration are retrieved from the request object.
-    The regression reports are loaded from the path specified in the agent benchmark configuration.
-    If the "--improve" option is used and the test name exists in the regression tests, the test is skipped.
-    If the "--maintain" option is used and the test name does not exist in the regression tests, the test is also skipped.
+    Fixture that checks for every test if it should be treated as a regression test,
+    and whether to skip it based on that.
+
+    The test name is retrieved from the `request` object. Regression reports are loaded
+    from the path specified in the benchmark configuration.
+
+    Effect:
+    * If the `--improve` option is used and the current test is considered a regression
+      test, it is skipped.
+    * If the `--maintain` option is used and the current test  is not considered a
+      regression test, it is also skipped.

    Args:
-        request (Any): The request object from which the test name and the agent benchmark configuration are retrieved.
+        request: The request object from which the test name and the benchmark
+            configuration are retrieved.
    """
    test_name = request.node.parent.name
-    agent_benchmark_config = load_config_from_request(request)
-    with contextlib.suppress(Exception):
-        test = agent_benchmark_config.get_regression_reports_path()
-        data = json.loads(test)
+    with contextlib.suppress(FileNotFoundError):
+        regression_report = agbenchmark_config.regression_tests_file
+        data = json.loads(regression_report.read_bytes())
        challenge_location = getattr(request.node.parent.cls, "CHALLENGE_LOCATION", "")

        skip_string = f"Skipping {test_name} at {challenge_location}"
@@ -173,55 +133,33 @@ def check_regression(request: Any) -> None:
            pytest.skip(f"{skip_string} because it's not a regression test")


-# this is to get the challenge_data from every test
-@pytest.fixture(autouse=True)
-def challenge_data(request: Any) -> None:
-    """
-    This pytest fixture is responsible for providing the challenge data for each test.
-    It is automatically used in every test due to the 'autouse=True' parameter.
-    The challenge data is retrieved from the request object's parameters.
-    This fixture is essential for the pytest system as it provides the necessary data for each test.
-
-    Args:
-        request (Any): The request object from which the challenge data is retrieved.
-
-    Returns:
-        None: The challenge data is directly passed to the test function and does not need to be returned.
-    """
-    return request.param
-
-
@pytest.fixture(autouse=True, scope="session")
-def mock(request: Any) -> None:
+def mock(request: pytest.FixtureRequest) -> bool:
    """
-    This pytest fixture is responsible for retrieving the value of the "--mock" command-line option.
-    It is automatically used in every test session due to the 'autouse=True' parameter and 'session' scope.
-    The "--mock" option is used to run the tests in mock mode.
-    This fixture is essential for the pytest system as it provides the necessary command-line option value for each test session.
+    Pytest fixture that retrieves the value of the `--mock` command-line option.
+    The `--mock` option is used to run the tests in mock mode.

    Args:
-        request (Any): The request object from which the "--mock" option value is retrieved.
+        request: The `pytest.FixtureRequest` from which the `--mock` option value
+            is retrieved.

    Returns:
-        None: The "--mock" option value is directly passed to the test session and does not need to be returned.
+        bool: Whether `--mock` is set for this session.
    """
    return request.config.getoption("--mock")


@pytest.fixture(autouse=True, scope="function")
-def timer(request: Any) -> Any:
+def timer(request: pytest.FixtureRequest) -> Generator[None, None, None]:
    """
-    This pytest fixture is responsible for timing the execution of each test.
-    It is automatically used in every test due to the 'autouse=True' parameter and 'function' scope.
+    Pytest fixture that times the execution of each test.
    At the start of each test, it records the current time.
-    After the test function completes, it calculates the run time and appends it to the test node's user properties.
-    This allows the run time of each test to be accessed later for reporting or analysis.
+    After the test function completes, it calculates the run time and adds it to
+    the test node's `user_properties`.

    Args:
-        request (Any): The request object from which the test node is retrieved.
-
-    Yields:
-        None: Control is yielded back to the test function.
+        request: The `pytest.FixtureRequest` object through which the run time is stored
+            in the test node's `user_properties`.
    """
    start_time = time.time()
    yield
@@ -229,33 +167,21 @@ def timer(request: Any) -> Any:
    request.node.user_properties.append(("run_time", run_time))


-def pytest_runtest_makereport(item: Any, call: Any) -> None:
+def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo) -> None:
    """
-    This function is a pytest hook that is called when a test report is being generated.
+    Pytest hook that is called when a test report is being generated.
    It is used to generate and finalize reports for each test.

    Args:
-        item (Any): The test item for which the report is being generated.
-        call (Any): The call object from which the test result is retrieved.
+        item: The test item for which the report is being generated.
+        call: The call object from which the test result is retrieved.
    """
-    challenge_data = item.funcargs.get("challenge_data", None)
-
-    if not challenge_data:
-        # this will only happen for dummy dependency setup tests
-        return
-
-    challenge_location: str = getattr(item.cls, "CHALLENGE_LOCATION", "")
-
-    flags = (
-        "--test" in sys.argv
-        or "--maintain" in sys.argv
-        or "--improve" in sys.argv
-        or "--explore" in sys.argv
-    )
+    challenge: type[Challenge] = item.cls  # type: ignore
+    challenge_data = challenge.data
+    challenge_location = challenge.CHALLENGE_LOCATION

    if call.when == "call":
        answers = getattr(item, "answers", None)
-        challenge_location: str = getattr(item.cls, "CHALLENGE_LOCATION", "")
        test_name = item.nodeid.split("::")[1]
        item.test_name = test_name

@@ -264,14 +190,14 @@ def pytest_runtest_makereport(item: Any, call: Any) -> None:
        )

    if call.when == "teardown":
-        finalize_reports(item, challenge_data)
+        finalize_reports(agbenchmark_config, item, challenge_data)


 def timeout_monitor(start_time: int) -> None:
    """
-    This function is responsible for monitoring the total execution time of the test suite.
-    It runs in a separate thread and checks every second if the total execution time has exceeded the global timeout.
-    If the global timeout is exceeded, it terminates the pytest session with a specific return code.
+    Function that limits the total execution time of the test suite.
+    This function is supposed to be run in a separate thread and calls `pytest.exit`
+    if the total execution time has exceeded the global timeout.

    Args:
        start_time (int): The start time of the test suite.
@@ -282,14 +208,11 @@ def timeout_monitor(start_time: int) -> None:
    pytest.exit("Test suite exceeded the global timeout", returncode=1)


-def pytest_sessionstart(session: Any) -> None:
+def pytest_sessionstart(session: pytest.Session) -> None:
    """
-    This function is a pytest hook that is called at the start of the test session.
-    It starts the timeout monitor in a separate thread.
-    The timeout monitor checks if the total execution time of the test suite has exceeded the global timeout.
+    Pytest hook that is called at the start of a test session.

-    Args:
-        session (Any): The pytest session object.
+    Sets up and runs a `timeout_monitor` in a separate thread.
    """
    start_time = time.time()
    t = threading.Thread(target=timeout_monitor, args=(start_time,))
@@ -297,94 +220,125 @@ def pytest_sessionstart(session: Any) -> None:
    t.start()


-def pytest_sessionfinish(session: Any) -> None:
+def pytest_sessionfinish(session: pytest.Session) -> None:
    """
-    This function is a pytest hook that is called at the end of the test session.
-    It is used to finalize and save the test reports.
-    The reports are saved in a specific location defined in the suite reports.
+    Pytest hook that is called at the end of a test session.

-    Args:
-        session (Any): The pytest session object.
+    Finalizes and saves the test reports.
    """
-    session_finish(suite_reports)
+    session_finish(agbenchmark_config, suite_reports)


@pytest.fixture
-def scores(request: Any) -> None:
+def scores(request: pytest.FixtureRequest) -> None:
    """
-    This pytest fixture is responsible for retrieving the scores of the test class.
-    The scores are retrieved from the test class's 'scores' attribute using the test class name.
-    This fixture is essential for the pytest system as it provides the necessary scores for each test.
+    Pytest fixture that retrieves the scores of the test class.
+    The scores are retrieved from the `Challenge.scores` attribute
+    using the test class name.

    Args:
-        request (Any): The request object from which the test class is retrieved.
-
-    Returns:
-        None: The scores are directly passed to the test function and do not need to be returned.
+        request: The request object.
    """
-    test_class_name = request.node.cls.__name__
-    return request.node.cls.scores.get(test_class_name)
+    challenge: type[Challenge] = request.node.cls
+    return challenge.scores.get(challenge.__name__)


-# this is adding the dependency marker and category markers automatically from the json
-def pytest_collection_modifyitems(items: Any, config: Any) -> None:
+def pytest_collection_modifyitems(
+    items: list[pytest.Item], config: pytest.Config
+) -> None:
    """
-    This function is a pytest hook that is called after the test collection has been performed.
-    It is used to modify the collected test items based on the agent benchmark configuration.
-    The function loads the agent benchmark configuration from the specified path and retrieves the regression reports.
-    For each test item, it checks if the test method exists and retrieves the dependencies and categories from the test class instance.
-    If the "--improve" or "--category" options are used, the dependencies are filtered based on the regression data.
-    If the "--test", "--no_dep", or "--maintain" options are used, the dependencies are cleared.
-    The function then dynamically adds the 'depends' and 'category' markers to the test item.
-    This function is essential for the pytest system as it provides the necessary modification of the test items based on the agent benchmark configuration.
+    Pytest hook that is called after initial test collection has been performed.
+    Modifies the collected test items based on the agent benchmark configuration,
+    adding the dependency marker and category markers.

    Args:
-        items (Any): The collected test items to be modified.
-        config (Any): The pytest configuration object from which the agent benchmark configuration path is retrieved.
+        items: The collected test items to be modified.
+        config: The active pytest configuration.
    """
-    agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
-    try:
-        with open(agent_benchmark_config_path) as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        raise
-
-    regression_file = agent_benchmark_config.get_regression_reports_path()
-    data = (
-        json.loads(open(regression_file, "r").read())
-        if os.path.exists(regression_file)
-        else {}
+    regression_file = agbenchmark_config.regression_tests_file
+    regression_tests: dict[str, Any] = (
+        json.loads(regression_file.read_bytes()) if regression_file.is_file() else {}
    )

-    for item in items:
-        # Assuming item.cls is your test class
-        test_class_instance = item.cls()
+    try:
+        challenges_beaten_in_the_past = json.loads(
+            agbenchmark_config.challenges_already_beaten_file.read_bytes()
+        )
+    except FileNotFoundError:
+        challenges_beaten_in_the_past = {}

-        if "test_method" not in item.name:
+    selected_tests: tuple[str] = config.getoption("--test")  # type: ignore
+    selected_categories: tuple[str] = config.getoption("--category")  # type: ignore
+
+    # Can't use a for-loop to remove items in-place
+    i = 0
+    while i < len(items):
+        item = items[i]
+        challenge = item.cls
+        challenge_name = item.cls.__name__
+
+        if not issubclass(challenge, Challenge):
+            item.warn(
+                pytest.PytestCollectionWarning(
+                    f"Non-challenge item collected: {challenge}"
+                )
+            )
+            i += 1
            continue

-        # Then you can access your properties
-        name = item.parent.cls.__name__
-        # dependencies = test_class_instance.data.dependencies
+        # --test: remove the test from the set if it's not specifically selected
+        if selected_tests and challenge.data.name not in selected_tests:
+            items.remove(item)
+            continue

-        # Filter dependencies if they exist in regression data if its an improvement test
-        # if config.getoption("--improve") or config.getoption(
-        #     "--category"
-        # ):
-        #     dependencies = [dep for dep in dependencies if not data.get(dep, None)]
-        # if (
-        #     config.getoption("--test")
-        #     or config.getoption("--no_dep")
-        #     or config.getoption("--maintain")
-        # ):
-        dependencies = test_class_instance.dependencies
+        # Filter challenges for --maintain, --improve, and --explore:
+        # --maintain -> only challenges expected to be passed (= regression tests)
+        # --improve -> only challenges that so far are not passed (reliably)
+        # --explore -> only challenges that have never been passed
+        is_regression_test = regression_tests.get(challenge.data.name, None)
+        has_been_passed = challenges_beaten_in_the_past.get(challenge.data.name, False)
+        if (
+            (config.getoption("--maintain") and not is_regression_test)
+            or (config.getoption("--improve") and is_regression_test)
+            or (config.getoption("--explore") and has_been_passed)
+        ):
+            items.remove(item)
+            continue

-        # Add depends marker dynamically
-        item.add_marker(pytest.mark.depends(on=dependencies, name=name))
+        dependencies = challenge.data.dependencies
+        if (
+            config.getoption("--test")
+            or config.getoption("--no-dep")
+            or config.getoption("--maintain")
+        ):
+            # Ignore dependencies:
+            # --test -> user selected specific tests to run, don't care about deps
+            # --no-dep -> ignore dependency relations regardless of test selection
+            # --maintain -> all "regression" tests must pass, so run all of them
+            dependencies = []
+        elif config.getoption("--improve"):
+            # Filter dependencies, keep only deps that are not "regression" tests
+            dependencies = [
+                d for d in dependencies if not regression_tests.get(d, None)
+            ]

-        categories = test_class_instance.data.category
+        # Set category markers
+        challenge_categories = [c.value for c in challenge.data.category]
+        for category in challenge_categories:
+            item.add_marker(category)

-        # Add category marker dynamically
-        for category in categories:
-            item.add_marker(getattr(pytest.mark, category))
+        # Enforce category selection
+        if selected_categories:
+            if not set(challenge_categories).intersection(set(selected_categories)):
+                items.remove(item)
+                continue
+            # # Filter dependencies, keep only deps from selected categories
+            # dependencies = [
+            #     d for d in dependencies
+            #     if not set(d.categories).intersection(set(selected_categories))
+            # ]
+
+        # Add marker for the DependencyManager
+        item.add_marker(pytest.mark.depends(on=dependencies, name=challenge_name))
+
+        i += 1
--- a/benchmark/agbenchmark/execute_sub_process.py
+++ b/benchmark/agbenchmark/execute_sub_process.py
@@ -1,79 +0,0 @@
-import platform
-import queue
-import select
-import subprocess
-import time
-from threading import Thread
-from typing import Any
-
-import psutil
-
-
-def run_linux_env(process: Any, start_time: float, timeout: float) -> None:
-    while True:
-        try:
-            # This checks if there's data to be read from stdout without blocking.
-            if process.stdout and select.select([process.stdout], [], [], 0)[0]:
-                output = process.stdout.readline()
-                print(output.strip())
-        except Exception as e:
-            continue
-
-        # Check if process has ended, has no more output, or exceeded timeout
-        if process.poll() is not None or (time.time() - start_time > timeout):
-            break
-
-    if time.time() - start_time > timeout:
-        print("The Python function has exceeded the time limit and was terminated.")
-        parent = psutil.Process(process.pid)
-        for child in parent.children(recursive=True):
-            child.kill()
-        parent.kill()
-
-    else:
-        print("The Python function has finished running.")
-
-
-def enqueue_output(out: Any, my_queue: Any) -> None:
-    for line in iter(out.readline, b""):
-        my_queue.put(line)
-    out.close()
-
-
-def run_windows_env(process: Any, start_time: float, timeout: float) -> None:
-    my_queue: Any = queue.Queue()
-    thread = Thread(target=enqueue_output, args=(process.stdout, my_queue))
-    thread.daemon = True
-    thread.start()
-
-    while True:
-        try:
-            output = my_queue.get_nowait().strip()
-            print(output)
-        except queue.Empty:
-            pass
-
-        if process.poll() is not None or (time.time() - start_time > timeout):
-            break
-
-    if time.time() - start_time > timeout:
-        print("The Python function has exceeded the time limit and was terminated.")
-        process.terminate()
-
-
-def execute_subprocess(command, timeout):
-    process = subprocess.Popen(
-        command,
-        stdout=subprocess.PIPE,
-        stderr=subprocess.STDOUT,
-        universal_newlines=True,
-        bufsize=1,
-    )
-    start_time = time.time()
-    if platform.system() == "Windows":
-        run_windows_env(process, start_time, timeout)
-    else:
-        run_linux_env(process, start_time, timeout)
-    process.wait()
-    if process.returncode != 0:
-        print(f"The agent timed out")
--- a/benchmark/agbenchmark/generate_test.py
+++ b/benchmark/agbenchmark/generate_test.py
@@ -1,147 +1,34 @@
 import glob
 import importlib
-import json
+import logging
 import os
-import sys
-import types
 from collections import deque
 from pathlib import Path
-from typing import Any, Dict, Optional, Union

-import pytest
-
-from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
-from agbenchmark.agent_api_interface import append_updates_file
-from agbenchmark.agent_protocol_client.models.step import Step
 from agbenchmark.utils.challenge import Challenge
-from agbenchmark.utils.data_types import AgentBenchmarkConfig, ChallengeData
+from agbenchmark.utils.data_types import ChallengeData

 DATA_CATEGORY = {}

-
-def create_single_test(
-    data: Dict[str, Any] | ChallengeData,
-    challenge_location: str,
-    file_datum: Optional[list[dict[str, Any]]] = None,
-) -> None:
-    challenge_data = None
-    artifacts_location = None
-    if isinstance(data, ChallengeData):
-        challenge_data = data
-        data = data.get_data()
-
-    DATA_CATEGORY[data["name"]] = data["category"][0]
-
-    # Define test class dynamically
-    challenge_class = types.new_class(f"Test{data['name']}", (Challenge,))
-    print(challenge_location)
-    # clean_challenge_location = get_test_path(challenge_location)
-    setattr(challenge_class, "CHALLENGE_LOCATION", challenge_location)
-
-    setattr(
-        challenge_class,
-        "ARTIFACTS_LOCATION",
-        artifacts_location or str(Path(challenge_location).resolve().parent),
-    )
-
-    # Define test method within the dynamically created class
-    @pytest.mark.asyncio
-    async def test_method(self, config: Dict[str, Any], request) -> None:  # type: ignore
-        # create a random number between 0 and 1
-        test_name = self.data.name
-
-        try:
-            with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
-                challenges_beaten_in_the_past = json.load(f)
-        except:
-            challenges_beaten_in_the_past = {}
-
-        if request.config.getoption("--explore") and challenges_beaten_in_the_past.get(
-            test_name, False
-        ):
-            return None
-
-        # skip optional categories
-        self.skip_optional_categories(config)
-
-        from helicone.lock import HeliconeLockManager
-
-        if os.environ.get("HELICONE_API_KEY"):
-            HeliconeLockManager.write_custom_property("challenge", self.data.name)
-
-        cutoff = self.data.cutoff or 60
-
-        timeout = cutoff
-        if "--nc" in sys.argv:
-            timeout = 100000
-        if "--cutoff" in sys.argv:
-            timeout = int(sys.argv[sys.argv.index("--cutoff") + 1])
-
-        await self.setup_challenge(config, timeout)
-
-        scores = self.get_scores(config)
-        request.node.answers = (
-            scores["answers"] if "--keep-answers" in sys.argv else None
-        )
-        del scores["answers"]  # remove answers from scores
-        request.node.scores = scores  # store scores in request.node
-        is_score_100 = 1 in scores["values"]
-
-        evaluation = "Correct!" if is_score_100 else "Incorrect."
-        eval_step = Step(
-            input=evaluation,
-            additional_input=None,
-            task_id="irrelevant, this step is a hack",
-            step_id="irrelevant, this step is a hack",
-            name="",
-            status="created",
-            output=None,
-            additional_output=None,
-            artifacts=[],
-            is_last=True,
-        )
-        await append_updates_file(eval_step)
-
-        assert is_score_100
-
-    # Parametrize the method here
-    test_method = pytest.mark.parametrize(
-        "challenge_data",
-        [data],
-        indirect=True,
-    )(test_method)
-
-    setattr(challenge_class, "test_method", test_method)
-
-    # Attach the new class to a module so it can be discovered by pytest
-    module = importlib.import_module(__name__)
-    setattr(module, f"Test{data['name']}", challenge_class)
-    return challenge_class
+logger = logging.getLogger(__name__)


-def create_single_suite_challenge(challenge_data: ChallengeData, path: Path) -> None:
-    create_single_test(challenge_data, str(path))
+def create_challenge_from_spec_file(spec_file: Path) -> type[Challenge]:
+    challenge = Challenge.from_challenge_spec(spec_file)
+    DATA_CATEGORY[challenge.data.name] = challenge.data.category[0].value
+    return challenge


-def create_challenge(
-    data: Dict[str, Any],
-    json_file: str,
-    json_files: deque,
-) -> Union[deque, Any]:
-    path = Path(json_file).resolve()
-    print("Creating challenge for", path)
-
-    challenge_class = create_single_test(data, str(path))
-    print("Creation complete for", path)
-
-    return json_files, challenge_class
+def create_challenge_from_spec_file_path(spec_file_path: str) -> type[Challenge]:
+    spec_file = Path(spec_file_path).resolve()
+    return create_challenge_from_spec_file(spec_file)


-def generate_tests() -> None:  # sourcery skip: invert-any-all
-    print("Generating tests...")
+def load_challenges() -> None:
+    logger.info("Loading challenges...")

    challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
-    print(f"Looking for challenges in {challenges_path}...")
+    logger.debug(f"Looking for challenges in {challenges_path}...")

    json_files = deque(
        glob.glob(
@@ -150,74 +37,39 @@ def generate_tests() -> None:  # sourcery skip: invert-any-all
        )
    )

-    print(f"Found {len(json_files)} challenges.")
-    print(f"Sample path: {json_files[0]}")
-
-    agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
-    try:
-        with open(agent_benchmark_config_path, "r") as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-            agent_benchmark_config.agent_benchmark_config_path = (
-                agent_benchmark_config_path
-            )
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        raise
-
-    regression_reports_path = agent_benchmark_config.get_regression_reports_path()
-    if regression_reports_path and os.path.exists(regression_reports_path):
-        with open(regression_reports_path, "r") as f:
-            regression_tests = json.load(f)
-    else:
-        regression_tests = {}
+    logger.debug(f"Found {len(json_files)} challenges.")
+    logger.debug(f"Sample path: {json_files[0]}")

+    loaded, ignored = 0, 0
    while json_files:
-        json_file = (
-            json_files.popleft()
-        )  # Take and remove the first element from json_files
+        # Take and remove the first element from json_files
+        json_file = json_files.popleft()
        if challenge_should_be_ignored(json_file):
+            ignored += 1
            continue

-        data = ChallengeData.get_json_from_path(json_file)
+        challenge_info = ChallengeData.parse_file(json_file)

-        commands = sys.argv
-        # --by flag
-        if "--category" in commands:
-            categories = data.get("category", [])
-            commands_set = set(commands)
+        challenge_class = create_challenge_from_spec_file_path(json_file)

-            # Convert the combined list to a set
-            categories_set = set(categories)
+        logger.debug(f"Generated test for {challenge_info.name}")
+        _add_challenge_to_module(challenge_class)
+        loaded += 1

-            # If there's no overlap with commands
-            if not categories_set.intersection(commands_set):
-                continue
-
-        # --test flag, only run the test if it's the exact one specified
-        tests = []
-        for command in commands:
-            if command.startswith("--test="):
-                tests.append(command.split("=")[1])
-
-        if tests and data["name"] not in tests:
-            continue
-
-        # --maintain and --improve flag
-        in_regression = regression_tests.get(data["name"], None)
-        improve_flag = in_regression and "--improve" in commands
-        maintain_flag = not in_regression and "--maintain" in commands
-        if "--maintain" in commands and maintain_flag:
-            continue
-        elif "--improve" in commands and improve_flag:
-            continue
-        json_files, challenge_class = create_challenge(data, json_file, json_files)
-
-        print(f"Generated test for {data['name']}.")
-    print("Test generation complete.")
+    logger.info(f"Loading challenges complete: loaded {loaded}, ignored {ignored}.")


-def challenge_should_be_ignored(json_file):
-    return "challenges/deprecated" in json_file or "challenges/library" in json_file
+def challenge_should_be_ignored(json_file_path: str):
+    return (
+        "challenges/deprecated" in json_file_path
+        or "challenges/library" in json_file_path
+    )


-generate_tests()
+def _add_challenge_to_module(challenge: type[Challenge]):
+    # Attach the Challenge class to this module so it can be discovered by pytest
+    module = importlib.import_module(__name__)
+    setattr(module, f"{challenge.__name__}", challenge)
+
+
+load_challenges()
--- a/benchmark/agbenchmark/main.py
+++ b/benchmark/agbenchmark/main.py
@@ -0,0 +1,153 @@
+import logging
+import os
+from pathlib import Path
+from typing import Optional, Sequence
+
+from dotenv import load_dotenv
+
+from agbenchmark.challenges import get_unique_categories
+from agbenchmark.config import AgentBenchmarkConfig
+
+load_dotenv()
+
+logger = logging.getLogger(__name__)
+
+
+def run_benchmark(
+    config: AgentBenchmarkConfig,
+    maintain: bool = False,
+    improve: bool = False,
+    explore: bool = False,
+    tests: tuple[str] = tuple(),
+    categories: tuple[str] = tuple(),
+    skip_categories: tuple[str] = tuple(),
+    mock: bool = False,
+    no_dep: bool = False,
+    no_cutoff: bool = False,
+    cutoff: Optional[int] = None,
+    keep_answers: bool = False,
+    server: bool = False,
+) -> int:
+    """
+    Starts the benchmark. If a category flag is provided, only challenges with the
+    corresponding mark will be run.
+    """
+    import pytest
+
+    from agbenchmark.reports.ReportManager import SingletonReportManager
+
+    validate_args(
+        maintain=maintain,
+        improve=improve,
+        explore=explore,
+        tests=tests,
+        categories=categories,
+        skip_categories=skip_categories,
+        no_cutoff=no_cutoff,
+        cutoff=cutoff,
+    )
+
+    SingletonReportManager()
+
+    for key, value in vars(config).items():
+        logger.debug(f"config.{key} = {repr(value)}")
+
+    pytest_args = ["-vs"]
+
+    if tests:
+        logger.info(f"Running specific test(s): {' '.join(tests)}")
+        pytest_args += [f"--test={t}" for t in tests]
+    else:
+        all_categories = get_unique_categories()
+
+        if categories or skip_categories:
+            categories_to_run = set(categories) or all_categories
+            if skip_categories:
+                categories_to_run = categories_to_run.difference(set(skip_categories))
+            assert categories_to_run, "Error: You can't skip all categories"
+            pytest_args += [f"--category={c}" for c in categories_to_run]
+            logger.info(f"Running tests of category: {categories_to_run}")
+        else:
+            logger.info("Running all categories")
+
+        if maintain:
+            logger.info("Running only regression tests")
+        elif improve:
+            logger.info("Running only non-regression tests")
+        elif explore:
+            logger.info("Only attempt challenges that have never been beaten")
+
+    if mock:
+        # TODO: unhack
+        os.environ[
+            "IS_MOCK"
+        ] = "True"  # ugly hack to make the mock work when calling from API
+
+    # Pass through flags
+    for flag, active in {
+        "--maintain": maintain,
+        "--improve": improve,
+        "--explore": explore,
+        "--no-dep": no_dep,
+        "--mock": mock,
+        "--nc": no_cutoff,
+        "--keep-answers": keep_answers,
+    }.items():
+        if active:
+            pytest_args.append(flag)
+
+    if cutoff:
+        pytest_args.append(f"--cutoff={cutoff}")
+        logger.debug(f"Setting cuttoff override to {cutoff} seconds.")
+
+    current_dir = Path(__file__).resolve().parent
+    pytest_args.append(str(current_dir / "generate_test.py"))
+
+    pytest_args.append("--cache-clear")
+    exit_code = pytest.main(pytest_args)
+
+    SingletonReportManager.clear_instance()
+    return exit_code
+
+
+class InvalidInvocationError(ValueError):
+    pass
+
+
+def validate_args(
+    maintain: bool,
+    improve: bool,
+    explore: bool,
+    tests: Sequence[str],
+    categories: Sequence[str],
+    skip_categories: Sequence[str],
+    no_cutoff: bool,
+    cutoff: Optional[int],
+) -> None:
+    if categories:
+        all_categories = get_unique_categories()
+        invalid_categories = set(categories) - all_categories
+        if invalid_categories:
+            raise InvalidInvocationError(
+                "One or more invalid categories were specified: "
+                f"{', '.join(invalid_categories)}.\n"
+                f"Valid categories are: {', '.join(all_categories)}."
+            )
+
+    if (maintain + improve + explore) > 1:
+        raise InvalidInvocationError(
+            "You can't use --maintain, --improve or --explore at the same time. "
+            "Please choose one."
+        )
+
+    if tests and (categories or skip_categories or maintain or improve or explore):
+        raise InvalidInvocationError(
+            "If you're running a specific test make sure no other options are "
+            "selected. Please just pass the --test."
+        )
+
+    if no_cutoff and cutoff:
+        raise InvalidInvocationError(
+            "You can't use both --nc and --cutoff at the same time. "
+            "Please choose one."
+        )
--- a/benchmark/agbenchmark/reports/ReportManager.py
+++ b/benchmark/agbenchmark/reports/ReportManager.py
@@ -4,11 +4,12 @@ import os
 import sys
 import time
 from datetime import datetime, timezone
+from pathlib import Path

+from agbenchmark.config import AgentBenchmarkConfig
 from agbenchmark.reports.processing.graphs import save_single_radar_chart
 from agbenchmark.reports.processing.process_report import get_agent_category
 from agbenchmark.reports.processing.report_types import Report
-from agbenchmark.utils.data_types import AgentBenchmarkConfig
 from agbenchmark.utils.utils import get_highest_success_difficulty


@@ -16,32 +17,26 @@ class SingletonReportManager:
    instance = None

    def __new__(cls):
-        from agbenchmark.reports.agent_benchmark_config import (
-            get_agent_benchmark_config,
-        )
-
        if not cls.instance:
            cls.instance = super(SingletonReportManager, cls).__new__(cls)

-            agent_benchmark_config = get_agent_benchmark_config()
+            agent_benchmark_config = AgentBenchmarkConfig.load()
            benchmark_start_time_dt = datetime.now(
                timezone.utc
            )  # or any logic to fetch the datetime

            # Make the Managers class attributes
            cls.REGRESSION_MANAGER = ReportManager(
-                agent_benchmark_config.get_regression_reports_path(),
+                agent_benchmark_config.regression_tests_file,
                benchmark_start_time_dt,
            )
            cls.INFO_MANAGER = ReportManager(
-                str(
-                    agent_benchmark_config.get_reports_path(benchmark_start_time_dt)
-                    / "report.json"
-                ),
+                agent_benchmark_config.get_report_dir(benchmark_start_time_dt)
+                / "report.json",
                benchmark_start_time_dt,
            )
            cls.INTERNAL_INFO_MANAGER = ReportManager(
-                agent_benchmark_config.get_success_rate_path(), benchmark_start_time_dt
+                agent_benchmark_config.success_rate_file, benchmark_start_time_dt
            )

        return cls.instance
@@ -57,21 +52,20 @@ class SingletonReportManager:
 class ReportManager:
    """Abstracts interaction with the regression tests file"""

-    def __init__(self, filename: str, benchmark_start_time: str):
-        self.filename = filename
+    def __init__(self, report_file: Path, benchmark_start_time: datetime):
+        self.report_file = report_file
        self.start_time = time.time()
        self.benchmark_start_time = benchmark_start_time

        self.load()

    def load(self) -> None:
-        if not os.path.exists(self.filename):
-            os.makedirs(os.path.dirname(self.filename), exist_ok=True)
-            with open(self.filename, "w") as f:
-                pass
+        if not self.report_file.exists():
+            self.report_file.parent.mkdir(exist_ok=True)
+            self.report_file.touch()

        try:
-            with open(self.filename, "r") as f:
+            with self.report_file.open("r") as f:
                file_content = (
                    f.read().strip()
                )  # read the content and remove any leading/trailing whitespace
@@ -87,7 +81,7 @@ class ReportManager:
        self.save()

    def save(self) -> None:
-        with open(self.filename, "w") as f:
+        with self.report_file.open("w") as f:
            json.dump(self.tests, f, indent=4)

    def add_test(self, test_name: str, test_details: dict | list) -> None:
@@ -137,7 +131,7 @@ class ReportManager:
        if len(agent_categories) > 1:
            save_single_radar_chart(
                agent_categories,
-                config.get_reports_path(self.benchmark_start_time) / "radar_chart.png",
+                config.get_report_dir(self.benchmark_start_time) / "radar_chart.png",
            )

        self.save()
--- a/benchmark/agbenchmark/reports/agent_benchmark_config.py
+++ b/benchmark/agbenchmark/reports/agent_benchmark_config.py
@@ -1,18 +0,0 @@
-import json
-from pathlib import Path
-
-from agbenchmark.utils.data_types import AgentBenchmarkConfig
-
-
-def get_agent_benchmark_config() -> AgentBenchmarkConfig:
-    agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
-    try:
-        with open(agent_benchmark_config_path, "r") as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-            agent_benchmark_config.agent_benchmark_config_path = (
-                agent_benchmark_config_path
-            )
-            return agent_benchmark_config
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        raise
--- a/benchmark/agbenchmark/reports/processing/process_report.py
+++ b/benchmark/agbenchmark/reports/processing/process_report.py
@@ -1,4 +1,5 @@
 import json
+import logging
 import os
 from pathlib import Path
 from typing import Any
@@ -9,6 +10,8 @@ from agbenchmark.reports.processing.get_files import (
 from agbenchmark.reports.processing.report_types import Report, Test
 from agbenchmark.utils.data_types import STRING_DIFFICULTY_MAP

+logger = logging.getLogger(__name__)
+

 def get_reports_data(report_path: str) -> dict[str, Any]:
    latest_files = get_latest_report_from_agent_directories(report_path)
@@ -60,7 +63,7 @@ def all_agent_categories(reports_data: dict[str, Any]) -> dict[str, Any]:
    for name, report in reports_data.items():
        categories = get_agent_category(report)
        if categories:  # only add to all_categories if categories is not empty
-            print(f"Adding {name}: {categories}")
+            logger.debug(f"Adding {name}: {categories}")
            all_categories[name] = categories

    return all_categories
--- a/benchmark/agbenchmark/reports/processing/report_types_v2.py
+++ b/benchmark/agbenchmark/reports/processing/report_types_v2.py
@@ -1,7 +1,6 @@
-from typing import Dict, List
+from pydantic import BaseModel, constr

 datetime_format = r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+00:00$"
-from pydantic import BaseModel, constr


 class BaseModelBenchmark(BaseModel):
@@ -14,32 +13,32 @@ class TaskInfo(BaseModelBenchmark):
    is_regression: bool | None
    answer: str
    description: str
-    category: List[str]
+    category: list[str]
    task: str


 class RepositoryInfo(BaseModelBenchmark):
-    repo_url: str | None
-    team_name: str | None
-    benchmark_git_commit_sha: str | None
-    agent_git_commit_sha: str | None
+    repo_url: str | None = None
+    team_name: str | None = None
+    agent_git_commit_sha: str | None = None
+    benchmark_git_commit_sha: str | None = None


 class Metrics(BaseModelBenchmark):
-    difficulty: str | None
+    cost: float | None = None
    success: bool
-    success_percentage: float | None
-    run_time: str | None
-    fail_reason: str | None
    attempted: bool
-    cost: float | None
+    difficulty: str | None = None
+    run_time: str | None = None
+    fail_reason: str | None = None
+    success_percentage: float | None = None


 class RunDetails(BaseModelBenchmark):
    test_name: str
-    run_id: str | None
+    run_id: str | None = None
    command: str
-    completion_time: str | None
+    completion_time: str | None = None
    benchmark_start_time: constr(regex=datetime_format)


@@ -48,5 +47,5 @@ class BenchmarkRun(BaseModelBenchmark):
    run_details: RunDetails
    task_info: TaskInfo
    metrics: Metrics
-    reached_cutoff: bool | None
-    config: Dict[str, str | dict[str, str]]
+    reached_cutoff: bool | None = None
+    config: dict[str, str | dict[str, str]]
--- a/benchmark/agbenchmark/reports/reports.py
+++ b/benchmark/agbenchmark/reports/reports.py
@@ -1,20 +1,24 @@
 import json
+import logging
 import os
 import sys
+from pathlib import Path
 from typing import Any, Dict

-from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
-from agbenchmark.reports.agent_benchmark_config import get_agent_benchmark_config
+import pytest
+
+from agbenchmark.config import AgentBenchmarkConfig
 from agbenchmark.reports.ReportManager import SingletonReportManager
-from agbenchmark.utils.data_types import DifficultyLevel
+from agbenchmark.utils.data_types import ChallengeData, DifficultyLevel
 from agbenchmark.utils.get_data_from_helicone import get_data_from_helicone
 from agbenchmark.utils.utils import calculate_success_percentage

+logger = logging.getLogger(__name__)
+

 def get_previous_test_results(
    test_name: str, info_details: dict[str, Any]
 ) -> list[bool]:
-    agent_tests: dict[str, list[bool]] = {}
    mock = os.getenv("IS_MOCK")  # Check if --mock is in sys.argv

    prev_test_results = SingletonReportManager().INTERNAL_INFO_MANAGER.tests.get(
@@ -49,17 +53,14 @@ def update_regression_tests(


 def generate_single_call_report(
-    item: Any,
-    call: Any,
-    challenge_data: dict[str, Any],
+    item: pytest.Item,
+    call: pytest.CallInfo,
+    challenge_data: ChallengeData,
    answers: dict[str, Any],
-    challenge_location,
-    test_name,
+    challenge_location: str,
+    test_name: str,
 ) -> None:
-    try:
-        difficulty = challenge_data["info"]["difficulty"]
-    except KeyError:
-        return None
+    difficulty = challenge_data.info.difficulty

    if isinstance(difficulty, DifficultyLevel):
        difficulty = difficulty.value
@@ -77,10 +78,10 @@ def generate_single_call_report(
    info_details: Any = {
        "data_path": challenge_location,
        "is_regression": False,
-        "category": challenge_data["category"],
-        "task": challenge_data["task"],
-        "answer": challenge_data["ground"]["answer"],
-        "description": challenge_data["info"]["description"],
+        "category": challenge_data.category,
+        "task": challenge_data.task,
+        "answer": challenge_data.ground.answer,
+        "description": challenge_data.info.description,
        "metrics": {
            "difficulty": difficulty,
            "success": False,
@@ -91,8 +92,8 @@ def generate_single_call_report(
    if answers:
        info_details["answers"] = answers

-    if "metadata" in challenge_data:
-        info_details["metadata"] = challenge_data["metadata"]
+    if challenge_data.metadata:
+        info_details["metadata"] = challenge_data.metadata

    mock = os.getenv("IS_MOCK")  # Check if --mock is in sys.argv
    if call:
@@ -116,7 +117,9 @@ def generate_single_call_report(
    return info_details


-def finalize_reports(item: Any, challenge_data: dict[str, Any]) -> None:
+def finalize_reports(
+    config: AgentBenchmarkConfig, item: pytest.Item, challenge_data: ChallengeData
+) -> None:
    run_time = dict(item.user_properties).get("run_time")

    info_details = getattr(item, "info_details", {})
@@ -126,8 +129,9 @@ def finalize_reports(item: Any, challenge_data: dict[str, Any]) -> None:
        if run_time is not None:
            cost = None
            if "--mock" not in sys.argv and os.environ.get("HELICONE_API_KEY"):
-                print("Getting cost from Helicone")
+                logger.debug("Getting cost from Helicone")
                cost = get_data_from_helicone(test_name)
+                logger.debug(f"Cost: {cost}")

            info_details["metrics"]["cost"] = cost

@@ -142,29 +146,33 @@ def finalize_reports(item: Any, challenge_data: dict[str, Any]) -> None:

            info_details["metrics"]["run_time"] = f"{str(round(run_time, 3))} seconds"

-            info_details["reached_cutoff"] = float(run_time) > challenge_data["cutoff"]
+            info_details["reached_cutoff"] = float(run_time) > challenge_data.cutoff

            if "--mock" not in sys.argv:
-                update_challenges_already_beaten(info_details, test_name)
+                update_challenges_already_beaten(
+                    config.challenges_already_beaten_file, info_details, test_name
+                )
                if info_details.get("tests") is not None:
                    for nested_test_name, nested_test_info in info_details[
                        "tests"
                    ].items():
                        update_challenges_already_beaten(
-                            nested_test_info, nested_test_name
+                            config.challenges_already_beaten_file,
+                            nested_test_info,
+                            nested_test_name,
                        )

        SingletonReportManager().INFO_MANAGER.add_test(test_name, info_details)


 def update_challenges_already_beaten(
-    info_details: Dict[str, Any], test_name: str
+    challenges_already_beaten_file: Path, info_details: Dict[str, Any], test_name: str
 ) -> None:
    current_run_successful = info_details["metrics"]["success"]
    try:
-        with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
+        with open(challenges_already_beaten_file, "r") as f:
            challenge_data = json.load(f)
-    except:
+    except FileNotFoundError:
        challenge_data = {}
    challenge_beaten_in_the_past = challenge_data.get(test_name)

@@ -172,13 +180,13 @@ def update_challenges_already_beaten(
    if challenge_beaten_in_the_past is None and not current_run_successful:
        challenge_data[test_name] = False

-    with open(CHALLENGES_ALREADY_BEATEN, "w") as f:
+    with open(challenges_already_beaten_file, "w") as f:
        json.dump(challenge_data, f, indent=4)


-def session_finish(suite_reports: dict) -> None:
-    agent_benchmark_config = get_agent_benchmark_config()
-
+def session_finish(
+    agbenchmark_config: AgentBenchmarkConfig, suite_reports: dict
+) -> None:
    SingletonReportManager().INTERNAL_INFO_MANAGER.save()
-    SingletonReportManager().INFO_MANAGER.end_info_report(agent_benchmark_config)
+    SingletonReportManager().INFO_MANAGER.end_info_report(agbenchmark_config)
    SingletonReportManager().REGRESSION_MANAGER.save()
--- a/benchmark/agbenchmark/schema.py
+++ b/benchmark/agbenchmark/schema.py
@@ -1,79 +1,14 @@
-# generated by fastapi-codegen:
-#   filename:  ../../postman/schemas/openapi.yaml
-#   timestamp: 2023-08-25T10:36:11+00:00
-
 from __future__ import annotations

-from datetime import datetime
-from enum import Enum
-from typing import List, Optional
+from typing import Optional

 from pydantic import BaseModel, Field


-class ArtifactUpload(BaseModel):
-    file: str = Field(..., description="File to upload.", format="binary")
-    relative_path: str = Field(
-        ...,
-        description="Relative path of the artifact in the agent's workspace.",
-        example="python/code",
-    )
-
-
-class Pagination(BaseModel):
-    total_items: int = Field(..., description="Total number of items.", example=42)
-    total_pages: int = Field(..., description="Total number of pages.", example=97)
-    current_page: int = Field(..., description="Current_page page number.", example=1)
-    page_size: int = Field(..., description="Number of items per page.", example=25)
-
-
 class TaskInput(BaseModel):
    pass


-class Artifact(BaseModel):
-    created_at: datetime = Field(
-        ...,
-        description="The creation datetime of the task.",
-        example="2023-01-01T00:00:00Z",
-        json_encoders={datetime: lambda v: v.isoformat()},
-    )
-    modified_at: datetime = Field(
-        ...,
-        description="The modification datetime of the task.",
-        example="2023-01-01T00:00:00Z",
-        json_encoders={datetime: lambda v: v.isoformat()},
-    )
-    artifact_id: str = Field(
-        ...,
-        description="ID of the artifact.",
-        example="b225e278-8b4c-4f99-a696-8facf19f0e56",
-    )
-    agent_created: bool = Field(
-        ...,
-        description="Whether the artifact has been created by the agent.",
-        example=False,
-    )
-    relative_path: str = Field(
-        ...,
-        description="Relative path of the artifact in the agents workspace.",
-        example="/my_folder/my_other_folder/",
-    )
-    file_name: str = Field(
-        ...,
-        description="Filename of the artifact.",
-        example="main.py",
-    )
-
-
-class StepInput(BaseModel):
-    pass
-
-
-class StepOutput(BaseModel):
-    pass
-
-
 class TaskRequestBody(BaseModel):
    input: str = Field(
        ...,
@@ -86,108 +21,3 @@ class TaskRequestBody(BaseModel):

 class TaskEvalRequestBody(TaskRequestBody):
    eval_id: str
-
-
-class Task(TaskRequestBody):
-    created_at: datetime = Field(
-        ...,
-        description="The creation datetime of the task.",
-        example="2023-01-01T00:00:00Z",
-        json_encoders={datetime: lambda v: v.isoformat()},
-    )
-    modified_at: datetime = Field(
-        ...,
-        description="The modification datetime of the task.",
-        example="2023-01-01T00:00:00Z",
-        json_encoders={datetime: lambda v: v.isoformat()},
-    )
-    task_id: str = Field(
-        ...,
-        description="The ID of the task.",
-        example="50da533e-3904-4401-8a07-c49adf88b5eb",
-    )
-    artifacts: Optional[List[Artifact]] = Field(
-        [],
-        description="A list of artifacts that the task has produced.",
-        example=[
-            "7a49f31c-f9c6-4346-a22c-e32bc5af4d8e",
-            "ab7b4091-2560-4692-a4fe-d831ea3ca7d6",
-        ],
-    )
-
-
-class StepRequestBody(BaseModel):
-    name: Optional[str] = Field(
-        None, description="The name of the task step.", example="Write to file"
-    )
-    input: Optional[str] = Field(
-        None,
-        min_length=1,
-        description="Input prompt for the step.",
-        example="Washington",
-    )
-    additional_input: Optional[StepInput] = {}
-
-
-class Status(Enum):
-    created = "created"
-    running = "running"
-    completed = "completed"
-
-
-class Step(StepRequestBody):
-    created_at: datetime = Field(
-        ...,
-        description="The creation datetime of the task.",
-        example="2023-01-01T00:00:00Z",
-        json_encoders={datetime: lambda v: v.isoformat()},
-    )
-    modified_at: datetime = Field(
-        ...,
-        description="The modification datetime of the task.",
-        example="2023-01-01T00:00:00Z",
-        json_encoders={datetime: lambda v: v.isoformat()},
-    )
-    task_id: str = Field(
-        ...,
-        description="The ID of the task this step belongs to.",
-        example="50da533e-3904-4401-8a07-c49adf88b5eb",
-    )
-    step_id: str = Field(
-        ...,
-        description="The ID of the task step.",
-        example="6bb1801a-fd80-45e8-899a-4dd723cc602e",
-    )
-    name: Optional[str] = Field(
-        None, description="The name of the task step.", example="Write to file"
-    )
-    status: Status = Field(
-        ..., description="The status of the task step.", example="created"
-    )
-    output: Optional[str] = Field(
-        None,
-        description="Output of the task step.",
-        example="I am going to use the write_to_file command and write Washington to a file called output.txt <write_to_file('output.txt', 'Washington')",
-    )
-    additional_output: Optional[StepOutput] = {}
-    artifacts: Optional[List[Artifact]] = Field(
-        [], description="A list of artifacts that the step has produced."
-    )
-    is_last: bool = Field(
-        ..., description="Whether this is the last step in the task.", example=True
-    )
-
-
-class TaskListResponse(BaseModel):
-    tasks: Optional[List[Task]] = None
-    pagination: Optional[Pagination] = None
-
-
-class TaskStepsListResponse(BaseModel):
-    steps: Optional[List[Step]] = None
-    pagination: Optional[Pagination] = None
-
-
-class TaskArtifactsListResponse(BaseModel):
-    artifacts: Optional[List[Artifact]] = None
-    pagination: Optional[Pagination] = None
--- a/benchmark/agbenchmark/utils/challenge.py
+++ b/benchmark/agbenchmark/utils/challenge.py
@@ -1,17 +1,20 @@
 import glob
+import json
+import logging
 import math
 import os
 import subprocess
 import sys
 from abc import ABC
 from pathlib import Path
-from typing import Any, Dict, List
+from typing import Any, ClassVar, List

 import openai
 import pytest
+from colorama import Fore, Style

-from agbenchmark.__main__ import OPTIONAL_CATEGORIES, TEMP_FOLDER_ABS_PATH
 from agbenchmark.agent_api_interface import run_api_agent
+from agbenchmark.config import AgentBenchmarkConfig
 from agbenchmark.utils.data_types import ChallengeData, Ground
 from agbenchmark.utils.prompts import (
    END_PROMPT,
@@ -19,43 +22,84 @@ from agbenchmark.utils.prompts import (
    PROMPT_MAP,
    SCORING_MAP,
 )
-from agbenchmark.utils.utils import agent_eligibible_for_optional_categories
+
+logger = logging.getLogger(__name__)
+
+with open(
+    Path(__file__).parent.parent / "challenges" / "optional_categories.json"
+) as f:
+    OPTIONAL_CATEGORIES: list[str] = json.load(f)["optional_categories"]


 class Challenge(ABC):
    """The parent class to all specific challenges classes.
    Defines helper methods for running a challenge"""

-    _data_cache: Dict[str, ChallengeData] = {}
-    CHALLENGE_LOCATION: str = ""
-    scores: dict[str, Any] = {}  # this is for suites
+    data: ChallengeData
+    CHALLENGE_LOCATION: ClassVar[str]
+    ARTIFACTS_LOCATION: ClassVar[str]
+    scores: ClassVar[dict[str, Any]] = {}  # this is for suites

-    @property
-    def data(self) -> ChallengeData:
-        if self.CHALLENGE_LOCATION not in self._data_cache:
-            self._data_cache[self.CHALLENGE_LOCATION] = ChallengeData.deserialize(
-                self.CHALLENGE_LOCATION
-            )
-        return self._data_cache[self.CHALLENGE_LOCATION]
+    @staticmethod
+    def from_challenge_spec(spec_file: Path) -> type["Challenge"]:
+        challenge_data = ChallengeData.parse_file(spec_file)

-    @property
-    def task(self) -> str:
-        return self.data.task
+        challenge_class_name = f"Test{challenge_data.name}"
+        logger.debug(f"Creating {challenge_class_name} from spec: {spec_file}")
+        return type(
+            challenge_class_name,
+            (Challenge,),
+            {
+                "data": challenge_data,
+                "CHALLENGE_LOCATION": str(spec_file),
+                "ARTIFACTS_LOCATION": str(spec_file.resolve().parent),
+            },
+        )

-    @property
-    def dependencies(self) -> list:
-        return self.data.dependencies
+    # Define test method within the dynamically created class
+    @pytest.mark.asyncio
+    async def test_method(
+        self, config: AgentBenchmarkConfig, request: pytest.FixtureRequest
+    ) -> None:
+        # skip optional categories
+        self.skip_optional_categories(config)

-    async def setup_challenge(self, config: Dict[str, Any], cutoff: int) -> None:
+        if os.environ.get("HELICONE_API_KEY"):
+            from helicone.lock import HeliconeLockManager
+
+            HeliconeLockManager.write_custom_property("challenge", self.data.name)
+
+        timeout = self.data.cutoff or 60
+
+        if request.config.getoption("--nc"):
+            timeout = 100000
+        elif cutoff := request.config.getoption("--cutoff"):
+            timeout = int(cutoff)
+
+        await self.run_challenge(config, timeout)
+
+        scores = self.get_scores(config.temp_folder)
+        request.node.answers = (
+            scores["answers"] if request.config.getoption("--keep-answers") else None
+        )
+        del scores["answers"]  # remove answers from scores
+        request.node.scores = scores  # store scores in request.node
+        is_score_100 = 1 in scores["values"]
+
+        assert is_score_100
+
+    async def run_challenge(self, config: AgentBenchmarkConfig, cutoff: int) -> None:
        from agbenchmark.agent_interface import copy_artifacts_into_temp_folder

-        if not self.task:
+        if not self.data.task:
            return

        print(
-            f"\033[1;35m============Starting {self.data.name} challenge============\033[0m"
+            f"{Fore.MAGENTA + Style.BRIGHT}{'='*24} "
+            f"Starting {self.data.name} challenge"
+            f" {'='*24}{Style.RESET_ALL}"
        )
-        print(f"\033[1;30mTask: {self.task}\033[0m")
+        print(f"{Fore.BLACK}Task: {self.data.task}{Fore.RESET}")

        await run_api_agent(self.data, config, self.ARTIFACTS_LOCATION, cutoff)

@@ -66,13 +110,11 @@ class Challenge(ABC):
            str(Path(self.CHALLENGE_LOCATION).parent),
        ]
        for path in artifact_paths:
-            copy_artifacts_into_temp_folder(TEMP_FOLDER_ABS_PATH, "custom_python", path)
-
-    def test_method(self, config: Dict[str, Any]) -> None:
-        raise NotImplementedError
+            copy_artifacts_into_temp_folder(config.temp_folder, "custom_python", path)

+    @staticmethod
    def get_artifacts_out(
-        self, workspace: str | dict[str, str], ground: Ground
+        workspace: str | Path | dict[str, str], ground: Ground
    ) -> List[str]:
        if isinstance(workspace, dict):
            workspace = workspace["output"]
@@ -108,7 +150,7 @@ class Challenge(ABC):
            if ground.eval.type == "pytest":
                result = subprocess.run(
                    [sys.executable, "-m", "pytest"],
-                    cwd=TEMP_FOLDER_ABS_PATH,
+                    cwd=os.path.abspath(workspace),
                    capture_output=True,
                    text=True,
                )
@@ -119,15 +161,17 @@ class Challenge(ABC):

        return files_contents

-    def scoring(self, config: Dict[str, Any], content: str, ground: Ground) -> float:
-        print("\033[1;34mScoring content:\033[0m", content)
+    @staticmethod
+    def scoring(content: str, ground: Ground) -> float:
+        print(f"{Fore.BLUE}Scoring content:{Style.RESET_ALL}", content)
        if ground.should_contain:
            for should_contain_word in ground.should_contain:
                if not getattr(ground, "case_sensitive", True):
                    should_contain_word = should_contain_word.lower()
                    content = content.lower()
                print_content = (
-                    f"\033[1;34mWord that should exist\033[0m - {should_contain_word}:"
+                    f"{Fore.BLUE}Word that should exist{Style.RESET_ALL}"
+                    f" - {should_contain_word}:"
                )
                if should_contain_word not in content:
                    print(print_content, "False")
@@ -140,7 +184,10 @@ class Challenge(ABC):
                if not getattr(ground, "case_sensitive", True):
                    should_not_contain_word = should_not_contain_word.lower()
                    content = content.lower()
-                print_content = f"\033[1;34mWord that should not exist\033[0m - {should_not_contain_word}:"
+                print_content = (
+                    f"{Fore.BLUE}Word that should not exist{Style.RESET_ALL}"
+                    f" - {should_not_contain_word}:"
+                )
                if should_not_contain_word in content:
                    print(print_content, "False")
                    return 0.0
@@ -149,14 +196,17 @@ class Challenge(ABC):

        return 1.0

-    def llm_eval(self, config: Dict[str, Any], content: str, ground: Ground) -> float:
+    @classmethod
+    def llm_eval(cls, content: str, ground: Ground) -> float:
        openai.api_key = os.getenv("OPENAI_API_KEY")
        if os.getenv("IS_MOCK"):
            return 1.0

        # the validation for this is done in the Eval BaseModel
        scoring = SCORING_MAP[ground.eval.scoring]  # type: ignore
-        prompt = PROMPT_MAP[ground.eval.template].format(task=self.data.task, scoring=scoring, answer=ground.answer, response=content)  # type: ignore
+        prompt = PROMPT_MAP[ground.eval.template].format(  # type: ignore
+            task=cls.data.task, scoring=scoring, answer=ground.answer, response=content
+        )

        if ground.eval.examples:
            prompt += FEW_SHOT_EXAMPLES.format(examples=ground.eval.examples)
@@ -172,34 +222,31 @@ class Challenge(ABC):

        return float(answer["choices"][0]["message"]["content"])  # type: ignore

-    def get_scores(self, config: Dict[str, Any]) -> dict[str, Any]:
+    @classmethod
+    def get_scores(cls, workspace: Path) -> dict[str, Any]:
        scores = []
        scores_dict: Any = {}
        percentage = None
        answers = {}
        try:
-            if self.data.task == "" and os.getenv("IS_MOCK"):
+            if cls.data.task == "" and os.getenv("IS_MOCK"):
                scores = [1.0]
                answers = {"mock": "This is a mock answer"}
-            elif isinstance(self.data.ground, Ground):
-                files_contents = self.get_artifacts_out(
-                    TEMP_FOLDER_ABS_PATH, self.data.ground
-                )
+            elif isinstance(cls.data.ground, Ground):
+                files_contents = cls.get_artifacts_out(workspace, cls.data.ground)
                answers = {"answer": files_contents}
                for file_content in files_contents:
-                    score = self.scoring(config, file_content, self.data.ground)
-                    print("\033[1;32mYour score is:\033[0m", score)
+                    score = cls.scoring(file_content, cls.data.ground)
+                    print(f"{Fore.GREEN}Your score is:{Style.RESET_ALL}", score)
                    scores.append(score)

-                if self.data.ground.eval.type == "llm":
-                    llm_eval = self.llm_eval(
-                        config, "\n".join(files_contents), self.data.ground
-                    )
-                    if self.data.ground.eval.scoring == "percentage":
+                if cls.data.ground.eval.type == "llm":
+                    llm_eval = cls.llm_eval("\n".join(files_contents), cls.data.ground)
+                    if cls.data.ground.eval.scoring == "percentage":
                        scores.append(math.ceil(llm_eval / 100))
-                    elif self.data.ground.eval.scoring == "scale":
+                    elif cls.data.ground.eval.scoring == "scale":
                        scores.append(math.ceil(llm_eval / 10))
-                    print("\033[1;32mYour score is:\033[0m", llm_eval)
+                    print(f"{Fore.GREEN}Your score is:{Style.RESET_ALL}", llm_eval)

                    scores.append(llm_eval)
        except Exception as e:
@@ -212,7 +259,7 @@ class Challenge(ABC):
            "answers": answers,
        }

-        self.scores[self.__class__.__name__] = scores_data
+        cls.scores[cls.__name__] = scores_data

        return scores_data

@@ -223,14 +270,15 @@ class Challenge(ABC):

        return None

-    def skip_optional_categories(self, config: Dict[str, Any]) -> None:
-        challenge_category = self.data.category
-        categories = [
-            category
-            for category in OPTIONAL_CATEGORIES
-            if category in challenge_category
-        ]
-        if not agent_eligibible_for_optional_categories(
-            categories, config.get("category", [])
+    @classmethod
+    def skip_optional_categories(cls, config: AgentBenchmarkConfig) -> None:
+        challenge_categories = set(c.value for c in cls.data.category)
+        challenge_optional_categories = challenge_categories & set(OPTIONAL_CATEGORIES)
+        if challenge_optional_categories and not (
+            config.categories
+            and set(challenge_optional_categories).issubset(set(config.categories))
        ):
-            pytest.skip("Agent is not eligible for this category")
+            pytest.skip(
+                f"Category {', '.join(challenge_optional_categories)} is optional, "
+                "and not explicitly selected in the benchmark config."
+            )
--- a/benchmark/agbenchmark/utils/data_types.py
+++ b/benchmark/agbenchmark/utils/data_types.py
@@ -1,12 +1,8 @@
-import datetime
-import json
-import sys
-from datetime import datetime
 from enum import Enum
 from pathlib import Path
 from typing import Any, Dict, List, Optional

-from pydantic import BaseModel, constr, validator
+from pydantic import BaseModel, Field, constr, validator


 class DifficultyLevel(Enum):
@@ -33,80 +29,6 @@ DIFFICULTY_MAP = {
 STRING_DIFFICULTY_MAP = {e.value: DIFFICULTY_MAP[e] for e in DifficultyLevel}


-def calculate_info_test_path(base_path: Path, benchmark_start_time: datetime) -> Path:
-    """
-    Calculates the path to the directory where the test report will be saved.
-    """
-    # Ensure the reports path exists
-    base_path.mkdir(parents=True, exist_ok=True)
-
-    # Get current UTC date-time stamp
-    date_stamp = benchmark_start_time.strftime("%Y%m%dT%H%M%S")
-
-    # Default run name
-    run_name = "full_run"
-
-    # Map command-line arguments to their respective labels
-    arg_labels = {
-        "--test": None,
-        "--category": None,
-        "--maintain": "maintain",
-        "--improve": "improve",
-        "--explore": "explore",
-    }
-
-    # Identify the relevant command-line argument
-    for arg, label in arg_labels.items():
-        if arg in sys.argv:
-            test_arg = sys.argv[sys.argv.index(arg) + 1] if label is None else None
-            run_name = arg.strip("--")
-            if test_arg:
-                run_name = f"{run_name}_{test_arg}"
-            break
-
-    # Create the full new directory path with ISO standard UTC date-time stamp
-    report_path = base_path / f"{date_stamp}_{run_name}"
-
-    # Ensure the new directory is created
-    report_path.mkdir(exist_ok=True)
-    return report_path
-
-
-class AgentBenchmarkConfig(BaseModel):
-    """
-    This class represents the configuration for the Agent agbenchmark.
-    It includes the following attributes:
-    - agent_benchmark_config_path: The path to the agent benchmark config that this object was created from.
-    - reports_folder: The path to the folder where the benchmark reports will be stored.
-    - host: The host where the benchmark is run.
-    """
-
-    agent_benchmark_config_path: Path | None = None
-    reports_folder: Path | None = None
-    host: str | None
-
-    def get_reports_location(self) -> Path:
-        # if not self.reports_folder:
-        #     self.reports_folder = (
-        #         Path(self.agent_benchmark_config_path).parent / "reports"
-        #     ).resolve()
-        return Path.cwd() / "agbenchmark_config" / "reports"
-
-    def get_reports_path(self, benchmark_start_time: datetime) -> Path:
-        return calculate_info_test_path(
-            self.get_reports_location(), benchmark_start_time
-        )
-
-    def get_regression_reports_path(self) -> Path:
-        return self.get_reports_location() / "regression_tests.json"
-
-    def get_success_rate_path(self) -> Path:
-        return self.get_reports_location() / "success_rate.json"
-
-    def get_agent_home_directory(self) -> Path:
-        return Path(self.agent_benchmark_config_path).resolve().parent
-
-
 class Info(BaseModel):
    difficulty: DifficultyLevel
    description: constr(regex=r"^Tests if the agent can.*")
@@ -180,6 +102,7 @@ class Category(str, Enum):


 class ChallengeData(BaseModel):
+    eval_id: str = ""
    name: str
    category: List[Category]
    task: str
@@ -189,73 +112,4 @@ class ChallengeData(BaseModel):
    info: Info | Dict[str, Info]
    metadata: Optional[Dict[str, Any]] = None

-    def serialize(self, path: str) -> None:
-        with open(path, "w") as file:
-            file.write(self.json())
-
-    def get_data(self) -> dict:
-        return self.dict()
-
-    @staticmethod
-    def get_json_from_path(json_path: Path | str) -> dict:
-        path = Path(json_path).resolve()
-        with open(path, "r") as file:
-            data = json.load(file)
-        return data
-
-    @staticmethod
-    def deserialize(path: str) -> "ChallengeData":
-        # this script is in root/agbenchmark/utils/define_task_types.py
-        script_dir = Path(__file__).resolve().parent.parent.parent
-        json_path = script_dir / Path(path)
-
-        with open(json_path, "r") as file:
-            data = json.load(file)
-        try:
-            return ChallengeData(**data)
-        except:
-            test = "ok"
-
-    def challenge_from_datum(self, file_datum: list[dict[str, Any]]) -> "ChallengeData":
-        same_task_data = {
-            "name": self.prefix,
-            "dependencies": self.dependencies,
-            "category": self.shared_category,
-            "task": self.task,
-            "cutoff": self.cutoff,
-        }
-
-        if not self.info:
-            same_task_data["info"] = {
-                datum["name"]: datum["info"] for datum in file_datum
-            }
-        else:
-            same_task_data["info"] = self.info
-
-        if not self.ground:
-            same_task_data["ground"] = {
-                datum["name"]: datum["ground"] for datum in file_datum
-            }
-        else:
-            same_task_data["ground"] = self.ground
-
-        return ChallengeData(**same_task_data)
-
-    def challenge_from_test_data(self, data: dict[str, Any]) -> "ChallengeData":
-        same_task_data = {
-            "name": data["name"],
-            "dependencies": data["dependencies"],
-            "category": data["category"],
-            "info": data["info"],
-            "ground": data["ground"],
-        }
-
-        if self.same_task:
-            same_task_data["category"].extend(self.shared_category)
-            same_task_data["task"] = self.task
-            same_task_data["cutoff"] = self.cutoff
-        else:
-            same_task_data["task"] = data["task"]
-            same_task_data["cutoff"] = data["cutoff"]
-
-        return ChallengeData(**same_task_data)
+    spec_file: Path | None = Field(None, exclude=True)
--- a/benchmark/agbenchmark/utils/dependencies/graphs.py
+++ b/benchmark/agbenchmark/utils/dependencies/graphs.py
@@ -1,3 +1,5 @@
+import json
+import logging
 import math
 from pathlib import Path
 from typing import Any, Dict, List, Tuple
@@ -11,6 +13,8 @@ from pyvis.network import Network
 from agbenchmark.generate_test import DATA_CATEGORY
 from agbenchmark.utils.utils import write_pretty_json

+logger = logging.getLogger(__name__)
+

 def bezier_curve(
    src: np.ndarray, ctrl: List[float], dst: np.ndarray
@@ -221,8 +225,8 @@ def graph_interactive_network(
            f"{source_id_str}_to_{target_id_str}"  # Construct a unique edge id
        )
        if not (source_id_str in nt.get_nodes() and target_id_str in nt.get_nodes()):
-            print(
-                f"Skipping edge {source_id_str} -> {target_id_str} due to missing nodes."
+            logger.warning(
+                f"Skipping edge {source_id_str} -> {target_id_str} due to missing nodes"
            )
            continue
        nt.add_edge(source_id_str, target_id_str, id=edge_id_str)
@@ -271,9 +275,12 @@ def graph_interactive_network(
        "layout": {"hierarchical": hierarchical_options},
    }

-    # Serialize the graph to JSON
+    # Serialize the graph to JSON and save in appropriate locations
    graph_data = {"nodes": nt.nodes, "edges": nt.edges}
+    logger.debug(f"Generated graph data:\n{json.dumps(graph_data, indent=4)}")

+    # FIXME: use more reliable method to find the right location for these files.
+    #   This will fail in all cases except if run from the root of our repo.
    home_path = Path.cwd()
    write_pretty_json(graph_data, home_path / "frontend" / "public" / "graph.json")

@@ -284,7 +291,6 @@ def graph_interactive_network(
    # this literally only works in the AutoGPT repo, but this part of the code is not reached if BUILD_SKILL_TREE is false
    write_pretty_json(graph_data, flutter_app_path / "tree_structure.json")
    validate_skill_tree(graph_data, "")
-    import json

    # Extract node IDs with category "coding"

@@ -317,9 +323,6 @@ def graph_interactive_network(
        scrape_synthesize_tree,
        flutter_app_path / "scrape_synthesize_tree_structure.json",
    )
-    # If you want to convert back to JSON
-    filtered_json = json.dumps(graph_data, indent=4)
-    print(filtered_json)

    if html_graph_path:
        file_path = str(Path(html_graph_path).resolve())
--- a/benchmark/agbenchmark/utils/get_data_from_helicone.py
+++ b/benchmark/agbenchmark/utils/get_data_from_helicone.py
@@ -1,4 +1,5 @@
 import json
+import logging
 import os
 from typing import Optional

@@ -7,6 +8,8 @@ import requests
 from agbenchmark.__main__ import BENCHMARK_START_TIME
 from agbenchmark.agent_interface import HELICONE_GRAPHQL_LOGS

+logger = logging.getLogger(__name__)
+

 def get_data_from_helicone(challenge: str) -> Optional[float]:
    # Define the endpoint of your GraphQL server
@@ -38,8 +41,8 @@ query ExampleQuery($properties: [PropertyFilter!]){
        ]
    }
    if HELICONE_GRAPHQL_LOGS:
-        print(query)
-        print(json.dumps(variables, indent=4))
+        logger.debug(f"Executing Helicone query:\n{query.strip()}")
+        logger.debug(f"Query variables:\n{json.dumps(variables, indent=4)}")

    operation_name = "ExampleQuery"

@@ -59,24 +62,22 @@ query ExampleQuery($properties: [PropertyFilter!]){

        data = response.json()
    except requests.HTTPError as http_err:
-        print(f"HTTP error occurred: {http_err}")
-        return None  # Re-raise the exception to stop execution
+        logger.error(f"Helicone returned an HTTP error: {http_err}")
+        return None
    except json.JSONDecodeError:
-        print(f"Invalid JSON response: {response.text if response else 'No response'}")
+        raw_response = response.text  # type: ignore
+        logger.error(
+            f"Helicone returned an invalid JSON response: '''{raw_response}'''"
+        )
        return None
    except Exception as err:
-        print(f"Other error occurred: {err}")
+        logger.error(f"Error while trying to get data from Helicone: {err}")
        return None

-    try:
-        if data is None or data.get("data") is None:
-            print("Invalid response received from server: no data")
-            return None
-        return (
-            data.get("data", {})
-            .get("aggregatedHeliconeRequest", {})
-            .get("costUSD", None)
-        )
-    except Exception as err:
-        print(f"Error occurred while parsing response: {err}")
+    if data is None or data.get("data") is None:
+        logger.error("Invalid response received from Helicone: no data")
+        logger.error(f"Offending response: {response}")
        return None
+    return (
+        data.get("data", {}).get("aggregatedHeliconeRequest", {}).get("costUSD", None)
+    )
--- a/benchmark/agbenchmark/utils/logging.py
+++ b/benchmark/agbenchmark/utils/logging.py
@@ -0,0 +1,74 @@
+from __future__ import annotations
+
+import logging
+
+from colorama import Fore, Style
+
+SIMPLE_LOG_FORMAT = "[%(asctime)s] %(levelname)s %(message)s"
+DEBUG_LOG_FORMAT = "[%(asctime)s] %(levelname)s %(filename)s:%(lineno)03d  %(message)s"
+
+
+def configure_logging(
+    level: int = logging.INFO,
+) -> None:
+    """Configure the native logging module."""
+
+    # Auto-adjust default log format based on log level
+    log_format = DEBUG_LOG_FORMAT if level == logging.DEBUG else SIMPLE_LOG_FORMAT
+
+    console_handler = logging.StreamHandler()
+    console_handler.setFormatter(FancyConsoleFormatter(log_format))
+
+    # Configure the root logger
+    logging.basicConfig(
+        level=level,
+        format=log_format,
+        handlers=[console_handler],
+    )
+
+
+class FancyConsoleFormatter(logging.Formatter):
+    """
+    A custom logging formatter designed for console output.
+
+    This formatter enhances the standard logging output with color coding. The color
+    coding is based on the level of the log message, making it easier to distinguish
+    between different types of messages in the console output.
+
+    The color for each level is defined in the LEVEL_COLOR_MAP class attribute.
+    """
+
+    # level -> (level & text color, title color)
+    LEVEL_COLOR_MAP = {
+        logging.DEBUG: Fore.LIGHTBLACK_EX,
+        logging.INFO: Fore.BLUE,
+        logging.WARNING: Fore.YELLOW,
+        logging.ERROR: Fore.RED,
+        logging.CRITICAL: Fore.RED + Style.BRIGHT,
+    }
+
+    def format(self, record: logging.LogRecord) -> str:
+        # Make sure `msg` is a string
+        if not hasattr(record, "msg"):
+            record.msg = ""
+        elif not type(record.msg) is str:
+            record.msg = str(record.msg)
+
+        # Justify the level name to 5 characters minimum
+        record.levelname = record.levelname.ljust(5)
+
+        # Determine default color based on error level
+        level_color = ""
+        if record.levelno in self.LEVEL_COLOR_MAP:
+            level_color = self.LEVEL_COLOR_MAP[record.levelno]
+            record.levelname = f"{level_color}{record.levelname}{Style.RESET_ALL}"
+
+        # Determine color for message
+        color = getattr(record, "color", level_color)
+        color_is_specified = hasattr(record, "color")
+
+        # Don't color INFO messages unless the color is explicitly specified.
+        if color and (record.levelno != logging.INFO or color_is_specified):
+            record.msg = f"{color}{record.msg}{Style.RESET_ALL}"
+
+        return super().format(record)
--- a/benchmark/agbenchmark/utils/utils.py
+++ b/benchmark/agbenchmark/utils/utils.py
@@ -1,18 +1,22 @@
 # radio charts, logs, helper functions for tests, anything else relevant.
 import json
+import logging
 import os
 import re
 from pathlib import Path
-from typing import Any, List, Optional
+from typing import Any, Optional

 from dotenv import load_dotenv

-load_dotenv()
 from agbenchmark.utils.data_types import DIFFICULTY_MAP, DifficultyLevel

+load_dotenv()
+
 AGENT_NAME = os.getenv("AGENT_NAME")
 REPORT_LOCATION = os.getenv("REPORT_LOCATION", None)

+logger = logging.getLogger(__name__)
+

 def replace_backslash(value: Any) -> Any:
    if isinstance(value, str):
@@ -72,8 +76,9 @@ def get_highest_success_difficulty(
                    highest_difficulty = DifficultyLevel[highest_difficulty_str]
                    highest_difficulty_level = DIFFICULTY_MAP[highest_difficulty]
                except KeyError:
-                    print(
-                        f"Unexpected difficulty level '{highest_difficulty_str}' in test '{test_name}'"
+                    logger.warning(
+                        f"Unexpected difficulty level '{highest_difficulty_str}' "
+                        f"in test '{test_name}'"
                    )
                    continue
            else:
@@ -88,12 +93,21 @@ def get_highest_success_difficulty(
                            highest_difficulty = difficulty_enum
                            highest_difficulty_level = difficulty_level
                    except KeyError:
-                        print(
-                            f"Unexpected difficulty level '{difficulty_str}' in test '{test_name}'"
+                        logger.warning(
+                            f"Unexpected difficulty level '{difficulty_str}' "
+                            f"in test '{test_name}'"
                        )
                        continue
-        except Exception:
-            print(f"Make sure you selected the right test, no reports were generated.")
+        except Exception as e:
+            logger.warning(
+                "An unexpected error [1] occurred while analyzing report [2]."
+                "Please notify a maintainer.\n"
+                f"Report data [1]: {data}\n"
+                f"Error [2]: {e}"
+            )
+            logger.warning(
+                "Make sure you selected the right test, no reports were generated."
+            )
            break

    if highest_difficulty is not None:
@@ -116,22 +130,13 @@ def get_highest_success_difficulty(
 #             remote_url = remote_url[:-4]
 #         git_commit_sha = f"{remote_url}/tree/{repo.head.commit.hexsha}"

-#         # print(f"GIT_COMMIT_SHA: {git_commit_sha}")
+#         # logger.debug(f"GIT_COMMIT_SHA: {git_commit_sha}")
 #         return git_commit_sha
 #     except Exception:
-#         # print(f"{directory} is not a git repository!")
+#         # logger.error(f"{directory} is not a git repository!")
 #         return None


-def agent_eligibible_for_optional_categories(
-    optional_challenge_categories: List, agent_categories: List
-) -> bool:
-    for element in optional_challenge_categories:
-        if element not in agent_categories:
-            return False
-    return True
-
-
 def write_pretty_json(data, json_file):
    sorted_data = deep_sort(data)
    json_graph = json.dumps(sorted_data, indent=4)
--- a/benchmark/poetry.lock
+++ b/benchmark/poetry.lock
--- a/benchmark/pyproject.toml
+++ b/benchmark/pyproject.toml
@@ -32,6 +32,8 @@ python-multipart = "^0.0.6"
 toml = "^0.10.2"
 helicone = "^1.0.9"
 httpx = "^0.24.0"
+agent-protocol-client = "^1.1.0"
+click-default-group = "^1.2.4"

 [tool.poetry.group.dev.dependencies]
 flake8 = "^3.9.2"
--- a/benchmark/server.py
+++ b/benchmark/server.py
@@ -1,121 +0,0 @@
-import io
-import json
-import logging
-import shutil
-from pathlib import Path
-from random import randint
-from typing import Annotated, Any, Dict, List
-
-from fastapi import FastAPI, File, Form, HTTPException, UploadFile
-from fastapi.responses import StreamingResponse
-from pydantic import BaseModel
-
-# Set up logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-app = FastAPI()
-artifacts: List[Dict[str, Any]] = []
-
-
-class Task(BaseModel):
-    input: str
-
-
-@app.post("/agent/tasks/{task_id}/artifacts")
-async def upload_file(
-    task_id: str, file: Annotated[UploadFile, File()], relative_path: str = Form("")
-) -> Dict[str, Any]:
-    logger.info(
-        "Uploading file for task_id: %s with relative path: %s", task_id, relative_path
-    )
-    absolute_directory_path = Path(__file__).parent.absolute()
-    save_path = (
-        absolute_directory_path
-        / "agent/gpt-engineer"
-        / "projects/my-new-project/workspace"
-    )
-
-    random_string = str(randint(0, 100000))
-    while random_string in artifacts:
-        random_string = str(randint(0, 100000))
-
-    artifact_data = await file.read()
-    artifacts.append(
-        {
-            "binary": artifact_data,
-            "relative_path": relative_path,
-            "file_name": file.filename,
-            "artifact_id": random_string,
-        }
-    )
-
-    print(artifacts)
-    return {
-        "artifact_id": random_string,
-        "file_name": "file_name",
-        "relative_path": "relative_path",
-    }
-
-
-@app.get("/agent/tasks/{task_id}/artifacts")
-async def get_files() -> List[Dict[str, Any]]:
-    logger.info("Fetching list of files for task")
-    return artifacts
-
-
-@app.get("/agent/tasks/{task_id}/artifacts/{artifact_id}")
-async def get_file(artifact_id: str):
-    for artifact in artifacts:
-        if artifact["artifact_id"] == artifact_id:
-            break
-    else:
-        logger.error("Attempt to access nonexistent artifact with ID: %s", artifact_id)
-        raise HTTPException(status_code=404, detail="Artifact not found")
-
-    logger.info("Fetching artifact with ID: %s", artifact_id)
-    # find aritifact where artifact_id = artifact_id
-
-    for artifact in artifacts:
-        if artifact["artifact_id"] == artifact_id:
-            return StreamingResponse(
-                io.BytesIO(artifact["binary"]),
-                media_type="application/octet-stream",
-                headers={"Content-Disposition": f"attachment; filename=test.txt"},
-            )
-    # return 404
-    return HTTPException(status_code=404, detail="Artifact not found")
-
-
-@app.post("/agent/tasks/{task_id}/steps")
-async def create_steps(task_id: str):
-    logger.info("Creating step for task_id: %s", task_id)
-    return {
-        "input": "random",
-        "additional_input": {},
-        "task_id": task_id,
-        "step_id": "random_step",
-        "name": "random",
-        "status": "created",
-        "output": "random",
-        "additional_output": {},
-        "artifacts": [],
-        "is_last": True,
-    }
-
-
-@app.post("/agent/tasks")
-async def create_tasks(task: Task):
-    artifacts.clear()
-    return {
-        "input": "random",
-        "additional_input": {},
-        "task_id": "static_task_id",
-        "artifacts": [],
-    }
-
-
-if __name__ == "__main__":
-    import uvicorn
-
-    uvicorn.run(app, host="0.0.0.0", port=8000)