AGBenchmark codebase clean-up (#6650)

* refactor(benchmark): Deduplicate configuration loading logic - Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module. - Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function. * fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py - Fixed type errors and linting errors in `__main__.py` - Improved the readability of CLI argument validation by introducing a separate function for it * refactor(benchmark): Lint and typefix app.py - Rearranged and cleaned up import statements - Fixed type errors caused by improper use of `psutil` objects - Simplified a number of `os.path` usages by converting to `pathlib` - Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema` * refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py - Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`). - Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`. - Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`). - Remove all unused types from schema.py (= most of them). * refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py * refactor(benchmark): Improve typing, response validation, and readability in app.py - Simplified response generation by leveraging type checking and conversion by FastAPI. - Introduced use of `HTTPException` for error responses. - Improved naming, formatting, and typing in `app.py::create_evaluation`. - Updated the docstring on `app.py::create_agent_task`. - Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py. - Added default values to optional attributes on models in report_types_v2.py. - Removed unused imports in `generate_test.py` * refactor(benchmark): Clean up logging and print statements - Introduced use of the `logging` library for unified logging and better readability. - Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`. - Improved descriptiveness of log statements. - Removed unnecessary print statements. - Added log statements to unspecific and non-verbose `except` blocks. - Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format. - Added `.utils.logging` module with `configure_logging` function to easily configure the logging library. - Converted raw escape sequences in `.utils.challenge` to use `colorama`. - Renamed `generate_test.py::generate_tests` to `load_challenges`. * refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent - Remove unused server.py file - Remove unused run_agent function from agent_interface.py * refactor(benchmark): Clean up conftest.py - Fix and add type annotations - Rewrite docstrings - Disable or remove unused code - Fix definition of arguments and their types in `pytest_addoption` * refactor(benchmark): Clean up generate_test.py file - Refactored the `create_single_test` function for clarity and readability - Removed unused variables - Made creation of `Challenge` subclasses more straightforward - Made bare `except` more specific - Renamed `Challenge.setup_challenge` method to `run_challenge` - Updated type hints and annotations - Made minor code/readability improvements in `load_challenges` - Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module * fix(benchmark): Fix and add type annotations in execute_sub_process.py * refactor(benchmark): Simplify const determination in agent_interface.py - Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS` * fix(benchmark): Register category markers to prevent warnings - Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime. * refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json * refactor(benchmark): Update agent_api_interface.py - Add type annotations to `copy_agent_artifacts_into_temp_folder` function - Add note about broken endpoint in the `agent_protocol_client` library - Remove unused variable in `run_api_agent` function - Improve readability and resolve linting error * feat(benchmark): Improve and centralize pathfinding - Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder. - Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const. - Replace path constants defined in __main__.py with usages of `PATH_MANAGER`. * feat(benchmark/cli): Clean up and improve CLI - Updated commands, options, and their descriptions to be more intuitive and consistent - Moved slow imports into the entrypoints that use them to speed up application startup - Fixed type hints to match output types of Click options - Hid deprecated `agbenchmark start` command - Refactored code to improve readability and maintainability - Moved main entrypoint into `run` subcommand - Fixed `version` and `serve` subcommands - Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility) - Renamed `--no_dep` to `--no-dep` for consistency - Fixed string formatting issues in log statements * refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py - Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`. - Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`. - Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`. - Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties. - Update all code references according to the changes mentioned above. * refactor(benchmark): Fix ReportManager init parameter types and use pathlib - Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`. - Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`. - Rename `self.filename` with `self.report_file` in `ReportManager`. - Change the way the report file is created, opened and saved to use the `Path` object. * refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation - Use `ChallengeData` objects instead of untyped `dict` in app.py, generate_test.py, reports.py. - Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class. - Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class. - Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py. - Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py. - Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py). * refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py - Cleaned up generate_test.py and conftest.py - Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method. - Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py. - Converted methods in the `Challenge` class to class methods where appropriate. - Improved argument handling in the `run_benchmark` function in `__main__.py`. * refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state - Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management. - Remove the `.path_manager` module containing `AGBenchmarkPathManager`. - Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity. * feat(benchmark/serve): Configurable port for `serve` subcommand - Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on. - If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set. * feat(benchmark/cli): Add `config` subcommand - Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config. * fix(benchmark): Gracefully handle incompatible challenge spec files in app.py - Added a check to skip deprecated challenges - Added logging to allow debugging of the loading process - Added handling of validation errors when parsing challenge spec files - Added missing `spec_file` attribute to `ChallengeData` * refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint - Move `run_benchmark` and `validate_args` from __main__.py to main.py - Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark` - Move `get_unique_categories` from __main__.py to challenges/__init__.py - Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py - Reduce operations on updates.json (including `initialize_updates_file`) outside of API * refactor(benchmark): Remove unused `/updates` endpoint and all related code - Remove `updates_json_file` attribute from `AgentBenchmarkConfig` - Remove `get_updates` and `_initialize_updates_file` in app.py - Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py - Remove call to `append_updates_file` in challenge.py * refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig` - Add and update docstrings - Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility - Make naming of path attributes on `AgentBenchmarkConfig` more consistent - Remove unused `agent_home_directory` attribute - Remove unused `workspace` attribute * fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config * fix(benchmark): Update agent-protocol-client to v1.1.0 - Fixes issue with fetching task artifact listings
2025-12-18 14:34:23 +01:00 · 2024-01-02 22:23:09 +01:00
parent b8238c2228
commit 25cc6ad6ae
47 changed files with 2122 additions and 7752 deletions
--- a/benchmark/agbenchmark/generate_test.py
+++ b/benchmark/agbenchmark/generate_test.py
@@ -1,147 +1,34 @@
 import glob
 import importlib
-import json
+import logging
 import os
-import sys
-import types
 from collections import deque
 from pathlib import Path
-from typing import Any, Dict, Optional, Union

-import pytest
-
-from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
-from agbenchmark.agent_api_interface import append_updates_file
-from agbenchmark.agent_protocol_client.models.step import Step
 from agbenchmark.utils.challenge import Challenge
-from agbenchmark.utils.data_types import AgentBenchmarkConfig, ChallengeData
+from agbenchmark.utils.data_types import ChallengeData

 DATA_CATEGORY = {}

-
-def create_single_test(
-    data: Dict[str, Any] | ChallengeData,
-    challenge_location: str,
-    file_datum: Optional[list[dict[str, Any]]] = None,
-) -> None:
-    challenge_data = None
-    artifacts_location = None
-    if isinstance(data, ChallengeData):
-        challenge_data = data
-        data = data.get_data()
-
-    DATA_CATEGORY[data["name"]] = data["category"][0]
-
-    # Define test class dynamically
-    challenge_class = types.new_class(f"Test{data['name']}", (Challenge,))
-    print(challenge_location)
-    # clean_challenge_location = get_test_path(challenge_location)
-    setattr(challenge_class, "CHALLENGE_LOCATION", challenge_location)
-
-    setattr(
-        challenge_class,
-        "ARTIFACTS_LOCATION",
-        artifacts_location or str(Path(challenge_location).resolve().parent),
-    )
-
-    # Define test method within the dynamically created class
-    @pytest.mark.asyncio
-    async def test_method(self, config: Dict[str, Any], request) -> None:  # type: ignore
-        # create a random number between 0 and 1
-        test_name = self.data.name
-
-        try:
-            with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
-                challenges_beaten_in_the_past = json.load(f)
-        except:
-            challenges_beaten_in_the_past = {}
-
-        if request.config.getoption("--explore") and challenges_beaten_in_the_past.get(
-            test_name, False
-        ):
-            return None
-
-        # skip optional categories
-        self.skip_optional_categories(config)
-
-        from helicone.lock import HeliconeLockManager
-
-        if os.environ.get("HELICONE_API_KEY"):
-            HeliconeLockManager.write_custom_property("challenge", self.data.name)
-
-        cutoff = self.data.cutoff or 60
-
-        timeout = cutoff
-        if "--nc" in sys.argv:
-            timeout = 100000
-        if "--cutoff" in sys.argv:
-            timeout = int(sys.argv[sys.argv.index("--cutoff") + 1])
-
-        await self.setup_challenge(config, timeout)
-
-        scores = self.get_scores(config)
-        request.node.answers = (
-            scores["answers"] if "--keep-answers" in sys.argv else None
-        )
-        del scores["answers"]  # remove answers from scores
-        request.node.scores = scores  # store scores in request.node
-        is_score_100 = 1 in scores["values"]
-
-        evaluation = "Correct!" if is_score_100 else "Incorrect."
-        eval_step = Step(
-            input=evaluation,
-            additional_input=None,
-            task_id="irrelevant, this step is a hack",
-            step_id="irrelevant, this step is a hack",
-            name="",
-            status="created",
-            output=None,
-            additional_output=None,
-            artifacts=[],
-            is_last=True,
-        )
-        await append_updates_file(eval_step)
-
-        assert is_score_100
-
-    # Parametrize the method here
-    test_method = pytest.mark.parametrize(
-        "challenge_data",
-        [data],
-        indirect=True,
-    )(test_method)
-
-    setattr(challenge_class, "test_method", test_method)
-
-    # Attach the new class to a module so it can be discovered by pytest
-    module = importlib.import_module(__name__)
-    setattr(module, f"Test{data['name']}", challenge_class)
-    return challenge_class
+logger = logging.getLogger(__name__)


-def create_single_suite_challenge(challenge_data: ChallengeData, path: Path) -> None:
-    create_single_test(challenge_data, str(path))
+def create_challenge_from_spec_file(spec_file: Path) -> type[Challenge]:
+    challenge = Challenge.from_challenge_spec(spec_file)
+    DATA_CATEGORY[challenge.data.name] = challenge.data.category[0].value
+    return challenge


-def create_challenge(
-    data: Dict[str, Any],
-    json_file: str,
-    json_files: deque,
-) -> Union[deque, Any]:
-    path = Path(json_file).resolve()
-    print("Creating challenge for", path)
-
-    challenge_class = create_single_test(data, str(path))
-    print("Creation complete for", path)
-
-    return json_files, challenge_class
+def create_challenge_from_spec_file_path(spec_file_path: str) -> type[Challenge]:
+    spec_file = Path(spec_file_path).resolve()
+    return create_challenge_from_spec_file(spec_file)


-def generate_tests() -> None:  # sourcery skip: invert-any-all
-    print("Generating tests...")
+def load_challenges() -> None:
+    logger.info("Loading challenges...")

    challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
-    print(f"Looking for challenges in {challenges_path}...")
+    logger.debug(f"Looking for challenges in {challenges_path}...")

    json_files = deque(
        glob.glob(
@@ -150,74 +37,39 @@ def generate_tests() -> None:  # sourcery skip: invert-any-all
        )
    )

-    print(f"Found {len(json_files)} challenges.")
-    print(f"Sample path: {json_files[0]}")
-
-    agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
-    try:
-        with open(agent_benchmark_config_path, "r") as f:
-            agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
-            agent_benchmark_config.agent_benchmark_config_path = (
-                agent_benchmark_config_path
-            )
-    except json.JSONDecodeError:
-        print("Error: benchmark_config.json is not a valid JSON file.")
-        raise
-
-    regression_reports_path = agent_benchmark_config.get_regression_reports_path()
-    if regression_reports_path and os.path.exists(regression_reports_path):
-        with open(regression_reports_path, "r") as f:
-            regression_tests = json.load(f)
-    else:
-        regression_tests = {}
+    logger.debug(f"Found {len(json_files)} challenges.")
+    logger.debug(f"Sample path: {json_files[0]}")

+    loaded, ignored = 0, 0
    while json_files:
-        json_file = (
-            json_files.popleft()
-        )  # Take and remove the first element from json_files
+        # Take and remove the first element from json_files
+        json_file = json_files.popleft()
        if challenge_should_be_ignored(json_file):
+            ignored += 1
            continue

-        data = ChallengeData.get_json_from_path(json_file)
+        challenge_info = ChallengeData.parse_file(json_file)

-        commands = sys.argv
-        # --by flag
-        if "--category" in commands:
-            categories = data.get("category", [])
-            commands_set = set(commands)
+        challenge_class = create_challenge_from_spec_file_path(json_file)

-            # Convert the combined list to a set
-            categories_set = set(categories)
+        logger.debug(f"Generated test for {challenge_info.name}")
+        _add_challenge_to_module(challenge_class)
+        loaded += 1

-            # If there's no overlap with commands
-            if not categories_set.intersection(commands_set):
-                continue
-
-        # --test flag, only run the test if it's the exact one specified
-        tests = []
-        for command in commands:
-            if command.startswith("--test="):
-                tests.append(command.split("=")[1])
-
-        if tests and data["name"] not in tests:
-            continue
-
-        # --maintain and --improve flag
-        in_regression = regression_tests.get(data["name"], None)
-        improve_flag = in_regression and "--improve" in commands
-        maintain_flag = not in_regression and "--maintain" in commands
-        if "--maintain" in commands and maintain_flag:
-            continue
-        elif "--improve" in commands and improve_flag:
-            continue
-        json_files, challenge_class = create_challenge(data, json_file, json_files)
-
-        print(f"Generated test for {data['name']}.")
-    print("Test generation complete.")
+    logger.info(f"Loading challenges complete: loaded {loaded}, ignored {ignored}.")


-def challenge_should_be_ignored(json_file):
-    return "challenges/deprecated" in json_file or "challenges/library" in json_file
+def challenge_should_be_ignored(json_file_path: str):
+    return (
+        "challenges/deprecated" in json_file_path
+        or "challenges/library" in json_file_path
+    )


-generate_tests()
+def _add_challenge_to_module(challenge: type[Challenge]):
+    # Attach the Challenge class to this module so it can be discovered by pytest
+    module = importlib.import_module(__name__)
+    setattr(module, f"{challenge.__name__}", challenge)
+
+
+load_challenges()