AGBenchmark codebase clean-up (#6650)

* refactor(benchmark): Deduplicate configuration loading logic

   - Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module.
   - Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function.

* fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py

   - Fixed type errors and linting errors in `__main__.py`
   - Improved the readability of CLI argument validation by introducing a separate function for it

* refactor(benchmark): Lint and typefix app.py

   - Rearranged and cleaned up import statements
   - Fixed type errors caused by improper use of `psutil` objects
   - Simplified a number of `os.path` usages by converting to `pathlib`
   - Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema`

* refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py

   - Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`).
      - Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`.
   - Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`).
   - Remove all unused types from schema.py (= most of them).

* refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py

* refactor(benchmark): Improve typing, response validation, and readability in app.py

   - Simplified response generation by leveraging type checking and conversion by FastAPI.
   - Introduced use of `HTTPException` for error responses.
   - Improved naming, formatting, and typing in `app.py::create_evaluation`.
   - Updated the docstring on `app.py::create_agent_task`.
   - Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py.
   - Added default values to optional attributes on models in report_types_v2.py.
   - Removed unused imports in `generate_test.py`

* refactor(benchmark): Clean up logging and print statements

   - Introduced use of the `logging` library for unified logging and better readability.
   - Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`.
   - Improved descriptiveness of log statements.
   - Removed unnecessary print statements.
   - Added log statements to unspecific and non-verbose `except` blocks.
   - Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format.
   - Added `.utils.logging` module with `configure_logging` function to easily configure the logging library.
   - Converted raw escape sequences in `.utils.challenge` to use `colorama`.
   - Renamed `generate_test.py::generate_tests` to `load_challenges`.

* refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent

   - Remove unused server.py file
   - Remove unused run_agent function from agent_interface.py

* refactor(benchmark): Clean up conftest.py

   - Fix and add type annotations
   - Rewrite docstrings
   - Disable or remove unused code
   - Fix definition of arguments and their types in `pytest_addoption`

* refactor(benchmark): Clean up generate_test.py file

   - Refactored the `create_single_test` function for clarity and readability
      - Removed unused variables
      - Made creation of `Challenge` subclasses more straightforward
      - Made bare `except` more specific
   - Renamed `Challenge.setup_challenge` method to `run_challenge`
   - Updated type hints and annotations
   - Made minor code/readability improvements in `load_challenges`
   - Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module

* fix(benchmark): Fix and add type annotations in execute_sub_process.py

* refactor(benchmark): Simplify const determination in agent_interface.py

   - Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS`

* fix(benchmark): Register category markers to prevent warnings

   - Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime.

* refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json

* refactor(benchmark): Update agent_api_interface.py

   - Add type annotations to `copy_agent_artifacts_into_temp_folder` function
   - Add note about broken endpoint in the `agent_protocol_client` library
   - Remove unused variable in `run_api_agent` function
   - Improve readability and resolve linting error

* feat(benchmark): Improve and centralize pathfinding

   - Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder.
   - Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const.
   - Replace path constants defined in __main__.py with usages of `PATH_MANAGER`.

* feat(benchmark/cli): Clean up and improve CLI

   - Updated commands, options, and their descriptions to be more intuitive and consistent
   - Moved slow imports into the entrypoints that use them to speed up application startup
   - Fixed type hints to match output types of Click options
   - Hid deprecated `agbenchmark start` command
   - Refactored code to improve readability and maintainability
   - Moved main entrypoint into `run` subcommand
   - Fixed `version` and `serve` subcommands
   - Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility)
   - Renamed `--no_dep` to `--no-dep` for consistency
   - Fixed string formatting issues in log statements

* refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py

   - Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`.
   - Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`.
   - Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`.
   - Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties.
   - Update all code references according to the changes mentioned above.

* refactor(benchmark): Fix ReportManager init parameter types and use pathlib

   - Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`.
   - Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`.
   - Rename `self.filename` with `self.report_file` in `ReportManager`.
   - Change the way the report file is created, opened and saved to use the `Path` object.

* refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation

   - Use `ChallengeData` objects instead of untyped `dict` in  app.py, generate_test.py, reports.py.
   - Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class.
   - Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class.
   - Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py.
   - Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py.
   - Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py).

* refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py

   - Cleaned up generate_test.py and conftest.py
      - Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method.
      - Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py.
   - Converted methods in the `Challenge` class to class methods where appropriate.
   - Improved argument handling in the `run_benchmark` function in `__main__.py`.

* refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state

   - Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management.
   - Remove the `.path_manager` module containing `AGBenchmarkPathManager`.
   - Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity.

* feat(benchmark/serve): Configurable port for `serve` subcommand

   - Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on.
   - If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set.

* feat(benchmark/cli): Add `config` subcommand

   - Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config.

* fix(benchmark): Gracefully handle incompatible challenge spec files in app.py

   - Added a check to skip deprecated challenges
   - Added logging to allow debugging of the loading process
   - Added handling of validation errors when parsing challenge spec files
   - Added missing `spec_file` attribute to `ChallengeData`

* refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint

   - Move `run_benchmark` and `validate_args` from __main__.py to main.py
   - Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark`
   - Move `get_unique_categories` from __main__.py to challenges/__init__.py
   - Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py
   - Reduce operations on updates.json (including `initialize_updates_file`) outside of API

* refactor(benchmark): Remove unused `/updates` endpoint and all related code

   - Remove `updates_json_file` attribute from `AgentBenchmarkConfig`
   - Remove `get_updates` and `_initialize_updates_file` in app.py
   - Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py
   - Remove call to `append_updates_file` in challenge.py

* refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig`

   - Add and update docstrings
   - Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility
   - Make naming of path attributes on `AgentBenchmarkConfig` more consistent
   - Remove unused `agent_home_directory` attribute
   - Remove unused `workspace` attribute

* fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config

* fix(benchmark): Update agent-protocol-client to v1.1.0

   - Fixes issue with fetching task artifact listings
This commit is contained in:
Reinier van der Leer
2024-01-02 22:23:09 +01:00
committed by GitHub
parent b8238c2228
commit 25cc6ad6ae
47 changed files with 2122 additions and 7752 deletions

View File

@@ -1,147 +1,34 @@
import glob
import importlib
import json
import logging
import os
import sys
import types
from collections import deque
from pathlib import Path
from typing import Any, Dict, Optional, Union
import pytest
from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
from agbenchmark.agent_api_interface import append_updates_file
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.utils.challenge import Challenge
from agbenchmark.utils.data_types import AgentBenchmarkConfig, ChallengeData
from agbenchmark.utils.data_types import ChallengeData
DATA_CATEGORY = {}
def create_single_test(
data: Dict[str, Any] | ChallengeData,
challenge_location: str,
file_datum: Optional[list[dict[str, Any]]] = None,
) -> None:
challenge_data = None
artifacts_location = None
if isinstance(data, ChallengeData):
challenge_data = data
data = data.get_data()
DATA_CATEGORY[data["name"]] = data["category"][0]
# Define test class dynamically
challenge_class = types.new_class(f"Test{data['name']}", (Challenge,))
print(challenge_location)
# clean_challenge_location = get_test_path(challenge_location)
setattr(challenge_class, "CHALLENGE_LOCATION", challenge_location)
setattr(
challenge_class,
"ARTIFACTS_LOCATION",
artifacts_location or str(Path(challenge_location).resolve().parent),
)
# Define test method within the dynamically created class
@pytest.mark.asyncio
async def test_method(self, config: Dict[str, Any], request) -> None: # type: ignore
# create a random number between 0 and 1
test_name = self.data.name
try:
with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
challenges_beaten_in_the_past = json.load(f)
except:
challenges_beaten_in_the_past = {}
if request.config.getoption("--explore") and challenges_beaten_in_the_past.get(
test_name, False
):
return None
# skip optional categories
self.skip_optional_categories(config)
from helicone.lock import HeliconeLockManager
if os.environ.get("HELICONE_API_KEY"):
HeliconeLockManager.write_custom_property("challenge", self.data.name)
cutoff = self.data.cutoff or 60
timeout = cutoff
if "--nc" in sys.argv:
timeout = 100000
if "--cutoff" in sys.argv:
timeout = int(sys.argv[sys.argv.index("--cutoff") + 1])
await self.setup_challenge(config, timeout)
scores = self.get_scores(config)
request.node.answers = (
scores["answers"] if "--keep-answers" in sys.argv else None
)
del scores["answers"] # remove answers from scores
request.node.scores = scores # store scores in request.node
is_score_100 = 1 in scores["values"]
evaluation = "Correct!" if is_score_100 else "Incorrect."
eval_step = Step(
input=evaluation,
additional_input=None,
task_id="irrelevant, this step is a hack",
step_id="irrelevant, this step is a hack",
name="",
status="created",
output=None,
additional_output=None,
artifacts=[],
is_last=True,
)
await append_updates_file(eval_step)
assert is_score_100
# Parametrize the method here
test_method = pytest.mark.parametrize(
"challenge_data",
[data],
indirect=True,
)(test_method)
setattr(challenge_class, "test_method", test_method)
# Attach the new class to a module so it can be discovered by pytest
module = importlib.import_module(__name__)
setattr(module, f"Test{data['name']}", challenge_class)
return challenge_class
logger = logging.getLogger(__name__)
def create_single_suite_challenge(challenge_data: ChallengeData, path: Path) -> None:
create_single_test(challenge_data, str(path))
def create_challenge_from_spec_file(spec_file: Path) -> type[Challenge]:
challenge = Challenge.from_challenge_spec(spec_file)
DATA_CATEGORY[challenge.data.name] = challenge.data.category[0].value
return challenge
def create_challenge(
data: Dict[str, Any],
json_file: str,
json_files: deque,
) -> Union[deque, Any]:
path = Path(json_file).resolve()
print("Creating challenge for", path)
challenge_class = create_single_test(data, str(path))
print("Creation complete for", path)
return json_files, challenge_class
def create_challenge_from_spec_file_path(spec_file_path: str) -> type[Challenge]:
spec_file = Path(spec_file_path).resolve()
return create_challenge_from_spec_file(spec_file)
def generate_tests() -> None: # sourcery skip: invert-any-all
print("Generating tests...")
def load_challenges() -> None:
logger.info("Loading challenges...")
challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
print(f"Looking for challenges in {challenges_path}...")
logger.debug(f"Looking for challenges in {challenges_path}...")
json_files = deque(
glob.glob(
@@ -150,74 +37,39 @@ def generate_tests() -> None: # sourcery skip: invert-any-all
)
)
print(f"Found {len(json_files)} challenges.")
print(f"Sample path: {json_files[0]}")
agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
try:
with open(agent_benchmark_config_path, "r") as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
agent_benchmark_config.agent_benchmark_config_path = (
agent_benchmark_config_path
)
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
raise
regression_reports_path = agent_benchmark_config.get_regression_reports_path()
if regression_reports_path and os.path.exists(regression_reports_path):
with open(regression_reports_path, "r") as f:
regression_tests = json.load(f)
else:
regression_tests = {}
logger.debug(f"Found {len(json_files)} challenges.")
logger.debug(f"Sample path: {json_files[0]}")
loaded, ignored = 0, 0
while json_files:
json_file = (
json_files.popleft()
) # Take and remove the first element from json_files
# Take and remove the first element from json_files
json_file = json_files.popleft()
if challenge_should_be_ignored(json_file):
ignored += 1
continue
data = ChallengeData.get_json_from_path(json_file)
challenge_info = ChallengeData.parse_file(json_file)
commands = sys.argv
# --by flag
if "--category" in commands:
categories = data.get("category", [])
commands_set = set(commands)
challenge_class = create_challenge_from_spec_file_path(json_file)
# Convert the combined list to a set
categories_set = set(categories)
logger.debug(f"Generated test for {challenge_info.name}")
_add_challenge_to_module(challenge_class)
loaded += 1
# If there's no overlap with commands
if not categories_set.intersection(commands_set):
continue
# --test flag, only run the test if it's the exact one specified
tests = []
for command in commands:
if command.startswith("--test="):
tests.append(command.split("=")[1])
if tests and data["name"] not in tests:
continue
# --maintain and --improve flag
in_regression = regression_tests.get(data["name"], None)
improve_flag = in_regression and "--improve" in commands
maintain_flag = not in_regression and "--maintain" in commands
if "--maintain" in commands and maintain_flag:
continue
elif "--improve" in commands and improve_flag:
continue
json_files, challenge_class = create_challenge(data, json_file, json_files)
print(f"Generated test for {data['name']}.")
print("Test generation complete.")
logger.info(f"Loading challenges complete: loaded {loaded}, ignored {ignored}.")
def challenge_should_be_ignored(json_file):
return "challenges/deprecated" in json_file or "challenges/library" in json_file
def challenge_should_be_ignored(json_file_path: str):
return (
"challenges/deprecated" in json_file_path
or "challenges/library" in json_file_path
)
generate_tests()
def _add_challenge_to_module(challenge: type[Challenge]):
# Attach the Challenge class to this module so it can be discovered by pytest
module = importlib.import_module(__name__)
setattr(module, f"{challenge.__name__}", challenge)
load_challenges()