mirror of
https://github.com/aljazceru/Auto-GPT.git
synced 2025-12-18 14:34:23 +01:00
AGBenchmark codebase clean-up (#6650)
* refactor(benchmark): Deduplicate configuration loading logic
- Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module.
- Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function.
* fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py
- Fixed type errors and linting errors in `__main__.py`
- Improved the readability of CLI argument validation by introducing a separate function for it
* refactor(benchmark): Lint and typefix app.py
- Rearranged and cleaned up import statements
- Fixed type errors caused by improper use of `psutil` objects
- Simplified a number of `os.path` usages by converting to `pathlib`
- Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema`
* refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py
- Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`).
- Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`.
- Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`).
- Remove all unused types from schema.py (= most of them).
* refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py
* refactor(benchmark): Improve typing, response validation, and readability in app.py
- Simplified response generation by leveraging type checking and conversion by FastAPI.
- Introduced use of `HTTPException` for error responses.
- Improved naming, formatting, and typing in `app.py::create_evaluation`.
- Updated the docstring on `app.py::create_agent_task`.
- Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py.
- Added default values to optional attributes on models in report_types_v2.py.
- Removed unused imports in `generate_test.py`
* refactor(benchmark): Clean up logging and print statements
- Introduced use of the `logging` library for unified logging and better readability.
- Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`.
- Improved descriptiveness of log statements.
- Removed unnecessary print statements.
- Added log statements to unspecific and non-verbose `except` blocks.
- Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format.
- Added `.utils.logging` module with `configure_logging` function to easily configure the logging library.
- Converted raw escape sequences in `.utils.challenge` to use `colorama`.
- Renamed `generate_test.py::generate_tests` to `load_challenges`.
* refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent
- Remove unused server.py file
- Remove unused run_agent function from agent_interface.py
* refactor(benchmark): Clean up conftest.py
- Fix and add type annotations
- Rewrite docstrings
- Disable or remove unused code
- Fix definition of arguments and their types in `pytest_addoption`
* refactor(benchmark): Clean up generate_test.py file
- Refactored the `create_single_test` function for clarity and readability
- Removed unused variables
- Made creation of `Challenge` subclasses more straightforward
- Made bare `except` more specific
- Renamed `Challenge.setup_challenge` method to `run_challenge`
- Updated type hints and annotations
- Made minor code/readability improvements in `load_challenges`
- Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module
* fix(benchmark): Fix and add type annotations in execute_sub_process.py
* refactor(benchmark): Simplify const determination in agent_interface.py
- Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS`
* fix(benchmark): Register category markers to prevent warnings
- Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime.
* refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json
* refactor(benchmark): Update agent_api_interface.py
- Add type annotations to `copy_agent_artifacts_into_temp_folder` function
- Add note about broken endpoint in the `agent_protocol_client` library
- Remove unused variable in `run_api_agent` function
- Improve readability and resolve linting error
* feat(benchmark): Improve and centralize pathfinding
- Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder.
- Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const.
- Replace path constants defined in __main__.py with usages of `PATH_MANAGER`.
* feat(benchmark/cli): Clean up and improve CLI
- Updated commands, options, and their descriptions to be more intuitive and consistent
- Moved slow imports into the entrypoints that use them to speed up application startup
- Fixed type hints to match output types of Click options
- Hid deprecated `agbenchmark start` command
- Refactored code to improve readability and maintainability
- Moved main entrypoint into `run` subcommand
- Fixed `version` and `serve` subcommands
- Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility)
- Renamed `--no_dep` to `--no-dep` for consistency
- Fixed string formatting issues in log statements
* refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py
- Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`.
- Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`.
- Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`.
- Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties.
- Update all code references according to the changes mentioned above.
* refactor(benchmark): Fix ReportManager init parameter types and use pathlib
- Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`.
- Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`.
- Rename `self.filename` with `self.report_file` in `ReportManager`.
- Change the way the report file is created, opened and saved to use the `Path` object.
* refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation
- Use `ChallengeData` objects instead of untyped `dict` in app.py, generate_test.py, reports.py.
- Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class.
- Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class.
- Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py.
- Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py.
- Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py).
* refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py
- Cleaned up generate_test.py and conftest.py
- Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method.
- Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py.
- Converted methods in the `Challenge` class to class methods where appropriate.
- Improved argument handling in the `run_benchmark` function in `__main__.py`.
* refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state
- Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management.
- Remove the `.path_manager` module containing `AGBenchmarkPathManager`.
- Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity.
* feat(benchmark/serve): Configurable port for `serve` subcommand
- Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on.
- If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set.
* feat(benchmark/cli): Add `config` subcommand
- Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config.
* fix(benchmark): Gracefully handle incompatible challenge spec files in app.py
- Added a check to skip deprecated challenges
- Added logging to allow debugging of the loading process
- Added handling of validation errors when parsing challenge spec files
- Added missing `spec_file` attribute to `ChallengeData`
* refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint
- Move `run_benchmark` and `validate_args` from __main__.py to main.py
- Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark`
- Move `get_unique_categories` from __main__.py to challenges/__init__.py
- Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py
- Reduce operations on updates.json (including `initialize_updates_file`) outside of API
* refactor(benchmark): Remove unused `/updates` endpoint and all related code
- Remove `updates_json_file` attribute from `AgentBenchmarkConfig`
- Remove `get_updates` and `_initialize_updates_file` in app.py
- Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py
- Remove call to `append_updates_file` in challenge.py
* refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig`
- Add and update docstrings
- Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility
- Make naming of path attributes on `AgentBenchmarkConfig` more consistent
- Remove unused `agent_home_directory` attribute
- Remove unused `workspace` attribute
* fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config
* fix(benchmark): Update agent-protocol-client to v1.1.0
- Fixes issue with fetching task artifact listings
This commit is contained in:
committed by
GitHub
parent
b8238c2228
commit
25cc6ad6ae
@@ -1,147 +1,34 @@
|
||||
import glob
|
||||
import importlib
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import types
|
||||
from collections import deque
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, Optional, Union
|
||||
|
||||
import pytest
|
||||
|
||||
from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
|
||||
from agbenchmark.agent_api_interface import append_updates_file
|
||||
from agbenchmark.agent_protocol_client.models.step import Step
|
||||
from agbenchmark.utils.challenge import Challenge
|
||||
from agbenchmark.utils.data_types import AgentBenchmarkConfig, ChallengeData
|
||||
from agbenchmark.utils.data_types import ChallengeData
|
||||
|
||||
DATA_CATEGORY = {}
|
||||
|
||||
|
||||
def create_single_test(
|
||||
data: Dict[str, Any] | ChallengeData,
|
||||
challenge_location: str,
|
||||
file_datum: Optional[list[dict[str, Any]]] = None,
|
||||
) -> None:
|
||||
challenge_data = None
|
||||
artifacts_location = None
|
||||
if isinstance(data, ChallengeData):
|
||||
challenge_data = data
|
||||
data = data.get_data()
|
||||
|
||||
DATA_CATEGORY[data["name"]] = data["category"][0]
|
||||
|
||||
# Define test class dynamically
|
||||
challenge_class = types.new_class(f"Test{data['name']}", (Challenge,))
|
||||
print(challenge_location)
|
||||
# clean_challenge_location = get_test_path(challenge_location)
|
||||
setattr(challenge_class, "CHALLENGE_LOCATION", challenge_location)
|
||||
|
||||
setattr(
|
||||
challenge_class,
|
||||
"ARTIFACTS_LOCATION",
|
||||
artifacts_location or str(Path(challenge_location).resolve().parent),
|
||||
)
|
||||
|
||||
# Define test method within the dynamically created class
|
||||
@pytest.mark.asyncio
|
||||
async def test_method(self, config: Dict[str, Any], request) -> None: # type: ignore
|
||||
# create a random number between 0 and 1
|
||||
test_name = self.data.name
|
||||
|
||||
try:
|
||||
with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
|
||||
challenges_beaten_in_the_past = json.load(f)
|
||||
except:
|
||||
challenges_beaten_in_the_past = {}
|
||||
|
||||
if request.config.getoption("--explore") and challenges_beaten_in_the_past.get(
|
||||
test_name, False
|
||||
):
|
||||
return None
|
||||
|
||||
# skip optional categories
|
||||
self.skip_optional_categories(config)
|
||||
|
||||
from helicone.lock import HeliconeLockManager
|
||||
|
||||
if os.environ.get("HELICONE_API_KEY"):
|
||||
HeliconeLockManager.write_custom_property("challenge", self.data.name)
|
||||
|
||||
cutoff = self.data.cutoff or 60
|
||||
|
||||
timeout = cutoff
|
||||
if "--nc" in sys.argv:
|
||||
timeout = 100000
|
||||
if "--cutoff" in sys.argv:
|
||||
timeout = int(sys.argv[sys.argv.index("--cutoff") + 1])
|
||||
|
||||
await self.setup_challenge(config, timeout)
|
||||
|
||||
scores = self.get_scores(config)
|
||||
request.node.answers = (
|
||||
scores["answers"] if "--keep-answers" in sys.argv else None
|
||||
)
|
||||
del scores["answers"] # remove answers from scores
|
||||
request.node.scores = scores # store scores in request.node
|
||||
is_score_100 = 1 in scores["values"]
|
||||
|
||||
evaluation = "Correct!" if is_score_100 else "Incorrect."
|
||||
eval_step = Step(
|
||||
input=evaluation,
|
||||
additional_input=None,
|
||||
task_id="irrelevant, this step is a hack",
|
||||
step_id="irrelevant, this step is a hack",
|
||||
name="",
|
||||
status="created",
|
||||
output=None,
|
||||
additional_output=None,
|
||||
artifacts=[],
|
||||
is_last=True,
|
||||
)
|
||||
await append_updates_file(eval_step)
|
||||
|
||||
assert is_score_100
|
||||
|
||||
# Parametrize the method here
|
||||
test_method = pytest.mark.parametrize(
|
||||
"challenge_data",
|
||||
[data],
|
||||
indirect=True,
|
||||
)(test_method)
|
||||
|
||||
setattr(challenge_class, "test_method", test_method)
|
||||
|
||||
# Attach the new class to a module so it can be discovered by pytest
|
||||
module = importlib.import_module(__name__)
|
||||
setattr(module, f"Test{data['name']}", challenge_class)
|
||||
return challenge_class
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def create_single_suite_challenge(challenge_data: ChallengeData, path: Path) -> None:
|
||||
create_single_test(challenge_data, str(path))
|
||||
def create_challenge_from_spec_file(spec_file: Path) -> type[Challenge]:
|
||||
challenge = Challenge.from_challenge_spec(spec_file)
|
||||
DATA_CATEGORY[challenge.data.name] = challenge.data.category[0].value
|
||||
return challenge
|
||||
|
||||
|
||||
def create_challenge(
|
||||
data: Dict[str, Any],
|
||||
json_file: str,
|
||||
json_files: deque,
|
||||
) -> Union[deque, Any]:
|
||||
path = Path(json_file).resolve()
|
||||
print("Creating challenge for", path)
|
||||
|
||||
challenge_class = create_single_test(data, str(path))
|
||||
print("Creation complete for", path)
|
||||
|
||||
return json_files, challenge_class
|
||||
def create_challenge_from_spec_file_path(spec_file_path: str) -> type[Challenge]:
|
||||
spec_file = Path(spec_file_path).resolve()
|
||||
return create_challenge_from_spec_file(spec_file)
|
||||
|
||||
|
||||
def generate_tests() -> None: # sourcery skip: invert-any-all
|
||||
print("Generating tests...")
|
||||
def load_challenges() -> None:
|
||||
logger.info("Loading challenges...")
|
||||
|
||||
challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
|
||||
print(f"Looking for challenges in {challenges_path}...")
|
||||
logger.debug(f"Looking for challenges in {challenges_path}...")
|
||||
|
||||
json_files = deque(
|
||||
glob.glob(
|
||||
@@ -150,74 +37,39 @@ def generate_tests() -> None: # sourcery skip: invert-any-all
|
||||
)
|
||||
)
|
||||
|
||||
print(f"Found {len(json_files)} challenges.")
|
||||
print(f"Sample path: {json_files[0]}")
|
||||
|
||||
agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
|
||||
try:
|
||||
with open(agent_benchmark_config_path, "r") as f:
|
||||
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
|
||||
agent_benchmark_config.agent_benchmark_config_path = (
|
||||
agent_benchmark_config_path
|
||||
)
|
||||
except json.JSONDecodeError:
|
||||
print("Error: benchmark_config.json is not a valid JSON file.")
|
||||
raise
|
||||
|
||||
regression_reports_path = agent_benchmark_config.get_regression_reports_path()
|
||||
if regression_reports_path and os.path.exists(regression_reports_path):
|
||||
with open(regression_reports_path, "r") as f:
|
||||
regression_tests = json.load(f)
|
||||
else:
|
||||
regression_tests = {}
|
||||
logger.debug(f"Found {len(json_files)} challenges.")
|
||||
logger.debug(f"Sample path: {json_files[0]}")
|
||||
|
||||
loaded, ignored = 0, 0
|
||||
while json_files:
|
||||
json_file = (
|
||||
json_files.popleft()
|
||||
) # Take and remove the first element from json_files
|
||||
# Take and remove the first element from json_files
|
||||
json_file = json_files.popleft()
|
||||
if challenge_should_be_ignored(json_file):
|
||||
ignored += 1
|
||||
continue
|
||||
|
||||
data = ChallengeData.get_json_from_path(json_file)
|
||||
challenge_info = ChallengeData.parse_file(json_file)
|
||||
|
||||
commands = sys.argv
|
||||
# --by flag
|
||||
if "--category" in commands:
|
||||
categories = data.get("category", [])
|
||||
commands_set = set(commands)
|
||||
challenge_class = create_challenge_from_spec_file_path(json_file)
|
||||
|
||||
# Convert the combined list to a set
|
||||
categories_set = set(categories)
|
||||
logger.debug(f"Generated test for {challenge_info.name}")
|
||||
_add_challenge_to_module(challenge_class)
|
||||
loaded += 1
|
||||
|
||||
# If there's no overlap with commands
|
||||
if not categories_set.intersection(commands_set):
|
||||
continue
|
||||
|
||||
# --test flag, only run the test if it's the exact one specified
|
||||
tests = []
|
||||
for command in commands:
|
||||
if command.startswith("--test="):
|
||||
tests.append(command.split("=")[1])
|
||||
|
||||
if tests and data["name"] not in tests:
|
||||
continue
|
||||
|
||||
# --maintain and --improve flag
|
||||
in_regression = regression_tests.get(data["name"], None)
|
||||
improve_flag = in_regression and "--improve" in commands
|
||||
maintain_flag = not in_regression and "--maintain" in commands
|
||||
if "--maintain" in commands and maintain_flag:
|
||||
continue
|
||||
elif "--improve" in commands and improve_flag:
|
||||
continue
|
||||
json_files, challenge_class = create_challenge(data, json_file, json_files)
|
||||
|
||||
print(f"Generated test for {data['name']}.")
|
||||
print("Test generation complete.")
|
||||
logger.info(f"Loading challenges complete: loaded {loaded}, ignored {ignored}.")
|
||||
|
||||
|
||||
def challenge_should_be_ignored(json_file):
|
||||
return "challenges/deprecated" in json_file or "challenges/library" in json_file
|
||||
def challenge_should_be_ignored(json_file_path: str):
|
||||
return (
|
||||
"challenges/deprecated" in json_file_path
|
||||
or "challenges/library" in json_file_path
|
||||
)
|
||||
|
||||
|
||||
generate_tests()
|
||||
def _add_challenge_to_module(challenge: type[Challenge]):
|
||||
# Attach the Challenge class to this module so it can be discovered by pytest
|
||||
module = importlib.import_module(__name__)
|
||||
setattr(module, f"{challenge.__name__}", challenge)
|
||||
|
||||
|
||||
load_challenges()
|
||||
|
||||
Reference in New Issue
Block a user