Files
Auto-GPT/benchmark/agbenchmark/conftest.py
Reinier van der Leer 25cc6ad6ae AGBenchmark codebase clean-up (#6650)
* refactor(benchmark): Deduplicate configuration loading logic

   - Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module.
   - Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function.

* fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py

   - Fixed type errors and linting errors in `__main__.py`
   - Improved the readability of CLI argument validation by introducing a separate function for it

* refactor(benchmark): Lint and typefix app.py

   - Rearranged and cleaned up import statements
   - Fixed type errors caused by improper use of `psutil` objects
   - Simplified a number of `os.path` usages by converting to `pathlib`
   - Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema`

* refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py

   - Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`).
      - Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`.
   - Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`).
   - Remove all unused types from schema.py (= most of them).

* refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py

* refactor(benchmark): Improve typing, response validation, and readability in app.py

   - Simplified response generation by leveraging type checking and conversion by FastAPI.
   - Introduced use of `HTTPException` for error responses.
   - Improved naming, formatting, and typing in `app.py::create_evaluation`.
   - Updated the docstring on `app.py::create_agent_task`.
   - Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py.
   - Added default values to optional attributes on models in report_types_v2.py.
   - Removed unused imports in `generate_test.py`

* refactor(benchmark): Clean up logging and print statements

   - Introduced use of the `logging` library for unified logging and better readability.
   - Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`.
   - Improved descriptiveness of log statements.
   - Removed unnecessary print statements.
   - Added log statements to unspecific and non-verbose `except` blocks.
   - Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format.
   - Added `.utils.logging` module with `configure_logging` function to easily configure the logging library.
   - Converted raw escape sequences in `.utils.challenge` to use `colorama`.
   - Renamed `generate_test.py::generate_tests` to `load_challenges`.

* refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent

   - Remove unused server.py file
   - Remove unused run_agent function from agent_interface.py

* refactor(benchmark): Clean up conftest.py

   - Fix and add type annotations
   - Rewrite docstrings
   - Disable or remove unused code
   - Fix definition of arguments and their types in `pytest_addoption`

* refactor(benchmark): Clean up generate_test.py file

   - Refactored the `create_single_test` function for clarity and readability
      - Removed unused variables
      - Made creation of `Challenge` subclasses more straightforward
      - Made bare `except` more specific
   - Renamed `Challenge.setup_challenge` method to `run_challenge`
   - Updated type hints and annotations
   - Made minor code/readability improvements in `load_challenges`
   - Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module

* fix(benchmark): Fix and add type annotations in execute_sub_process.py

* refactor(benchmark): Simplify const determination in agent_interface.py

   - Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS`

* fix(benchmark): Register category markers to prevent warnings

   - Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime.

* refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json

* refactor(benchmark): Update agent_api_interface.py

   - Add type annotations to `copy_agent_artifacts_into_temp_folder` function
   - Add note about broken endpoint in the `agent_protocol_client` library
   - Remove unused variable in `run_api_agent` function
   - Improve readability and resolve linting error

* feat(benchmark): Improve and centralize pathfinding

   - Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder.
   - Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const.
   - Replace path constants defined in __main__.py with usages of `PATH_MANAGER`.

* feat(benchmark/cli): Clean up and improve CLI

   - Updated commands, options, and their descriptions to be more intuitive and consistent
   - Moved slow imports into the entrypoints that use them to speed up application startup
   - Fixed type hints to match output types of Click options
   - Hid deprecated `agbenchmark start` command
   - Refactored code to improve readability and maintainability
   - Moved main entrypoint into `run` subcommand
   - Fixed `version` and `serve` subcommands
   - Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility)
   - Renamed `--no_dep` to `--no-dep` for consistency
   - Fixed string formatting issues in log statements

* refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py

   - Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`.
   - Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`.
   - Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`.
   - Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties.
   - Update all code references according to the changes mentioned above.

* refactor(benchmark): Fix ReportManager init parameter types and use pathlib

   - Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`.
   - Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`.
   - Rename `self.filename` with `self.report_file` in `ReportManager`.
   - Change the way the report file is created, opened and saved to use the `Path` object.

* refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation

   - Use `ChallengeData` objects instead of untyped `dict` in  app.py, generate_test.py, reports.py.
   - Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class.
   - Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class.
   - Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py.
   - Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py.
   - Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py).

* refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py

   - Cleaned up generate_test.py and conftest.py
      - Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method.
      - Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py.
   - Converted methods in the `Challenge` class to class methods where appropriate.
   - Improved argument handling in the `run_benchmark` function in `__main__.py`.

* refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state

   - Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management.
   - Remove the `.path_manager` module containing `AGBenchmarkPathManager`.
   - Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity.

* feat(benchmark/serve): Configurable port for `serve` subcommand

   - Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on.
   - If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set.

* feat(benchmark/cli): Add `config` subcommand

   - Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config.

* fix(benchmark): Gracefully handle incompatible challenge spec files in app.py

   - Added a check to skip deprecated challenges
   - Added logging to allow debugging of the loading process
   - Added handling of validation errors when parsing challenge spec files
   - Added missing `spec_file` attribute to `ChallengeData`

* refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint

   - Move `run_benchmark` and `validate_args` from __main__.py to main.py
   - Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark`
   - Move `get_unique_categories` from __main__.py to challenges/__init__.py
   - Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py
   - Reduce operations on updates.json (including `initialize_updates_file`) outside of API

* refactor(benchmark): Remove unused `/updates` endpoint and all related code

   - Remove `updates_json_file` attribute from `AgentBenchmarkConfig`
   - Remove `get_updates` and `_initialize_updates_file` in app.py
   - Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py
   - Remove call to `append_updates_file` in challenge.py

* refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig`

   - Add and update docstrings
   - Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility
   - Make naming of path attributes on `AgentBenchmarkConfig` more consistent
   - Remove unused `agent_home_directory` attribute
   - Remove unused `workspace` attribute

* fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config

* fix(benchmark): Update agent-protocol-client to v1.1.0

   - Fixes issue with fetching task artifact listings
2024-01-02 22:23:09 +01:00

345 lines
13 KiB
Python

import contextlib
import json
import logging
import os
import shutil
import threading
import time
from pathlib import Path
from typing import Any, Generator
import pytest
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.reports.reports import (
finalize_reports,
generate_single_call_report,
session_finish,
)
from agbenchmark.utils.challenge import Challenge
from agbenchmark.utils.data_types import Category
GLOBAL_TIMEOUT = (
1500 # The tests will stop after 25 minutes so we can send the reports.
)
agbenchmark_config = AgentBenchmarkConfig.load()
logger = logging.getLogger(__name__)
pytest_plugins = ["agbenchmark.utils.dependencies"]
collect_ignore = ["challenges"]
suite_reports: dict[str, list] = {}
@pytest.fixture(scope="module")
def config() -> AgentBenchmarkConfig:
return agbenchmark_config
@pytest.fixture(autouse=True)
def temp_folder() -> Generator[Path, None, None]:
"""
Pytest fixture that sets up and tears down the temporary folder for each test.
It is automatically used in every test due to the 'autouse=True' parameter.
"""
# create output directory if it doesn't exist
if not os.path.exists(agbenchmark_config.temp_folder):
os.makedirs(agbenchmark_config.temp_folder, exist_ok=True)
yield agbenchmark_config.temp_folder
# teardown after test function completes
if not os.getenv("KEEP_TEMP_FOLDER_FILES"):
for filename in os.listdir(agbenchmark_config.temp_folder):
file_path = os.path.join(agbenchmark_config.temp_folder, filename)
try:
if os.path.isfile(file_path) or os.path.islink(file_path):
os.unlink(file_path)
elif os.path.isdir(file_path):
shutil.rmtree(file_path)
except Exception as e:
logger.warning(f"Failed to delete {file_path}. Reason: {e}")
def pytest_addoption(parser: pytest.Parser) -> None:
"""
Pytest hook that adds command-line options to the `pytest` command.
The added options are specific to agbenchmark and control its behavior:
* `--mock` is used to run the tests in mock mode.
* `--host` is used to specify the host for the tests.
* `--category` is used to run only tests of a specific category.
* `--nc` is used to run the tests without caching.
* `--cutoff` is used to specify a cutoff time for the tests.
* `--improve` is used to run only the tests that are marked for improvement.
* `--maintain` is used to run only the tests that are marked for maintenance.
* `--explore` is used to run the tests in exploration mode.
* `--test` is used to run a specific test.
* `--no-dep` is used to run the tests without dependencies.
* `--keep-answers` is used to keep the answers of the tests.
Args:
parser: The Pytest CLI parser to which the command-line options are added.
"""
parser.addoption("--no-dep", action="store_true")
parser.addoption("--mock", action="store_true")
parser.addoption("--host", default=None)
parser.addoption("--nc", action="store_true")
parser.addoption("--cutoff", action="store")
parser.addoption("--category", action="append")
parser.addoption("--test", action="append")
parser.addoption("--improve", action="store_true")
parser.addoption("--maintain", action="store_true")
parser.addoption("--explore", action="store_true")
parser.addoption("--keep-answers", action="store_true")
def pytest_configure(config: pytest.Config) -> None:
# Register category markers to prevent "unknown marker" warnings
for category in Category:
config.addinivalue_line("markers", f"{category.value}: {category}")
@pytest.fixture(autouse=True)
def check_regression(request: pytest.FixtureRequest) -> None:
"""
Fixture that checks for every test if it should be treated as a regression test,
and whether to skip it based on that.
The test name is retrieved from the `request` object. Regression reports are loaded
from the path specified in the benchmark configuration.
Effect:
* If the `--improve` option is used and the current test is considered a regression
test, it is skipped.
* If the `--maintain` option is used and the current test is not considered a
regression test, it is also skipped.
Args:
request: The request object from which the test name and the benchmark
configuration are retrieved.
"""
test_name = request.node.parent.name
with contextlib.suppress(FileNotFoundError):
regression_report = agbenchmark_config.regression_tests_file
data = json.loads(regression_report.read_bytes())
challenge_location = getattr(request.node.parent.cls, "CHALLENGE_LOCATION", "")
skip_string = f"Skipping {test_name} at {challenge_location}"
# Check if the test name exists in the regression tests
if request.config.getoption("--improve") and data.get(test_name, None):
pytest.skip(f"{skip_string} because it's a regression test")
elif request.config.getoption("--maintain") and not data.get(test_name, None):
pytest.skip(f"{skip_string} because it's not a regression test")
@pytest.fixture(autouse=True, scope="session")
def mock(request: pytest.FixtureRequest) -> bool:
"""
Pytest fixture that retrieves the value of the `--mock` command-line option.
The `--mock` option is used to run the tests in mock mode.
Args:
request: The `pytest.FixtureRequest` from which the `--mock` option value
is retrieved.
Returns:
bool: Whether `--mock` is set for this session.
"""
return request.config.getoption("--mock")
@pytest.fixture(autouse=True, scope="function")
def timer(request: pytest.FixtureRequest) -> Generator[None, None, None]:
"""
Pytest fixture that times the execution of each test.
At the start of each test, it records the current time.
After the test function completes, it calculates the run time and adds it to
the test node's `user_properties`.
Args:
request: The `pytest.FixtureRequest` object through which the run time is stored
in the test node's `user_properties`.
"""
start_time = time.time()
yield
run_time = time.time() - start_time
request.node.user_properties.append(("run_time", run_time))
def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo) -> None:
"""
Pytest hook that is called when a test report is being generated.
It is used to generate and finalize reports for each test.
Args:
item: The test item for which the report is being generated.
call: The call object from which the test result is retrieved.
"""
challenge: type[Challenge] = item.cls # type: ignore
challenge_data = challenge.data
challenge_location = challenge.CHALLENGE_LOCATION
if call.when == "call":
answers = getattr(item, "answers", None)
test_name = item.nodeid.split("::")[1]
item.test_name = test_name
generate_single_call_report(
item, call, challenge_data, answers, challenge_location, test_name
)
if call.when == "teardown":
finalize_reports(agbenchmark_config, item, challenge_data)
def timeout_monitor(start_time: int) -> None:
"""
Function that limits the total execution time of the test suite.
This function is supposed to be run in a separate thread and calls `pytest.exit`
if the total execution time has exceeded the global timeout.
Args:
start_time (int): The start time of the test suite.
"""
while time.time() - start_time < GLOBAL_TIMEOUT:
time.sleep(1) # check every second
pytest.exit("Test suite exceeded the global timeout", returncode=1)
def pytest_sessionstart(session: pytest.Session) -> None:
"""
Pytest hook that is called at the start of a test session.
Sets up and runs a `timeout_monitor` in a separate thread.
"""
start_time = time.time()
t = threading.Thread(target=timeout_monitor, args=(start_time,))
t.daemon = True # Daemon threads are abruptly stopped at shutdown
t.start()
def pytest_sessionfinish(session: pytest.Session) -> None:
"""
Pytest hook that is called at the end of a test session.
Finalizes and saves the test reports.
"""
session_finish(agbenchmark_config, suite_reports)
@pytest.fixture
def scores(request: pytest.FixtureRequest) -> None:
"""
Pytest fixture that retrieves the scores of the test class.
The scores are retrieved from the `Challenge.scores` attribute
using the test class name.
Args:
request: The request object.
"""
challenge: type[Challenge] = request.node.cls
return challenge.scores.get(challenge.__name__)
def pytest_collection_modifyitems(
items: list[pytest.Item], config: pytest.Config
) -> None:
"""
Pytest hook that is called after initial test collection has been performed.
Modifies the collected test items based on the agent benchmark configuration,
adding the dependency marker and category markers.
Args:
items: The collected test items to be modified.
config: The active pytest configuration.
"""
regression_file = agbenchmark_config.regression_tests_file
regression_tests: dict[str, Any] = (
json.loads(regression_file.read_bytes()) if regression_file.is_file() else {}
)
try:
challenges_beaten_in_the_past = json.loads(
agbenchmark_config.challenges_already_beaten_file.read_bytes()
)
except FileNotFoundError:
challenges_beaten_in_the_past = {}
selected_tests: tuple[str] = config.getoption("--test") # type: ignore
selected_categories: tuple[str] = config.getoption("--category") # type: ignore
# Can't use a for-loop to remove items in-place
i = 0
while i < len(items):
item = items[i]
challenge = item.cls
challenge_name = item.cls.__name__
if not issubclass(challenge, Challenge):
item.warn(
pytest.PytestCollectionWarning(
f"Non-challenge item collected: {challenge}"
)
)
i += 1
continue
# --test: remove the test from the set if it's not specifically selected
if selected_tests and challenge.data.name not in selected_tests:
items.remove(item)
continue
# Filter challenges for --maintain, --improve, and --explore:
# --maintain -> only challenges expected to be passed (= regression tests)
# --improve -> only challenges that so far are not passed (reliably)
# --explore -> only challenges that have never been passed
is_regression_test = regression_tests.get(challenge.data.name, None)
has_been_passed = challenges_beaten_in_the_past.get(challenge.data.name, False)
if (
(config.getoption("--maintain") and not is_regression_test)
or (config.getoption("--improve") and is_regression_test)
or (config.getoption("--explore") and has_been_passed)
):
items.remove(item)
continue
dependencies = challenge.data.dependencies
if (
config.getoption("--test")
or config.getoption("--no-dep")
or config.getoption("--maintain")
):
# Ignore dependencies:
# --test -> user selected specific tests to run, don't care about deps
# --no-dep -> ignore dependency relations regardless of test selection
# --maintain -> all "regression" tests must pass, so run all of them
dependencies = []
elif config.getoption("--improve"):
# Filter dependencies, keep only deps that are not "regression" tests
dependencies = [
d for d in dependencies if not regression_tests.get(d, None)
]
# Set category markers
challenge_categories = [c.value for c in challenge.data.category]
for category in challenge_categories:
item.add_marker(category)
# Enforce category selection
if selected_categories:
if not set(challenge_categories).intersection(set(selected_categories)):
items.remove(item)
continue
# # Filter dependencies, keep only deps from selected categories
# dependencies = [
# d for d in dependencies
# if not set(d.categories).intersection(set(selected_categories))
# ]
# Add marker for the DependencyManager
item.add_marker(pytest.mark.depends(on=dependencies, name=challenge_name))
i += 1