mirror of
https://github.com/aljazceru/Auto-GPT.git
synced 2025-12-17 22:14:28 +01:00
* refactor(benchmark): Deduplicate configuration loading logic
- Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module.
- Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function.
* fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py
- Fixed type errors and linting errors in `__main__.py`
- Improved the readability of CLI argument validation by introducing a separate function for it
* refactor(benchmark): Lint and typefix app.py
- Rearranged and cleaned up import statements
- Fixed type errors caused by improper use of `psutil` objects
- Simplified a number of `os.path` usages by converting to `pathlib`
- Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema`
* refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py
- Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`).
- Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`.
- Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`).
- Remove all unused types from schema.py (= most of them).
* refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py
* refactor(benchmark): Improve typing, response validation, and readability in app.py
- Simplified response generation by leveraging type checking and conversion by FastAPI.
- Introduced use of `HTTPException` for error responses.
- Improved naming, formatting, and typing in `app.py::create_evaluation`.
- Updated the docstring on `app.py::create_agent_task`.
- Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py.
- Added default values to optional attributes on models in report_types_v2.py.
- Removed unused imports in `generate_test.py`
* refactor(benchmark): Clean up logging and print statements
- Introduced use of the `logging` library for unified logging and better readability.
- Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`.
- Improved descriptiveness of log statements.
- Removed unnecessary print statements.
- Added log statements to unspecific and non-verbose `except` blocks.
- Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format.
- Added `.utils.logging` module with `configure_logging` function to easily configure the logging library.
- Converted raw escape sequences in `.utils.challenge` to use `colorama`.
- Renamed `generate_test.py::generate_tests` to `load_challenges`.
* refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent
- Remove unused server.py file
- Remove unused run_agent function from agent_interface.py
* refactor(benchmark): Clean up conftest.py
- Fix and add type annotations
- Rewrite docstrings
- Disable or remove unused code
- Fix definition of arguments and their types in `pytest_addoption`
* refactor(benchmark): Clean up generate_test.py file
- Refactored the `create_single_test` function for clarity and readability
- Removed unused variables
- Made creation of `Challenge` subclasses more straightforward
- Made bare `except` more specific
- Renamed `Challenge.setup_challenge` method to `run_challenge`
- Updated type hints and annotations
- Made minor code/readability improvements in `load_challenges`
- Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module
* fix(benchmark): Fix and add type annotations in execute_sub_process.py
* refactor(benchmark): Simplify const determination in agent_interface.py
- Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS`
* fix(benchmark): Register category markers to prevent warnings
- Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime.
* refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json
* refactor(benchmark): Update agent_api_interface.py
- Add type annotations to `copy_agent_artifacts_into_temp_folder` function
- Add note about broken endpoint in the `agent_protocol_client` library
- Remove unused variable in `run_api_agent` function
- Improve readability and resolve linting error
* feat(benchmark): Improve and centralize pathfinding
- Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder.
- Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const.
- Replace path constants defined in __main__.py with usages of `PATH_MANAGER`.
* feat(benchmark/cli): Clean up and improve CLI
- Updated commands, options, and their descriptions to be more intuitive and consistent
- Moved slow imports into the entrypoints that use them to speed up application startup
- Fixed type hints to match output types of Click options
- Hid deprecated `agbenchmark start` command
- Refactored code to improve readability and maintainability
- Moved main entrypoint into `run` subcommand
- Fixed `version` and `serve` subcommands
- Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility)
- Renamed `--no_dep` to `--no-dep` for consistency
- Fixed string formatting issues in log statements
* refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py
- Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`.
- Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`.
- Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`.
- Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties.
- Update all code references according to the changes mentioned above.
* refactor(benchmark): Fix ReportManager init parameter types and use pathlib
- Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`.
- Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`.
- Rename `self.filename` with `self.report_file` in `ReportManager`.
- Change the way the report file is created, opened and saved to use the `Path` object.
* refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation
- Use `ChallengeData` objects instead of untyped `dict` in app.py, generate_test.py, reports.py.
- Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class.
- Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class.
- Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py.
- Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py.
- Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py).
* refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py
- Cleaned up generate_test.py and conftest.py
- Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method.
- Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py.
- Converted methods in the `Challenge` class to class methods where appropriate.
- Improved argument handling in the `run_benchmark` function in `__main__.py`.
* refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state
- Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management.
- Remove the `.path_manager` module containing `AGBenchmarkPathManager`.
- Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity.
* feat(benchmark/serve): Configurable port for `serve` subcommand
- Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on.
- If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set.
* feat(benchmark/cli): Add `config` subcommand
- Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config.
* fix(benchmark): Gracefully handle incompatible challenge spec files in app.py
- Added a check to skip deprecated challenges
- Added logging to allow debugging of the loading process
- Added handling of validation errors when parsing challenge spec files
- Added missing `spec_file` attribute to `ChallengeData`
* refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint
- Move `run_benchmark` and `validate_args` from __main__.py to main.py
- Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark`
- Move `get_unique_categories` from __main__.py to challenges/__init__.py
- Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py
- Reduce operations on updates.json (including `initialize_updates_file`) outside of API
* refactor(benchmark): Remove unused `/updates` endpoint and all related code
- Remove `updates_json_file` attribute from `AgentBenchmarkConfig`
- Remove `get_updates` and `_initialize_updates_file` in app.py
- Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py
- Remove call to `append_updates_file` in challenge.py
* refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig`
- Add and update docstrings
- Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility
- Make naming of path attributes on `AgentBenchmarkConfig` more consistent
- Remove unused `agent_home_directory` attribute
- Remove unused `workspace` attribute
* fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config
* fix(benchmark): Update agent-protocol-client to v1.1.0
- Fixes issue with fetching task artifact listings
345 lines
13 KiB
Python
345 lines
13 KiB
Python
import contextlib
|
|
import json
|
|
import logging
|
|
import os
|
|
import shutil
|
|
import threading
|
|
import time
|
|
from pathlib import Path
|
|
from typing import Any, Generator
|
|
|
|
import pytest
|
|
|
|
from agbenchmark.config import AgentBenchmarkConfig
|
|
from agbenchmark.reports.reports import (
|
|
finalize_reports,
|
|
generate_single_call_report,
|
|
session_finish,
|
|
)
|
|
from agbenchmark.utils.challenge import Challenge
|
|
from agbenchmark.utils.data_types import Category
|
|
|
|
GLOBAL_TIMEOUT = (
|
|
1500 # The tests will stop after 25 minutes so we can send the reports.
|
|
)
|
|
|
|
agbenchmark_config = AgentBenchmarkConfig.load()
|
|
logger = logging.getLogger(__name__)
|
|
|
|
pytest_plugins = ["agbenchmark.utils.dependencies"]
|
|
collect_ignore = ["challenges"]
|
|
suite_reports: dict[str, list] = {}
|
|
|
|
|
|
@pytest.fixture(scope="module")
|
|
def config() -> AgentBenchmarkConfig:
|
|
return agbenchmark_config
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def temp_folder() -> Generator[Path, None, None]:
|
|
"""
|
|
Pytest fixture that sets up and tears down the temporary folder for each test.
|
|
It is automatically used in every test due to the 'autouse=True' parameter.
|
|
"""
|
|
|
|
# create output directory if it doesn't exist
|
|
if not os.path.exists(agbenchmark_config.temp_folder):
|
|
os.makedirs(agbenchmark_config.temp_folder, exist_ok=True)
|
|
|
|
yield agbenchmark_config.temp_folder
|
|
# teardown after test function completes
|
|
if not os.getenv("KEEP_TEMP_FOLDER_FILES"):
|
|
for filename in os.listdir(agbenchmark_config.temp_folder):
|
|
file_path = os.path.join(agbenchmark_config.temp_folder, filename)
|
|
try:
|
|
if os.path.isfile(file_path) or os.path.islink(file_path):
|
|
os.unlink(file_path)
|
|
elif os.path.isdir(file_path):
|
|
shutil.rmtree(file_path)
|
|
except Exception as e:
|
|
logger.warning(f"Failed to delete {file_path}. Reason: {e}")
|
|
|
|
|
|
def pytest_addoption(parser: pytest.Parser) -> None:
|
|
"""
|
|
Pytest hook that adds command-line options to the `pytest` command.
|
|
The added options are specific to agbenchmark and control its behavior:
|
|
* `--mock` is used to run the tests in mock mode.
|
|
* `--host` is used to specify the host for the tests.
|
|
* `--category` is used to run only tests of a specific category.
|
|
* `--nc` is used to run the tests without caching.
|
|
* `--cutoff` is used to specify a cutoff time for the tests.
|
|
* `--improve` is used to run only the tests that are marked for improvement.
|
|
* `--maintain` is used to run only the tests that are marked for maintenance.
|
|
* `--explore` is used to run the tests in exploration mode.
|
|
* `--test` is used to run a specific test.
|
|
* `--no-dep` is used to run the tests without dependencies.
|
|
* `--keep-answers` is used to keep the answers of the tests.
|
|
|
|
Args:
|
|
parser: The Pytest CLI parser to which the command-line options are added.
|
|
"""
|
|
parser.addoption("--no-dep", action="store_true")
|
|
parser.addoption("--mock", action="store_true")
|
|
parser.addoption("--host", default=None)
|
|
parser.addoption("--nc", action="store_true")
|
|
parser.addoption("--cutoff", action="store")
|
|
parser.addoption("--category", action="append")
|
|
parser.addoption("--test", action="append")
|
|
parser.addoption("--improve", action="store_true")
|
|
parser.addoption("--maintain", action="store_true")
|
|
parser.addoption("--explore", action="store_true")
|
|
parser.addoption("--keep-answers", action="store_true")
|
|
|
|
|
|
def pytest_configure(config: pytest.Config) -> None:
|
|
# Register category markers to prevent "unknown marker" warnings
|
|
for category in Category:
|
|
config.addinivalue_line("markers", f"{category.value}: {category}")
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def check_regression(request: pytest.FixtureRequest) -> None:
|
|
"""
|
|
Fixture that checks for every test if it should be treated as a regression test,
|
|
and whether to skip it based on that.
|
|
|
|
The test name is retrieved from the `request` object. Regression reports are loaded
|
|
from the path specified in the benchmark configuration.
|
|
|
|
Effect:
|
|
* If the `--improve` option is used and the current test is considered a regression
|
|
test, it is skipped.
|
|
* If the `--maintain` option is used and the current test is not considered a
|
|
regression test, it is also skipped.
|
|
|
|
Args:
|
|
request: The request object from which the test name and the benchmark
|
|
configuration are retrieved.
|
|
"""
|
|
test_name = request.node.parent.name
|
|
with contextlib.suppress(FileNotFoundError):
|
|
regression_report = agbenchmark_config.regression_tests_file
|
|
data = json.loads(regression_report.read_bytes())
|
|
challenge_location = getattr(request.node.parent.cls, "CHALLENGE_LOCATION", "")
|
|
|
|
skip_string = f"Skipping {test_name} at {challenge_location}"
|
|
|
|
# Check if the test name exists in the regression tests
|
|
if request.config.getoption("--improve") and data.get(test_name, None):
|
|
pytest.skip(f"{skip_string} because it's a regression test")
|
|
elif request.config.getoption("--maintain") and not data.get(test_name, None):
|
|
pytest.skip(f"{skip_string} because it's not a regression test")
|
|
|
|
|
|
@pytest.fixture(autouse=True, scope="session")
|
|
def mock(request: pytest.FixtureRequest) -> bool:
|
|
"""
|
|
Pytest fixture that retrieves the value of the `--mock` command-line option.
|
|
The `--mock` option is used to run the tests in mock mode.
|
|
|
|
Args:
|
|
request: The `pytest.FixtureRequest` from which the `--mock` option value
|
|
is retrieved.
|
|
|
|
Returns:
|
|
bool: Whether `--mock` is set for this session.
|
|
"""
|
|
return request.config.getoption("--mock")
|
|
|
|
|
|
@pytest.fixture(autouse=True, scope="function")
|
|
def timer(request: pytest.FixtureRequest) -> Generator[None, None, None]:
|
|
"""
|
|
Pytest fixture that times the execution of each test.
|
|
At the start of each test, it records the current time.
|
|
After the test function completes, it calculates the run time and adds it to
|
|
the test node's `user_properties`.
|
|
|
|
Args:
|
|
request: The `pytest.FixtureRequest` object through which the run time is stored
|
|
in the test node's `user_properties`.
|
|
"""
|
|
start_time = time.time()
|
|
yield
|
|
run_time = time.time() - start_time
|
|
request.node.user_properties.append(("run_time", run_time))
|
|
|
|
|
|
def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo) -> None:
|
|
"""
|
|
Pytest hook that is called when a test report is being generated.
|
|
It is used to generate and finalize reports for each test.
|
|
|
|
Args:
|
|
item: The test item for which the report is being generated.
|
|
call: The call object from which the test result is retrieved.
|
|
"""
|
|
challenge: type[Challenge] = item.cls # type: ignore
|
|
challenge_data = challenge.data
|
|
challenge_location = challenge.CHALLENGE_LOCATION
|
|
|
|
if call.when == "call":
|
|
answers = getattr(item, "answers", None)
|
|
test_name = item.nodeid.split("::")[1]
|
|
item.test_name = test_name
|
|
|
|
generate_single_call_report(
|
|
item, call, challenge_data, answers, challenge_location, test_name
|
|
)
|
|
|
|
if call.when == "teardown":
|
|
finalize_reports(agbenchmark_config, item, challenge_data)
|
|
|
|
|
|
def timeout_monitor(start_time: int) -> None:
|
|
"""
|
|
Function that limits the total execution time of the test suite.
|
|
This function is supposed to be run in a separate thread and calls `pytest.exit`
|
|
if the total execution time has exceeded the global timeout.
|
|
|
|
Args:
|
|
start_time (int): The start time of the test suite.
|
|
"""
|
|
while time.time() - start_time < GLOBAL_TIMEOUT:
|
|
time.sleep(1) # check every second
|
|
|
|
pytest.exit("Test suite exceeded the global timeout", returncode=1)
|
|
|
|
|
|
def pytest_sessionstart(session: pytest.Session) -> None:
|
|
"""
|
|
Pytest hook that is called at the start of a test session.
|
|
|
|
Sets up and runs a `timeout_monitor` in a separate thread.
|
|
"""
|
|
start_time = time.time()
|
|
t = threading.Thread(target=timeout_monitor, args=(start_time,))
|
|
t.daemon = True # Daemon threads are abruptly stopped at shutdown
|
|
t.start()
|
|
|
|
|
|
def pytest_sessionfinish(session: pytest.Session) -> None:
|
|
"""
|
|
Pytest hook that is called at the end of a test session.
|
|
|
|
Finalizes and saves the test reports.
|
|
"""
|
|
session_finish(agbenchmark_config, suite_reports)
|
|
|
|
|
|
@pytest.fixture
|
|
def scores(request: pytest.FixtureRequest) -> None:
|
|
"""
|
|
Pytest fixture that retrieves the scores of the test class.
|
|
The scores are retrieved from the `Challenge.scores` attribute
|
|
using the test class name.
|
|
|
|
Args:
|
|
request: The request object.
|
|
"""
|
|
challenge: type[Challenge] = request.node.cls
|
|
return challenge.scores.get(challenge.__name__)
|
|
|
|
|
|
def pytest_collection_modifyitems(
|
|
items: list[pytest.Item], config: pytest.Config
|
|
) -> None:
|
|
"""
|
|
Pytest hook that is called after initial test collection has been performed.
|
|
Modifies the collected test items based on the agent benchmark configuration,
|
|
adding the dependency marker and category markers.
|
|
|
|
Args:
|
|
items: The collected test items to be modified.
|
|
config: The active pytest configuration.
|
|
"""
|
|
regression_file = agbenchmark_config.regression_tests_file
|
|
regression_tests: dict[str, Any] = (
|
|
json.loads(regression_file.read_bytes()) if regression_file.is_file() else {}
|
|
)
|
|
|
|
try:
|
|
challenges_beaten_in_the_past = json.loads(
|
|
agbenchmark_config.challenges_already_beaten_file.read_bytes()
|
|
)
|
|
except FileNotFoundError:
|
|
challenges_beaten_in_the_past = {}
|
|
|
|
selected_tests: tuple[str] = config.getoption("--test") # type: ignore
|
|
selected_categories: tuple[str] = config.getoption("--category") # type: ignore
|
|
|
|
# Can't use a for-loop to remove items in-place
|
|
i = 0
|
|
while i < len(items):
|
|
item = items[i]
|
|
challenge = item.cls
|
|
challenge_name = item.cls.__name__
|
|
|
|
if not issubclass(challenge, Challenge):
|
|
item.warn(
|
|
pytest.PytestCollectionWarning(
|
|
f"Non-challenge item collected: {challenge}"
|
|
)
|
|
)
|
|
i += 1
|
|
continue
|
|
|
|
# --test: remove the test from the set if it's not specifically selected
|
|
if selected_tests and challenge.data.name not in selected_tests:
|
|
items.remove(item)
|
|
continue
|
|
|
|
# Filter challenges for --maintain, --improve, and --explore:
|
|
# --maintain -> only challenges expected to be passed (= regression tests)
|
|
# --improve -> only challenges that so far are not passed (reliably)
|
|
# --explore -> only challenges that have never been passed
|
|
is_regression_test = regression_tests.get(challenge.data.name, None)
|
|
has_been_passed = challenges_beaten_in_the_past.get(challenge.data.name, False)
|
|
if (
|
|
(config.getoption("--maintain") and not is_regression_test)
|
|
or (config.getoption("--improve") and is_regression_test)
|
|
or (config.getoption("--explore") and has_been_passed)
|
|
):
|
|
items.remove(item)
|
|
continue
|
|
|
|
dependencies = challenge.data.dependencies
|
|
if (
|
|
config.getoption("--test")
|
|
or config.getoption("--no-dep")
|
|
or config.getoption("--maintain")
|
|
):
|
|
# Ignore dependencies:
|
|
# --test -> user selected specific tests to run, don't care about deps
|
|
# --no-dep -> ignore dependency relations regardless of test selection
|
|
# --maintain -> all "regression" tests must pass, so run all of them
|
|
dependencies = []
|
|
elif config.getoption("--improve"):
|
|
# Filter dependencies, keep only deps that are not "regression" tests
|
|
dependencies = [
|
|
d for d in dependencies if not regression_tests.get(d, None)
|
|
]
|
|
|
|
# Set category markers
|
|
challenge_categories = [c.value for c in challenge.data.category]
|
|
for category in challenge_categories:
|
|
item.add_marker(category)
|
|
|
|
# Enforce category selection
|
|
if selected_categories:
|
|
if not set(challenge_categories).intersection(set(selected_categories)):
|
|
items.remove(item)
|
|
continue
|
|
# # Filter dependencies, keep only deps from selected categories
|
|
# dependencies = [
|
|
# d for d in dependencies
|
|
# if not set(d.categories).intersection(set(selected_categories))
|
|
# ]
|
|
|
|
# Add marker for the DependencyManager
|
|
item.add_marker(pytest.mark.depends(on=dependencies, name=challenge_name))
|
|
|
|
i += 1
|