mirror of
https://github.com/aljazceru/Auto-GPT.git
synced 2026-02-21 22:24:30 +01:00
* refactor(benchmark): Deduplicate configuration loading logic
- Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module.
- Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function.
* fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py
- Fixed type errors and linting errors in `__main__.py`
- Improved the readability of CLI argument validation by introducing a separate function for it
* refactor(benchmark): Lint and typefix app.py
- Rearranged and cleaned up import statements
- Fixed type errors caused by improper use of `psutil` objects
- Simplified a number of `os.path` usages by converting to `pathlib`
- Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema`
* refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py
- Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`).
- Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`.
- Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`).
- Remove all unused types from schema.py (= most of them).
* refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py
* refactor(benchmark): Improve typing, response validation, and readability in app.py
- Simplified response generation by leveraging type checking and conversion by FastAPI.
- Introduced use of `HTTPException` for error responses.
- Improved naming, formatting, and typing in `app.py::create_evaluation`.
- Updated the docstring on `app.py::create_agent_task`.
- Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py.
- Added default values to optional attributes on models in report_types_v2.py.
- Removed unused imports in `generate_test.py`
* refactor(benchmark): Clean up logging and print statements
- Introduced use of the `logging` library for unified logging and better readability.
- Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`.
- Improved descriptiveness of log statements.
- Removed unnecessary print statements.
- Added log statements to unspecific and non-verbose `except` blocks.
- Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format.
- Added `.utils.logging` module with `configure_logging` function to easily configure the logging library.
- Converted raw escape sequences in `.utils.challenge` to use `colorama`.
- Renamed `generate_test.py::generate_tests` to `load_challenges`.
* refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent
- Remove unused server.py file
- Remove unused run_agent function from agent_interface.py
* refactor(benchmark): Clean up conftest.py
- Fix and add type annotations
- Rewrite docstrings
- Disable or remove unused code
- Fix definition of arguments and their types in `pytest_addoption`
* refactor(benchmark): Clean up generate_test.py file
- Refactored the `create_single_test` function for clarity and readability
- Removed unused variables
- Made creation of `Challenge` subclasses more straightforward
- Made bare `except` more specific
- Renamed `Challenge.setup_challenge` method to `run_challenge`
- Updated type hints and annotations
- Made minor code/readability improvements in `load_challenges`
- Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module
* fix(benchmark): Fix and add type annotations in execute_sub_process.py
* refactor(benchmark): Simplify const determination in agent_interface.py
- Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS`
* fix(benchmark): Register category markers to prevent warnings
- Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime.
* refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json
* refactor(benchmark): Update agent_api_interface.py
- Add type annotations to `copy_agent_artifacts_into_temp_folder` function
- Add note about broken endpoint in the `agent_protocol_client` library
- Remove unused variable in `run_api_agent` function
- Improve readability and resolve linting error
* feat(benchmark): Improve and centralize pathfinding
- Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder.
- Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const.
- Replace path constants defined in __main__.py with usages of `PATH_MANAGER`.
* feat(benchmark/cli): Clean up and improve CLI
- Updated commands, options, and their descriptions to be more intuitive and consistent
- Moved slow imports into the entrypoints that use them to speed up application startup
- Fixed type hints to match output types of Click options
- Hid deprecated `agbenchmark start` command
- Refactored code to improve readability and maintainability
- Moved main entrypoint into `run` subcommand
- Fixed `version` and `serve` subcommands
- Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility)
- Renamed `--no_dep` to `--no-dep` for consistency
- Fixed string formatting issues in log statements
* refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py
- Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`.
- Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`.
- Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`.
- Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties.
- Update all code references according to the changes mentioned above.
* refactor(benchmark): Fix ReportManager init parameter types and use pathlib
- Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`.
- Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`.
- Rename `self.filename` with `self.report_file` in `ReportManager`.
- Change the way the report file is created, opened and saved to use the `Path` object.
* refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation
- Use `ChallengeData` objects instead of untyped `dict` in app.py, generate_test.py, reports.py.
- Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class.
- Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class.
- Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py.
- Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py.
- Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py).
* refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py
- Cleaned up generate_test.py and conftest.py
- Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method.
- Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py.
- Converted methods in the `Challenge` class to class methods where appropriate.
- Improved argument handling in the `run_benchmark` function in `__main__.py`.
* refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state
- Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management.
- Remove the `.path_manager` module containing `AGBenchmarkPathManager`.
- Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity.
* feat(benchmark/serve): Configurable port for `serve` subcommand
- Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on.
- If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set.
* feat(benchmark/cli): Add `config` subcommand
- Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config.
* fix(benchmark): Gracefully handle incompatible challenge spec files in app.py
- Added a check to skip deprecated challenges
- Added logging to allow debugging of the loading process
- Added handling of validation errors when parsing challenge spec files
- Added missing `spec_file` attribute to `ChallengeData`
* refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint
- Move `run_benchmark` and `validate_args` from __main__.py to main.py
- Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark`
- Move `get_unique_categories` from __main__.py to challenges/__init__.py
- Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py
- Reduce operations on updates.json (including `initialize_updates_file`) outside of API
* refactor(benchmark): Remove unused `/updates` endpoint and all related code
- Remove `updates_json_file` attribute from `AgentBenchmarkConfig`
- Remove `get_updates` and `_initialize_updates_file` in app.py
- Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py
- Remove call to `append_updates_file` in challenge.py
* refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig`
- Add and update docstrings
- Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility
- Make naming of path attributes on `AgentBenchmarkConfig` more consistent
- Remove unused `agent_home_directory` attribute
- Remove unused `workspace` attribute
* fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config
* fix(benchmark): Update agent-protocol-client to v1.1.0
- Fixes issue with fetching task artifact listings
333 lines
12 KiB
Python
333 lines
12 KiB
Python
import datetime
|
|
import glob
|
|
import json
|
|
import logging
|
|
import sys
|
|
import time
|
|
import uuid
|
|
from collections import defaultdict, deque
|
|
from multiprocessing import Process
|
|
from pathlib import Path
|
|
from typing import Any, Optional
|
|
|
|
import httpx
|
|
import psutil
|
|
from agent_protocol_client import AgentApi, ApiClient, ApiException, Configuration
|
|
from agent_protocol_client.models import Task, TaskRequestBody
|
|
from fastapi import APIRouter, FastAPI, HTTPException, Request, Response
|
|
from fastapi.middleware.cors import CORSMiddleware
|
|
from pydantic import BaseModel, Extra, ValidationError
|
|
|
|
from agbenchmark.config import AgentBenchmarkConfig
|
|
from agbenchmark.reports.processing.report_types_v2 import (
|
|
BenchmarkRun,
|
|
Metrics,
|
|
RepositoryInfo,
|
|
RunDetails,
|
|
TaskInfo,
|
|
)
|
|
from agbenchmark.schema import TaskEvalRequestBody
|
|
from agbenchmark.utils.data_types import ChallengeData
|
|
from agbenchmark.utils.utils import write_pretty_json
|
|
|
|
sys.path.append(str(Path(__file__).parent.parent))
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
CHALLENGES: dict[str, ChallengeData] = {}
|
|
challenges_path = Path(__file__).parent / "challenges"
|
|
challenge_spec_files = deque(
|
|
glob.glob(
|
|
f"{challenges_path}/**/data.json",
|
|
recursive=True,
|
|
)
|
|
)
|
|
|
|
logger.debug("Loading challenges...")
|
|
while challenge_spec_files:
|
|
challenge_spec_file = Path(challenge_spec_files.popleft())
|
|
challenge_relpath = challenge_spec_file.relative_to(challenges_path.parent)
|
|
if challenge_relpath.is_relative_to("challenges/deprecated"):
|
|
continue
|
|
|
|
logger.debug(f"Loading {challenge_relpath}...")
|
|
try:
|
|
challenge_info = ChallengeData.parse_file(challenge_spec_file)
|
|
except ValidationError as e:
|
|
if logging.getLogger().level == logging.DEBUG:
|
|
logger.warning(f"Spec file {challenge_relpath} failed to load:\n{e}")
|
|
logger.debug(f"Invalid challenge spec: {challenge_spec_file.read_text()}")
|
|
continue
|
|
challenge_info.spec_file = challenge_spec_file
|
|
|
|
if not challenge_info.eval_id:
|
|
challenge_info.eval_id = str(uuid.uuid4())
|
|
# this will sort all the keys of the JSON systematically
|
|
# so that the order is always the same
|
|
write_pretty_json(challenge_info.dict(), challenge_spec_file)
|
|
|
|
CHALLENGES[challenge_info.eval_id] = challenge_info
|
|
|
|
task_informations = defaultdict(dict[str, Any])
|
|
|
|
|
|
def find_agbenchmark_without_uvicorn():
|
|
pids = []
|
|
for process in psutil.process_iter(
|
|
attrs=[
|
|
"pid",
|
|
"cmdline",
|
|
"name",
|
|
"username",
|
|
"status",
|
|
"cpu_percent",
|
|
"memory_info",
|
|
"create_time",
|
|
"cwd",
|
|
"connections",
|
|
]
|
|
):
|
|
try:
|
|
# Convert the process.info dictionary values to strings and concatenate them
|
|
full_info = " ".join([str(v) for k, v in process.as_dict().items()])
|
|
|
|
if "agbenchmark" in full_info and "uvicorn" not in full_info:
|
|
pids.append(process.pid)
|
|
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
|
|
pass
|
|
return pids
|
|
|
|
|
|
class CreateReportRequest(BaseModel):
|
|
test: str = None
|
|
test_run_id: str = None
|
|
# category: Optional[str] = []
|
|
mock: Optional[bool] = False
|
|
|
|
class Config:
|
|
extra = Extra.forbid # this will forbid any extra fields
|
|
|
|
|
|
updates_list = []
|
|
|
|
origins = [
|
|
"http://localhost:8000",
|
|
"http://localhost:8080",
|
|
"http://127.0.0.1:5000",
|
|
"http://localhost:5000",
|
|
]
|
|
|
|
|
|
def stream_output(pipe):
|
|
for line in pipe:
|
|
print(line, end="")
|
|
|
|
|
|
def setup_fastapi_app(agbenchmark_config: AgentBenchmarkConfig) -> FastAPI:
|
|
from agbenchmark.agent_api_interface import (
|
|
copy_agent_artifacts_into_folder,
|
|
upload_artifacts,
|
|
)
|
|
from agbenchmark.agent_interface import copy_artifacts_into_temp_folder
|
|
from agbenchmark.generate_test import create_challenge_from_spec_file
|
|
from agbenchmark.main import run_benchmark
|
|
|
|
configuration = Configuration(
|
|
host=agbenchmark_config.host or "http://localhost:8000"
|
|
)
|
|
app = FastAPI()
|
|
app.add_middleware(
|
|
CORSMiddleware,
|
|
allow_origins=origins,
|
|
allow_credentials=True,
|
|
allow_methods=["*"],
|
|
allow_headers=["*"],
|
|
)
|
|
router = APIRouter()
|
|
|
|
@router.post("/reports")
|
|
def run_single_test(body: CreateReportRequest) -> dict:
|
|
pids = find_agbenchmark_without_uvicorn()
|
|
logger.info(f"pids already running with agbenchmark: {pids}")
|
|
|
|
logger.debug(f"Request to /reports: {body.dict()}")
|
|
|
|
# Start the benchmark in a separate thread
|
|
benchmark_process = Process(
|
|
target=lambda: run_benchmark(
|
|
config=agbenchmark_config,
|
|
tests=(body.test,),
|
|
mock=body.mock or False,
|
|
)
|
|
)
|
|
benchmark_process.start()
|
|
|
|
# Wait for the benchmark to finish, with a timeout of 200 seconds
|
|
timeout = 200
|
|
start_time = time.time()
|
|
while benchmark_process.is_alive():
|
|
if time.time() - start_time > timeout:
|
|
logger.warning(f"Benchmark run timed out after {timeout} seconds")
|
|
benchmark_process.terminate()
|
|
break
|
|
time.sleep(1)
|
|
else:
|
|
logger.debug(f"Benchmark finished running in {time.time() - start_time} s")
|
|
|
|
# List all folders in the current working directory
|
|
path_reports = agbenchmark_config.reports_folder
|
|
folders = [folder for folder in path_reports.iterdir() if folder.is_dir()]
|
|
|
|
# Sort the folders based on their names
|
|
sorted_folders = sorted(folders, key=lambda x: x.name)
|
|
|
|
# Get the last folder
|
|
latest_folder = sorted_folders[-1] if sorted_folders else None
|
|
|
|
# Read report.json from this folder
|
|
if latest_folder:
|
|
report_path = latest_folder / "report.json"
|
|
logger.debug(f"Getting latest report from {report_path}")
|
|
if report_path.exists():
|
|
with report_path.open() as file:
|
|
data = json.load(file)
|
|
logger.debug(f"Report data: {data}")
|
|
else:
|
|
logger.error(
|
|
"Could not get result after running benchmark: "
|
|
f"'report.json' does not exist in '{latest_folder}'"
|
|
)
|
|
else:
|
|
logger.error(
|
|
"Could not get result after running benchmark: no reports found"
|
|
)
|
|
|
|
return data
|
|
|
|
@router.post("/agent/tasks", tags=["agent"])
|
|
async def create_agent_task(task_eval_request: TaskEvalRequestBody) -> Task:
|
|
"""
|
|
Creates a new task using the provided TaskEvalRequestBody and returns a Task.
|
|
|
|
Args:
|
|
task_eval_request: `TaskRequestBody` including an eval_id.
|
|
|
|
Returns:
|
|
Task: A new task with task_id, input, additional_input,
|
|
and empty lists for artifacts and steps.
|
|
|
|
Example:
|
|
Request (TaskEvalRequestBody defined in schema.py):
|
|
{
|
|
...,
|
|
"eval_id": "50da533e-3904-4401-8a07-c49adf88b5eb"
|
|
}
|
|
|
|
Response (Task defined in `agent_protocol_client.models`):
|
|
{
|
|
"task_id": "50da533e-3904-4401-8a07-c49adf88b5eb",
|
|
"input": "Write the word 'Washington' to a .txt file",
|
|
"artifacts": []
|
|
}
|
|
"""
|
|
try:
|
|
async with ApiClient(configuration) as api_client:
|
|
api_instance = AgentApi(api_client)
|
|
task_input = CHALLENGES[task_eval_request.eval_id].task
|
|
|
|
task_request_body = TaskRequestBody(input=task_input)
|
|
task_response = await api_instance.create_agent_task(
|
|
task_request_body=task_request_body
|
|
)
|
|
task_informations[task_response.task_id][
|
|
"benchmark_start_time"
|
|
] = datetime.datetime.now(datetime.timezone.utc).strftime(
|
|
"%Y-%m-%dT%H:%M:%S+00:00"
|
|
)
|
|
task_informations[task_response.task_id][
|
|
"eval_id"
|
|
] = task_eval_request.eval_id
|
|
await upload_artifacts(
|
|
api_instance,
|
|
str(CHALLENGES[task_eval_request.eval_id].spec_file.parent),
|
|
task_response.task_id,
|
|
"artifacts_in",
|
|
)
|
|
return task_response
|
|
except ApiException as e:
|
|
logger.error(f"Error whilst trying to create a task:\n{e}")
|
|
logger.error(
|
|
"The above error was caused while processing request: "
|
|
f"{task_eval_request}"
|
|
)
|
|
raise HTTPException(500)
|
|
|
|
@router.post("/agent/tasks/{task_id}/steps")
|
|
async def proxy(request: Request, task_id: str):
|
|
timeout = httpx.Timeout(300.0, read=300.0) # 5 minutes
|
|
async with httpx.AsyncClient(timeout=timeout) as client:
|
|
# Construct the new URL
|
|
new_url = f"{configuration.host}/ap/v1/agent/tasks/{task_id}/steps"
|
|
|
|
# Forward the request
|
|
response = await client.post(
|
|
new_url,
|
|
data=await request.body(),
|
|
headers=dict(request.headers),
|
|
)
|
|
|
|
# Return the response from the forwarded request
|
|
return Response(content=response.content, status_code=response.status_code)
|
|
|
|
@router.post("/agent/tasks/{task_id}/evaluations")
|
|
async def create_evaluation(task_id: str) -> BenchmarkRun:
|
|
challenge_info = CHALLENGES[task_informations[task_id]["eval_id"]]
|
|
workspace = agbenchmark_config.temp_folder
|
|
try:
|
|
async with ApiClient(configuration) as api_client:
|
|
api_instance = AgentApi(api_client)
|
|
await copy_agent_artifacts_into_folder(api_instance, task_id, workspace)
|
|
|
|
artifact_path = challenge_info.spec_file.parent
|
|
copy_artifacts_into_temp_folder(workspace, "custom_python", artifact_path)
|
|
|
|
challenge = create_challenge_from_spec_file(challenge_info.spec_file)
|
|
scores = challenge.get_scores(workspace)
|
|
is_score_100 = 1 in scores["values"]
|
|
|
|
eval_info = BenchmarkRun(
|
|
repository_info=RepositoryInfo(),
|
|
run_details=RunDetails(
|
|
command=f"agbenchmark --test={challenge_info.name}",
|
|
benchmark_start_time=(
|
|
task_informations[task_id]["benchmark_start_time"]
|
|
),
|
|
test_name=challenge_info.name,
|
|
),
|
|
task_info=TaskInfo(
|
|
data_path=str(
|
|
challenge_info.spec_file.relative_to(challenges_path.parent)
|
|
),
|
|
is_regression=None,
|
|
category=[c.value for c in challenge_info.category],
|
|
task=challenge_info.task,
|
|
answer=challenge_info.ground.answer,
|
|
description=challenge_info.info.description,
|
|
),
|
|
metrics=Metrics(
|
|
success=is_score_100,
|
|
attempted=True,
|
|
),
|
|
config={},
|
|
)
|
|
|
|
logger.debug(f"Returning evaluation data:\n{eval_info.json(indent=4)}")
|
|
return eval_info
|
|
except ApiException as e:
|
|
logger.error(f"Error {e} whilst trying to evaluate task: {task_id}")
|
|
raise HTTPException(500)
|
|
|
|
app.include_router(router, prefix="/ap/v1")
|
|
|
|
return app
|