AGBenchmark codebase clean-up (#6650)

* refactor(benchmark): Deduplicate configuration loading logic

   - Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module.
   - Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function.

* fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py

   - Fixed type errors and linting errors in `__main__.py`
   - Improved the readability of CLI argument validation by introducing a separate function for it

* refactor(benchmark): Lint and typefix app.py

   - Rearranged and cleaned up import statements
   - Fixed type errors caused by improper use of `psutil` objects
   - Simplified a number of `os.path` usages by converting to `pathlib`
   - Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema`

* refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py

   - Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`).
      - Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`.
   - Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`).
   - Remove all unused types from schema.py (= most of them).

* refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py

* refactor(benchmark): Improve typing, response validation, and readability in app.py

   - Simplified response generation by leveraging type checking and conversion by FastAPI.
   - Introduced use of `HTTPException` for error responses.
   - Improved naming, formatting, and typing in `app.py::create_evaluation`.
   - Updated the docstring on `app.py::create_agent_task`.
   - Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py.
   - Added default values to optional attributes on models in report_types_v2.py.
   - Removed unused imports in `generate_test.py`

* refactor(benchmark): Clean up logging and print statements

   - Introduced use of the `logging` library for unified logging and better readability.
   - Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`.
   - Improved descriptiveness of log statements.
   - Removed unnecessary print statements.
   - Added log statements to unspecific and non-verbose `except` blocks.
   - Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format.
   - Added `.utils.logging` module with `configure_logging` function to easily configure the logging library.
   - Converted raw escape sequences in `.utils.challenge` to use `colorama`.
   - Renamed `generate_test.py::generate_tests` to `load_challenges`.

* refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent

   - Remove unused server.py file
   - Remove unused run_agent function from agent_interface.py

* refactor(benchmark): Clean up conftest.py

   - Fix and add type annotations
   - Rewrite docstrings
   - Disable or remove unused code
   - Fix definition of arguments and their types in `pytest_addoption`

* refactor(benchmark): Clean up generate_test.py file

   - Refactored the `create_single_test` function for clarity and readability
      - Removed unused variables
      - Made creation of `Challenge` subclasses more straightforward
      - Made bare `except` more specific
   - Renamed `Challenge.setup_challenge` method to `run_challenge`
   - Updated type hints and annotations
   - Made minor code/readability improvements in `load_challenges`
   - Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module

* fix(benchmark): Fix and add type annotations in execute_sub_process.py

* refactor(benchmark): Simplify const determination in agent_interface.py

   - Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS`

* fix(benchmark): Register category markers to prevent warnings

   - Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime.

* refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json

* refactor(benchmark): Update agent_api_interface.py

   - Add type annotations to `copy_agent_artifacts_into_temp_folder` function
   - Add note about broken endpoint in the `agent_protocol_client` library
   - Remove unused variable in `run_api_agent` function
   - Improve readability and resolve linting error

* feat(benchmark): Improve and centralize pathfinding

   - Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder.
   - Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const.
   - Replace path constants defined in __main__.py with usages of `PATH_MANAGER`.

* feat(benchmark/cli): Clean up and improve CLI

   - Updated commands, options, and their descriptions to be more intuitive and consistent
   - Moved slow imports into the entrypoints that use them to speed up application startup
   - Fixed type hints to match output types of Click options
   - Hid deprecated `agbenchmark start` command
   - Refactored code to improve readability and maintainability
   - Moved main entrypoint into `run` subcommand
   - Fixed `version` and `serve` subcommands
   - Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility)
   - Renamed `--no_dep` to `--no-dep` for consistency
   - Fixed string formatting issues in log statements

* refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py

   - Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`.
   - Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`.
   - Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`.
   - Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties.
   - Update all code references according to the changes mentioned above.

* refactor(benchmark): Fix ReportManager init parameter types and use pathlib

   - Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`.
   - Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`.
   - Rename `self.filename` with `self.report_file` in `ReportManager`.
   - Change the way the report file is created, opened and saved to use the `Path` object.

* refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation

   - Use `ChallengeData` objects instead of untyped `dict` in  app.py, generate_test.py, reports.py.
   - Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class.
   - Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class.
   - Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py.
   - Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py.
   - Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py).

* refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py

   - Cleaned up generate_test.py and conftest.py
      - Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method.
      - Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py.
   - Converted methods in the `Challenge` class to class methods where appropriate.
   - Improved argument handling in the `run_benchmark` function in `__main__.py`.

* refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state

   - Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management.
   - Remove the `.path_manager` module containing `AGBenchmarkPathManager`.
   - Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity.

* feat(benchmark/serve): Configurable port for `serve` subcommand

   - Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on.
   - If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set.

* feat(benchmark/cli): Add `config` subcommand

   - Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config.

* fix(benchmark): Gracefully handle incompatible challenge spec files in app.py

   - Added a check to skip deprecated challenges
   - Added logging to allow debugging of the loading process
   - Added handling of validation errors when parsing challenge spec files
   - Added missing `spec_file` attribute to `ChallengeData`

* refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint

   - Move `run_benchmark` and `validate_args` from __main__.py to main.py
   - Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark`
   - Move `get_unique_categories` from __main__.py to challenges/__init__.py
   - Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py
   - Reduce operations on updates.json (including `initialize_updates_file`) outside of API

* refactor(benchmark): Remove unused `/updates` endpoint and all related code

   - Remove `updates_json_file` attribute from `AgentBenchmarkConfig`
   - Remove `get_updates` and `_initialize_updates_file` in app.py
   - Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py
   - Remove call to `append_updates_file` in challenge.py

* refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig`

   - Add and update docstrings
   - Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility
   - Make naming of path attributes on `AgentBenchmarkConfig` more consistent
   - Remove unused `agent_home_directory` attribute
   - Remove unused `workspace` attribute

* fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config

* fix(benchmark): Update agent-protocol-client to v1.1.0

   - Fixes issue with fetching task artifact listings
This commit is contained in:
Reinier van der Leer
2024-01-02 22:23:09 +01:00
committed by GitHub
parent b8238c2228
commit 25cc6ad6ae
47 changed files with 2122 additions and 7752 deletions

View File

@@ -121,7 +121,7 @@ jobs:
./run agent start $AGENT_NAME
cd ../benchmark
poetry install
poetry run agbenchmark --no_dep
poetry run agbenchmark --no-dep
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SERP_API_KEY: ${{ secrets.SERP_API_KEY }}

View File

@@ -1,5 +1,4 @@
import glob
import json
import logging
import os
import sys
from datetime import datetime, timezone
@@ -7,205 +6,97 @@ from pathlib import Path
from typing import Any, Optional
import click
import pytest
import toml
from click_default_group import DefaultGroup
from dotenv import load_dotenv
from helicone.lock import HeliconeLockManager
from agbenchmark.app import app
from agbenchmark.reports.ReportManager import SingletonReportManager
from agbenchmark.utils.data_types import AgentBenchmarkConfig
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.utils.logging import configure_logging
load_dotenv()
try:
if os.getenv("HELICONE_API_KEY"):
import helicone # noqa
helicone_enabled = True
else:
helicone_enabled = False
except ImportError:
helicone_enabled = False
class InvalidInvocationError(ValueError):
pass
logger = logging.getLogger(__name__)
BENCHMARK_START_TIME_DT = datetime.now(timezone.utc)
BENCHMARK_START_TIME = BENCHMARK_START_TIME_DT.strftime("%Y-%m-%dT%H:%M:%S+00:00")
TEMP_FOLDER_ABS_PATH = Path.cwd() / "agbenchmark_config" / "temp_folder"
CHALLENGES_ALREADY_BEATEN = (
Path.cwd() / "agbenchmark_config" / "challenges_already_beaten.json"
)
UPDATES_JSON_PATH = Path.cwd() / "agbenchmark_config" / "updates.json"
if os.environ.get("HELICONE_API_KEY"):
if helicone_enabled:
from helicone.lock import HeliconeLockManager
HeliconeLockManager.write_custom_property(
"benchmark_start_time", BENCHMARK_START_TIME
)
with open(
Path(__file__).resolve().parent / "challenges" / "optional_categories.json"
) as f:
OPTIONAL_CATEGORIES = json.load(f)["optional_categories"]
@click.group(cls=DefaultGroup, default_if_no_args=True)
@click.option("--debug", is_flag=True, help="Enable debug output")
def cli(
debug: bool,
) -> Any:
configure_logging(logging.DEBUG if debug else logging.INFO)
def get_unique_categories() -> set[str]:
"""Find all data.json files in the directory relative to this file and its subdirectories,
read the "category" field from each file, and return a set of unique categories."""
categories = set()
# Get the directory of this file
this_dir = os.path.dirname(os.path.abspath(__file__))
glob_path = os.path.join(this_dir, "./challenges/**/data.json")
# Use it as the base for the glob pattern
for data_file in glob.glob(glob_path, recursive=True):
with open(data_file, "r") as f:
try:
data = json.load(f)
categories.update(data.get("category", []))
except json.JSONDecodeError:
print(f"Error: {data_file} is not a valid JSON file.")
continue
except IOError:
print(f"IOError: file could not be read: {data_file}")
continue
return categories
@cli.command(hidden=True)
def start():
raise DeprecationWarning(
"`agbenchmark start` is deprecated. Use `agbenchmark run` instead."
)
def run_benchmark(
maintain: bool = False,
improve: bool = False,
explore: bool = False,
mock: bool = False,
no_dep: bool = False,
nc: bool = False,
keep_answers: bool = False,
category: Optional[tuple[str]] = None,
skip_category: Optional[tuple[str]] = None,
test: Optional[str] = None,
cutoff: Optional[int] = None,
server: bool = False,
) -> int:
"""Start the benchmark tests. If a category flag is provided, run the categories with that mark."""
# Check if configuration file exists and is not empty
initialize_updates_file()
SingletonReportManager()
agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
try:
with open(agent_benchmark_config_path, "r") as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
agent_benchmark_config.agent_benchmark_config_path = (
agent_benchmark_config_path
)
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
return 1
if maintain and improve and explore:
print(
"Error: You can't use --maintain, --improve or --explore at the same time. Please choose one."
)
return 1
if test and (category or skip_category or maintain or improve or explore):
print(
"Error: If you're running a specific test make sure no other options are selected. Please just pass the --test."
)
return 1
assert agent_benchmark_config.host, "Error: host needs to be added to the config."
print("Current configuration:")
for key, value in vars(agent_benchmark_config).items():
print(f"{key}: {value}")
pytest_args = ["-vs"]
if keep_answers:
pytest_args.append("--keep-answers")
if test:
print("Running specific test:", test)
else:
# Categories that are used in the challenges
categories = get_unique_categories()
if category:
invalid_categories = set(category) - categories
assert (
not invalid_categories
), f"Invalid categories: {invalid_categories}. Valid categories are: {categories}"
if category:
categories_to_run = set(category)
if skip_category:
categories_to_run = categories_to_run.difference(set(skip_category))
assert categories_to_run, "Error: You can't skip all categories"
pytest_args.extend(["-m", " or ".join(categories_to_run), "--category"])
print("Running tests of category:", categories_to_run)
elif skip_category:
categories_to_run = categories - set(skip_category)
assert categories_to_run, "Error: You can't skip all categories"
pytest_args.extend(["-m", " or ".join(categories_to_run), "--category"])
print("Running tests of category:", categories_to_run)
else:
print("Running all categories")
if maintain:
print("Running only regression tests")
pytest_args.append("--maintain")
elif improve:
print("Running only non-regression tests")
pytest_args.append("--improve")
elif explore:
print("Only attempt challenges that have never been beaten")
pytest_args.append("--explore")
if mock:
pytest_args.append("--mock")
os.environ[
"IS_MOCK"
] = "True" # ugly hack to make the mock work when calling from API
if no_dep:
pytest_args.append("--no_dep")
if nc and cutoff:
print(
"Error: You can't use both --nc and --cutoff at the same time. Please choose one."
)
return 1
if nc:
pytest_args.append("--nc")
if cutoff:
pytest_args.append("--cutoff")
print(f"Setting cuttoff override to {cutoff} seconds.")
current_dir = Path(__file__).resolve().parent
print(f"Current directory: {current_dir}")
pytest_args.extend((str(current_dir), "--cache-clear"))
exit_code = pytest.main(pytest_args)
SingletonReportManager().clear_instance()
@click.group(invoke_without_command=True)
@click.option("--backend", is_flag=True, help="If it's being run from the cli")
@click.option("-c", "--category", multiple=True, help="Specific category to run")
@cli.command(default=True)
@click.option(
"-c",
"--category",
multiple=True,
help="(+) Select a category to run.",
)
@click.option(
"-s",
"--skip-category",
multiple=True,
help="Skips preventing the tests from this category from running",
help="(+) Exclude a category from running.",
)
@click.option("--test", multiple=True, help="Specific test to run")
@click.option("--maintain", is_flag=True, help="Runs only regression tests")
@click.option("--improve", is_flag=True, help="Run only non-regression tests")
@click.option("--test", multiple=True, help="(+) Select a test to run.")
@click.option("--maintain", is_flag=True, help="Run only regression tests.")
@click.option("--improve", is_flag=True, help="Run only non-regression tests.")
@click.option(
"--explore",
is_flag=True,
help="Only attempt challenges that have never been beaten",
help="Run only challenges that have never been beaten.",
)
@click.option("--mock", is_flag=True, help="Run with mock")
@click.option(
"--no_dep",
"--no-dep",
is_flag=True,
help="Run without dependencies",
help="Run all (selected) challenges, regardless of dependency success/failure.",
)
@click.option("--nc", is_flag=True, help="Run without cutoff")
@click.option("--cutoff", type=int, help="Override the challenge time limit (seconds).")
@click.option("--nc", is_flag=True, help="Disable the challenge time limit.")
@click.option("--mock", is_flag=True, help="Run with mock")
@click.option("--keep-answers", is_flag=True, help="Keep answers")
@click.option("--cutoff", help="Set or override tests cutoff (seconds)")
@click.argument("value", type=str, required=False)
def cli(
@click.option(
"--backend",
is_flag=True,
help="Write log output to a file instead of the terminal.",
)
# @click.argument(
# "agent_path", type=click.Path(exists=True, file_okay=False), required=False
# )
def run(
maintain: bool,
improve: bool,
explore: bool,
@@ -213,18 +104,37 @@ def cli(
no_dep: bool,
nc: bool,
keep_answers: bool,
category: Optional[list[str]] = None,
skip_category: Optional[list[str]] = None,
test: Optional[str] = None,
test: tuple[str],
category: tuple[str],
skip_category: tuple[str],
cutoff: Optional[int] = None,
backend: Optional[bool] = False,
value: Optional[str] = None,
) -> Any:
# Redirect stdout if backend is True
if value == "start":
raise ("`agbenchmark start` is removed. Run `agbenchmark` instead.")
if value == "serve":
return serve()
# agent_path: Optional[Path] = None,
) -> None:
"""
Run the benchmark on the agent in the current directory.
Options marked with (+) can be specified multiple times, to select multiple items.
"""
from agbenchmark.main import run_benchmark, validate_args
agbenchmark_config = AgentBenchmarkConfig.load()
logger.debug(f"agbenchmark_config: {agbenchmark_config.agbenchmark_config_dir}")
try:
validate_args(
maintain=maintain,
improve=improve,
explore=explore,
tests=test,
categories=category,
skip_categories=skip_category,
no_cutoff=nc,
cutoff=cutoff,
)
except InvalidInvocationError as e:
logger.error("Error: " + "\n".join(e.args))
sys.exit(1)
original_stdout = sys.stdout # Save the original standard output
exit_code = None
@@ -232,16 +142,17 @@ def cli(
with open("backend/backend_stdout.txt", "w") as f:
sys.stdout = f
exit_code = run_benchmark(
config=agbenchmark_config,
maintain=maintain,
improve=improve,
explore=explore,
mock=mock,
no_dep=no_dep,
nc=nc,
no_cutoff=nc,
keep_answers=keep_answers,
category=category,
skip_category=skip_category,
test=test,
tests=test,
categories=category,
skip_categories=skip_category,
cutoff=cutoff,
)
@@ -249,16 +160,17 @@ def cli(
else:
exit_code = run_benchmark(
config=agbenchmark_config,
maintain=maintain,
improve=improve,
explore=explore,
mock=mock,
no_dep=no_dep,
nc=nc,
no_cutoff=nc,
keep_answers=keep_answers,
category=category,
skip_category=skip_category,
test=test,
tests=test,
categories=category,
skip_categories=skip_category,
cutoff=cutoff,
)
@@ -266,33 +178,44 @@ def cli(
@cli.command()
def version():
"""Print the version of the benchmark tool."""
current_directory = Path(__file__).resolve().parent
version = toml.load(current_directory / ".." / "pyproject.toml")["tool"]["poetry"][
"version"
]
print(f"Benchmark Tool Version {version}")
def serve():
@click.option("--port", type=int, help="Port to run the API on.")
def serve(port: Optional[int] = None):
"""Serve the benchmark frontend and API on port 8080."""
import uvicorn
from agbenchmark.app import setup_fastapi_app
config = AgentBenchmarkConfig.load()
app = setup_fastapi_app(config)
# Run the FastAPI application using uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080)
port = port or int(os.getenv("PORT", 8080))
uvicorn.run(app, host="0.0.0.0", port=port)
def initialize_updates_file():
if os.path.exists(UPDATES_JSON_PATH):
# If the file already exists, overwrite it with an empty list
with open(UPDATES_JSON_PATH, "w") as file:
json.dump([], file, indent=2)
print("Initialized updates.json by overwriting with an empty array")
else:
# If the file doesn't exist, create it and write an empty list
with open(UPDATES_JSON_PATH, "w") as file:
json.dump([], file, indent=2)
print("Created updates.json and initialized it with an empty array")
@cli.command()
def config():
"""Displays info regarding the present AGBenchmark config."""
try:
config = AgentBenchmarkConfig.load()
except FileNotFoundError as e:
click.echo(e, err=True)
return 1
k_col_width = max(len(k) for k in config.dict().keys())
for k, v in config.dict().items():
click.echo(f"{k: <{k_col_width}} = {v}")
@cli.command()
def version():
"""Print version info for the AGBenchmark application."""
import toml
package_root = Path(__file__).resolve().parent.parent
pyproject = toml.load(package_root / "pyproject.toml")
version = pyproject["tool"]["poetry"]["version"]
click.echo(f"AGBenchmark version {version}")
if __name__ == "__main__":

View File

@@ -1,30 +1,25 @@
import json
import logging
import os
import pathlib
import time
from typing import Any, Dict, Optional
from pathlib import Path
from typing import Optional
from agent_protocol_client import AgentApi, ApiClient, Configuration, TaskRequestBody
from agbenchmark.__main__ import TEMP_FOLDER_ABS_PATH, UPDATES_JSON_PATH
from agbenchmark.agent_interface import get_list_of_file_paths
from agbenchmark.agent_protocol_client import (
AgentApi,
ApiClient,
Configuration,
TaskRequestBody,
)
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.utils.data_types import ChallengeData
LOG = logging.getLogger(__name__)
async def run_api_agent(
task: ChallengeData, config: Dict[str, Any], artifacts_location: str, timeout: int
task: ChallengeData,
config: AgentBenchmarkConfig,
artifacts_location: str,
timeout: int,
) -> None:
host_value = None
configuration = Configuration(host=config["AgentBenchmarkConfig"].host + "/ap/v1")
configuration = Configuration(host=config.host)
async with ApiClient(configuration) as api_client:
api_instance = AgentApi(api_client)
task_request_body = TaskRequestBody(input=task.task)
@@ -45,7 +40,6 @@ async def run_api_agent(
# Read the existing JSON data from the file
step = await api_instance.execute_agent_task_step(task_id=task_id)
await append_updates_file(step)
print(f"[{task.name}] - step {step.name} ({i}. request)")
i += 1
@@ -54,34 +48,38 @@ async def run_api_agent(
raise TimeoutError("Time limit exceeded")
if not step or step.is_last:
steps_remaining = False
# if we're calling a mock agent, we "cheat" and give the correct artifacts to pass the tests
# In "mock" mode, we cheat by giving the correct artifacts to pass the challenge
if os.getenv("IS_MOCK"):
await upload_artifacts(
api_instance, artifacts_location, task_id, "artifacts_out"
)
await copy_agent_artifacts_into_temp_folder(api_instance, task_id)
await copy_agent_artifacts_into_folder(
api_instance, task_id, config.temp_folder
)
async def copy_agent_artifacts_into_temp_folder(api_instance, task_id):
async def copy_agent_artifacts_into_folder(
api_instance: AgentApi, task_id: str, folder: Path
):
artifacts = await api_instance.list_agent_task_artifacts(task_id=task_id)
for artifact in artifacts.artifacts:
# current absolute path of the directory of the file
directory_location = pathlib.Path(TEMP_FOLDER_ABS_PATH)
if artifact.relative_path:
path = (
path: str = (
artifact.relative_path
if not artifact.relative_path.startswith("/")
else artifact.relative_path[1:]
)
directory_location = pathlib.Path(
os.path.dirname(directory_location / path)
)
LOG.info(f"Creating directory {directory_location}")
folder = (folder / path).parent
directory_location.mkdir(parents=True, exist_ok=True)
if not folder.exists():
LOG.info(f"Creating directory {folder}")
folder.mkdir(parents=True)
file_path = directory_location / artifact.file_name
file_path = folder / artifact.file_name
LOG.info(f"Writing file {file_path}")
with open(file_path, "wb") as f:
content = await api_instance.download_agent_task_artifact(
@@ -91,35 +89,16 @@ async def copy_agent_artifacts_into_temp_folder(api_instance, task_id):
f.write(content)
async def append_updates_file(step: Step):
with open(UPDATES_JSON_PATH, "r") as file:
existing_data = json.load(file)
# Append the new update to the existing array
new_update = create_update_json(step)
existing_data.append(new_update)
# Write the updated array back to the file
with open(UPDATES_JSON_PATH, "w") as file:
file.write(json.dumps(existing_data, indent=2))
async def upload_artifacts(
api_instance: ApiClient, artifacts_location: str, task_id: str, type: str
api_instance: AgentApi, artifacts_location: str, task_id: str, type: str
) -> None:
for file_path in get_list_of_file_paths(artifacts_location, type):
relative_path: Optional[str] = "/".join(
file_path.split(f"{type}/", 1)[-1].split("/")[:-1]
str(file_path).split(f"{type}/", 1)[-1].split("/")[:-1]
)
if not relative_path:
relative_path = None
await api_instance.upload_agent_task_artifacts(
task_id=task_id, file=file_path, relative_path=relative_path
task_id=task_id, file=str(file_path), relative_path=relative_path
)
def create_update_json(step: Step):
now = int(time.time())
content = {"content": step.to_dict(), "timestamp": now}
return content

View File

@@ -1,45 +1,27 @@
import os
import shutil
import sys
from typing import List
from pathlib import Path
from dotenv import load_dotenv
from agbenchmark.execute_sub_process import execute_subprocess
load_dotenv()
helicone_graphql_logs = os.getenv("HELICONE_GRAPHQL_LOGS")
HELICONE_GRAPHQL_LOGS = (
helicone_graphql_logs.lower() == "true" if helicone_graphql_logs else False
)
def run_agent(task: str, timeout: int) -> None:
print(f"Running agbenchmark/benchmarks.py with timeout {timeout}")
command = [sys.executable, "-m", "agbenchmark_config.benchmarks", str(task)]
execute_subprocess(command, timeout)
HELICONE_GRAPHQL_LOGS = os.getenv("HELICONE_GRAPHQL_LOGS", "").lower() == "true"
def get_list_of_file_paths(
challenge_dir_path: str, artifact_folder_name: str
) -> List[str]:
# this file is at agbenchmark\agent_interface.py
source_dir = os.path.join(
challenge_dir_path,
artifact_folder_name,
)
if not os.path.exists(source_dir):
challenge_dir_path: str | Path, artifact_folder_name: str
) -> list[Path]:
source_dir = Path(challenge_dir_path) / artifact_folder_name
if not source_dir.exists():
return []
return [os.path.join(source_dir, file_name) for file_name in os.listdir(source_dir)]
return list(source_dir.iterdir())
def copy_artifacts_into_temp_folder(
workspace: str | dict[str, str], artifact_folder_name: str, challenge_dir_path: str
workspace: str | Path, artifact_folder_name: str, challenge_dir_path: str | Path
) -> None:
file_paths = get_list_of_file_paths(challenge_dir_path, artifact_folder_name)
for file_path in file_paths:
if os.path.isfile(file_path):
if file_path.is_file():
shutil.copy(file_path, workspace)

View File

@@ -1,42 +0,0 @@
# coding: utf-8
# flake8: noqa
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
__version__ = "1.0.0"
# import apis into sdk package
from agbenchmark.agent_protocol_client.api.agent_api import AgentApi
from agbenchmark.agent_protocol_client.api_client import ApiClient
# import ApiClient
from agbenchmark.agent_protocol_client.api_response import ApiResponse
from agbenchmark.agent_protocol_client.configuration import Configuration
from agbenchmark.agent_protocol_client.exceptions import (
ApiAttributeError,
ApiException,
ApiKeyError,
ApiTypeError,
ApiValueError,
OpenApiException,
)
# import models into sdk package
from agbenchmark.agent_protocol_client.models.artifact import Artifact
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.agent_protocol_client.models.step_all_of import StepAllOf
from agbenchmark.agent_protocol_client.models.step_request_body import StepRequestBody
from agbenchmark.agent_protocol_client.models.task import Task
from agbenchmark.agent_protocol_client.models.task_all_of import TaskAllOf
from agbenchmark.agent_protocol_client.models.task_request_body import TaskRequestBody

View File

@@ -1,4 +0,0 @@
# flake8: noqa
# import apis into api package
from agbenchmark.agent_protocol_client.api.agent_api import AgentApi

File diff suppressed because it is too large Load Diff

View File

@@ -1,838 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
import atexit
import datetime
import json
import mimetypes
import os
import re
import tempfile
from multiprocessing.pool import ThreadPool
from urllib.parse import quote
from dateutil.parser import parse
import agbenchmark.agent_protocol_client.models
from agbenchmark.agent_protocol_client import rest
from agbenchmark.agent_protocol_client.api_response import ApiResponse
from agbenchmark.agent_protocol_client.configuration import Configuration
from agbenchmark.agent_protocol_client.exceptions import ApiException, ApiValueError
class ApiClient(object):
"""Generic API client for OpenAPI client library builds.
OpenAPI generic API client. This client handles the client-
server communication, and is invariant across implementations. Specifics of
the methods and models for each application are generated from the OpenAPI
templates.
:param configuration: .Configuration object for this client
:param header_name: a header to pass when making calls to the API.
:param header_value: a header value to pass when making calls to
the API.
:param cookie: a cookie to include in the header when making calls
to the API
:param pool_threads: The number of threads to use for async requests
to the API. More threads means more concurrent API requests.
"""
PRIMITIVE_TYPES = (float, bool, bytes, str, int)
NATIVE_TYPES_MAPPING = {
"int": int,
"long": int, # TODO remove as only py3 is supported?
"float": float,
"str": str,
"bool": bool,
"date": datetime.date,
"datetime": datetime.datetime,
"object": object,
}
_pool = None
def __init__(
self,
configuration=None,
header_name=None,
header_value=None,
cookie=None,
pool_threads=1,
):
# use default configuration if none is provided
if configuration is None:
configuration = Configuration.get_default()
self.configuration = configuration
self.pool_threads = pool_threads
self.rest_client = rest.RESTClientObject(configuration)
self.default_headers = {}
if header_name is not None:
self.default_headers[header_name] = header_value
self.cookie = cookie
# Set default User-Agent.
self.user_agent = "OpenAPI-Generator/1.0.0/python"
self.client_side_validation = configuration.client_side_validation
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc_value, traceback):
await self.close()
async def close(self):
await self.rest_client.close()
if self._pool:
self._pool.close()
self._pool.join()
self._pool = None
if hasattr(atexit, "unregister"):
atexit.unregister(self.close)
@property
def pool(self):
"""Create thread pool on first request
avoids instantiating unused threadpool for blocking clients.
"""
if self._pool is None:
atexit.register(self.close)
self._pool = ThreadPool(self.pool_threads)
return self._pool
@property
def user_agent(self):
"""User agent for this API client"""
return self.default_headers["User-Agent"]
@user_agent.setter
def user_agent(self, value):
self.default_headers["User-Agent"] = value
def set_default_header(self, header_name, header_value):
self.default_headers[header_name] = header_value
_default = None
@classmethod
def get_default(cls):
"""Return new instance of ApiClient.
This method returns newly created, based on default constructor,
object of ApiClient class or returns a copy of default
ApiClient.
:return: The ApiClient object.
"""
if cls._default is None:
cls._default = ApiClient()
return cls._default
@classmethod
def set_default(cls, default):
"""Set default instance of ApiClient.
It stores default ApiClient.
:param default: object of ApiClient.
"""
cls._default = default
async def __call_api(
self,
resource_path,
method,
path_params=None,
query_params=None,
header_params=None,
body=None,
post_params=None,
files=None,
response_types_map=None,
auth_settings=None,
_return_http_data_only=None,
collection_formats=None,
_preload_content=True,
_request_timeout=None,
_host=None,
_request_auth=None,
):
config = self.configuration
# header parameters
header_params = header_params or {}
header_params.update(self.default_headers)
if self.cookie:
header_params["Cookie"] = self.cookie
if header_params:
header_params = self.sanitize_for_serialization(header_params)
header_params = dict(
self.parameters_to_tuples(header_params, collection_formats)
)
# path parameters
if path_params:
path_params = self.sanitize_for_serialization(path_params)
path_params = self.parameters_to_tuples(path_params, collection_formats)
for k, v in path_params:
# specified safe chars, encode everything
resource_path = resource_path.replace(
"{%s}" % k, quote(str(v), safe=config.safe_chars_for_path_param)
)
# post parameters
if post_params or files:
post_params = post_params if post_params else []
post_params = self.sanitize_for_serialization(post_params)
post_params = self.parameters_to_tuples(post_params, collection_formats)
post_params.extend(self.files_parameters(files))
# auth setting
self.update_params_for_auth(
header_params,
query_params,
auth_settings,
resource_path,
method,
body,
request_auth=_request_auth,
)
# body
if body:
body = self.sanitize_for_serialization(body)
# request url
if _host is None:
url = self.configuration.host + resource_path
else:
# use server/host defined in path or operation instead
url = _host + resource_path
# query parameters
if query_params:
query_params = self.sanitize_for_serialization(query_params)
url_query = self.parameters_to_url_query(query_params, collection_formats)
url += "?" + url_query
try:
# perform request and return response
response_data = await self.request(
method,
url,
query_params=query_params,
headers=header_params,
post_params=post_params,
body=body,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
)
except ApiException as e:
if e.body:
e.body = e.body.decode("utf-8")
raise e
self.last_response = response_data
return_data = None # assuming derialization is not needed
# data needs deserialization or returns HTTP data (deserialized) only
if _preload_content or _return_http_data_only:
response_type = response_types_map.get(str(response_data.status), None)
if response_type == "bytearray":
response_data.data = response_data.data
else:
match = None
content_type = response_data.getheader("content-type")
if content_type is not None:
match = re.search(r"charset=([a-zA-Z\-\d]+)[\s;]?", content_type)
encoding = match.group(1) if match else "utf-8"
response_data.data = response_data.data.decode(encoding)
# deserialize response data
if response_type == "bytearray":
return_data = response_data.data
elif response_type:
return_data = self.deserialize(response_data, response_type)
else:
return_data = None
if _return_http_data_only:
return return_data
else:
return ApiResponse(
status_code=response_data.status,
data=return_data,
headers=response_data.getheaders(),
raw_data=response_data.data,
)
def sanitize_for_serialization(self, obj):
"""Builds a JSON POST object.
If obj is None, return None.
If obj is str, int, long, float, bool, return directly.
If obj is datetime.datetime, datetime.date
convert to string in iso8601 format.
If obj is list, sanitize each element in the list.
If obj is dict, return the dict.
If obj is OpenAPI model, return the properties dict.
:param obj: The data to serialize.
:return: The serialized form of data.
"""
if obj is None:
return None
elif isinstance(obj, self.PRIMITIVE_TYPES):
return obj
elif isinstance(obj, list):
return [self.sanitize_for_serialization(sub_obj) for sub_obj in obj]
elif isinstance(obj, tuple):
return tuple(self.sanitize_for_serialization(sub_obj) for sub_obj in obj)
elif isinstance(obj, (datetime.datetime, datetime.date)):
return obj.isoformat()
if isinstance(obj, dict):
obj_dict = obj
else:
# Convert model obj to dict except
# attributes `openapi_types`, `attribute_map`
# and attributes which value is not None.
# Convert attribute name to json key in
# model definition for request.
obj_dict = obj.to_dict()
return {
key: self.sanitize_for_serialization(val) for key, val in obj_dict.items()
}
def deserialize(self, response, response_type):
"""Deserializes response into an object.
:param response: RESTResponse object to be deserialized.
:param response_type: class literal for
deserialized object, or string of class name.
:return: deserialized object.
"""
# handle file downloading
# save response body into a tmp file and return the instance
if response_type == "file":
return self.__deserialize_file(response)
# fetch data from response object
try:
data = json.loads(response.data)
except ValueError:
data = response.data
return self.__deserialize(data, response_type)
def __deserialize(self, data, klass):
"""Deserializes dict, list, str into an object.
:param data: dict, list or str.
:param klass: class literal, or string of class name.
:return: object.
"""
if data is None:
return None
if type(klass) == str:
if klass.startswith("List["):
sub_kls = re.match(r"List\[(.*)]", klass).group(1)
return [self.__deserialize(sub_data, sub_kls) for sub_data in data]
if klass.startswith("Dict["):
sub_kls = re.match(r"Dict\[([^,]*), (.*)]", klass).group(2)
return {k: self.__deserialize(v, sub_kls) for k, v in data.items()}
# convert str to class
if klass in self.NATIVE_TYPES_MAPPING:
klass = self.NATIVE_TYPES_MAPPING[klass]
else:
klass = getattr(agbenchmark.agent_protocol_client.models, klass)
if klass in self.PRIMITIVE_TYPES:
return self.__deserialize_primitive(data, klass)
elif klass == object:
return self.__deserialize_object(data)
elif klass == datetime.date:
return self.__deserialize_date(data)
elif klass == datetime.datetime:
return self.__deserialize_datetime(data)
else:
return self.__deserialize_model(data, klass)
def call_api(
self,
resource_path,
method,
path_params=None,
query_params=None,
header_params=None,
body=None,
post_params=None,
files=None,
response_types_map=None,
auth_settings=None,
async_req=None,
_return_http_data_only=None,
collection_formats=None,
_preload_content=True,
_request_timeout=None,
_host=None,
_request_auth=None,
):
"""Makes the HTTP request (synchronous) and returns deserialized data.
To make an async_req request, set the async_req parameter.
:param resource_path: Path to method endpoint.
:param method: Method to call.
:param path_params: Path parameters in the url.
:param query_params: Query parameters in the url.
:param header_params: Header parameters to be
placed in the request header.
:param body: Request body.
:param post_params dict: Request post form parameters,
for `application/x-www-form-urlencoded`, `multipart/form-data`.
:param auth_settings list: Auth Settings names for the request.
:param response: Response data type.
:param files dict: key -> filename, value -> filepath,
for `multipart/form-data`.
:param async_req bool: execute request asynchronously
:param _return_http_data_only: response data instead of ApiResponse
object with status code, headers, etc
:param _preload_content: if False, the ApiResponse.data will
be set to none and raw_data will store the
HTTP response body without reading/decoding.
Default is True.
:param collection_formats: dict of collection formats for path, query,
header, and post parameters.
:param _request_timeout: timeout setting for this request. If one
number provided, it will be total request
timeout. It can also be a pair (tuple) of
(connection, read) timeouts.
:param _request_auth: set to override the auth_settings for an a single
request; this effectively ignores the authentication
in the spec for a single request.
:type _request_token: dict, optional
:return:
If async_req parameter is True,
the request will be called asynchronously.
The method will return the request thread.
If parameter async_req is False or missing,
then the method will return the response directly.
"""
if not async_req:
return self.__call_api(
resource_path,
method,
path_params,
query_params,
header_params,
body,
post_params,
files,
response_types_map,
auth_settings,
_return_http_data_only,
collection_formats,
_preload_content,
_request_timeout,
_host,
_request_auth,
)
return self.pool.apply_async(
self.__call_api,
(
resource_path,
method,
path_params,
query_params,
header_params,
body,
post_params,
files,
response_types_map,
auth_settings,
_return_http_data_only,
collection_formats,
_preload_content,
_request_timeout,
_host,
_request_auth,
),
)
def request(
self,
method,
url,
query_params=None,
headers=None,
post_params=None,
body=None,
_preload_content=True,
_request_timeout=None,
):
"""Makes the HTTP request using RESTClient."""
if method == "GET":
return self.rest_client.get_request(
url,
query_params=query_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
headers=headers,
)
elif method == "HEAD":
return self.rest_client.head_request(
url,
query_params=query_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
headers=headers,
)
elif method == "OPTIONS":
return self.rest_client.options_request(
url,
query_params=query_params,
headers=headers,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
)
elif method == "POST":
return self.rest_client.post_request(
url,
query_params=query_params,
headers=headers,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
elif method == "PUT":
return self.rest_client.put_request(
url,
query_params=query_params,
headers=headers,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
elif method == "PATCH":
return self.rest_client.patch_request(
url,
query_params=query_params,
headers=headers,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
elif method == "DELETE":
return self.rest_client.delete_request(
url,
query_params=query_params,
headers=headers,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
else:
raise ApiValueError(
"http method must be `GET`, `HEAD`, `OPTIONS`,"
" `POST`, `PATCH`, `PUT` or `DELETE`."
)
def parameters_to_tuples(self, params, collection_formats):
"""Get parameters as list of tuples, formatting collections.
:param params: Parameters as dict or list of two-tuples
:param dict collection_formats: Parameter collection formats
:return: Parameters as list of tuples, collections formatted
"""
new_params = []
if collection_formats is None:
collection_formats = {}
for k, v in (
params.items() if isinstance(params, dict) else params
): # noqa: E501
if k in collection_formats:
collection_format = collection_formats[k]
if collection_format == "multi":
new_params.extend((k, value) for value in v)
else:
if collection_format == "ssv":
delimiter = " "
elif collection_format == "tsv":
delimiter = "\t"
elif collection_format == "pipes":
delimiter = "|"
else: # csv is the default
delimiter = ","
new_params.append((k, delimiter.join(str(value) for value in v)))
else:
new_params.append((k, v))
return new_params
def parameters_to_url_query(self, params, collection_formats):
"""Get parameters as list of tuples, formatting collections.
:param params: Parameters as dict or list of two-tuples
:param dict collection_formats: Parameter collection formats
:return: URL query string (e.g. a=Hello%20World&b=123)
"""
new_params = []
if collection_formats is None:
collection_formats = {}
for k, v in (
params.items() if isinstance(params, dict) else params
): # noqa: E501
if isinstance(v, (int, float)):
v = str(v)
if isinstance(v, bool):
v = str(v).lower()
if isinstance(v, dict):
v = json.dumps(v)
if k in collection_formats:
collection_format = collection_formats[k]
if collection_format == "multi":
new_params.extend((k, value) for value in v)
else:
if collection_format == "ssv":
delimiter = " "
elif collection_format == "tsv":
delimiter = "\t"
elif collection_format == "pipes":
delimiter = "|"
else: # csv is the default
delimiter = ","
new_params.append(
(k, delimiter.join(quote(str(value)) for value in v))
)
else:
new_params.append((k, quote(str(v))))
return "&".join(["=".join(item) for item in new_params])
def files_parameters(self, files=None):
"""Builds form parameters.
:param files: File parameters.
:return: Form parameters with files.
"""
params = []
if files:
for k, v in files.items():
if not v:
continue
file_names = v if type(v) is list else [v]
for n in file_names:
with open(n, "rb") as f:
filename = os.path.basename(f.name)
filedata = f.read()
mimetype = (
mimetypes.guess_type(filename)[0]
or "application/octet-stream"
)
params.append(tuple([k, tuple([filename, filedata, mimetype])]))
return params
def select_header_accept(self, accepts):
"""Returns `Accept` based on an array of accepts provided.
:param accepts: List of headers.
:return: Accept (e.g. application/json).
"""
if not accepts:
return
for accept in accepts:
if re.search("json", accept, re.IGNORECASE):
return accept
return accepts[0]
def select_header_content_type(self, content_types):
"""Returns `Content-Type` based on an array of content_types provided.
:param content_types: List of content-types.
:return: Content-Type (e.g. application/json).
"""
if not content_types:
return None
for content_type in content_types:
if re.search("json", content_type, re.IGNORECASE):
return content_type
return content_types[0]
def update_params_for_auth(
self,
headers,
queries,
auth_settings,
resource_path,
method,
body,
request_auth=None,
):
"""Updates header and query params based on authentication setting.
:param headers: Header parameters dict to be updated.
:param queries: Query parameters tuple list to be updated.
:param auth_settings: Authentication setting identifiers list.
:resource_path: A string representation of the HTTP request resource path.
:method: A string representation of the HTTP request method.
:body: A object representing the body of the HTTP request.
The object type is the return value of sanitize_for_serialization().
:param request_auth: if set, the provided settings will
override the token in the configuration.
"""
if not auth_settings:
return
if request_auth:
self._apply_auth_params(
headers, queries, resource_path, method, body, request_auth
)
return
for auth in auth_settings:
auth_setting = self.configuration.auth_settings().get(auth)
if auth_setting:
self._apply_auth_params(
headers, queries, resource_path, method, body, auth_setting
)
def _apply_auth_params(
self, headers, queries, resource_path, method, body, auth_setting
):
"""Updates the request parameters based on a single auth_setting
:param headers: Header parameters dict to be updated.
:param queries: Query parameters tuple list to be updated.
:resource_path: A string representation of the HTTP request resource path.
:method: A string representation of the HTTP request method.
:body: A object representing the body of the HTTP request.
The object type is the return value of sanitize_for_serialization().
:param auth_setting: auth settings for the endpoint
"""
if auth_setting["in"] == "cookie":
headers["Cookie"] = auth_setting["value"]
elif auth_setting["in"] == "header":
if auth_setting["type"] != "http-signature":
headers[auth_setting["key"]] = auth_setting["value"]
elif auth_setting["in"] == "query":
queries.append((auth_setting["key"], auth_setting["value"]))
else:
raise ApiValueError("Authentication token must be in `query` or `header`")
def __deserialize_file(self, response):
"""Deserializes body to file
Saves response body into a file in a temporary folder,
using the filename from the `Content-Disposition` header if provided.
:param response: RESTResponse.
:return: file path.
"""
fd, path = tempfile.mkstemp(dir=self.configuration.temp_folder_path)
os.close(fd)
os.remove(path)
content_disposition = response.getheader("Content-Disposition")
if content_disposition:
filename = re.search(
r'filename=[\'"]?([^\'"\s]+)[\'"]?', content_disposition
).group(1)
path = os.path.join(os.path.dirname(path), filename)
with open(path, "wb") as f:
f.write(response.data)
return path
def __deserialize_primitive(self, data, klass):
"""Deserializes string to primitive type.
:param data: str.
:param klass: class literal.
:return: int, long, float, str, bool.
"""
try:
return klass(data)
except UnicodeEncodeError:
return str(data)
except TypeError:
return data
def __deserialize_object(self, value):
"""Return an original value.
:return: object.
"""
return value
def __deserialize_date(self, string):
"""Deserializes string to date.
:param string: str.
:return: date.
"""
try:
return parse(string).date()
except ImportError:
return string
except ValueError:
raise rest.ApiException(
status=0, reason="Failed to parse `{0}` as date object".format(string)
)
def __deserialize_datetime(self, string):
"""Deserializes string to datetime.
The string should be in iso8601 datetime format.
:param string: str.
:return: datetime.
"""
try:
return parse(string)
except ImportError:
return string
except ValueError:
raise rest.ApiException(
status=0,
reason=("Failed to parse `{0}` as datetime object".format(string)),
)
def __deserialize_model(self, data, klass):
"""Deserializes list or dict to model.
:param data: dict, list.
:param klass: class literal.
:return: model object.
"""
return klass.from_dict(data)

View File

@@ -1,28 +0,0 @@
"""API response object."""
from __future__ import annotations
from typing import Any, Dict, Optional
from pydantic import Field, StrictInt, StrictStr
class ApiResponse:
"""
API response object
"""
status_code: Optional[StrictInt] = Field(None, description="HTTP status code")
headers: Optional[Dict[StrictStr, StrictStr]] = Field(
None, description="HTTP headers"
)
data: Optional[Any] = Field(
None, description="Deserialized data given the data type"
)
raw_data: Optional[Any] = Field(None, description="Raw data (HTTP response body)")
def __init__(self, status_code=None, headers=None, data=None, raw_data=None):
self.status_code = status_code
self.headers = headers
self.data = data
self.raw_data = raw_data

View File

@@ -1,447 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
import copy
import http.client as httplib
import logging
import sys
import urllib3
JSON_SCHEMA_VALIDATION_KEYWORDS = {
"multipleOf",
"maximum",
"exclusiveMaximum",
"minimum",
"exclusiveMinimum",
"maxLength",
"minLength",
"pattern",
"maxItems",
"minItems",
}
class Configuration(object):
"""This class contains various settings of the API client.
:param host: Base url.
:param api_key: Dict to store API key(s).
Each entry in the dict specifies an API key.
The dict key is the name of the security scheme in the OAS specification.
The dict value is the API key secret.
:param api_key_prefix: Dict to store API prefix (e.g. Bearer).
The dict key is the name of the security scheme in the OAS specification.
The dict value is an API key prefix when generating the auth data.
:param username: Username for HTTP basic authentication.
:param password: Password for HTTP basic authentication.
:param access_token: Access token.
:param server_index: Index to servers configuration.
:param server_variables: Mapping with string values to replace variables in
templated server configuration. The validation of enums is performed for
variables with defined enum values before.
:param server_operation_index: Mapping from operation ID to an index to server
configuration.
:param server_operation_variables: Mapping from operation ID to a mapping with
string values to replace variables in templated server configuration.
The validation of enums is performed for variables with defined enum values before.
:param ssl_ca_cert: str - the path to a file of concatenated CA certificates
in PEM format.
"""
_default = None
def __init__(
self,
host=None,
api_key=None,
api_key_prefix=None,
username=None,
password=None,
access_token=None,
server_index=None,
server_variables=None,
server_operation_index=None,
server_operation_variables=None,
ssl_ca_cert=None,
):
"""Constructor"""
self._base_path = "http://localhost" if host is None else host
"""Default Base url
"""
self.server_index = 0 if server_index is None and host is None else server_index
self.server_operation_index = server_operation_index or {}
"""Default server index
"""
self.server_variables = server_variables or {}
self.server_operation_variables = server_operation_variables or {}
"""Default server variables
"""
self.temp_folder_path = None
"""Temp file folder for downloading files
"""
# Authentication Settings
self.api_key = {}
if api_key:
self.api_key = api_key
"""dict to store API key(s)
"""
self.api_key_prefix = {}
if api_key_prefix:
self.api_key_prefix = api_key_prefix
"""dict to store API prefix (e.g. Bearer)
"""
self.refresh_api_key_hook = None
"""function hook to refresh API key if expired
"""
self.username = username
"""Username for HTTP basic authentication
"""
self.password = password
"""Password for HTTP basic authentication
"""
self.access_token = access_token
"""Access token
"""
self.logger = {}
"""Logging Settings
"""
self.logger["package_logger"] = logging.getLogger("agent_protocol_client")
self.logger["urllib3_logger"] = logging.getLogger("urllib3")
self.logger_format = "%(asctime)s %(levelname)s %(message)s"
"""Log format
"""
self.logger_stream_handler = None
"""Log stream handler
"""
self.logger_file_handler = None
"""Log file handler
"""
self.logger_file = None
"""Debug file location
"""
self.debug = False
"""Debug switch
"""
self.verify_ssl = True
"""SSL/TLS verification
Set this to false to skip verifying SSL certificate when calling API
from https server.
"""
self.ssl_ca_cert = ssl_ca_cert
"""Set this to customize the certificate file to verify the peer.
"""
self.cert_file = None
"""client certificate file
"""
self.key_file = None
"""client key file
"""
self.assert_hostname = None
"""Set this to True/False to enable/disable SSL hostname verification.
"""
self.tls_server_name = None
"""SSL/TLS Server Name Indication (SNI)
Set this to the SNI value expected by the server.
"""
self.connection_pool_maxsize = 100
"""This value is passed to the aiohttp to limit simultaneous connections.
Default values is 100, None means no-limit.
"""
self.proxy = None
"""Proxy URL
"""
self.proxy_headers = None
"""Proxy headers
"""
self.safe_chars_for_path_param = ""
"""Safe chars for path_param
"""
self.retries = None
"""Adding retries to override urllib3 default value 3
"""
# Enable client side validation
self.client_side_validation = True
self.socket_options = None
"""Options to pass down to the underlying urllib3 socket
"""
self.datetime_format = "%Y-%m-%dT%H:%M:%S.%f%z"
"""datetime format
"""
self.date_format = "%Y-%m-%d"
"""date format
"""
def __deepcopy__(self, memo):
cls = self.__class__
result = cls.__new__(cls)
memo[id(self)] = result
for k, v in self.__dict__.items():
if k not in ("logger", "logger_file_handler"):
setattr(result, k, copy.deepcopy(v, memo))
# shallow copy of loggers
result.logger = copy.copy(self.logger)
# use setters to configure loggers
result.logger_file = self.logger_file
result.debug = self.debug
return result
def __setattr__(self, name, value):
object.__setattr__(self, name, value)
@classmethod
def set_default(cls, default):
"""Set default instance of configuration.
It stores default configuration, which can be
returned by get_default_copy method.
:param default: object of Configuration
"""
cls._default = default
@classmethod
def get_default_copy(cls):
"""Deprecated. Please use `get_default` instead.
Deprecated. Please use `get_default` instead.
:return: The configuration object.
"""
return cls.get_default()
@classmethod
def get_default(cls):
"""Return the default configuration.
This method returns newly created, based on default constructor,
object of Configuration class or returns a copy of default
configuration.
:return: The configuration object.
"""
if cls._default is None:
cls._default = Configuration()
return cls._default
@property
def logger_file(self):
"""The logger file.
If the logger_file is None, then add stream handler and remove file
handler. Otherwise, add file handler and remove stream handler.
:param value: The logger_file path.
:type: str
"""
return self.__logger_file
@logger_file.setter
def logger_file(self, value):
"""The logger file.
If the logger_file is None, then add stream handler and remove file
handler. Otherwise, add file handler and remove stream handler.
:param value: The logger_file path.
:type: str
"""
self.__logger_file = value
if self.__logger_file:
# If set logging file,
# then add file handler and remove stream handler.
self.logger_file_handler = logging.FileHandler(self.__logger_file)
self.logger_file_handler.setFormatter(self.logger_formatter)
for _, logger in self.logger.items():
logger.addHandler(self.logger_file_handler)
@property
def debug(self):
"""Debug status
:param value: The debug status, True or False.
:type: bool
"""
return self.__debug
@debug.setter
def debug(self, value):
"""Debug status
:param value: The debug status, True or False.
:type: bool
"""
self.__debug = value
if self.__debug:
# if debug status is True, turn on debug logging
for _, logger in self.logger.items():
logger.setLevel(logging.DEBUG)
# turn on httplib debug
httplib.HTTPConnection.debuglevel = 1
else:
# if debug status is False, turn off debug logging,
# setting log level to default `logging.WARNING`
for _, logger in self.logger.items():
logger.setLevel(logging.WARNING)
# turn off httplib debug
httplib.HTTPConnection.debuglevel = 0
@property
def logger_format(self):
"""The logger format.
The logger_formatter will be updated when sets logger_format.
:param value: The format string.
:type: str
"""
return self.__logger_format
@logger_format.setter
def logger_format(self, value):
"""The logger format.
The logger_formatter will be updated when sets logger_format.
:param value: The format string.
:type: str
"""
self.__logger_format = value
self.logger_formatter = logging.Formatter(self.__logger_format)
def get_api_key_with_prefix(self, identifier, alias=None):
"""Gets API key (with prefix if set).
:param identifier: The identifier of apiKey.
:param alias: The alternative identifier of apiKey.
:return: The token for api key authentication.
"""
if self.refresh_api_key_hook is not None:
self.refresh_api_key_hook(self)
key = self.api_key.get(
identifier, self.api_key.get(alias) if alias is not None else None
)
if key:
prefix = self.api_key_prefix.get(identifier)
if prefix:
return "%s %s" % (prefix, key)
else:
return key
def get_basic_auth_token(self):
"""Gets HTTP basic authentication header (string).
:return: The token for basic HTTP authentication.
"""
username = ""
if self.username is not None:
username = self.username
password = ""
if self.password is not None:
password = self.password
return urllib3.util.make_headers(basic_auth=username + ":" + password).get(
"authorization"
)
def auth_settings(self):
"""Gets Auth Settings dict for api client.
:return: The Auth Settings information dict.
"""
auth = {}
return auth
def to_debug_report(self):
"""Gets the essential information for debugging.
:return: The report for debugging.
"""
return (
"Python SDK Debug Report:\n"
"OS: {env}\n"
"Python Version: {pyversion}\n"
"Version of the API: v0.2\n"
"SDK Package Version: 1.0.0".format(env=sys.platform, pyversion=sys.version)
)
def get_host_settings(self):
"""Gets an array of host settings
:return: An array of host settings
"""
return [
{
"url": "",
"description": "No description provided",
}
]
def get_host_from_settings(self, index, variables=None, servers=None):
"""Gets host URL based on the index and variables
:param index: array index of the host settings
:param variables: hash of variable and the corresponding value
:param servers: an array of host settings or None
:return: URL based on host settings
"""
if index is None:
return self._base_path
variables = {} if variables is None else variables
servers = self.get_host_settings() if servers is None else servers
try:
server = servers[index]
except IndexError:
raise ValueError(
"Invalid index {0} when selecting the host settings. "
"Must be less than {1}".format(index, len(servers))
)
url = server["url"]
# go through variables and replace placeholders
for variable_name, variable in server.get("variables", {}).items():
used_value = variables.get(variable_name, variable["default_value"])
if "enum_values" in variable and used_value not in variable["enum_values"]:
raise ValueError(
"The variable `{0}` in the host URL has invalid value "
"{1}. Must be {2}.".format(
variable_name, variables[variable_name], variable["enum_values"]
)
)
url = url.replace("{" + variable_name + "}", used_value)
return url
@property
def host(self):
"""Return generated host."""
return self.get_host_from_settings(
self.server_index, variables=self.server_variables
)
@host.setter
def host(self, value):
"""Fix base path."""
self._base_path = value
self.server_index = None

View File

@@ -1,615 +0,0 @@
# agbenchmark.agent_protocol_client.AgentApi
All URIs are relative to _http://localhost_
| Method | HTTP request | Description |
| ---------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------- |
| [**create_agent_task**](AgentApi.md#create_agent_task) | **POST** /agent/tasks | Creates a task for the agent. |
| [**download_agent_task_artifact**](AgentApi.md#download_agent_task_artifact) | **GET** /agent/tasks/{task_id}/artifacts/{artifact_id} | Download a specified artifact. |
| [**execute_agent_task_step**](AgentApi.md#execute_agent_task_step) | **POST** /agent/tasks/{task_id}/steps | Execute a step in the specified agent task. |
| [**get_agent_task**](AgentApi.md#get_agent_task) | **GET** /agent/tasks/{task_id} | Get details about a specified agent task. |
| [**get_agent_task_step**](AgentApi.md#get_agent_task_step) | **GET** /agent/tasks/{task_id}/steps/{step_id} | Get details about a specified task step. |
| [**list_agent_task_artifacts**](AgentApi.md#list_agent_task_artifacts) | **GET** /agent/tasks/{task_id}/artifacts | List all artifacts that have been created for the given task. |
| [**list_agent_task_steps**](AgentApi.md#list_agent_task_steps) | **GET** /agent/tasks/{task_id}/steps | List all steps for the specified task. |
| [**list_agent_tasks_ids**](AgentApi.md#list_agent_tasks_ids) | **GET** /agent/tasks | List all tasks that have been created for the agent. |
| [**upload_agent_task_artifacts**](AgentApi.md#upload_agent_task_artifacts) | **POST** /agent/tasks/{task_id}/artifacts | Upload an artifact for the specified task. |
# **create_agent_task**
> Task create_agent_task(task_request_body=task_request_body)
Creates a task for the agent.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.models.task import Task
from agbenchmark.agent_protocol_client.models.task_request_body import TaskRequestBody
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_request_body = agbenchmark.agent_protocol_client.TaskRequestBody() # TaskRequestBody | (optional)
try:
# Creates a task for the agent.
api_response = await api_instance.create_agent_task(task_request_body=task_request_body)
print("The response of AgentApi->create_agent_task:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->create_agent_task: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| --------------------- | ----------------------------------------- | ----------- | ---------- |
| **task_request_body** | [**TaskRequestBody**](TaskRequestBody.md) | | [optional] |
### Return type
[**Task**](Task.md)
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: application/json
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------------ | ---------------- |
| **200** | A new agent task was successfully created. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **download_agent_task_artifact**
> bytearray download_agent_task_artifact(task_id, artifact_id)
Download a specified artifact.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
artifact_id = 'artifact_id_example' # str | ID of the artifact
try:
# Download a specified artifact.
api_response = await api_instance.download_agent_task_artifact(task_id, artifact_id)
print("The response of AgentApi->download_agent_task_artifact:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->download_agent_task_artifact: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| --------------- | ------- | ------------------ | ----- |
| **task_id** | **str** | ID of the task |
| **artifact_id** | **str** | ID of the artifact |
### Return type
**bytearray**
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: Not defined
- **Accept**: application/octet-stream
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------- | ---------------- |
| **200** | Returned the content of the artifact. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **execute_agent_task_step**
> Step execute_agent_task_step(task_id, step_request_body=step_request_body)
Execute a step in the specified agent task.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.agent_protocol_client.models.step_request_body import StepRequestBody
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
step_request_body = agbenchmark.agent_protocol_client.StepRequestBody() # StepRequestBody | (optional)
try:
# Execute a step in the specified agent task.
api_response = await api_instance.execute_agent_task_step(task_id, step_request_body=step_request_body)
print("The response of AgentApi->execute_agent_task_step:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->execute_agent_task_step: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| --------------------- | ----------------------------------------- | -------------- | ---------- |
| **task_id** | **str** | ID of the task |
| **step_request_body** | [**StepRequestBody**](StepRequestBody.md) | | [optional] |
### Return type
[**Step**](Step.md)
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: application/json
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | --------------------------------- | ---------------- |
| **200** | Executed step for the agent task. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **get_agent_task**
> Task get_agent_task(task_id)
Get details about a specified agent task.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.models.task import Task
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
try:
# Get details about a specified agent task.
api_response = await api_instance.get_agent_task(task_id)
print("The response of AgentApi->get_agent_task:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->get_agent_task: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| ----------- | ------- | -------------- | ----- |
| **task_id** | **str** | ID of the task |
### Return type
[**Task**](Task.md)
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: Not defined
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------- | ---------------- |
| **200** | Returned details about an agent task. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **get_agent_task_step**
> Step get_agent_task_step(task_id, step_id)
Get details about a specified task step.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
step_id = 'step_id_example' # str | ID of the step
try:
# Get details about a specified task step.
api_response = await api_instance.get_agent_task_step(task_id, step_id)
print("The response of AgentApi->get_agent_task_step:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->get_agent_task_step: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| ----------- | ------- | -------------- | ----- |
| **task_id** | **str** | ID of the task |
| **step_id** | **str** | ID of the step |
### Return type
[**Step**](Step.md)
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: Not defined
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------------ | ---------------- |
| **200** | Returned details about an agent task step. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **list_agent_task_artifacts**
> List[Artifact] list_agent_task_artifacts(task_id)
List all artifacts that have been created for the given task.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.models.artifact import Artifact
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
try:
# List all artifacts that have been created for the given task.
api_response = await api_instance.list_agent_task_artifacts(task_id)
print("The response of AgentApi->list_agent_task_artifacts:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->list_agent_task_artifacts: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| ----------- | ------- | -------------- | ----- |
| **task_id** | **str** | ID of the task |
### Return type
[**List[Artifact]**](Artifact.md)
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: Not defined
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------- | ---------------- |
| **200** | Returned the content of the artifact. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **list_agent_task_steps**
> List[str] list_agent_task_steps(task_id)
List all steps for the specified task.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
try:
# List all steps for the specified task.
api_response = await api_instance.list_agent_task_steps(task_id)
print("The response of AgentApi->list_agent_task_steps:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->list_agent_task_steps: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| ----------- | ------- | -------------- | ----- |
| **task_id** | **str** | ID of the task |
### Return type
**List[str]**
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: Not defined
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------------------------------- | ---------------- |
| **200** | Returned list of agent&#39;s step IDs for the specified task. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **list_agent_tasks_ids**
> List[str] list_agent_tasks_ids()
List all tasks that have been created for the agent.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
try:
# List all tasks that have been created for the agent.
api_response = await api_instance.list_agent_tasks_ids()
print("The response of AgentApi->list_agent_tasks_ids:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->list_agent_tasks_ids: %s\n" % e)
```
### Parameters
This endpoint does not need any parameter.
### Return type
**List[str]**
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: Not defined
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | -------------------------------------- | ---------------- |
| **200** | Returned list of agent&#39;s task IDs. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)
# **upload_agent_task_artifacts**
> Artifact upload_agent_task_artifacts(task_id, file, relative_path=relative_path)
Upload an artifact for the specified task.
### Example
```python
import time
import os
import agent_protocol_client
from agbenchmark.agent_protocol_client.models.artifact import Artifact
from agbenchmark.agent_protocol_client.rest import ApiException
from pprint import pprint
# Defining the host is optional and defaults to http://localhost
# See configuration.py for a list of all supported configuration parameters.
configuration = agbenchmark.agent_protocol_client.Configuration(
host = "http://localhost"
)
# Enter a context with an instance of the API client
async with agbenchmark.agent_protocol_client.ApiClient(configuration) as api_client:
# Create an instance of the API class
api_instance = agbenchmark.agent_protocol_client.AgentApi(api_client)
task_id = 'task_id_example' # str | ID of the task
file = None # bytearray | File to upload.
relative_path = 'relative_path_example' # str | Relative path of the artifact in the agent's workspace. (optional)
try:
# Upload an artifact for the specified task.
api_response = await api_instance.upload_agent_task_artifacts(task_id, file, relative_path=relative_path)
print("The response of AgentApi->upload_agent_task_artifacts:\n")
pprint(api_response)
except Exception as e:
print("Exception when calling AgentApi->upload_agent_task_artifacts: %s\n" % e)
```
### Parameters
| Name | Type | Description | Notes |
| ----------------- | ------------- | ----------------------------------------------------------- | ---------- |
| **task_id** | **str** | ID of the task |
| **file** | **bytearray** | File to upload. |
| **relative_path** | **str** | Relative path of the artifact in the agent&#39;s workspace. | [optional] |
### Return type
[**Artifact**](Artifact.md)
### Authorization
No authorization required
### HTTP request headers
- **Content-Type**: multipart/form-data
- **Accept**: application/json
### HTTP response details
| Status code | Description | Response headers |
| ----------- | ------------------------------------- | ---------------- |
| **200** | Returned the content of the artifact. | - |
| **0** | Internal Server Error | - |
[[Back to top]](#) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to Model list]](../README.md#documentation-for-models) [[Back to README]](../README.md)

View File

@@ -1,154 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
class OpenApiException(Exception):
"""The base exception class for all OpenAPIExceptions"""
class ApiTypeError(OpenApiException, TypeError):
def __init__(self, msg, path_to_item=None, valid_classes=None, key_type=None):
"""Raises an exception for TypeErrors
Args:
msg (str): the exception message
Keyword Args:
path_to_item (list): a list of keys an indices to get to the
current_item
None if unset
valid_classes (tuple): the primitive classes that current item
should be an instance of
None if unset
key_type (bool): False if our value is a value in a dict
True if it is a key in a dict
False if our item is an item in a list
None if unset
"""
self.path_to_item = path_to_item
self.valid_classes = valid_classes
self.key_type = key_type
full_msg = msg
if path_to_item:
full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
super(ApiTypeError, self).__init__(full_msg)
class ApiValueError(OpenApiException, ValueError):
def __init__(self, msg, path_to_item=None):
"""
Args:
msg (str): the exception message
Keyword Args:
path_to_item (list) the path to the exception in the
received_data dict. None if unset
"""
self.path_to_item = path_to_item
full_msg = msg
if path_to_item:
full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
super(ApiValueError, self).__init__(full_msg)
class ApiAttributeError(OpenApiException, AttributeError):
def __init__(self, msg, path_to_item=None):
"""
Raised when an attribute reference or assignment fails.
Args:
msg (str): the exception message
Keyword Args:
path_to_item (None/list) the path to the exception in the
received_data dict
"""
self.path_to_item = path_to_item
full_msg = msg
if path_to_item:
full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
super(ApiAttributeError, self).__init__(full_msg)
class ApiKeyError(OpenApiException, KeyError):
def __init__(self, msg, path_to_item=None):
"""
Args:
msg (str): the exception message
Keyword Args:
path_to_item (None/list) the path to the exception in the
received_data dict
"""
self.path_to_item = path_to_item
full_msg = msg
if path_to_item:
full_msg = "{0} at {1}".format(msg, render_path(path_to_item))
super(ApiKeyError, self).__init__(full_msg)
class ApiException(OpenApiException):
def __init__(self, status=None, reason=None, http_resp=None):
if http_resp:
self.status = http_resp.status
self.reason = http_resp.reason
self.body = http_resp.data
self.headers = http_resp.getheaders()
else:
self.status = status
self.reason = reason
self.body = None
self.headers = None
def __str__(self):
"""Custom error messages for exception"""
error_message = "({0})\n" "Reason: {1}\n".format(self.status, self.reason)
if self.headers:
error_message += "HTTP response headers: {0}\n".format(self.headers)
if self.body:
error_message += "HTTP response body: {0}\n".format(self.body)
return error_message
class NotFoundException(ApiException):
def __init__(self, status=None, reason=None, http_resp=None):
super(NotFoundException, self).__init__(status, reason, http_resp)
class UnauthorizedException(ApiException):
def __init__(self, status=None, reason=None, http_resp=None):
super(UnauthorizedException, self).__init__(status, reason, http_resp)
class ForbiddenException(ApiException):
def __init__(self, status=None, reason=None, http_resp=None):
super(ForbiddenException, self).__init__(status, reason, http_resp)
class ServiceException(ApiException):
def __init__(self, status=None, reason=None, http_resp=None):
super(ServiceException, self).__init__(status, reason, http_resp)
def render_path(path_to_item):
"""Returns a string representation of a path"""
result = ""
for pth in path_to_item:
if isinstance(pth, int):
result += "[{0}]".format(pth)
else:
result += "['{0}']".format(pth)
return result

View File

@@ -1,25 +0,0 @@
# coding: utf-8
# flake8: noqa
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
# import models into model package
from agbenchmark.agent_protocol_client.models.artifact import Artifact
from agbenchmark.agent_protocol_client.models.artifacts import Artifacts
from agbenchmark.agent_protocol_client.models.pagination import Pagination
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.agent_protocol_client.models.step_all_of import StepAllOf
from agbenchmark.agent_protocol_client.models.step_request_body import StepRequestBody
from agbenchmark.agent_protocol_client.models.task import Task
from agbenchmark.agent_protocol_client.models.task_all_of import TaskAllOf
from agbenchmark.agent_protocol_client.models.task_request_body import TaskRequestBody

View File

@@ -1,72 +0,0 @@
# coding: utf-8
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Optional
from pydantic import BaseModel, Field, StrictStr
class Artifact(BaseModel):
"""
Artifact that the task has produced.
"""
artifact_id: StrictStr = Field(..., description="ID of the artifact.")
file_name: StrictStr = Field(..., description="Filename of the artifact.")
relative_path: Optional[StrictStr] = Field(
None, description="Relative path of the artifact in the agent's workspace."
)
__properties = ["artifact_id", "file_name", "relative_path"]
created_at: StrictStr = Field(..., description="Creation date of the artifact.")
# modified_at: StrictStr = Field(..., description="Modification date of the artifact.")
agent_created: bool = Field(..., description="True if created by the agent")
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> Artifact:
"""Create an instance of Artifact from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
return _dict
@classmethod
def from_dict(cls, obj: dict) -> Artifact:
"""Create an instance of Artifact from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return Artifact.parse_obj(obj)
_obj = Artifact.parse_obj(
{
"artifact_id": obj.get("artifact_id"),
"file_name": obj.get("file_name"),
"relative_path": obj.get("relative_path"),
"created_at": obj.get("created_at"),
"modified_at": obj.get("modified_at"),
"agent_created": obj.get("agent_created"),
}
)
return _obj

View File

@@ -1,77 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from pydantic import BaseModel
from agbenchmark.agent_protocol_client.models.artifact import Artifact
from agbenchmark.agent_protocol_client.models.pagination import Pagination
class Artifacts(BaseModel):
"""
Artifacts that the task has produced.
"""
artifacts: list[Artifact]
pagination: Pagination
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> Artifacts:
"""Create an instance of Artifacts from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
return _dict
@classmethod
def from_dict(cls, obj: dict) -> Artifacts:
"""Create an instance of Artifacts from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return Artifacts.parse_obj(obj)
_obj = Artifacts.parse_obj(
{
"artifacts": obj.get("artifacts"),
"pagination": obj.get("pagination"),
}
)
return _obj
Artifacts.update_forward_refs()

View File

@@ -1,75 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from pydantic import BaseModel
class Pagination(BaseModel):
"""
Pagination that the task has produced.
"""
total_items: int
total_pages: int
current_page: int
page_size: int
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> Pagination:
"""Create an instance of Pagination from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
return _dict
@classmethod
def from_dict(cls, obj: dict) -> Pagination:
"""Create an instance of Pagination from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return Pagination.parse_obj(obj)
_obj = Pagination.parse_obj(
{
"total_items": obj.get("total_items"),
"total_pages": obj.get("total_pages"),
"current_page": obj.get("current_page"),
"page_size": obj.get("page_size"),
}
)
return _obj

View File

@@ -1,146 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Any, Optional
from pydantic import BaseModel, Field, StrictBool, StrictStr, conlist, validator
from agbenchmark.agent_protocol_client.models.artifact import Artifact
class Step(BaseModel):
"""
Step
"""
input: Optional[StrictStr] = Field(None, description="Input prompt for the step.")
additional_input: Optional[Any] = Field(
None, description="Input parameters for the task step. Any value is allowed."
)
task_id: StrictStr = Field(
..., description="The ID of the task this step belongs to."
)
step_id: StrictStr = Field(..., description="The ID of the task step.")
name: Optional[StrictStr] = Field(None, description="The name of the task step.")
status: StrictStr = Field(..., description="The status of the task step.")
output: Optional[StrictStr] = Field(None, description="Output of the task step.")
additional_output: Optional[Any] = Field(
None,
description="Output that the task step has produced. Any value is allowed.",
)
artifacts: conlist(Artifact) = Field(
..., description="A list of artifacts that the step has produced."
)
is_last: Optional[StrictBool] = Field(
False, description="Whether this is the last step in the task."
)
__properties = [
"input",
"additional_input",
"task_id",
"step_id",
"name",
"status",
"output",
"additional_output",
"artifacts",
"is_last",
]
@validator("status")
def status_validate_enum(cls, value):
"""Validates the enum"""
if value not in ("created", "completed"):
raise ValueError("must be one of enum values ('created', 'completed')")
return value
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> Step:
"""Create an instance of Step from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
_items = []
if self.artifacts:
for _item in self.artifacts:
if _item:
_items.append(_item.to_dict())
_dict["artifacts"] = _items
# set to None if additional_input (nullable) is None
# and __fields_set__ contains the field
if self.additional_input is None and "additional_input" in self.__fields_set__:
_dict["additional_input"] = None
# set to None if additional_output (nullable) is None
# and __fields_set__ contains the field
if (
self.additional_output is None
and "additional_output" in self.__fields_set__
):
_dict["additional_output"] = None
return _dict
@classmethod
def from_dict(cls, obj: dict) -> Step:
"""Create an instance of Step from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return Step.parse_obj(obj)
_obj = Step.parse_obj(
{
"input": obj.get("input"),
"additional_input": obj.get("additional_input"),
"task_id": obj.get("task_id"),
"step_id": obj.get("step_id"),
"name": obj.get("name"),
"status": obj.get("status"),
"output": obj.get("output"),
"additional_output": obj.get("additional_output"),
"artifacts": [
Artifact.from_dict(_item) for _item in obj.get("artifacts")
]
if obj.get("artifacts") is not None
else None,
"is_last": obj.get("is_last")
if obj.get("is_last") is not None
else False,
}
)
return _obj

View File

@@ -1,133 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Any, Optional
from pydantic import BaseModel, Field, StrictBool, StrictStr, conlist, validator
from agbenchmark.agent_protocol_client.models.artifact import Artifact
class StepAllOf(BaseModel):
"""
StepAllOf
"""
task_id: StrictStr = Field(
..., description="The ID of the task this step belongs to."
)
step_id: StrictStr = Field(..., description="The ID of the task step.")
name: Optional[StrictStr] = Field(None, description="The name of the task step.")
status: StrictStr = Field(..., description="The status of the task step.")
output: Optional[StrictStr] = Field(None, description="Output of the task step.")
additional_output: Optional[Any] = Field(
None,
description="Output that the task step has produced. Any value is allowed.",
)
artifacts: conlist(Artifact) = Field(
..., description="A list of artifacts that the step has produced."
)
is_last: Optional[StrictBool] = Field(
False, description="Whether this is the last step in the task."
)
__properties = [
"task_id",
"step_id",
"name",
"status",
"output",
"additional_output",
"artifacts",
"is_last",
]
@validator("status")
def status_validate_enum(cls, value):
"""Validates the enum"""
if value not in ("created", "completed"):
raise ValueError("must be one of enum values ('created', 'completed')")
return value
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> StepAllOf:
"""Create an instance of StepAllOf from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
_items = []
if self.artifacts:
for _item in self.artifacts:
if _item:
_items.append(_item.to_dict())
_dict["artifacts"] = _items
# set to None if additional_output (nullable) is None
# and __fields_set__ contains the field
if (
self.additional_output is None
and "additional_output" in self.__fields_set__
):
_dict["additional_output"] = None
return _dict
@classmethod
def from_dict(cls, obj: dict) -> StepAllOf:
"""Create an instance of StepAllOf from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return StepAllOf.parse_obj(obj)
_obj = StepAllOf.parse_obj(
{
"task_id": obj.get("task_id"),
"step_id": obj.get("step_id"),
"name": obj.get("name"),
"status": obj.get("status"),
"output": obj.get("output"),
"additional_output": obj.get("additional_output"),
"artifacts": [
Artifact.from_dict(_item) for _item in obj.get("artifacts")
]
if obj.get("artifacts") is not None
else None,
"is_last": obj.get("is_last")
if obj.get("is_last") is not None
else False,
}
)
return _obj

View File

@@ -1,77 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Any, Optional
from pydantic import BaseModel, Field, StrictStr
class StepRequestBody(BaseModel):
"""
Body of the task request.
"""
input: Optional[StrictStr] = Field(None, description="Input prompt for the step.")
additional_input: Optional[Any] = Field(
None, description="Input parameters for the task step. Any value is allowed."
)
__properties = ["input", "additional_input"]
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> StepRequestBody:
"""Create an instance of StepRequestBody from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# set to None if additional_input (nullable) is None
# and __fields_set__ contains the field
if self.additional_input is None and "additional_input" in self.__fields_set__:
_dict["additional_input"] = None
return _dict
@classmethod
def from_dict(cls, obj: dict) -> StepRequestBody:
"""Create an instance of StepRequestBody from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return StepRequestBody.parse_obj(obj)
_obj = StepRequestBody.parse_obj(
{"input": obj.get("input"), "additional_input": obj.get("additional_input")}
)
return _obj

View File

@@ -1,89 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v1
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Any, Optional
from pydantic import BaseModel, Field, StrictBool, conlist
class StepResult(BaseModel):
"""
Result of the task step.
"""
output: Optional[Any] = Field(
None,
description="Output that the task step has produced. Any value is allowed.",
)
artifacts: conlist(Any) = Field(
..., description="A list of artifacts that the step has produced."
)
is_last: Optional[StrictBool] = Field(
False, description="Whether this is the last step in the task."
)
__properties = ["output", "artifacts", "is_last"]
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> StepResult:
"""Create an instance of StepResult from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# set to None if output (nullable) is None
# and __fields_set__ contains the field
if self.output is None and "output" in self.__fields_set__:
_dict["output"] = None
return _dict
@classmethod
def from_dict(cls, obj: dict) -> StepResult:
"""Create an instance of StepResult from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return StepResult.parse_obj(obj)
_obj = StepResult.parse_obj(
{
"output": obj.get("output"),
"artifacts": obj.get("artifacts"),
"is_last": obj.get("is_last")
if obj.get("is_last") is not None
else False,
}
)
return _obj

View File

@@ -1,99 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Any, Optional
from pydantic import BaseModel, Field, StrictStr, conlist
from agbenchmark.agent_protocol_client.models.artifact import Artifact
class Task(BaseModel):
"""
Task
"""
input: Optional[StrictStr] = Field(None, description="Input prompt for the task.")
additional_input: Optional[Any] = Field(
None, description="Input parameters for the task. Any value is allowed."
)
task_id: StrictStr = Field(..., description="The ID of the task.")
artifacts: conlist(Artifact) = Field(
..., description="A list of artifacts that the task has produced."
)
__properties = ["input", "additional_input", "task_id", "artifacts"]
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> Task:
"""Create an instance of Task from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
_items = []
if self.artifacts:
for _item in self.artifacts:
if _item:
_items.append(_item.to_dict())
_dict["artifacts"] = _items
# set to None if additional_input (nullable) is None
# and __fields_set__ contains the field
if self.additional_input is None and "additional_input" in self.__fields_set__:
_dict["additional_input"] = None
return _dict
@classmethod
def from_dict(cls, obj: dict) -> Task:
"""Create an instance of Task from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return Task.parse_obj(obj)
_obj = Task.parse_obj(
{
"input": obj.get("input"),
"additional_input": obj.get("additional_input"),
"task_id": obj.get("task_id"),
"artifacts": [
Artifact.from_dict(_item) for _item in obj.get("artifacts")
]
if obj.get("artifacts") is not None
else None,
}
)
return _obj

View File

@@ -1,87 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from pydantic import BaseModel, Field, StrictStr, conlist
from agbenchmark.agent_protocol_client.models.artifact import Artifact
class TaskAllOf(BaseModel):
"""
Definition of an agent task.
"""
task_id: StrictStr = Field(..., description="The ID of the task.")
artifacts: conlist(Artifact) = Field(
..., description="A list of artifacts that the task has produced."
)
__properties = ["task_id", "artifacts"]
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> TaskAllOf:
"""Create an instance of TaskAllOf from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# override the default output from pydantic by calling `to_dict()` of each item in artifacts (list)
_items = []
if self.artifacts:
for _item in self.artifacts:
if _item:
_items.append(_item.to_dict())
_dict["artifacts"] = _items
return _dict
@classmethod
def from_dict(cls, obj: dict) -> TaskAllOf:
"""Create an instance of TaskAllOf from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return TaskAllOf.parse_obj(obj)
_obj = TaskAllOf.parse_obj(
{
"task_id": obj.get("task_id"),
"artifacts": [
Artifact.from_dict(_item) for _item in obj.get("artifacts")
]
if obj.get("artifacts") is not None
else None,
}
)
return _obj

View File

@@ -1,77 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
from __future__ import annotations
import json
import pprint
import re # noqa: F401
from typing import Any, Optional
from pydantic import BaseModel, Field, StrictStr
class TaskRequestBody(BaseModel):
"""
Body of the task request.
"""
input: Optional[StrictStr] = Field(None, description="Input prompt for the task.")
additional_input: Optional[Any] = Field(
None, description="Input parameters for the task. Any value is allowed."
)
__properties = ["input", "additional_input"]
class Config:
"""Pydantic configuration"""
allow_population_by_field_name = True
validate_assignment = True
def to_str(self) -> str:
"""Returns the string representation of the model using alias"""
return pprint.pformat(self.dict(by_alias=True))
def to_json(self) -> str:
"""Returns the JSON representation of the model using alias"""
return json.dumps(self.to_dict())
@classmethod
def from_json(cls, json_str: str) -> TaskRequestBody:
"""Create an instance of TaskRequestBody from a JSON string"""
return cls.from_dict(json.loads(json_str))
def to_dict(self):
"""Returns the dictionary representation of the model using alias"""
_dict = self.dict(by_alias=True, exclude={}, exclude_none=True)
# set to None if additional_input (nullable) is None
# and __fields_set__ contains the field
if self.additional_input is None and "additional_input" in self.__fields_set__:
_dict["additional_input"] = None
return _dict
@classmethod
def from_dict(cls, obj: dict) -> TaskRequestBody:
"""Create an instance of TaskRequestBody from a dict"""
if obj is None:
return None
if not isinstance(obj, dict):
return TaskRequestBody.parse_obj(obj)
_obj = TaskRequestBody.parse_obj(
{"input": obj.get("input"), "additional_input": obj.get("additional_input")}
)
return _obj

View File

@@ -1,311 +0,0 @@
# coding: utf-8
"""
Agent Communication Protocol
Specification of the API protocol for communication with an agent. # noqa: E501
The version of the OpenAPI document: v0.2
Generated by OpenAPI Generator (https://openapi-generator.tech)
Do not edit the class manually.
"""
import io
import json
import logging
import re
import ssl
from urllib.parse import urlencode
import aiohttp
from agbenchmark.agent_protocol_client.exceptions import ApiException, ApiValueError
logger = logging.getLogger(__name__)
class RESTResponse(io.IOBase):
def __init__(self, resp, data):
self.aiohttp_response = resp
self.status = resp.status
self.reason = resp.reason
self.data = data
def getheaders(self):
"""Returns a CIMultiDictProxy of the response headers."""
return self.aiohttp_response.headers
def getheader(self, name, default=None):
"""Returns a given response header."""
return self.aiohttp_response.headers.get(name, default)
class RESTClientObject(object):
def __init__(self, configuration, pools_size=4, maxsize=None):
# maxsize is number of requests to host that are allowed in parallel
if maxsize is None:
maxsize = configuration.connection_pool_maxsize
ssl_context = ssl.create_default_context(cafile=configuration.ssl_ca_cert)
if configuration.cert_file:
ssl_context.load_cert_chain(
configuration.cert_file, keyfile=configuration.key_file
)
if not configuration.verify_ssl:
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
connector = aiohttp.TCPConnector(limit=maxsize, ssl=ssl_context)
self.proxy = configuration.proxy
self.proxy_headers = configuration.proxy_headers
# https pool manager
self.pool_manager = aiohttp.ClientSession(connector=connector, trust_env=True)
async def close(self):
await self.pool_manager.close()
async def request(
self,
method,
url,
query_params=None,
headers=None,
body=None,
post_params=None,
_preload_content=True,
_request_timeout=None,
):
"""Execute request
:param method: http request method
:param url: http request url
:param query_params: query parameters in the url
:param headers: http request headers
:param body: request json body, for `application/json`
:param post_params: request post parameters,
`application/x-www-form-urlencoded`
and `multipart/form-data`
:param _preload_content: this is a non-applicable field for
the AiohttpClient.
:param _request_timeout: timeout setting for this request. If one
number provided, it will be total request
timeout. It can also be a pair (tuple) of
(connection, read) timeouts.
"""
method = method.upper()
assert method in ["GET", "HEAD", "DELETE", "POST", "PUT", "PATCH", "OPTIONS"]
if post_params and body:
raise ApiValueError(
"body parameter cannot be used with post_params parameter."
)
post_params = post_params or {}
headers = headers or {}
# url already contains the URL query string
# so reset query_params to empty dict
query_params = {}
timeout = _request_timeout or 5 * 60
if "Content-Type" not in headers:
headers["Content-Type"] = "application/json"
args = {"method": method, "url": url, "timeout": timeout, "headers": headers}
if self.proxy:
args["proxy"] = self.proxy
if self.proxy_headers:
args["proxy_headers"] = self.proxy_headers
if query_params:
args["url"] += "?" + urlencode(query_params)
# For `POST`, `PUT`, `PATCH`, `OPTIONS`, `DELETE`
if method in ["POST", "PUT", "PATCH", "OPTIONS", "DELETE"]:
if re.search("json", headers["Content-Type"], re.IGNORECASE):
if body is not None:
body = json.dumps(body)
args["data"] = body
elif (
headers["Content-Type"] == "application/x-www-form-urlencoded"
): # noqa: E501
args["data"] = aiohttp.FormData(post_params)
elif headers["Content-Type"] == "multipart/form-data":
# must del headers['Content-Type'], or the correct
# Content-Type which generated by aiohttp
del headers["Content-Type"]
data = aiohttp.FormData()
for param in post_params:
k, v = param
if isinstance(v, tuple) and len(v) == 3:
data.add_field(k, value=v[1], filename=v[0], content_type=v[2])
else:
data.add_field(k, v)
args["data"] = data
# Pass a `bytes` parameter directly in the body to support
# other content types than Json when `body` argument is provided
# in serialized form
elif isinstance(body, bytes):
args["data"] = body
else:
# Cannot generate the request from given parameters
msg = """Cannot prepare a request message for provided
arguments. Please check that your arguments match
declared content type."""
raise ApiException(status=0, reason=msg)
r = await self.pool_manager.request(**args)
if _preload_content:
data = await r.read()
r = RESTResponse(r, data)
# log response body
logger.debug("response body: %s", r.data)
if not 200 <= r.status <= 299:
raise ApiException(http_resp=r)
return r
async def get_request(
self,
url,
headers=None,
query_params=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"GET",
url,
headers=headers,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
query_params=query_params,
)
async def head_request(
self,
url,
headers=None,
query_params=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"HEAD",
url,
headers=headers,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
query_params=query_params,
)
async def options_request(
self,
url,
headers=None,
query_params=None,
post_params=None,
body=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"OPTIONS",
url,
headers=headers,
query_params=query_params,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
async def delete_request(
self,
url,
headers=None,
query_params=None,
body=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"DELETE",
url,
headers=headers,
query_params=query_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
async def post_request(
self,
url,
headers=None,
query_params=None,
post_params=None,
body=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"POST",
url,
headers=headers,
query_params=query_params,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
async def put_request(
self,
url,
headers=None,
query_params=None,
post_params=None,
body=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"PUT",
url,
headers=headers,
query_params=query_params,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)
async def patch_request(
self,
url,
headers=None,
query_params=None,
post_params=None,
body=None,
_preload_content=True,
_request_timeout=None,
):
return await self.request(
"PATCH",
url,
headers=headers,
query_params=query_params,
post_params=post_params,
_preload_content=_preload_content,
_request_timeout=_request_timeout,
body=body,
)

View File

@@ -1,78 +1,74 @@
import datetime
import glob
import json
import logging
import sys
import time
import uuid
from collections import defaultdict, deque
from multiprocessing import Process
from pathlib import Path
import httpx
from agbenchmark.agent_protocol_client import (
AgentApi,
ApiClient,
ApiException,
Configuration,
)
from agbenchmark.reports.processing.report_types_v2 import BenchmarkRun
from agbenchmark.schema import TaskEvalRequestBody
from agbenchmark.utils.utils import write_pretty_json
configuration = Configuration(host="http://localhost:8000" + "/ap/v1")
import json
import os
import sys
from typing import Any, Optional
import httpx
import psutil
from fastapi import APIRouter, FastAPI
from fastapi import (
HTTPException as FastAPIHTTPException, # Import HTTPException from FastAPI
)
from fastapi import Request, Response
from agent_protocol_client import AgentApi, ApiClient, ApiException, Configuration
from agent_protocol_client.models import Task, TaskRequestBody
from fastapi import APIRouter, FastAPI, HTTPException, Request, Response
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Extra, ValidationError
from agbenchmark.execute_sub_process import execute_subprocess
from agbenchmark.schema import Task, TaskRequestBody
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.reports.processing.report_types_v2 import (
BenchmarkRun,
Metrics,
RepositoryInfo,
RunDetails,
TaskInfo,
)
from agbenchmark.schema import TaskEvalRequestBody
from agbenchmark.utils.data_types import ChallengeData
from agbenchmark.utils.utils import write_pretty_json
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from fastapi import FastAPI
from pydantic import BaseModel, Extra
sys.path.append(str(Path(__file__).parent.parent))
router = APIRouter()
import glob
logger = logging.getLogger(__name__)
# Change the current working directory to the benchmark path
# home_path = find_absolute_benchmark_path()
# os.chdir(home_path)
general_command = ["poetry", "run", "agbenchmark", "start", "--backend"]
import psutil
challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
json_files = deque(
CHALLENGES: dict[str, ChallengeData] = {}
challenges_path = Path(__file__).parent / "challenges"
challenge_spec_files = deque(
glob.glob(
f"{challenges_path}/**/data.json",
recursive=True,
)
)
CHALLENGES = {}
task_informations = defaultdict(dict)
logger.debug("Loading challenges...")
while challenge_spec_files:
challenge_spec_file = Path(challenge_spec_files.popleft())
challenge_relpath = challenge_spec_file.relative_to(challenges_path.parent)
if challenge_relpath.is_relative_to("challenges/deprecated"):
continue
while json_files:
json_file = json_files.popleft()
logger.debug(f"Loading {challenge_relpath}...")
try:
challenge_info = ChallengeData.parse_file(challenge_spec_file)
except ValidationError as e:
if logging.getLogger().level == logging.DEBUG:
logger.warning(f"Spec file {challenge_relpath} failed to load:\n{e}")
logger.debug(f"Invalid challenge spec: {challenge_spec_file.read_text()}")
continue
challenge_info.spec_file = challenge_spec_file
with open(json_file, "r") as file:
data = json.load(file)
if not challenge_info.eval_id:
challenge_info.eval_id = str(uuid.uuid4())
# this will sort all the keys of the JSON systematically
# so that the order is always the same
write_pretty_json(challenge_info.dict(), challenge_spec_file)
if "eval_id" not in data:
data["eval_id"] = str(uuid.uuid4())
# this will sort all the keys of the JSON systematically so that the order is always the same
write_pretty_json(data, json_file)
# ok
CHALLENGES[data["eval_id"]] = data
CHALLENGES[data["eval_id"]]["path"] = json_file
CHALLENGES[challenge_info.eval_id] = challenge_info
task_informations = defaultdict(dict[str, Any])
def find_agbenchmark_without_uvicorn():
@@ -93,10 +89,10 @@ def find_agbenchmark_without_uvicorn():
):
try:
# Convert the process.info dictionary values to strings and concatenate them
full_info = " ".join([str(v) for k, v in process.info.items()])
full_info = " ".join([str(v) for k, v in process.as_dict().items()])
if "agbenchmark" in full_info and "uvicorn" not in full_info:
pids.append(process.info["pid"])
pids.append(process.pid)
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass
return pids
@@ -114,24 +110,12 @@ class CreateReportRequest(BaseModel):
updates_list = []
updates_list = []
import json
origins = [
"http://localhost:8000",
"http://localhost:8080",
"http://127.0.0.1:5000",
"http://localhost:5000",
]
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
def stream_output(pipe):
@@ -139,275 +123,210 @@ def stream_output(pipe):
print(line, end="")
@router.post("/reports")
def run_single_test(body: CreateReportRequest) -> Any:
pids = find_agbenchmark_without_uvicorn()
print(f"pids already running with agbenchmark: {pids}")
print(body.dict())
# it's a hack because other parts of the code are using sys.argv
print(os.getcwd())
command_options = ["agbenchmark"]
# if body.category:
# sys.argv.append(f"--category={body.category}")
command_options.append(f"--test={body.test}")
if body.mock:
command_options.append("--mock")
execute_subprocess(command_options, 200)
import json
from pathlib import Path
print("finished running")
# List all folders in the current working directory
path_reports = Path.cwd() / "agbenchmark_config" / "reports"
folders = [folder for folder in path_reports.iterdir() if folder.is_dir()]
# Sort the folders based on their names
sorted_folders = sorted(folders, key=lambda x: x.name)
# Get the last folder
last_folder = sorted_folders[-1] if sorted_folders else None
# Read report.json from this folder
if last_folder:
report_path = last_folder / "report.json"
print(report_path)
if report_path.exists():
with report_path.open() as file:
data = json.load(file)
print(data)
else:
print(f"'report.json' does not exist in '{last_folder}'")
else:
print("No folders found.")
return Response(
content=json.dumps(data),
status_code=200,
media_type="application/json",
def setup_fastapi_app(agbenchmark_config: AgentBenchmarkConfig) -> FastAPI:
from agbenchmark.agent_api_interface import (
copy_agent_artifacts_into_folder,
upload_artifacts,
)
import json
from typing import Any
from fastapi import FastAPI, Request, Response
@router.get("/updates")
def get_updates(request: Request) -> Any:
from agbenchmark.__main__ import UPDATES_JSON_PATH
try:
# Read data from the "update.json" file (provide the correct file path)
with open(UPDATES_JSON_PATH, "r") as file:
data = json.load(file)
# Get the last_update_time from the query parameter
query_param = request.query_params.get("last_update_time")
if query_param is None:
# Handle the case when last_update_time is not provided
print("ERROR: last_update_time parameter is missing")
return Response(
content=json.dumps({"error": "last_update_time parameter is missing"}),
status_code=400,
media_type="application/json",
headers={"Content-Type": "application/json"},
)
# Convert query_param to a Unix timestamp (assuming it's in seconds as a string)
query_timestamp = int(query_param)
# Filter the data based on the timestamp (keep timestamps before query_timestamp)
filtered_data = [item for item in data if item["timestamp"] > query_timestamp]
# Extract only the "content" field from each item
filtered_data = [item["content"] for item in filtered_data]
# Convert the filtered data to JSON
filtered_json = json.dumps(filtered_data, indent=2)
print("INFO: Returning filtered data to the client")
return Response(
content=filtered_json,
status_code=200,
media_type="application/json",
headers={"Content-Type": "application/json"},
)
except FileNotFoundError:
print("ERROR: File not found: updates.json")
return Response(
content=json.dumps({"error": "File not found"}),
status_code=404,
media_type="application/json",
headers={"Content-Type": "application/json"},
)
@router.post("/agent/tasks", tags=["agent"], response_model=Task)
async def create_agent_task(task_eval_request: TaskEvalRequestBody) -> Task:
"""
Creates a new task using the provided TaskRequestBody and returns a Task.
Args:
request (Request): FastAPI request object.
task (TaskRequestBody): The task request containing input and additional input data.
Returns:
Task: A new task with task_id, input, additional_input, and empty lists for artifacts and steps.
Example:
Request (TaskRequestBody defined in schema.py):
{
"input": "Write the words you receive to the file 'output.txt'.",
"additional_input": "python/code"
}
Response (Task defined in schema.py):
{
"task_id": "50da533e-3904-4401-8a07-c49adf88b5eb",
"input": "Write the word 'Washington' to a .txt file",
"additional_input": "python/code",
"artifacts": [],
}
"""
from agbenchmark.agent_api_interface import upload_artifacts
try:
async with ApiClient(configuration) as api_client:
api_instance = AgentApi(api_client)
task_input = CHALLENGES[task_eval_request.eval_id]["task"]
task_request_body = TaskRequestBody(input=task_input)
task_response = await api_instance.create_agent_task(
task_request_body=task_request_body
)
task_informations[task_response.task_id][
"benchmark_start_time"
] = datetime.datetime.now(datetime.timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%S+00:00"
)
task_informations[task_response.task_id][
"eval_id"
] = task_eval_request.eval_id
await upload_artifacts(
api_instance,
str(Path(CHALLENGES[task_eval_request.eval_id]["path"]).parent),
task_response.task_id,
"artifacts_in",
)
return Response(
content=task_response.json(),
status_code=200,
media_type="application/json",
)
except ApiException as e:
print(f"Error whilst trying to create a task: {task_eval_request}")
return Response(
content=json.dumps({"error": "Internal server error"}),
status_code=500,
media_type="application/json",
)
@router.post("/agent/tasks/{task_id}/steps")
async def proxy(request: Request, task_id: str):
timeout = httpx.Timeout(300.0, read=300.0) # 5 minutes
async with httpx.AsyncClient(timeout=timeout) as client:
# Construct the new URL
new_url = f"http://localhost:8000/ap/v1/agent/tasks/{task_id}/steps"
# Forward the request
response = await client.post(
new_url,
data=await request.body(),
headers=dict(request.headers),
)
# Return the response from the forwarded request
return Response(content=response.content, status_code=response.status_code)
@router.post("/agent/tasks/{task_id}/evaluations")
async def create_evaluation(task_id: str) -> deque:
from agbenchmark.__main__ import TEMP_FOLDER_ABS_PATH
from agbenchmark.agent_api_interface import copy_agent_artifacts_into_temp_folder
from agbenchmark.agent_interface import copy_artifacts_into_temp_folder
from agbenchmark.generate_test import create_challenge
from agbenchmark.generate_test import create_challenge_from_spec_file
from agbenchmark.main import run_benchmark
try:
async with ApiClient(configuration) as api_client:
api_instance = AgentApi(api_client)
await copy_agent_artifacts_into_temp_folder(api_instance, task_id)
# add custom python
data = CHALLENGES[task_informations[task_id]["eval_id"]]
configuration = Configuration(
host=agbenchmark_config.host or "http://localhost:8000"
)
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
router = APIRouter()
artifact_path = str(Path(data["path"]).parent)
copy_artifacts_into_temp_folder(
TEMP_FOLDER_ABS_PATH, "custom_python", artifact_path
@router.post("/reports")
def run_single_test(body: CreateReportRequest) -> dict:
pids = find_agbenchmark_without_uvicorn()
logger.info(f"pids already running with agbenchmark: {pids}")
logger.debug(f"Request to /reports: {body.dict()}")
# Start the benchmark in a separate thread
benchmark_process = Process(
target=lambda: run_benchmark(
config=agbenchmark_config,
tests=(body.test,),
mock=body.mock or False,
)
)
json_file = CHALLENGES[task_informations[task_id]["eval_id"]]["path"]
json_files = deque()
benchmark_process.start()
_, challenge_class = create_challenge(data, json_file, json_files)
challenge_instance = challenge_class()
scores = challenge_instance.get_scores(config={})
test_name = "Test" + data["name"]
is_score_100 = 1 in scores["values"]
# Wait for the benchmark to finish, with a timeout of 200 seconds
timeout = 200
start_time = time.time()
while benchmark_process.is_alive():
if time.time() - start_time > timeout:
logger.warning(f"Benchmark run timed out after {timeout} seconds")
benchmark_process.terminate()
break
time.sleep(1)
else:
logger.debug(f"Benchmark finished running in {time.time() - start_time} s")
info_details = {
"repository_info": {
"repo_url": None,
"team_name": None,
"benchmark_git_commit_sha": None,
"agent_git_commit_sha": None,
},
"run_details": {
"run_id": None,
"command": "agbenchmark" + " --test=" + test_name,
"completion_time": None,
"benchmark_start_time": task_informations[task_id][
# List all folders in the current working directory
path_reports = agbenchmark_config.reports_folder
folders = [folder for folder in path_reports.iterdir() if folder.is_dir()]
# Sort the folders based on their names
sorted_folders = sorted(folders, key=lambda x: x.name)
# Get the last folder
latest_folder = sorted_folders[-1] if sorted_folders else None
# Read report.json from this folder
if latest_folder:
report_path = latest_folder / "report.json"
logger.debug(f"Getting latest report from {report_path}")
if report_path.exists():
with report_path.open() as file:
data = json.load(file)
logger.debug(f"Report data: {data}")
else:
logger.error(
"Could not get result after running benchmark: "
f"'report.json' does not exist in '{latest_folder}'"
)
else:
logger.error(
"Could not get result after running benchmark: no reports found"
)
return data
@router.post("/agent/tasks", tags=["agent"])
async def create_agent_task(task_eval_request: TaskEvalRequestBody) -> Task:
"""
Creates a new task using the provided TaskEvalRequestBody and returns a Task.
Args:
task_eval_request: `TaskRequestBody` including an eval_id.
Returns:
Task: A new task with task_id, input, additional_input,
and empty lists for artifacts and steps.
Example:
Request (TaskEvalRequestBody defined in schema.py):
{
...,
"eval_id": "50da533e-3904-4401-8a07-c49adf88b5eb"
}
Response (Task defined in `agent_protocol_client.models`):
{
"task_id": "50da533e-3904-4401-8a07-c49adf88b5eb",
"input": "Write the word 'Washington' to a .txt file",
"artifacts": []
}
"""
try:
async with ApiClient(configuration) as api_client:
api_instance = AgentApi(api_client)
task_input = CHALLENGES[task_eval_request.eval_id].task
task_request_body = TaskRequestBody(input=task_input)
task_response = await api_instance.create_agent_task(
task_request_body=task_request_body
)
task_informations[task_response.task_id][
"benchmark_start_time"
],
"test_name": data["name"],
},
"task_info": {
"data_path": data["path"].split("benchmark/", 1)[-1],
"is_regression": None,
"category": data["category"],
"task": data["task"],
"answer": data["ground"]["answer"],
"description": data["info"]["description"],
},
"metrics": {
"difficulty": None,
"success": is_score_100,
"attempted": True,
"success_percentage": None,
"cost": None,
"run_time": None,
},
"reached_cutoff": None,
"config": {},
}
] = datetime.datetime.now(datetime.timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%S+00:00"
)
task_informations[task_response.task_id][
"eval_id"
] = task_eval_request.eval_id
await upload_artifacts(
api_instance,
str(CHALLENGES[task_eval_request.eval_id].spec_file.parent),
task_response.task_id,
"artifacts_in",
)
return task_response
except ApiException as e:
logger.error(f"Error whilst trying to create a task:\n{e}")
logger.error(
"The above error was caused while processing request: "
f"{task_eval_request}"
)
raise HTTPException(500)
BenchmarkRun.parse_obj(info_details)
@router.post("/agent/tasks/{task_id}/steps")
async def proxy(request: Request, task_id: str):
timeout = httpx.Timeout(300.0, read=300.0) # 5 minutes
async with httpx.AsyncClient(timeout=timeout) as client:
# Construct the new URL
new_url = f"{configuration.host}/ap/v1/agent/tasks/{task_id}/steps"
print(json.dumps(info_details, indent=4))
return Response(
content=json.dumps(info_details),
status_code=200,
media_type="application/json",
)
except ApiException as e:
print(f"Error whilst trying to evaluate the task: {task_id}")
return Response(
content=json.dumps({"error": "Internal server error"}),
status_code=500,
media_type="application/json",
)
# path = Path(json_file).resolve()
# Forward the request
response = await client.post(
new_url,
data=await request.body(),
headers=dict(request.headers),
)
# Return the response from the forwarded request
return Response(content=response.content, status_code=response.status_code)
app.include_router(router, prefix="/ap/v1")
@router.post("/agent/tasks/{task_id}/evaluations")
async def create_evaluation(task_id: str) -> BenchmarkRun:
challenge_info = CHALLENGES[task_informations[task_id]["eval_id"]]
workspace = agbenchmark_config.temp_folder
try:
async with ApiClient(configuration) as api_client:
api_instance = AgentApi(api_client)
await copy_agent_artifacts_into_folder(api_instance, task_id, workspace)
artifact_path = challenge_info.spec_file.parent
copy_artifacts_into_temp_folder(workspace, "custom_python", artifact_path)
challenge = create_challenge_from_spec_file(challenge_info.spec_file)
scores = challenge.get_scores(workspace)
is_score_100 = 1 in scores["values"]
eval_info = BenchmarkRun(
repository_info=RepositoryInfo(),
run_details=RunDetails(
command=f"agbenchmark --test={challenge_info.name}",
benchmark_start_time=(
task_informations[task_id]["benchmark_start_time"]
),
test_name=challenge_info.name,
),
task_info=TaskInfo(
data_path=str(
challenge_info.spec_file.relative_to(challenges_path.parent)
),
is_regression=None,
category=[c.value for c in challenge_info.category],
task=challenge_info.task,
answer=challenge_info.ground.answer,
description=challenge_info.info.description,
),
metrics=Metrics(
success=is_score_100,
attempted=True,
),
config={},
)
logger.debug(f"Returning evaluation data:\n{eval_info.json(indent=4)}")
return eval_info
except ApiException as e:
logger.error(f"Error {e} whilst trying to evaluate task: {task_id}")
raise HTTPException(500)
app.include_router(router, prefix="/ap/v1")
return app

View File

@@ -0,0 +1,32 @@
import glob
import json
import logging
from pathlib import Path
logger = logging.getLogger(__name__)
def get_unique_categories() -> set[str]:
"""
Find all data.json files in the directory relative to this file and its
subdirectories, read the "category" field from each file, and return a set of unique
categories.
"""
categories = set()
challenges_dir = Path(__file__).parent
glob_path = f"{challenges_dir}/**/data.json"
for data_file in glob.glob(glob_path, recursive=True):
with open(data_file, "r") as f:
try:
challenge_data = json.load(f)
categories.update(challenge_data.get("category", []))
except json.JSONDecodeError:
logger.error(f"Error: {data_file} is not a valid JSON file.")
continue
except IOError:
logger.error(f"IOError: file could not be read: {data_file}")
continue
return categories

View File

@@ -16,21 +16,21 @@
".txt"
],
"should_contain": [
"15",
"112",
"117",
"204",
"413",
"2,0",
"3,198",
"4,046",
"7,000",
"11,759",
"21,461",
"24,578",
"31,536",
"53,823",
"81,462"
"15",
"112",
"117",
"204",
"413",
"2,0",
"3,198",
"4,046",
"7,000",
"11,759",
"21,461",
"24,578",
"31,536",
"53,823",
"81,462"
],
"should_not_contain": []
},

View File

@@ -0,0 +1,119 @@
import json
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
from pydantic import BaseSettings
def _calculate_info_test_path(base_path: Path, benchmark_start_time: datetime) -> Path:
"""
Calculates the path to the directory where the test report will be saved.
"""
# Ensure the reports path exists
base_path.mkdir(parents=True, exist_ok=True)
# Get current UTC date-time stamp
date_stamp = benchmark_start_time.strftime("%Y%m%dT%H%M%S")
# Default run name
run_name = "full_run"
# Map command-line arguments to their respective labels
arg_labels = {
"--test": None,
"--category": None,
"--maintain": "maintain",
"--improve": "improve",
"--explore": "explore",
}
# Identify the relevant command-line argument
for arg, label in arg_labels.items():
if arg in sys.argv:
test_arg = sys.argv[sys.argv.index(arg) + 1] if label is None else None
run_name = arg.strip("--")
if test_arg:
run_name = f"{run_name}_{test_arg}"
break
# Create the full new directory path with ISO standard UTC date-time stamp
report_path = base_path / f"{date_stamp}_{run_name}"
# Ensure the new directory is created
# FIXME: this is not a desirable side-effect of loading the config
report_path.mkdir(exist_ok=True)
return report_path
class AgentBenchmarkConfig(BaseSettings, extra="allow"):
"""
Configuration model and loader for the AGBenchmark.
Projects that want to use AGBenchmark should contain an agbenchmark_config folder
with a config.json file that - at minimum - specifies the `host` at which the
subject application exposes an Agent Protocol compliant API.
"""
agbenchmark_config_dir: Path
"""Path to the agbenchmark_config folder of the subject agent application."""
categories: list[str] | None = None
"""Categories to benchmark the agent for. If omitted, all categories are assumed."""
host: str
"""Host (scheme://address:port) of the subject agent application."""
@classmethod
def load(cls, config_dir: Optional[Path] = None) -> "AgentBenchmarkConfig":
config_dir = config_dir or cls.find_config_folder()
with (config_dir / "config.json").open("r") as f:
return cls(
agbenchmark_config_dir=config_dir,
**json.load(f),
)
@staticmethod
def find_config_folder(for_dir: Path = Path.cwd()) -> Path:
"""
Find the closest ancestor folder containing an agbenchmark_config folder,
and returns the path of that agbenchmark_config folder.
"""
current_directory = for_dir
while current_directory != Path("/"):
if (path := current_directory / "agbenchmark_config").exists():
if (path / "config.json").is_file():
return path
current_directory = current_directory.parent
raise FileNotFoundError(
"No 'agbenchmark_config' directory found in the path hierarchy."
)
@property
def config_file(self) -> Path:
return self.agbenchmark_config_dir / "config.json"
@property
def reports_folder(self) -> Path:
return self.agbenchmark_config_dir / "reports"
def get_report_dir(self, benchmark_start_time: datetime) -> Path:
return _calculate_info_test_path(self.reports_folder, benchmark_start_time)
@property
def regression_tests_file(self) -> Path:
return self.reports_folder / "regression_tests.json"
@property
def success_rate_file(self) -> Path:
return self.reports_folder / "success_rate.json"
@property
def challenges_already_beaten_file(self) -> Path:
return self.agbenchmark_config_dir / "challenges_already_beaten.json"
@property
def temp_folder(self) -> Path:
return self.agbenchmark_config_dir / "temp_folder"

View File

@@ -1,167 +1,127 @@
import contextlib
import json
import logging
import os
import shutil
import sys
import threading
import time
from pathlib import Path # noqa
from pathlib import Path
from typing import Any, Generator
import pytest
from agbenchmark.__main__ import TEMP_FOLDER_ABS_PATH
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.reports.reports import (
finalize_reports,
generate_single_call_report,
session_finish,
)
from agbenchmark.utils.data_types import AgentBenchmarkConfig
from agbenchmark.utils.challenge import Challenge
from agbenchmark.utils.data_types import Category
GLOBAL_TIMEOUT = (
1500 # The tests will stop after 25 minutes so we can send the reports.
)
agbenchmark_config = AgentBenchmarkConfig.load()
logger = logging.getLogger(__name__)
pytest_plugins = ["agbenchmark.utils.dependencies"]
collect_ignore = ["challenges"]
suite_reports: dict[str, list] = {}
def load_config_from_request(request: Any) -> AgentBenchmarkConfig:
"""
This function loads the configuration for the agent benchmark from a given request.
Args:
request (Any): The request object from which the agent benchmark configuration is to be loaded.
Returns:
AgentBenchmarkConfig: The loaded agent benchmark configuration.
Raises:
json.JSONDecodeError: If the benchmark configuration file is not a valid JSON file.
"""
agent_benchmark_config_path = Path.cwd() / "agbenchmark_config" / "config.json"
try:
with open(agent_benchmark_config_path, "r") as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
agent_benchmark_config.agent_benchmark_config_path = (
agent_benchmark_config_path
)
return agent_benchmark_config
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
raise
@pytest.fixture(scope="module")
def config(request: Any) -> Any:
"""
This pytest fixture is responsible for loading the agent benchmark configuration from a given request.
This fixture is scoped to the module level, meaning it's invoked once per test module.
Args:
request (Any): The request object from which the agent benchmark configuration is to be loaded.
Returns:
Any: The loaded configuration dictionary.
Raises:
json.JSONDecodeError: If the benchmark configuration file is not a valid JSON file.
"""
config = {}
agent_benchmark_config_path = Path.cwd() / "agbenchmark_config" / "config.json"
try:
with open(agent_benchmark_config_path, "r") as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
agent_benchmark_config.agent_benchmark_config_path = (
agent_benchmark_config_path
)
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
raise
config["AgentBenchmarkConfig"] = agent_benchmark_config
return config
def config() -> AgentBenchmarkConfig:
return agbenchmark_config
@pytest.fixture(autouse=True)
def temp_folder() -> Generator[str, None, None]:
def temp_folder() -> Generator[Path, None, None]:
"""
This pytest fixture is responsible for setting up and tearing down the temporary folder for each test.
Pytest fixture that sets up and tears down the temporary folder for each test.
It is automatically used in every test due to the 'autouse=True' parameter.
It is used in order to let agbenchmark store files so they can then be evaluated.
"""
# create output directory if it doesn't exist
if not os.path.exists(TEMP_FOLDER_ABS_PATH):
os.makedirs(TEMP_FOLDER_ABS_PATH, exist_ok=True)
if not os.path.exists(agbenchmark_config.temp_folder):
os.makedirs(agbenchmark_config.temp_folder, exist_ok=True)
yield
yield agbenchmark_config.temp_folder
# teardown after test function completes
if not os.getenv("KEEP_TEMP_FOLDER_FILES"):
for filename in os.listdir(TEMP_FOLDER_ABS_PATH):
file_path = os.path.join(TEMP_FOLDER_ABS_PATH, filename)
for filename in os.listdir(agbenchmark_config.temp_folder):
file_path = os.path.join(agbenchmark_config.temp_folder, filename)
try:
if os.path.isfile(file_path) or os.path.islink(file_path):
os.unlink(file_path)
elif os.path.isdir(file_path):
shutil.rmtree(file_path)
except Exception as e:
print(f"Failed to delete {file_path}. Reason: {e}")
logger.warning(f"Failed to delete {file_path}. Reason: {e}")
def pytest_addoption(parser: Any) -> None:
def pytest_addoption(parser: pytest.Parser) -> None:
"""
This function is a pytest hook that is called to add command-line options.
It is used to add custom command-line options that are specific to the agent benchmark tests.
These options can be used to control the behavior of the tests.
The "--mock" option is used to run the tests in mock mode.
The "--host" option is used to specify the host for the tests.
The "--category" option is used to run only tests of a specific category.
The "--nc" option is used to run the tests without caching.
The "--cutoff" option is used to specify a cutoff time for the tests.
The "--improve" option is used to run only the tests that are marked for improvement.
The "--maintain" option is used to run only the tests that are marked for maintenance.
The "--explore" option is used to run the tests in exploration mode.
The "--test" option is used to run a specific test.
The "--no_dep" option is used to run the tests without dependencies.
The "--keep_answers" option is used to keep the answers of the tests.
Pytest hook that adds command-line options to the `pytest` command.
The added options are specific to agbenchmark and control its behavior:
* `--mock` is used to run the tests in mock mode.
* `--host` is used to specify the host for the tests.
* `--category` is used to run only tests of a specific category.
* `--nc` is used to run the tests without caching.
* `--cutoff` is used to specify a cutoff time for the tests.
* `--improve` is used to run only the tests that are marked for improvement.
* `--maintain` is used to run only the tests that are marked for maintenance.
* `--explore` is used to run the tests in exploration mode.
* `--test` is used to run a specific test.
* `--no-dep` is used to run the tests without dependencies.
* `--keep-answers` is used to keep the answers of the tests.
Args:
parser (Any): The parser object to which the command-line options are added.
parser: The Pytest CLI parser to which the command-line options are added.
"""
parser.addoption("--no_dep", action="store_true", default=False)
parser.addoption("--mock", action="store_true", default=False)
parser.addoption("--host", action="store_true", default=None)
parser.addoption("--nc", action="store_true", default=False)
parser.addoption("--cutoff", action="store_true", default=False)
parser.addoption("--category", action="store_true", default=False)
parser.addoption("--test", action="store_true", default=None)
parser.addoption("--improve", action="store_true", default=False)
parser.addoption("--maintain", action="store_true", default=False)
parser.addoption("--explore", action="store_true", default=False)
parser.addoption("--keep-answers", action="store_true", default=False)
parser.addoption("--no-dep", action="store_true")
parser.addoption("--mock", action="store_true")
parser.addoption("--host", default=None)
parser.addoption("--nc", action="store_true")
parser.addoption("--cutoff", action="store")
parser.addoption("--category", action="append")
parser.addoption("--test", action="append")
parser.addoption("--improve", action="store_true")
parser.addoption("--maintain", action="store_true")
parser.addoption("--explore", action="store_true")
parser.addoption("--keep-answers", action="store_true")
def pytest_configure(config: pytest.Config) -> None:
# Register category markers to prevent "unknown marker" warnings
for category in Category:
config.addinivalue_line("markers", f"{category.value}: {category}")
@pytest.fixture(autouse=True)
def check_regression(request: Any) -> None:
def check_regression(request: pytest.FixtureRequest) -> None:
"""
This pytest fixture is responsible for checking if a test is a regression test.
It is automatically used in every test due to the 'autouse=True' parameter.
The test name and the agent benchmark configuration are retrieved from the request object.
The regression reports are loaded from the path specified in the agent benchmark configuration.
If the "--improve" option is used and the test name exists in the regression tests, the test is skipped.
If the "--maintain" option is used and the test name does not exist in the regression tests, the test is also skipped.
Fixture that checks for every test if it should be treated as a regression test,
and whether to skip it based on that.
The test name is retrieved from the `request` object. Regression reports are loaded
from the path specified in the benchmark configuration.
Effect:
* If the `--improve` option is used and the current test is considered a regression
test, it is skipped.
* If the `--maintain` option is used and the current test is not considered a
regression test, it is also skipped.
Args:
request (Any): The request object from which the test name and the agent benchmark configuration are retrieved.
request: The request object from which the test name and the benchmark
configuration are retrieved.
"""
test_name = request.node.parent.name
agent_benchmark_config = load_config_from_request(request)
with contextlib.suppress(Exception):
test = agent_benchmark_config.get_regression_reports_path()
data = json.loads(test)
with contextlib.suppress(FileNotFoundError):
regression_report = agbenchmark_config.regression_tests_file
data = json.loads(regression_report.read_bytes())
challenge_location = getattr(request.node.parent.cls, "CHALLENGE_LOCATION", "")
skip_string = f"Skipping {test_name} at {challenge_location}"
@@ -173,55 +133,33 @@ def check_regression(request: Any) -> None:
pytest.skip(f"{skip_string} because it's not a regression test")
# this is to get the challenge_data from every test
@pytest.fixture(autouse=True)
def challenge_data(request: Any) -> None:
"""
This pytest fixture is responsible for providing the challenge data for each test.
It is automatically used in every test due to the 'autouse=True' parameter.
The challenge data is retrieved from the request object's parameters.
This fixture is essential for the pytest system as it provides the necessary data for each test.
Args:
request (Any): The request object from which the challenge data is retrieved.
Returns:
None: The challenge data is directly passed to the test function and does not need to be returned.
"""
return request.param
@pytest.fixture(autouse=True, scope="session")
def mock(request: Any) -> None:
def mock(request: pytest.FixtureRequest) -> bool:
"""
This pytest fixture is responsible for retrieving the value of the "--mock" command-line option.
It is automatically used in every test session due to the 'autouse=True' parameter and 'session' scope.
The "--mock" option is used to run the tests in mock mode.
This fixture is essential for the pytest system as it provides the necessary command-line option value for each test session.
Pytest fixture that retrieves the value of the `--mock` command-line option.
The `--mock` option is used to run the tests in mock mode.
Args:
request (Any): The request object from which the "--mock" option value is retrieved.
request: The `pytest.FixtureRequest` from which the `--mock` option value
is retrieved.
Returns:
None: The "--mock" option value is directly passed to the test session and does not need to be returned.
bool: Whether `--mock` is set for this session.
"""
return request.config.getoption("--mock")
@pytest.fixture(autouse=True, scope="function")
def timer(request: Any) -> Any:
def timer(request: pytest.FixtureRequest) -> Generator[None, None, None]:
"""
This pytest fixture is responsible for timing the execution of each test.
It is automatically used in every test due to the 'autouse=True' parameter and 'function' scope.
Pytest fixture that times the execution of each test.
At the start of each test, it records the current time.
After the test function completes, it calculates the run time and appends it to the test node's user properties.
This allows the run time of each test to be accessed later for reporting or analysis.
After the test function completes, it calculates the run time and adds it to
the test node's `user_properties`.
Args:
request (Any): The request object from which the test node is retrieved.
Yields:
None: Control is yielded back to the test function.
request: The `pytest.FixtureRequest` object through which the run time is stored
in the test node's `user_properties`.
"""
start_time = time.time()
yield
@@ -229,33 +167,21 @@ def timer(request: Any) -> Any:
request.node.user_properties.append(("run_time", run_time))
def pytest_runtest_makereport(item: Any, call: Any) -> None:
def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo) -> None:
"""
This function is a pytest hook that is called when a test report is being generated.
Pytest hook that is called when a test report is being generated.
It is used to generate and finalize reports for each test.
Args:
item (Any): The test item for which the report is being generated.
call (Any): The call object from which the test result is retrieved.
item: The test item for which the report is being generated.
call: The call object from which the test result is retrieved.
"""
challenge_data = item.funcargs.get("challenge_data", None)
if not challenge_data:
# this will only happen for dummy dependency setup tests
return
challenge_location: str = getattr(item.cls, "CHALLENGE_LOCATION", "")
flags = (
"--test" in sys.argv
or "--maintain" in sys.argv
or "--improve" in sys.argv
or "--explore" in sys.argv
)
challenge: type[Challenge] = item.cls # type: ignore
challenge_data = challenge.data
challenge_location = challenge.CHALLENGE_LOCATION
if call.when == "call":
answers = getattr(item, "answers", None)
challenge_location: str = getattr(item.cls, "CHALLENGE_LOCATION", "")
test_name = item.nodeid.split("::")[1]
item.test_name = test_name
@@ -264,14 +190,14 @@ def pytest_runtest_makereport(item: Any, call: Any) -> None:
)
if call.when == "teardown":
finalize_reports(item, challenge_data)
finalize_reports(agbenchmark_config, item, challenge_data)
def timeout_monitor(start_time: int) -> None:
"""
This function is responsible for monitoring the total execution time of the test suite.
It runs in a separate thread and checks every second if the total execution time has exceeded the global timeout.
If the global timeout is exceeded, it terminates the pytest session with a specific return code.
Function that limits the total execution time of the test suite.
This function is supposed to be run in a separate thread and calls `pytest.exit`
if the total execution time has exceeded the global timeout.
Args:
start_time (int): The start time of the test suite.
@@ -282,14 +208,11 @@ def timeout_monitor(start_time: int) -> None:
pytest.exit("Test suite exceeded the global timeout", returncode=1)
def pytest_sessionstart(session: Any) -> None:
def pytest_sessionstart(session: pytest.Session) -> None:
"""
This function is a pytest hook that is called at the start of the test session.
It starts the timeout monitor in a separate thread.
The timeout monitor checks if the total execution time of the test suite has exceeded the global timeout.
Pytest hook that is called at the start of a test session.
Args:
session (Any): The pytest session object.
Sets up and runs a `timeout_monitor` in a separate thread.
"""
start_time = time.time()
t = threading.Thread(target=timeout_monitor, args=(start_time,))
@@ -297,94 +220,125 @@ def pytest_sessionstart(session: Any) -> None:
t.start()
def pytest_sessionfinish(session: Any) -> None:
def pytest_sessionfinish(session: pytest.Session) -> None:
"""
This function is a pytest hook that is called at the end of the test session.
It is used to finalize and save the test reports.
The reports are saved in a specific location defined in the suite reports.
Pytest hook that is called at the end of a test session.
Args:
session (Any): The pytest session object.
Finalizes and saves the test reports.
"""
session_finish(suite_reports)
session_finish(agbenchmark_config, suite_reports)
@pytest.fixture
def scores(request: Any) -> None:
def scores(request: pytest.FixtureRequest) -> None:
"""
This pytest fixture is responsible for retrieving the scores of the test class.
The scores are retrieved from the test class's 'scores' attribute using the test class name.
This fixture is essential for the pytest system as it provides the necessary scores for each test.
Pytest fixture that retrieves the scores of the test class.
The scores are retrieved from the `Challenge.scores` attribute
using the test class name.
Args:
request (Any): The request object from which the test class is retrieved.
Returns:
None: The scores are directly passed to the test function and do not need to be returned.
request: The request object.
"""
test_class_name = request.node.cls.__name__
return request.node.cls.scores.get(test_class_name)
challenge: type[Challenge] = request.node.cls
return challenge.scores.get(challenge.__name__)
# this is adding the dependency marker and category markers automatically from the json
def pytest_collection_modifyitems(items: Any, config: Any) -> None:
def pytest_collection_modifyitems(
items: list[pytest.Item], config: pytest.Config
) -> None:
"""
This function is a pytest hook that is called after the test collection has been performed.
It is used to modify the collected test items based on the agent benchmark configuration.
The function loads the agent benchmark configuration from the specified path and retrieves the regression reports.
For each test item, it checks if the test method exists and retrieves the dependencies and categories from the test class instance.
If the "--improve" or "--category" options are used, the dependencies are filtered based on the regression data.
If the "--test", "--no_dep", or "--maintain" options are used, the dependencies are cleared.
The function then dynamically adds the 'depends' and 'category' markers to the test item.
This function is essential for the pytest system as it provides the necessary modification of the test items based on the agent benchmark configuration.
Pytest hook that is called after initial test collection has been performed.
Modifies the collected test items based on the agent benchmark configuration,
adding the dependency marker and category markers.
Args:
items (Any): The collected test items to be modified.
config (Any): The pytest configuration object from which the agent benchmark configuration path is retrieved.
items: The collected test items to be modified.
config: The active pytest configuration.
"""
agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
try:
with open(agent_benchmark_config_path) as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
raise
regression_file = agent_benchmark_config.get_regression_reports_path()
data = (
json.loads(open(regression_file, "r").read())
if os.path.exists(regression_file)
else {}
regression_file = agbenchmark_config.regression_tests_file
regression_tests: dict[str, Any] = (
json.loads(regression_file.read_bytes()) if regression_file.is_file() else {}
)
for item in items:
# Assuming item.cls is your test class
test_class_instance = item.cls()
try:
challenges_beaten_in_the_past = json.loads(
agbenchmark_config.challenges_already_beaten_file.read_bytes()
)
except FileNotFoundError:
challenges_beaten_in_the_past = {}
if "test_method" not in item.name:
selected_tests: tuple[str] = config.getoption("--test") # type: ignore
selected_categories: tuple[str] = config.getoption("--category") # type: ignore
# Can't use a for-loop to remove items in-place
i = 0
while i < len(items):
item = items[i]
challenge = item.cls
challenge_name = item.cls.__name__
if not issubclass(challenge, Challenge):
item.warn(
pytest.PytestCollectionWarning(
f"Non-challenge item collected: {challenge}"
)
)
i += 1
continue
# Then you can access your properties
name = item.parent.cls.__name__
# dependencies = test_class_instance.data.dependencies
# --test: remove the test from the set if it's not specifically selected
if selected_tests and challenge.data.name not in selected_tests:
items.remove(item)
continue
# Filter dependencies if they exist in regression data if its an improvement test
# if config.getoption("--improve") or config.getoption(
# "--category"
# ):
# dependencies = [dep for dep in dependencies if not data.get(dep, None)]
# if (
# config.getoption("--test")
# or config.getoption("--no_dep")
# or config.getoption("--maintain")
# ):
dependencies = test_class_instance.dependencies
# Filter challenges for --maintain, --improve, and --explore:
# --maintain -> only challenges expected to be passed (= regression tests)
# --improve -> only challenges that so far are not passed (reliably)
# --explore -> only challenges that have never been passed
is_regression_test = regression_tests.get(challenge.data.name, None)
has_been_passed = challenges_beaten_in_the_past.get(challenge.data.name, False)
if (
(config.getoption("--maintain") and not is_regression_test)
or (config.getoption("--improve") and is_regression_test)
or (config.getoption("--explore") and has_been_passed)
):
items.remove(item)
continue
# Add depends marker dynamically
item.add_marker(pytest.mark.depends(on=dependencies, name=name))
dependencies = challenge.data.dependencies
if (
config.getoption("--test")
or config.getoption("--no-dep")
or config.getoption("--maintain")
):
# Ignore dependencies:
# --test -> user selected specific tests to run, don't care about deps
# --no-dep -> ignore dependency relations regardless of test selection
# --maintain -> all "regression" tests must pass, so run all of them
dependencies = []
elif config.getoption("--improve"):
# Filter dependencies, keep only deps that are not "regression" tests
dependencies = [
d for d in dependencies if not regression_tests.get(d, None)
]
categories = test_class_instance.data.category
# Set category markers
challenge_categories = [c.value for c in challenge.data.category]
for category in challenge_categories:
item.add_marker(category)
# Add category marker dynamically
for category in categories:
item.add_marker(getattr(pytest.mark, category))
# Enforce category selection
if selected_categories:
if not set(challenge_categories).intersection(set(selected_categories)):
items.remove(item)
continue
# # Filter dependencies, keep only deps from selected categories
# dependencies = [
# d for d in dependencies
# if not set(d.categories).intersection(set(selected_categories))
# ]
# Add marker for the DependencyManager
item.add_marker(pytest.mark.depends(on=dependencies, name=challenge_name))
i += 1

View File

@@ -1,79 +0,0 @@
import platform
import queue
import select
import subprocess
import time
from threading import Thread
from typing import Any
import psutil
def run_linux_env(process: Any, start_time: float, timeout: float) -> None:
while True:
try:
# This checks if there's data to be read from stdout without blocking.
if process.stdout and select.select([process.stdout], [], [], 0)[0]:
output = process.stdout.readline()
print(output.strip())
except Exception as e:
continue
# Check if process has ended, has no more output, or exceeded timeout
if process.poll() is not None or (time.time() - start_time > timeout):
break
if time.time() - start_time > timeout:
print("The Python function has exceeded the time limit and was terminated.")
parent = psutil.Process(process.pid)
for child in parent.children(recursive=True):
child.kill()
parent.kill()
else:
print("The Python function has finished running.")
def enqueue_output(out: Any, my_queue: Any) -> None:
for line in iter(out.readline, b""):
my_queue.put(line)
out.close()
def run_windows_env(process: Any, start_time: float, timeout: float) -> None:
my_queue: Any = queue.Queue()
thread = Thread(target=enqueue_output, args=(process.stdout, my_queue))
thread.daemon = True
thread.start()
while True:
try:
output = my_queue.get_nowait().strip()
print(output)
except queue.Empty:
pass
if process.poll() is not None or (time.time() - start_time > timeout):
break
if time.time() - start_time > timeout:
print("The Python function has exceeded the time limit and was terminated.")
process.terminate()
def execute_subprocess(command, timeout):
process = subprocess.Popen(
command,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True,
bufsize=1,
)
start_time = time.time()
if platform.system() == "Windows":
run_windows_env(process, start_time, timeout)
else:
run_linux_env(process, start_time, timeout)
process.wait()
if process.returncode != 0:
print(f"The agent timed out")

View File

@@ -1,147 +1,34 @@
import glob
import importlib
import json
import logging
import os
import sys
import types
from collections import deque
from pathlib import Path
from typing import Any, Dict, Optional, Union
import pytest
from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
from agbenchmark.agent_api_interface import append_updates_file
from agbenchmark.agent_protocol_client.models.step import Step
from agbenchmark.utils.challenge import Challenge
from agbenchmark.utils.data_types import AgentBenchmarkConfig, ChallengeData
from agbenchmark.utils.data_types import ChallengeData
DATA_CATEGORY = {}
def create_single_test(
data: Dict[str, Any] | ChallengeData,
challenge_location: str,
file_datum: Optional[list[dict[str, Any]]] = None,
) -> None:
challenge_data = None
artifacts_location = None
if isinstance(data, ChallengeData):
challenge_data = data
data = data.get_data()
DATA_CATEGORY[data["name"]] = data["category"][0]
# Define test class dynamically
challenge_class = types.new_class(f"Test{data['name']}", (Challenge,))
print(challenge_location)
# clean_challenge_location = get_test_path(challenge_location)
setattr(challenge_class, "CHALLENGE_LOCATION", challenge_location)
setattr(
challenge_class,
"ARTIFACTS_LOCATION",
artifacts_location or str(Path(challenge_location).resolve().parent),
)
# Define test method within the dynamically created class
@pytest.mark.asyncio
async def test_method(self, config: Dict[str, Any], request) -> None: # type: ignore
# create a random number between 0 and 1
test_name = self.data.name
try:
with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
challenges_beaten_in_the_past = json.load(f)
except:
challenges_beaten_in_the_past = {}
if request.config.getoption("--explore") and challenges_beaten_in_the_past.get(
test_name, False
):
return None
# skip optional categories
self.skip_optional_categories(config)
from helicone.lock import HeliconeLockManager
if os.environ.get("HELICONE_API_KEY"):
HeliconeLockManager.write_custom_property("challenge", self.data.name)
cutoff = self.data.cutoff or 60
timeout = cutoff
if "--nc" in sys.argv:
timeout = 100000
if "--cutoff" in sys.argv:
timeout = int(sys.argv[sys.argv.index("--cutoff") + 1])
await self.setup_challenge(config, timeout)
scores = self.get_scores(config)
request.node.answers = (
scores["answers"] if "--keep-answers" in sys.argv else None
)
del scores["answers"] # remove answers from scores
request.node.scores = scores # store scores in request.node
is_score_100 = 1 in scores["values"]
evaluation = "Correct!" if is_score_100 else "Incorrect."
eval_step = Step(
input=evaluation,
additional_input=None,
task_id="irrelevant, this step is a hack",
step_id="irrelevant, this step is a hack",
name="",
status="created",
output=None,
additional_output=None,
artifacts=[],
is_last=True,
)
await append_updates_file(eval_step)
assert is_score_100
# Parametrize the method here
test_method = pytest.mark.parametrize(
"challenge_data",
[data],
indirect=True,
)(test_method)
setattr(challenge_class, "test_method", test_method)
# Attach the new class to a module so it can be discovered by pytest
module = importlib.import_module(__name__)
setattr(module, f"Test{data['name']}", challenge_class)
return challenge_class
logger = logging.getLogger(__name__)
def create_single_suite_challenge(challenge_data: ChallengeData, path: Path) -> None:
create_single_test(challenge_data, str(path))
def create_challenge_from_spec_file(spec_file: Path) -> type[Challenge]:
challenge = Challenge.from_challenge_spec(spec_file)
DATA_CATEGORY[challenge.data.name] = challenge.data.category[0].value
return challenge
def create_challenge(
data: Dict[str, Any],
json_file: str,
json_files: deque,
) -> Union[deque, Any]:
path = Path(json_file).resolve()
print("Creating challenge for", path)
challenge_class = create_single_test(data, str(path))
print("Creation complete for", path)
return json_files, challenge_class
def create_challenge_from_spec_file_path(spec_file_path: str) -> type[Challenge]:
spec_file = Path(spec_file_path).resolve()
return create_challenge_from_spec_file(spec_file)
def generate_tests() -> None: # sourcery skip: invert-any-all
print("Generating tests...")
def load_challenges() -> None:
logger.info("Loading challenges...")
challenges_path = os.path.join(os.path.dirname(__file__), "challenges")
print(f"Looking for challenges in {challenges_path}...")
logger.debug(f"Looking for challenges in {challenges_path}...")
json_files = deque(
glob.glob(
@@ -150,74 +37,39 @@ def generate_tests() -> None: # sourcery skip: invert-any-all
)
)
print(f"Found {len(json_files)} challenges.")
print(f"Sample path: {json_files[0]}")
agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
try:
with open(agent_benchmark_config_path, "r") as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
agent_benchmark_config.agent_benchmark_config_path = (
agent_benchmark_config_path
)
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
raise
regression_reports_path = agent_benchmark_config.get_regression_reports_path()
if regression_reports_path and os.path.exists(regression_reports_path):
with open(regression_reports_path, "r") as f:
regression_tests = json.load(f)
else:
regression_tests = {}
logger.debug(f"Found {len(json_files)} challenges.")
logger.debug(f"Sample path: {json_files[0]}")
loaded, ignored = 0, 0
while json_files:
json_file = (
json_files.popleft()
) # Take and remove the first element from json_files
# Take and remove the first element from json_files
json_file = json_files.popleft()
if challenge_should_be_ignored(json_file):
ignored += 1
continue
data = ChallengeData.get_json_from_path(json_file)
challenge_info = ChallengeData.parse_file(json_file)
commands = sys.argv
# --by flag
if "--category" in commands:
categories = data.get("category", [])
commands_set = set(commands)
challenge_class = create_challenge_from_spec_file_path(json_file)
# Convert the combined list to a set
categories_set = set(categories)
logger.debug(f"Generated test for {challenge_info.name}")
_add_challenge_to_module(challenge_class)
loaded += 1
# If there's no overlap with commands
if not categories_set.intersection(commands_set):
continue
# --test flag, only run the test if it's the exact one specified
tests = []
for command in commands:
if command.startswith("--test="):
tests.append(command.split("=")[1])
if tests and data["name"] not in tests:
continue
# --maintain and --improve flag
in_regression = regression_tests.get(data["name"], None)
improve_flag = in_regression and "--improve" in commands
maintain_flag = not in_regression and "--maintain" in commands
if "--maintain" in commands and maintain_flag:
continue
elif "--improve" in commands and improve_flag:
continue
json_files, challenge_class = create_challenge(data, json_file, json_files)
print(f"Generated test for {data['name']}.")
print("Test generation complete.")
logger.info(f"Loading challenges complete: loaded {loaded}, ignored {ignored}.")
def challenge_should_be_ignored(json_file):
return "challenges/deprecated" in json_file or "challenges/library" in json_file
def challenge_should_be_ignored(json_file_path: str):
return (
"challenges/deprecated" in json_file_path
or "challenges/library" in json_file_path
)
generate_tests()
def _add_challenge_to_module(challenge: type[Challenge]):
# Attach the Challenge class to this module so it can be discovered by pytest
module = importlib.import_module(__name__)
setattr(module, f"{challenge.__name__}", challenge)
load_challenges()

View File

@@ -0,0 +1,153 @@
import logging
import os
from pathlib import Path
from typing import Optional, Sequence
from dotenv import load_dotenv
from agbenchmark.challenges import get_unique_categories
from agbenchmark.config import AgentBenchmarkConfig
load_dotenv()
logger = logging.getLogger(__name__)
def run_benchmark(
config: AgentBenchmarkConfig,
maintain: bool = False,
improve: bool = False,
explore: bool = False,
tests: tuple[str] = tuple(),
categories: tuple[str] = tuple(),
skip_categories: tuple[str] = tuple(),
mock: bool = False,
no_dep: bool = False,
no_cutoff: bool = False,
cutoff: Optional[int] = None,
keep_answers: bool = False,
server: bool = False,
) -> int:
"""
Starts the benchmark. If a category flag is provided, only challenges with the
corresponding mark will be run.
"""
import pytest
from agbenchmark.reports.ReportManager import SingletonReportManager
validate_args(
maintain=maintain,
improve=improve,
explore=explore,
tests=tests,
categories=categories,
skip_categories=skip_categories,
no_cutoff=no_cutoff,
cutoff=cutoff,
)
SingletonReportManager()
for key, value in vars(config).items():
logger.debug(f"config.{key} = {repr(value)}")
pytest_args = ["-vs"]
if tests:
logger.info(f"Running specific test(s): {' '.join(tests)}")
pytest_args += [f"--test={t}" for t in tests]
else:
all_categories = get_unique_categories()
if categories or skip_categories:
categories_to_run = set(categories) or all_categories
if skip_categories:
categories_to_run = categories_to_run.difference(set(skip_categories))
assert categories_to_run, "Error: You can't skip all categories"
pytest_args += [f"--category={c}" for c in categories_to_run]
logger.info(f"Running tests of category: {categories_to_run}")
else:
logger.info("Running all categories")
if maintain:
logger.info("Running only regression tests")
elif improve:
logger.info("Running only non-regression tests")
elif explore:
logger.info("Only attempt challenges that have never been beaten")
if mock:
# TODO: unhack
os.environ[
"IS_MOCK"
] = "True" # ugly hack to make the mock work when calling from API
# Pass through flags
for flag, active in {
"--maintain": maintain,
"--improve": improve,
"--explore": explore,
"--no-dep": no_dep,
"--mock": mock,
"--nc": no_cutoff,
"--keep-answers": keep_answers,
}.items():
if active:
pytest_args.append(flag)
if cutoff:
pytest_args.append(f"--cutoff={cutoff}")
logger.debug(f"Setting cuttoff override to {cutoff} seconds.")
current_dir = Path(__file__).resolve().parent
pytest_args.append(str(current_dir / "generate_test.py"))
pytest_args.append("--cache-clear")
exit_code = pytest.main(pytest_args)
SingletonReportManager.clear_instance()
return exit_code
class InvalidInvocationError(ValueError):
pass
def validate_args(
maintain: bool,
improve: bool,
explore: bool,
tests: Sequence[str],
categories: Sequence[str],
skip_categories: Sequence[str],
no_cutoff: bool,
cutoff: Optional[int],
) -> None:
if categories:
all_categories = get_unique_categories()
invalid_categories = set(categories) - all_categories
if invalid_categories:
raise InvalidInvocationError(
"One or more invalid categories were specified: "
f"{', '.join(invalid_categories)}.\n"
f"Valid categories are: {', '.join(all_categories)}."
)
if (maintain + improve + explore) > 1:
raise InvalidInvocationError(
"You can't use --maintain, --improve or --explore at the same time. "
"Please choose one."
)
if tests and (categories or skip_categories or maintain or improve or explore):
raise InvalidInvocationError(
"If you're running a specific test make sure no other options are "
"selected. Please just pass the --test."
)
if no_cutoff and cutoff:
raise InvalidInvocationError(
"You can't use both --nc and --cutoff at the same time. "
"Please choose one."
)

View File

@@ -4,11 +4,12 @@ import os
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.reports.processing.graphs import save_single_radar_chart
from agbenchmark.reports.processing.process_report import get_agent_category
from agbenchmark.reports.processing.report_types import Report
from agbenchmark.utils.data_types import AgentBenchmarkConfig
from agbenchmark.utils.utils import get_highest_success_difficulty
@@ -16,32 +17,26 @@ class SingletonReportManager:
instance = None
def __new__(cls):
from agbenchmark.reports.agent_benchmark_config import (
get_agent_benchmark_config,
)
if not cls.instance:
cls.instance = super(SingletonReportManager, cls).__new__(cls)
agent_benchmark_config = get_agent_benchmark_config()
agent_benchmark_config = AgentBenchmarkConfig.load()
benchmark_start_time_dt = datetime.now(
timezone.utc
) # or any logic to fetch the datetime
# Make the Managers class attributes
cls.REGRESSION_MANAGER = ReportManager(
agent_benchmark_config.get_regression_reports_path(),
agent_benchmark_config.regression_tests_file,
benchmark_start_time_dt,
)
cls.INFO_MANAGER = ReportManager(
str(
agent_benchmark_config.get_reports_path(benchmark_start_time_dt)
/ "report.json"
),
agent_benchmark_config.get_report_dir(benchmark_start_time_dt)
/ "report.json",
benchmark_start_time_dt,
)
cls.INTERNAL_INFO_MANAGER = ReportManager(
agent_benchmark_config.get_success_rate_path(), benchmark_start_time_dt
agent_benchmark_config.success_rate_file, benchmark_start_time_dt
)
return cls.instance
@@ -57,21 +52,20 @@ class SingletonReportManager:
class ReportManager:
"""Abstracts interaction with the regression tests file"""
def __init__(self, filename: str, benchmark_start_time: str):
self.filename = filename
def __init__(self, report_file: Path, benchmark_start_time: datetime):
self.report_file = report_file
self.start_time = time.time()
self.benchmark_start_time = benchmark_start_time
self.load()
def load(self) -> None:
if not os.path.exists(self.filename):
os.makedirs(os.path.dirname(self.filename), exist_ok=True)
with open(self.filename, "w") as f:
pass
if not self.report_file.exists():
self.report_file.parent.mkdir(exist_ok=True)
self.report_file.touch()
try:
with open(self.filename, "r") as f:
with self.report_file.open("r") as f:
file_content = (
f.read().strip()
) # read the content and remove any leading/trailing whitespace
@@ -87,7 +81,7 @@ class ReportManager:
self.save()
def save(self) -> None:
with open(self.filename, "w") as f:
with self.report_file.open("w") as f:
json.dump(self.tests, f, indent=4)
def add_test(self, test_name: str, test_details: dict | list) -> None:
@@ -137,7 +131,7 @@ class ReportManager:
if len(agent_categories) > 1:
save_single_radar_chart(
agent_categories,
config.get_reports_path(self.benchmark_start_time) / "radar_chart.png",
config.get_report_dir(self.benchmark_start_time) / "radar_chart.png",
)
self.save()

View File

@@ -1,18 +0,0 @@
import json
from pathlib import Path
from agbenchmark.utils.data_types import AgentBenchmarkConfig
def get_agent_benchmark_config() -> AgentBenchmarkConfig:
agent_benchmark_config_path = str(Path.cwd() / "agbenchmark_config" / "config.json")
try:
with open(agent_benchmark_config_path, "r") as f:
agent_benchmark_config = AgentBenchmarkConfig(**json.load(f))
agent_benchmark_config.agent_benchmark_config_path = (
agent_benchmark_config_path
)
return agent_benchmark_config
except json.JSONDecodeError:
print("Error: benchmark_config.json is not a valid JSON file.")
raise

View File

@@ -1,4 +1,5 @@
import json
import logging
import os
from pathlib import Path
from typing import Any
@@ -9,6 +10,8 @@ from agbenchmark.reports.processing.get_files import (
from agbenchmark.reports.processing.report_types import Report, Test
from agbenchmark.utils.data_types import STRING_DIFFICULTY_MAP
logger = logging.getLogger(__name__)
def get_reports_data(report_path: str) -> dict[str, Any]:
latest_files = get_latest_report_from_agent_directories(report_path)
@@ -60,7 +63,7 @@ def all_agent_categories(reports_data: dict[str, Any]) -> dict[str, Any]:
for name, report in reports_data.items():
categories = get_agent_category(report)
if categories: # only add to all_categories if categories is not empty
print(f"Adding {name}: {categories}")
logger.debug(f"Adding {name}: {categories}")
all_categories[name] = categories
return all_categories

View File

@@ -1,7 +1,6 @@
from typing import Dict, List
from pydantic import BaseModel, constr
datetime_format = r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+00:00$"
from pydantic import BaseModel, constr
class BaseModelBenchmark(BaseModel):
@@ -14,32 +13,32 @@ class TaskInfo(BaseModelBenchmark):
is_regression: bool | None
answer: str
description: str
category: List[str]
category: list[str]
task: str
class RepositoryInfo(BaseModelBenchmark):
repo_url: str | None
team_name: str | None
benchmark_git_commit_sha: str | None
agent_git_commit_sha: str | None
repo_url: str | None = None
team_name: str | None = None
agent_git_commit_sha: str | None = None
benchmark_git_commit_sha: str | None = None
class Metrics(BaseModelBenchmark):
difficulty: str | None
cost: float | None = None
success: bool
success_percentage: float | None
run_time: str | None
fail_reason: str | None
attempted: bool
cost: float | None
difficulty: str | None = None
run_time: str | None = None
fail_reason: str | None = None
success_percentage: float | None = None
class RunDetails(BaseModelBenchmark):
test_name: str
run_id: str | None
run_id: str | None = None
command: str
completion_time: str | None
completion_time: str | None = None
benchmark_start_time: constr(regex=datetime_format)
@@ -48,5 +47,5 @@ class BenchmarkRun(BaseModelBenchmark):
run_details: RunDetails
task_info: TaskInfo
metrics: Metrics
reached_cutoff: bool | None
config: Dict[str, str | dict[str, str]]
reached_cutoff: bool | None = None
config: dict[str, str | dict[str, str]]

View File

@@ -1,20 +1,24 @@
import json
import logging
import os
import sys
from pathlib import Path
from typing import Any, Dict
from agbenchmark.__main__ import CHALLENGES_ALREADY_BEATEN
from agbenchmark.reports.agent_benchmark_config import get_agent_benchmark_config
import pytest
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.reports.ReportManager import SingletonReportManager
from agbenchmark.utils.data_types import DifficultyLevel
from agbenchmark.utils.data_types import ChallengeData, DifficultyLevel
from agbenchmark.utils.get_data_from_helicone import get_data_from_helicone
from agbenchmark.utils.utils import calculate_success_percentage
logger = logging.getLogger(__name__)
def get_previous_test_results(
test_name: str, info_details: dict[str, Any]
) -> list[bool]:
agent_tests: dict[str, list[bool]] = {}
mock = os.getenv("IS_MOCK") # Check if --mock is in sys.argv
prev_test_results = SingletonReportManager().INTERNAL_INFO_MANAGER.tests.get(
@@ -49,17 +53,14 @@ def update_regression_tests(
def generate_single_call_report(
item: Any,
call: Any,
challenge_data: dict[str, Any],
item: pytest.Item,
call: pytest.CallInfo,
challenge_data: ChallengeData,
answers: dict[str, Any],
challenge_location,
test_name,
challenge_location: str,
test_name: str,
) -> None:
try:
difficulty = challenge_data["info"]["difficulty"]
except KeyError:
return None
difficulty = challenge_data.info.difficulty
if isinstance(difficulty, DifficultyLevel):
difficulty = difficulty.value
@@ -77,10 +78,10 @@ def generate_single_call_report(
info_details: Any = {
"data_path": challenge_location,
"is_regression": False,
"category": challenge_data["category"],
"task": challenge_data["task"],
"answer": challenge_data["ground"]["answer"],
"description": challenge_data["info"]["description"],
"category": challenge_data.category,
"task": challenge_data.task,
"answer": challenge_data.ground.answer,
"description": challenge_data.info.description,
"metrics": {
"difficulty": difficulty,
"success": False,
@@ -91,8 +92,8 @@ def generate_single_call_report(
if answers:
info_details["answers"] = answers
if "metadata" in challenge_data:
info_details["metadata"] = challenge_data["metadata"]
if challenge_data.metadata:
info_details["metadata"] = challenge_data.metadata
mock = os.getenv("IS_MOCK") # Check if --mock is in sys.argv
if call:
@@ -116,7 +117,9 @@ def generate_single_call_report(
return info_details
def finalize_reports(item: Any, challenge_data: dict[str, Any]) -> None:
def finalize_reports(
config: AgentBenchmarkConfig, item: pytest.Item, challenge_data: ChallengeData
) -> None:
run_time = dict(item.user_properties).get("run_time")
info_details = getattr(item, "info_details", {})
@@ -126,8 +129,9 @@ def finalize_reports(item: Any, challenge_data: dict[str, Any]) -> None:
if run_time is not None:
cost = None
if "--mock" not in sys.argv and os.environ.get("HELICONE_API_KEY"):
print("Getting cost from Helicone")
logger.debug("Getting cost from Helicone")
cost = get_data_from_helicone(test_name)
logger.debug(f"Cost: {cost}")
info_details["metrics"]["cost"] = cost
@@ -142,29 +146,33 @@ def finalize_reports(item: Any, challenge_data: dict[str, Any]) -> None:
info_details["metrics"]["run_time"] = f"{str(round(run_time, 3))} seconds"
info_details["reached_cutoff"] = float(run_time) > challenge_data["cutoff"]
info_details["reached_cutoff"] = float(run_time) > challenge_data.cutoff
if "--mock" not in sys.argv:
update_challenges_already_beaten(info_details, test_name)
update_challenges_already_beaten(
config.challenges_already_beaten_file, info_details, test_name
)
if info_details.get("tests") is not None:
for nested_test_name, nested_test_info in info_details[
"tests"
].items():
update_challenges_already_beaten(
nested_test_info, nested_test_name
config.challenges_already_beaten_file,
nested_test_info,
nested_test_name,
)
SingletonReportManager().INFO_MANAGER.add_test(test_name, info_details)
def update_challenges_already_beaten(
info_details: Dict[str, Any], test_name: str
challenges_already_beaten_file: Path, info_details: Dict[str, Any], test_name: str
) -> None:
current_run_successful = info_details["metrics"]["success"]
try:
with open(CHALLENGES_ALREADY_BEATEN, "r") as f:
with open(challenges_already_beaten_file, "r") as f:
challenge_data = json.load(f)
except:
except FileNotFoundError:
challenge_data = {}
challenge_beaten_in_the_past = challenge_data.get(test_name)
@@ -172,13 +180,13 @@ def update_challenges_already_beaten(
if challenge_beaten_in_the_past is None and not current_run_successful:
challenge_data[test_name] = False
with open(CHALLENGES_ALREADY_BEATEN, "w") as f:
with open(challenges_already_beaten_file, "w") as f:
json.dump(challenge_data, f, indent=4)
def session_finish(suite_reports: dict) -> None:
agent_benchmark_config = get_agent_benchmark_config()
def session_finish(
agbenchmark_config: AgentBenchmarkConfig, suite_reports: dict
) -> None:
SingletonReportManager().INTERNAL_INFO_MANAGER.save()
SingletonReportManager().INFO_MANAGER.end_info_report(agent_benchmark_config)
SingletonReportManager().INFO_MANAGER.end_info_report(agbenchmark_config)
SingletonReportManager().REGRESSION_MANAGER.save()

View File

@@ -1,79 +1,14 @@
# generated by fastapi-codegen:
# filename: ../../postman/schemas/openapi.yaml
# timestamp: 2023-08-25T10:36:11+00:00
from __future__ import annotations
from datetime import datetime
from enum import Enum
from typing import List, Optional
from typing import Optional
from pydantic import BaseModel, Field
class ArtifactUpload(BaseModel):
file: str = Field(..., description="File to upload.", format="binary")
relative_path: str = Field(
...,
description="Relative path of the artifact in the agent's workspace.",
example="python/code",
)
class Pagination(BaseModel):
total_items: int = Field(..., description="Total number of items.", example=42)
total_pages: int = Field(..., description="Total number of pages.", example=97)
current_page: int = Field(..., description="Current_page page number.", example=1)
page_size: int = Field(..., description="Number of items per page.", example=25)
class TaskInput(BaseModel):
pass
class Artifact(BaseModel):
created_at: datetime = Field(
...,
description="The creation datetime of the task.",
example="2023-01-01T00:00:00Z",
json_encoders={datetime: lambda v: v.isoformat()},
)
modified_at: datetime = Field(
...,
description="The modification datetime of the task.",
example="2023-01-01T00:00:00Z",
json_encoders={datetime: lambda v: v.isoformat()},
)
artifact_id: str = Field(
...,
description="ID of the artifact.",
example="b225e278-8b4c-4f99-a696-8facf19f0e56",
)
agent_created: bool = Field(
...,
description="Whether the artifact has been created by the agent.",
example=False,
)
relative_path: str = Field(
...,
description="Relative path of the artifact in the agents workspace.",
example="/my_folder/my_other_folder/",
)
file_name: str = Field(
...,
description="Filename of the artifact.",
example="main.py",
)
class StepInput(BaseModel):
pass
class StepOutput(BaseModel):
pass
class TaskRequestBody(BaseModel):
input: str = Field(
...,
@@ -86,108 +21,3 @@ class TaskRequestBody(BaseModel):
class TaskEvalRequestBody(TaskRequestBody):
eval_id: str
class Task(TaskRequestBody):
created_at: datetime = Field(
...,
description="The creation datetime of the task.",
example="2023-01-01T00:00:00Z",
json_encoders={datetime: lambda v: v.isoformat()},
)
modified_at: datetime = Field(
...,
description="The modification datetime of the task.",
example="2023-01-01T00:00:00Z",
json_encoders={datetime: lambda v: v.isoformat()},
)
task_id: str = Field(
...,
description="The ID of the task.",
example="50da533e-3904-4401-8a07-c49adf88b5eb",
)
artifacts: Optional[List[Artifact]] = Field(
[],
description="A list of artifacts that the task has produced.",
example=[
"7a49f31c-f9c6-4346-a22c-e32bc5af4d8e",
"ab7b4091-2560-4692-a4fe-d831ea3ca7d6",
],
)
class StepRequestBody(BaseModel):
name: Optional[str] = Field(
None, description="The name of the task step.", example="Write to file"
)
input: Optional[str] = Field(
None,
min_length=1,
description="Input prompt for the step.",
example="Washington",
)
additional_input: Optional[StepInput] = {}
class Status(Enum):
created = "created"
running = "running"
completed = "completed"
class Step(StepRequestBody):
created_at: datetime = Field(
...,
description="The creation datetime of the task.",
example="2023-01-01T00:00:00Z",
json_encoders={datetime: lambda v: v.isoformat()},
)
modified_at: datetime = Field(
...,
description="The modification datetime of the task.",
example="2023-01-01T00:00:00Z",
json_encoders={datetime: lambda v: v.isoformat()},
)
task_id: str = Field(
...,
description="The ID of the task this step belongs to.",
example="50da533e-3904-4401-8a07-c49adf88b5eb",
)
step_id: str = Field(
...,
description="The ID of the task step.",
example="6bb1801a-fd80-45e8-899a-4dd723cc602e",
)
name: Optional[str] = Field(
None, description="The name of the task step.", example="Write to file"
)
status: Status = Field(
..., description="The status of the task step.", example="created"
)
output: Optional[str] = Field(
None,
description="Output of the task step.",
example="I am going to use the write_to_file command and write Washington to a file called output.txt <write_to_file('output.txt', 'Washington')",
)
additional_output: Optional[StepOutput] = {}
artifacts: Optional[List[Artifact]] = Field(
[], description="A list of artifacts that the step has produced."
)
is_last: bool = Field(
..., description="Whether this is the last step in the task.", example=True
)
class TaskListResponse(BaseModel):
tasks: Optional[List[Task]] = None
pagination: Optional[Pagination] = None
class TaskStepsListResponse(BaseModel):
steps: Optional[List[Step]] = None
pagination: Optional[Pagination] = None
class TaskArtifactsListResponse(BaseModel):
artifacts: Optional[List[Artifact]] = None
pagination: Optional[Pagination] = None

View File

@@ -1,17 +1,20 @@
import glob
import json
import logging
import math
import os
import subprocess
import sys
from abc import ABC
from pathlib import Path
from typing import Any, Dict, List
from typing import Any, ClassVar, List
import openai
import pytest
from colorama import Fore, Style
from agbenchmark.__main__ import OPTIONAL_CATEGORIES, TEMP_FOLDER_ABS_PATH
from agbenchmark.agent_api_interface import run_api_agent
from agbenchmark.config import AgentBenchmarkConfig
from agbenchmark.utils.data_types import ChallengeData, Ground
from agbenchmark.utils.prompts import (
END_PROMPT,
@@ -19,43 +22,84 @@ from agbenchmark.utils.prompts import (
PROMPT_MAP,
SCORING_MAP,
)
from agbenchmark.utils.utils import agent_eligibible_for_optional_categories
logger = logging.getLogger(__name__)
with open(
Path(__file__).parent.parent / "challenges" / "optional_categories.json"
) as f:
OPTIONAL_CATEGORIES: list[str] = json.load(f)["optional_categories"]
class Challenge(ABC):
"""The parent class to all specific challenges classes.
Defines helper methods for running a challenge"""
_data_cache: Dict[str, ChallengeData] = {}
CHALLENGE_LOCATION: str = ""
scores: dict[str, Any] = {} # this is for suites
data: ChallengeData
CHALLENGE_LOCATION: ClassVar[str]
ARTIFACTS_LOCATION: ClassVar[str]
scores: ClassVar[dict[str, Any]] = {} # this is for suites
@property
def data(self) -> ChallengeData:
if self.CHALLENGE_LOCATION not in self._data_cache:
self._data_cache[self.CHALLENGE_LOCATION] = ChallengeData.deserialize(
self.CHALLENGE_LOCATION
)
return self._data_cache[self.CHALLENGE_LOCATION]
@staticmethod
def from_challenge_spec(spec_file: Path) -> type["Challenge"]:
challenge_data = ChallengeData.parse_file(spec_file)
@property
def task(self) -> str:
return self.data.task
challenge_class_name = f"Test{challenge_data.name}"
logger.debug(f"Creating {challenge_class_name} from spec: {spec_file}")
return type(
challenge_class_name,
(Challenge,),
{
"data": challenge_data,
"CHALLENGE_LOCATION": str(spec_file),
"ARTIFACTS_LOCATION": str(spec_file.resolve().parent),
},
)
@property
def dependencies(self) -> list:
return self.data.dependencies
# Define test method within the dynamically created class
@pytest.mark.asyncio
async def test_method(
self, config: AgentBenchmarkConfig, request: pytest.FixtureRequest
) -> None:
# skip optional categories
self.skip_optional_categories(config)
async def setup_challenge(self, config: Dict[str, Any], cutoff: int) -> None:
if os.environ.get("HELICONE_API_KEY"):
from helicone.lock import HeliconeLockManager
HeliconeLockManager.write_custom_property("challenge", self.data.name)
timeout = self.data.cutoff or 60
if request.config.getoption("--nc"):
timeout = 100000
elif cutoff := request.config.getoption("--cutoff"):
timeout = int(cutoff)
await self.run_challenge(config, timeout)
scores = self.get_scores(config.temp_folder)
request.node.answers = (
scores["answers"] if request.config.getoption("--keep-answers") else None
)
del scores["answers"] # remove answers from scores
request.node.scores = scores # store scores in request.node
is_score_100 = 1 in scores["values"]
assert is_score_100
async def run_challenge(self, config: AgentBenchmarkConfig, cutoff: int) -> None:
from agbenchmark.agent_interface import copy_artifacts_into_temp_folder
if not self.task:
if not self.data.task:
return
print(
f"\033[1;35m============Starting {self.data.name} challenge============\033[0m"
f"{Fore.MAGENTA + Style.BRIGHT}{'='*24} "
f"Starting {self.data.name} challenge"
f" {'='*24}{Style.RESET_ALL}"
)
print(f"\033[1;30mTask: {self.task}\033[0m")
print(f"{Fore.BLACK}Task: {self.data.task}{Fore.RESET}")
await run_api_agent(self.data, config, self.ARTIFACTS_LOCATION, cutoff)
@@ -66,13 +110,11 @@ class Challenge(ABC):
str(Path(self.CHALLENGE_LOCATION).parent),
]
for path in artifact_paths:
copy_artifacts_into_temp_folder(TEMP_FOLDER_ABS_PATH, "custom_python", path)
def test_method(self, config: Dict[str, Any]) -> None:
raise NotImplementedError
copy_artifacts_into_temp_folder(config.temp_folder, "custom_python", path)
@staticmethod
def get_artifacts_out(
self, workspace: str | dict[str, str], ground: Ground
workspace: str | Path | dict[str, str], ground: Ground
) -> List[str]:
if isinstance(workspace, dict):
workspace = workspace["output"]
@@ -108,7 +150,7 @@ class Challenge(ABC):
if ground.eval.type == "pytest":
result = subprocess.run(
[sys.executable, "-m", "pytest"],
cwd=TEMP_FOLDER_ABS_PATH,
cwd=os.path.abspath(workspace),
capture_output=True,
text=True,
)
@@ -119,15 +161,17 @@ class Challenge(ABC):
return files_contents
def scoring(self, config: Dict[str, Any], content: str, ground: Ground) -> float:
print("\033[1;34mScoring content:\033[0m", content)
@staticmethod
def scoring(content: str, ground: Ground) -> float:
print(f"{Fore.BLUE}Scoring content:{Style.RESET_ALL}", content)
if ground.should_contain:
for should_contain_word in ground.should_contain:
if not getattr(ground, "case_sensitive", True):
should_contain_word = should_contain_word.lower()
content = content.lower()
print_content = (
f"\033[1;34mWord that should exist\033[0m - {should_contain_word}:"
f"{Fore.BLUE}Word that should exist{Style.RESET_ALL}"
f" - {should_contain_word}:"
)
if should_contain_word not in content:
print(print_content, "False")
@@ -140,7 +184,10 @@ class Challenge(ABC):
if not getattr(ground, "case_sensitive", True):
should_not_contain_word = should_not_contain_word.lower()
content = content.lower()
print_content = f"\033[1;34mWord that should not exist\033[0m - {should_not_contain_word}:"
print_content = (
f"{Fore.BLUE}Word that should not exist{Style.RESET_ALL}"
f" - {should_not_contain_word}:"
)
if should_not_contain_word in content:
print(print_content, "False")
return 0.0
@@ -149,14 +196,17 @@ class Challenge(ABC):
return 1.0
def llm_eval(self, config: Dict[str, Any], content: str, ground: Ground) -> float:
@classmethod
def llm_eval(cls, content: str, ground: Ground) -> float:
openai.api_key = os.getenv("OPENAI_API_KEY")
if os.getenv("IS_MOCK"):
return 1.0
# the validation for this is done in the Eval BaseModel
scoring = SCORING_MAP[ground.eval.scoring] # type: ignore
prompt = PROMPT_MAP[ground.eval.template].format(task=self.data.task, scoring=scoring, answer=ground.answer, response=content) # type: ignore
prompt = PROMPT_MAP[ground.eval.template].format( # type: ignore
task=cls.data.task, scoring=scoring, answer=ground.answer, response=content
)
if ground.eval.examples:
prompt += FEW_SHOT_EXAMPLES.format(examples=ground.eval.examples)
@@ -172,34 +222,31 @@ class Challenge(ABC):
return float(answer["choices"][0]["message"]["content"]) # type: ignore
def get_scores(self, config: Dict[str, Any]) -> dict[str, Any]:
@classmethod
def get_scores(cls, workspace: Path) -> dict[str, Any]:
scores = []
scores_dict: Any = {}
percentage = None
answers = {}
try:
if self.data.task == "" and os.getenv("IS_MOCK"):
if cls.data.task == "" and os.getenv("IS_MOCK"):
scores = [1.0]
answers = {"mock": "This is a mock answer"}
elif isinstance(self.data.ground, Ground):
files_contents = self.get_artifacts_out(
TEMP_FOLDER_ABS_PATH, self.data.ground
)
elif isinstance(cls.data.ground, Ground):
files_contents = cls.get_artifacts_out(workspace, cls.data.ground)
answers = {"answer": files_contents}
for file_content in files_contents:
score = self.scoring(config, file_content, self.data.ground)
print("\033[1;32mYour score is:\033[0m", score)
score = cls.scoring(file_content, cls.data.ground)
print(f"{Fore.GREEN}Your score is:{Style.RESET_ALL}", score)
scores.append(score)
if self.data.ground.eval.type == "llm":
llm_eval = self.llm_eval(
config, "\n".join(files_contents), self.data.ground
)
if self.data.ground.eval.scoring == "percentage":
if cls.data.ground.eval.type == "llm":
llm_eval = cls.llm_eval("\n".join(files_contents), cls.data.ground)
if cls.data.ground.eval.scoring == "percentage":
scores.append(math.ceil(llm_eval / 100))
elif self.data.ground.eval.scoring == "scale":
elif cls.data.ground.eval.scoring == "scale":
scores.append(math.ceil(llm_eval / 10))
print("\033[1;32mYour score is:\033[0m", llm_eval)
print(f"{Fore.GREEN}Your score is:{Style.RESET_ALL}", llm_eval)
scores.append(llm_eval)
except Exception as e:
@@ -212,7 +259,7 @@ class Challenge(ABC):
"answers": answers,
}
self.scores[self.__class__.__name__] = scores_data
cls.scores[cls.__name__] = scores_data
return scores_data
@@ -223,14 +270,15 @@ class Challenge(ABC):
return None
def skip_optional_categories(self, config: Dict[str, Any]) -> None:
challenge_category = self.data.category
categories = [
category
for category in OPTIONAL_CATEGORIES
if category in challenge_category
]
if not agent_eligibible_for_optional_categories(
categories, config.get("category", [])
@classmethod
def skip_optional_categories(cls, config: AgentBenchmarkConfig) -> None:
challenge_categories = set(c.value for c in cls.data.category)
challenge_optional_categories = challenge_categories & set(OPTIONAL_CATEGORIES)
if challenge_optional_categories and not (
config.categories
and set(challenge_optional_categories).issubset(set(config.categories))
):
pytest.skip("Agent is not eligible for this category")
pytest.skip(
f"Category {', '.join(challenge_optional_categories)} is optional, "
"and not explicitly selected in the benchmark config."
)

View File

@@ -1,12 +1,8 @@
import datetime
import json
import sys
from datetime import datetime
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional
from pydantic import BaseModel, constr, validator
from pydantic import BaseModel, Field, constr, validator
class DifficultyLevel(Enum):
@@ -33,80 +29,6 @@ DIFFICULTY_MAP = {
STRING_DIFFICULTY_MAP = {e.value: DIFFICULTY_MAP[e] for e in DifficultyLevel}
def calculate_info_test_path(base_path: Path, benchmark_start_time: datetime) -> Path:
"""
Calculates the path to the directory where the test report will be saved.
"""
# Ensure the reports path exists
base_path.mkdir(parents=True, exist_ok=True)
# Get current UTC date-time stamp
date_stamp = benchmark_start_time.strftime("%Y%m%dT%H%M%S")
# Default run name
run_name = "full_run"
# Map command-line arguments to their respective labels
arg_labels = {
"--test": None,
"--category": None,
"--maintain": "maintain",
"--improve": "improve",
"--explore": "explore",
}
# Identify the relevant command-line argument
for arg, label in arg_labels.items():
if arg in sys.argv:
test_arg = sys.argv[sys.argv.index(arg) + 1] if label is None else None
run_name = arg.strip("--")
if test_arg:
run_name = f"{run_name}_{test_arg}"
break
# Create the full new directory path with ISO standard UTC date-time stamp
report_path = base_path / f"{date_stamp}_{run_name}"
# Ensure the new directory is created
report_path.mkdir(exist_ok=True)
return report_path
class AgentBenchmarkConfig(BaseModel):
"""
This class represents the configuration for the Agent agbenchmark.
It includes the following attributes:
- agent_benchmark_config_path: The path to the agent benchmark config that this object was created from.
- reports_folder: The path to the folder where the benchmark reports will be stored.
- host: The host where the benchmark is run.
"""
agent_benchmark_config_path: Path | None = None
reports_folder: Path | None = None
host: str | None
def get_reports_location(self) -> Path:
# if not self.reports_folder:
# self.reports_folder = (
# Path(self.agent_benchmark_config_path).parent / "reports"
# ).resolve()
return Path.cwd() / "agbenchmark_config" / "reports"
def get_reports_path(self, benchmark_start_time: datetime) -> Path:
return calculate_info_test_path(
self.get_reports_location(), benchmark_start_time
)
def get_regression_reports_path(self) -> Path:
return self.get_reports_location() / "regression_tests.json"
def get_success_rate_path(self) -> Path:
return self.get_reports_location() / "success_rate.json"
def get_agent_home_directory(self) -> Path:
return Path(self.agent_benchmark_config_path).resolve().parent
class Info(BaseModel):
difficulty: DifficultyLevel
description: constr(regex=r"^Tests if the agent can.*")
@@ -180,6 +102,7 @@ class Category(str, Enum):
class ChallengeData(BaseModel):
eval_id: str = ""
name: str
category: List[Category]
task: str
@@ -189,73 +112,4 @@ class ChallengeData(BaseModel):
info: Info | Dict[str, Info]
metadata: Optional[Dict[str, Any]] = None
def serialize(self, path: str) -> None:
with open(path, "w") as file:
file.write(self.json())
def get_data(self) -> dict:
return self.dict()
@staticmethod
def get_json_from_path(json_path: Path | str) -> dict:
path = Path(json_path).resolve()
with open(path, "r") as file:
data = json.load(file)
return data
@staticmethod
def deserialize(path: str) -> "ChallengeData":
# this script is in root/agbenchmark/utils/define_task_types.py
script_dir = Path(__file__).resolve().parent.parent.parent
json_path = script_dir / Path(path)
with open(json_path, "r") as file:
data = json.load(file)
try:
return ChallengeData(**data)
except:
test = "ok"
def challenge_from_datum(self, file_datum: list[dict[str, Any]]) -> "ChallengeData":
same_task_data = {
"name": self.prefix,
"dependencies": self.dependencies,
"category": self.shared_category,
"task": self.task,
"cutoff": self.cutoff,
}
if not self.info:
same_task_data["info"] = {
datum["name"]: datum["info"] for datum in file_datum
}
else:
same_task_data["info"] = self.info
if not self.ground:
same_task_data["ground"] = {
datum["name"]: datum["ground"] for datum in file_datum
}
else:
same_task_data["ground"] = self.ground
return ChallengeData(**same_task_data)
def challenge_from_test_data(self, data: dict[str, Any]) -> "ChallengeData":
same_task_data = {
"name": data["name"],
"dependencies": data["dependencies"],
"category": data["category"],
"info": data["info"],
"ground": data["ground"],
}
if self.same_task:
same_task_data["category"].extend(self.shared_category)
same_task_data["task"] = self.task
same_task_data["cutoff"] = self.cutoff
else:
same_task_data["task"] = data["task"]
same_task_data["cutoff"] = data["cutoff"]
return ChallengeData(**same_task_data)
spec_file: Path | None = Field(None, exclude=True)

View File

@@ -1,3 +1,5 @@
import json
import logging
import math
from pathlib import Path
from typing import Any, Dict, List, Tuple
@@ -11,6 +13,8 @@ from pyvis.network import Network
from agbenchmark.generate_test import DATA_CATEGORY
from agbenchmark.utils.utils import write_pretty_json
logger = logging.getLogger(__name__)
def bezier_curve(
src: np.ndarray, ctrl: List[float], dst: np.ndarray
@@ -221,8 +225,8 @@ def graph_interactive_network(
f"{source_id_str}_to_{target_id_str}" # Construct a unique edge id
)
if not (source_id_str in nt.get_nodes() and target_id_str in nt.get_nodes()):
print(
f"Skipping edge {source_id_str} -> {target_id_str} due to missing nodes."
logger.warning(
f"Skipping edge {source_id_str} -> {target_id_str} due to missing nodes"
)
continue
nt.add_edge(source_id_str, target_id_str, id=edge_id_str)
@@ -271,9 +275,12 @@ def graph_interactive_network(
"layout": {"hierarchical": hierarchical_options},
}
# Serialize the graph to JSON
# Serialize the graph to JSON and save in appropriate locations
graph_data = {"nodes": nt.nodes, "edges": nt.edges}
logger.debug(f"Generated graph data:\n{json.dumps(graph_data, indent=4)}")
# FIXME: use more reliable method to find the right location for these files.
# This will fail in all cases except if run from the root of our repo.
home_path = Path.cwd()
write_pretty_json(graph_data, home_path / "frontend" / "public" / "graph.json")
@@ -284,7 +291,6 @@ def graph_interactive_network(
# this literally only works in the AutoGPT repo, but this part of the code is not reached if BUILD_SKILL_TREE is false
write_pretty_json(graph_data, flutter_app_path / "tree_structure.json")
validate_skill_tree(graph_data, "")
import json
# Extract node IDs with category "coding"
@@ -317,9 +323,6 @@ def graph_interactive_network(
scrape_synthesize_tree,
flutter_app_path / "scrape_synthesize_tree_structure.json",
)
# If you want to convert back to JSON
filtered_json = json.dumps(graph_data, indent=4)
print(filtered_json)
if html_graph_path:
file_path = str(Path(html_graph_path).resolve())

View File

@@ -1,4 +1,5 @@
import json
import logging
import os
from typing import Optional
@@ -7,6 +8,8 @@ import requests
from agbenchmark.__main__ import BENCHMARK_START_TIME
from agbenchmark.agent_interface import HELICONE_GRAPHQL_LOGS
logger = logging.getLogger(__name__)
def get_data_from_helicone(challenge: str) -> Optional[float]:
# Define the endpoint of your GraphQL server
@@ -38,8 +41,8 @@ query ExampleQuery($properties: [PropertyFilter!]){
]
}
if HELICONE_GRAPHQL_LOGS:
print(query)
print(json.dumps(variables, indent=4))
logger.debug(f"Executing Helicone query:\n{query.strip()}")
logger.debug(f"Query variables:\n{json.dumps(variables, indent=4)}")
operation_name = "ExampleQuery"
@@ -59,24 +62,22 @@ query ExampleQuery($properties: [PropertyFilter!]){
data = response.json()
except requests.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
return None # Re-raise the exception to stop execution
logger.error(f"Helicone returned an HTTP error: {http_err}")
return None
except json.JSONDecodeError:
print(f"Invalid JSON response: {response.text if response else 'No response'}")
raw_response = response.text # type: ignore
logger.error(
f"Helicone returned an invalid JSON response: '''{raw_response}'''"
)
return None
except Exception as err:
print(f"Other error occurred: {err}")
logger.error(f"Error while trying to get data from Helicone: {err}")
return None
try:
if data is None or data.get("data") is None:
print("Invalid response received from server: no data")
return None
return (
data.get("data", {})
.get("aggregatedHeliconeRequest", {})
.get("costUSD", None)
)
except Exception as err:
print(f"Error occurred while parsing response: {err}")
if data is None or data.get("data") is None:
logger.error("Invalid response received from Helicone: no data")
logger.error(f"Offending response: {response}")
return None
return (
data.get("data", {}).get("aggregatedHeliconeRequest", {}).get("costUSD", None)
)

View File

@@ -0,0 +1,74 @@
from __future__ import annotations
import logging
from colorama import Fore, Style
SIMPLE_LOG_FORMAT = "[%(asctime)s] %(levelname)s %(message)s"
DEBUG_LOG_FORMAT = "[%(asctime)s] %(levelname)s %(filename)s:%(lineno)03d %(message)s"
def configure_logging(
level: int = logging.INFO,
) -> None:
"""Configure the native logging module."""
# Auto-adjust default log format based on log level
log_format = DEBUG_LOG_FORMAT if level == logging.DEBUG else SIMPLE_LOG_FORMAT
console_handler = logging.StreamHandler()
console_handler.setFormatter(FancyConsoleFormatter(log_format))
# Configure the root logger
logging.basicConfig(
level=level,
format=log_format,
handlers=[console_handler],
)
class FancyConsoleFormatter(logging.Formatter):
"""
A custom logging formatter designed for console output.
This formatter enhances the standard logging output with color coding. The color
coding is based on the level of the log message, making it easier to distinguish
between different types of messages in the console output.
The color for each level is defined in the LEVEL_COLOR_MAP class attribute.
"""
# level -> (level & text color, title color)
LEVEL_COLOR_MAP = {
logging.DEBUG: Fore.LIGHTBLACK_EX,
logging.INFO: Fore.BLUE,
logging.WARNING: Fore.YELLOW,
logging.ERROR: Fore.RED,
logging.CRITICAL: Fore.RED + Style.BRIGHT,
}
def format(self, record: logging.LogRecord) -> str:
# Make sure `msg` is a string
if not hasattr(record, "msg"):
record.msg = ""
elif not type(record.msg) is str:
record.msg = str(record.msg)
# Justify the level name to 5 characters minimum
record.levelname = record.levelname.ljust(5)
# Determine default color based on error level
level_color = ""
if record.levelno in self.LEVEL_COLOR_MAP:
level_color = self.LEVEL_COLOR_MAP[record.levelno]
record.levelname = f"{level_color}{record.levelname}{Style.RESET_ALL}"
# Determine color for message
color = getattr(record, "color", level_color)
color_is_specified = hasattr(record, "color")
# Don't color INFO messages unless the color is explicitly specified.
if color and (record.levelno != logging.INFO or color_is_specified):
record.msg = f"{color}{record.msg}{Style.RESET_ALL}"
return super().format(record)

View File

@@ -1,18 +1,22 @@
# radio charts, logs, helper functions for tests, anything else relevant.
import json
import logging
import os
import re
from pathlib import Path
from typing import Any, List, Optional
from typing import Any, Optional
from dotenv import load_dotenv
load_dotenv()
from agbenchmark.utils.data_types import DIFFICULTY_MAP, DifficultyLevel
load_dotenv()
AGENT_NAME = os.getenv("AGENT_NAME")
REPORT_LOCATION = os.getenv("REPORT_LOCATION", None)
logger = logging.getLogger(__name__)
def replace_backslash(value: Any) -> Any:
if isinstance(value, str):
@@ -72,8 +76,9 @@ def get_highest_success_difficulty(
highest_difficulty = DifficultyLevel[highest_difficulty_str]
highest_difficulty_level = DIFFICULTY_MAP[highest_difficulty]
except KeyError:
print(
f"Unexpected difficulty level '{highest_difficulty_str}' in test '{test_name}'"
logger.warning(
f"Unexpected difficulty level '{highest_difficulty_str}' "
f"in test '{test_name}'"
)
continue
else:
@@ -88,12 +93,21 @@ def get_highest_success_difficulty(
highest_difficulty = difficulty_enum
highest_difficulty_level = difficulty_level
except KeyError:
print(
f"Unexpected difficulty level '{difficulty_str}' in test '{test_name}'"
logger.warning(
f"Unexpected difficulty level '{difficulty_str}' "
f"in test '{test_name}'"
)
continue
except Exception:
print(f"Make sure you selected the right test, no reports were generated.")
except Exception as e:
logger.warning(
"An unexpected error [1] occurred while analyzing report [2]."
"Please notify a maintainer.\n"
f"Report data [1]: {data}\n"
f"Error [2]: {e}"
)
logger.warning(
"Make sure you selected the right test, no reports were generated."
)
break
if highest_difficulty is not None:
@@ -116,22 +130,13 @@ def get_highest_success_difficulty(
# remote_url = remote_url[:-4]
# git_commit_sha = f"{remote_url}/tree/{repo.head.commit.hexsha}"
# # print(f"GIT_COMMIT_SHA: {git_commit_sha}")
# # logger.debug(f"GIT_COMMIT_SHA: {git_commit_sha}")
# return git_commit_sha
# except Exception:
# # print(f"{directory} is not a git repository!")
# # logger.error(f"{directory} is not a git repository!")
# return None
def agent_eligibible_for_optional_categories(
optional_challenge_categories: List, agent_categories: List
) -> bool:
for element in optional_challenge_categories:
if element not in agent_categories:
return False
return True
def write_pretty_json(data, json_file):
sorted_data = deep_sort(data)
json_graph = json.dumps(sorted_data, indent=4)

1787
benchmark/poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -32,6 +32,8 @@ python-multipart = "^0.0.6"
toml = "^0.10.2"
helicone = "^1.0.9"
httpx = "^0.24.0"
agent-protocol-client = "^1.1.0"
click-default-group = "^1.2.4"
[tool.poetry.group.dev.dependencies]
flake8 = "^3.9.2"

View File

@@ -1,121 +0,0 @@
import io
import json
import logging
import shutil
from pathlib import Path
from random import randint
from typing import Annotated, Any, Dict, List
from fastapi import FastAPI, File, Form, HTTPException, UploadFile
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI()
artifacts: List[Dict[str, Any]] = []
class Task(BaseModel):
input: str
@app.post("/agent/tasks/{task_id}/artifacts")
async def upload_file(
task_id: str, file: Annotated[UploadFile, File()], relative_path: str = Form("")
) -> Dict[str, Any]:
logger.info(
"Uploading file for task_id: %s with relative path: %s", task_id, relative_path
)
absolute_directory_path = Path(__file__).parent.absolute()
save_path = (
absolute_directory_path
/ "agent/gpt-engineer"
/ "projects/my-new-project/workspace"
)
random_string = str(randint(0, 100000))
while random_string in artifacts:
random_string = str(randint(0, 100000))
artifact_data = await file.read()
artifacts.append(
{
"binary": artifact_data,
"relative_path": relative_path,
"file_name": file.filename,
"artifact_id": random_string,
}
)
print(artifacts)
return {
"artifact_id": random_string,
"file_name": "file_name",
"relative_path": "relative_path",
}
@app.get("/agent/tasks/{task_id}/artifacts")
async def get_files() -> List[Dict[str, Any]]:
logger.info("Fetching list of files for task")
return artifacts
@app.get("/agent/tasks/{task_id}/artifacts/{artifact_id}")
async def get_file(artifact_id: str):
for artifact in artifacts:
if artifact["artifact_id"] == artifact_id:
break
else:
logger.error("Attempt to access nonexistent artifact with ID: %s", artifact_id)
raise HTTPException(status_code=404, detail="Artifact not found")
logger.info("Fetching artifact with ID: %s", artifact_id)
# find aritifact where artifact_id = artifact_id
for artifact in artifacts:
if artifact["artifact_id"] == artifact_id:
return StreamingResponse(
io.BytesIO(artifact["binary"]),
media_type="application/octet-stream",
headers={"Content-Disposition": f"attachment; filename=test.txt"},
)
# return 404
return HTTPException(status_code=404, detail="Artifact not found")
@app.post("/agent/tasks/{task_id}/steps")
async def create_steps(task_id: str):
logger.info("Creating step for task_id: %s", task_id)
return {
"input": "random",
"additional_input": {},
"task_id": task_id,
"step_id": "random_step",
"name": "random",
"status": "created",
"output": "random",
"additional_output": {},
"artifacts": [],
"is_last": True,
}
@app.post("/agent/tasks")
async def create_tasks(task: Task):
artifacts.clear()
return {
"input": "random",
"additional_input": {},
"task_id": "static_task_id",
"artifacts": [],
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)