`agbenchmark` currently creates files like success_rate.json in the base REPORTS_FOLDER, which causes conflicts in the last step of the benchmark workflow.
To prevent issues, these files must be removed prior to switching to the data branch.
Validation errors don't mention the values causing the error, making it hard to debug. This happened a few times in autogpts-benchmark.yml, so let's put this log statement here until we figure out what makes it crash.
- Added `task_cumulative_cost` and `task_total_cost` attributes to the `Step.additional_output` in the `AgentProtocolServer.execute_step` endpoint.
- Updated `agbenchmark` dependency in Agent and Forge
- Added `n_steps` attribute to `TestResult` type
- Added logic to record the number of steps to `BuiltinChallenge.test_method`, `WebArenaChallenge.test_method`, and `.reports.add_test_result_to_report`
- Add `challenge list` command with options `--all`, `--names`, `--json`
- Add `tabular` dependency
- Add `.utils.utils.sorted_by_enum_index` function to easily sort lists by an enum value/property based on the order of the enum's definition
- Add `challenge info [name]` command with option `--json`
- Add `.utils.utils.pretty_print_model` routine to pretty-print Pydantic models
- Refactor `config` subcommand to use `pretty_print_model`
- Reduce duplicate and nested statements
- Add `skip_unavailable` parameter
Related changes:
- Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec`
- Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run
- Make `AgentBenchmarkConfig.reports_folder` directly configurable (through `REPORTS_FOLDER` env variable). The default is still `./agbenchmark_config/reports`.
- Change all mentions of `REPORT_LOCATION` (which fulfilled the same function at some point in the past) to `REPORTS_FOLDER`.
- Added a helper function `.app.utils.vcs_state_diverges_from_master()`. This function determines whether the relevant part of the codebase diverges from our `master`.
- Updated `.app.telemetry._setup_sentry()` to determine the default environment name using `vcs_state_diverges_from_master`.
- Added a helper function `wait_until_conn_ready(port)` to wait for the benchmark and agent applications to finish starting
- Improved the CLI's own logging (within the `agent start` command)
- Fixed `--mock` mode
- Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
- Implemented mock mode in `WebArenaChallenge`
- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted
- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`
- Disabled left-over Helicone code (see 056163e)
- Fixed a couple of challenge definitions
- WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
- synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list
- Added some debug logging in agent_api_interface.py and challenges/builtin.py
OpenAI's newest models return JSON with markdown fences around it, breaking the `json.loads` parser.
This commit adds an `extract_list_from_response` function to json_utils/utilities.py and uses this function to replace `json.loads` in `_process_text`.
* Add Sentry integration for telemetry
- Add `sentry_sdk` dependency
- Add setup logic and config flow using `TELEMETRY_OPT_IN` environment variable
- Add app/telemetry.py with `setup_telemetry` helper routine
- Call `setup_telemetry` in `cli()` in app/cli.py
- Add `TELEMETRY_OPT_IN` to .env.template
- Add helper function `env_file_exists` and routine `set_env_config_value` to app/utils.py
- Add unit tests for `set_env_config_value` in test_utils.py
- Add prompt to startup to ask whether the user wants to enable telemetry if the env variable isn't set
* Add `capture_exception` statements for LLM parsing errors and command failures