Commit Graph

5154 Commits

Author SHA1 Message Date
Ethan Presberg
6cfe229332 feat(frontend): Allow sending a message with the enter key (#6378)
This has not yet been tested due to an issue with compiling on WSL. This was the fix suggested by Pwuts.
2024-02-20 10:49:37 +01:00
Reinier van der Leer
1079d71699 fix(ci/benchmark): Unbreak "Push reports to data branch" step
The `report_subfolder` variable was being populated with two identical lines, because there will be two untracked files in the folder, resulting in the same dirname.
This caused later commands using that variable to fail. Fix is to `sort -u` before storing the value to `report_subfolder`.
2024-02-20 10:35:14 +01:00
Reinier van der Leer
e104427767 feat(ci/benchmark): Generate step summary from benchmark report 2024-02-19 17:13:41 +01:00
Reinier van der Leer
bfd479a50b feat(benchmark): Add reports/format.py script to convert report.json to markdown 2024-02-19 17:13:05 +01:00
Reinier van der Leer
fb63bf4425 chore: Update agbenchmark dependency for agent and forge 2024-02-19 17:11:19 +01:00
Reinier van der Leer
3a17011129 feat(benchmark): Include Steps in Report 2024-02-19 17:08:24 +01:00
Reinier van der Leer
c339c6b54f chore: Update agbenchmark dependency for agent and forge 2024-02-18 17:37:03 +01:00
Reinier van der Leer
7f71d6d9fd debug(benchmark): Improve TestResult validation error output format 2024-02-18 17:10:14 +01:00
Reinier van der Leer
784e2bbb1c fix(ci/benchmark): Mitigate VCS conflicts with files in data branch
`agbenchmark` currently creates files like success_rate.json in the base REPORTS_FOLDER, which causes conflicts in the last step of the benchmark workflow.
To prevent issues, these files must be removed prior to switching to the data branch.
2024-02-17 18:09:44 +01:00
Reinier van der Leer
959377f54c fix(ci/benchmark): Add set +e because we expect (some) challenges to fail 2024-02-17 15:56:55 +01:00
Reinier van der Leer
6bc83e925c chore: Update agbenchmark dependency for agent and forge 2024-02-17 15:56:33 +01:00
Reinier van der Leer
4ede773f5a debug(benchmark): Add more debug code to pinpoint cause of rare crash
Target: https://github.com/Significant-Gravitas/AutoGPT/actions/runs/7941977633/job/21684817491
2024-02-17 15:48:57 +01:00
Reinier van der Leer
d5ad719757 ci: Allow telemetry for non-push events, as long as it's on master
Also disable telemetry for AutoGPT's unit/integration tests.
2024-02-17 15:12:43 +01:00
Reinier van der Leer
1ca9b9fa93 ci: Fix setting/passing TELEMETRY_* environment variables 2024-02-17 14:26:03 +01:00
Reinier van der Leer
15024fb5a1 chore: Update agbenchmark dependency for agent and forge 2024-02-17 14:18:02 +01:00
Reinier van der Leer
fa4bdef17c ci: Update actions to newest versions
- `actions/stale` -> `v9`
- `actions/cache` -> `v4`
- `actions/checkout` -> `v4`
- `actions/setup-node` -> `v4`
- `docker/login-action` -> `v3`
- `actions/setup-python` -> `v5`
- `codecov/codecov-action` -> `v4`
- `actions/upload-artifact` -> `v4`
- `subosito/flutter-action` -> `v2`
- `docker/build-push-action` -> `v5`
- `docker/setup-buildx-action` -> `v3`
2024-02-17 13:59:13 +01:00
Reinier van der Leer
e2b519ef3b debug(benchmark): Make sure TestResult validator error output is sufficient to debug 2024-02-17 13:36:17 +01:00
Reinier van der Leer
09c307d679 debug(benchmark): Add log statement to validator on TestResult
Validation errors don't mention the values causing the error, making it hard to debug. This happened a few times in autogpts-benchmark.yml, so let's put this log statement here until we figure out what makes it crash.
2024-02-17 13:32:22 +01:00
Reinier van der Leer
880c8e804c fix(ci/benchmark): Allow workflow to continue regardless of challenge outcomes 2024-02-17 11:52:26 +01:00
Reinier van der Leer
5f0764b65c chore: Update agbenchmark dependency for agent and forge 2024-02-16 19:07:37 +01:00
Reinier van der Leer
63e6014b27 fix(benchmark): Fix TestResult.fail_reason assignment condition
The condition must be the same as for `success`, because otherwise it causes a crash when `call.excinfo` evaluates to `False` but is not `None`.
2024-02-16 19:05:00 +01:00
Reinier van der Leer
83fcd9ad16 chore: Update agbenchmark dependency for agent and forge 2024-02-16 18:44:58 +01:00
Reinier van der Leer
f9792ed7f3 fix(benchmark): Unbreak -N/--attempts option 2024-02-16 18:43:37 +01:00
Reinier van der Leer
d6ab470c58 Rename autogpts-benchmark-nightly.yml to autogpts-benchmark.yml 2024-02-16 18:32:50 +01:00
Reinier van der Leer
666a5a8777 feat(agent/serve): Report task cost through Step.additional_output
- Added `task_cumulative_cost` and `task_total_cost` attributes to the `Step.additional_output` in the `AgentProtocolServer.execute_step` endpoint.
- Updated `agbenchmark` dependency in Agent and Forge
2024-02-16 18:19:04 +01:00
Reinier van der Leer
21f1e64559 feat(benchmark): Get agent task cost from Step.additional_output 2024-02-16 18:10:46 +01:00
Reinier van der Leer
752bac099b feat(benchmark/report): Add and record TestResult.n_steps
- Added `n_steps` attribute to `TestResult` type
- Added logic to record the number of steps to `BuiltinChallenge.test_method`, `WebArenaChallenge.test_method`, and `.reports.add_test_result_to_report`
2024-02-16 17:53:19 +01:00
Reinier van der Leer
a5de79beb6 ci(benchmark): Add nightly benchmark workflow
Added autogpts-benchmark-nightly.yml, which will run every night at 02:00 UTC with a selection of challenges.
2024-02-16 17:41:58 +01:00
Reinier van der Leer
483c01b681 lint(benchmark): Remove unnecessary pass statement in __main__.py 2024-02-16 17:27:56 +01:00
Reinier van der Leer
992b8874fc chore: Update agbenchmark dependency for agent and forge 2024-02-16 17:22:58 +01:00
Reinier van der Leer
2a55efb322 fix(benchmark): Include WebArenaSiteInfo.additional_info (e.g. credentials) in task input
Without the `additional_info`, it is impossible to get past the login page on challenges where that is necessary.
2024-02-16 17:20:44 +01:00
Reinier van der Leer
23d58a3cc0 feat(benchmark/cli): Add challenge list, challenge info subcommands
- Add `challenge list` command with options `--all`, `--names`, `--json`
   - Add `tabular` dependency
   - Add `.utils.utils.sorted_by_enum_index` function to easily sort lists by an enum value/property based on the order of the enum's definition
- Add `challenge info [name]` command with option `--json`
   - Add `.utils.utils.pretty_print_model` routine to pretty-print Pydantic models
- Refactor `config` subcommand to use `pretty_print_model`
2024-02-16 15:17:11 +01:00
Reinier van der Leer
70e345b2ce refactor(benchmark): load_webarena_challenges
- Reduce duplicate and nested statements
- Add `skip_unavailable` parameter

Related changes:
- Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec`
- Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run
2024-02-16 15:11:48 +01:00
Reinier van der Leer
650a701317 chore: Update agbenchmark dependency for agent and forge 2024-02-15 18:19:06 +01:00
Reinier van der Leer
679339d00c feat(benchmark): Make report output folder configurable
- Make `AgentBenchmarkConfig.reports_folder` directly configurable (through `REPORTS_FOLDER` env variable). The default is still `./agbenchmark_config/reports`.
- Change all mentions of `REPORT_LOCATION` (which fulfilled the same function at some point in the past) to `REPORTS_FOLDER`.
2024-02-15 18:07:45 +01:00
Reinier van der Leer
fd5730b04a feat(agent/telemetry): Distinguish between production and dev environment based on VCS state
- Added a helper function `.app.utils.vcs_state_diverges_from_master()`. This function determines whether the relevant part of the codebase diverges from our `master`.
- Updated `.app.telemetry._setup_sentry()` to determine the default environment name using `vcs_state_diverges_from_master`.
2024-02-15 16:00:30 +01:00
Reinier van der Leer
b7f08cd0f7 feat(agent/telemetry): Enable performance tracing & update opt-in prompt accordingly 2024-02-15 14:46:36 +01:00
Reinier van der Leer
8762f7ab3d fix(forge): Make watchfiles pattern more specific to prevent unwanted (breaking) reloads
This fixes the issue of changes in artifacts triggering an application reload (which caused connection errors for in-progress requests).
2024-02-15 13:42:38 +01:00
Reinier van der Leer
a9b7b175ff fix(agent/profile_generator): Improve robustness by leveraging create_chat_completion's parse handling 2024-02-15 11:48:07 +01:00
Reinier van der Leer
52b93dd84e fix(cli/agent start): Wait for applications to finish starting before returning
- Added a helper function `wait_until_conn_ready(port)` to wait for the benchmark and agent applications to finish starting
- Improved the CLI's own logging (within the `agent start` command)
2024-02-15 11:26:26 +01:00
Reinier van der Leer
6a09a44ef7 lint(agent): Fix telemetry.py linting error & formatting 2024-02-14 23:31:35 +01:00
Toran Bruce Richards
32a627eda9 Add Privacy Policy link to telementry opt-in. 2024-02-14 16:42:34 +00:00
Reinier van der Leer
67bafa6302 fix(autogpt/llm): AssistantChatMessage.tool_calls default [] instead of None
OpenAI ChatCompletion calls fail when `tool_calls = None`. This issue came to light after 22aba6d.
2024-02-14 14:34:04 +01:00
Reinier van der Leer
6017eefb32 ci: Enable telemetry in CI runs on master 2024-02-14 12:03:54 +01:00
Reinier van der Leer
ae197fc85f feat(agent/telemetry): Distinguish between users
This allows us to get a much better sense of how many users actually experience issues, and how issue occurrence is distributed among users.
2024-02-14 11:50:45 +01:00
Reinier van der Leer
22aba6dd8a fix(agent/llm): Include bad response in parse-fix prompt in OpenAIProvider.create_chat_completion
Apparently I forgot to also append the response that caused the parse error before throwing it back to the LLM and letting it fix its mistake(s).
2024-02-14 11:20:31 +01:00
Reinier van der Leer
88bbdfc7fc ci: Pick 3 challenges to run with --mock in smoke test CI 2024-02-14 02:30:03 +01:00
Reinier van der Leer
d0c9b7c405 lint(benchmark): Remove unused imports 2024-02-14 01:34:30 +01:00
Reinier van der Leer
e7698a4610 chore(agent): Update forge and agbenchmark dependencies 2024-02-14 01:32:28 +01:00
Reinier van der Leer
ab05b7ae70 chore(forge): Update agbenchmark dependency 2024-02-14 01:27:07 +01:00