Auto-GPT

aljaz/Auto-GPT

Fork 0

mirror of https://github.com/aljazceru/Auto-GPT.git synced 2025-12-17 22:14:28 +01:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Reinier van der Leer	a0cae78ba3	feat(benchmark): Add `-N`, `--attempts` option for multiple attempts per challenge LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result. Changes: - Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`. - Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge. - Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge. - Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge. - Calculate `success_percentage` from results of the current run, rather than all known results ever. - Add docstrings to a number of models in report_types.py. - Allow `None` as a success value, e.g. for runs that did not render any results before being cut off. - Make SingletonReportManager thread-safe.	2024-01-22 17:16:55 +01:00
Reinier van der Leer	488f40a20f	feat(benchmark): JungleGym WebArena (#6691 ) * feat(benchmark): Add JungleGym WebArena challenges - Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work - Add WebArena challenges to Pytest collection endpoint generate_test.py * feat(benchmark/webarena): Add hand-picked selection of WebArena challenges	2024-01-19 20:34:04 +01:00

Reinier van der Leer

a0cae78ba3

feat(benchmark): Add -N, --attempts option for multiple attempts per challenge

LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result.

Changes:
- Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`.
- Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge.
- Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge.
- Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge.
   - Calculate `success_percentage` from results of the current run, rather than all known results ever.
   - Add docstrings to a number of models in report_types.py.
   - Allow `None` as a success value, e.g. for runs that did not render any results before being cut off.
- Make SingletonReportManager thread-safe.

2024-01-22 17:16:55 +01:00

Reinier van der Leer

488f40a20f

feat(benchmark): JungleGym WebArena (#6691 )

* feat(benchmark): Add JungleGym WebArena challenges
   - Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work
   - Add WebArena challenges to Pytest collection endpoint generate_test.py

* feat(benchmark/webarena): Add hand-picked selection of WebArena challenges

2024-01-19 20:34:04 +01:00

2 Commits