Commit Graph

9 Commits

Author SHA1 Message Date
Reinier van der Leer
3a17011129 feat(benchmark): Include Steps in Report 2024-02-19 17:08:24 +01:00
Reinier van der Leer
f9792ed7f3 fix(benchmark): Unbreak -N/--attempts option 2024-02-16 18:43:37 +01:00
Reinier van der Leer
21f1e64559 feat(benchmark): Get agent task cost from Step.additional_output 2024-02-16 18:10:46 +01:00
Reinier van der Leer
752bac099b feat(benchmark/report): Add and record TestResult.n_steps
- Added `n_steps` attribute to `TestResult` type
- Added logic to record the number of steps to `BuiltinChallenge.test_method`, `WebArenaChallenge.test_method`, and `.reports.add_test_result_to_report`
2024-02-16 17:53:19 +01:00
Reinier van der Leer
2a55efb322 fix(benchmark): Include WebArenaSiteInfo.additional_info (e.g. credentials) in task input
Without the `additional_info`, it is impossible to get past the login page on challenges where that is necessary.
2024-02-16 17:20:44 +01:00
Reinier van der Leer
70e345b2ce refactor(benchmark): load_webarena_challenges
- Reduce duplicate and nested statements
- Add `skip_unavailable` parameter

Related changes:
- Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec`
- Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run
2024-02-16 15:11:48 +01:00
Reinier van der Leer
327fb1f916 fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions
- Fixed `--mock` mode
   - Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
   - Implemented mock mode in `WebArenaChallenge`

- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted

- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`

- Disabled left-over Helicone code (see 056163e)

- Fixed a couple of challenge definitions
   - WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
   - synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list

- Added some debug logging in agent_api_interface.py and challenges/builtin.py
2024-02-14 01:05:34 +01:00
Reinier van der Leer
a0cae78ba3 feat(benchmark): Add -N, --attempts option for multiple attempts per challenge
LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result.

Changes:
- Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`.
- Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge.
- Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge.
- Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge.
   - Calculate `success_percentage` from results of the current run, rather than all known results ever.
   - Add docstrings to a number of models in report_types.py.
   - Allow `None` as a success value, e.g. for runs that did not render any results before being cut off.
- Make SingletonReportManager thread-safe.
2024-01-22 17:16:55 +01:00
Reinier van der Leer
488f40a20f feat(benchmark): JungleGym WebArena (#6691)
* feat(benchmark): Add JungleGym WebArena challenges
   - Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work
   - Add WebArena challenges to Pytest collection endpoint generate_test.py

* feat(benchmark/webarena): Add hand-picked selection of WebArena challenges
2024-01-19 20:34:04 +01:00