mirror of https://github.com/aljazceru/Auto-GPT.git synced 2026-02-09 08:14:27 +01:00

Files

Reinier van der Leer 327fb1f916 fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions

- Fixed `--mock` mode
   - Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
   - Implemented mock mode in `WebArenaChallenge`

- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted

- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`

- Disabled left-over Helicone code (see 056163e)

- Fixed a couple of challenge definitions
   - WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
   - synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list

- Added some debug logging in agent_api_interface.py and challenges/builtin.py

2024-02-14 01:05:34 +01:00

abilities

Add more data challenges (#5390 )

2023-09-28 19:30:08 -07:00

alignment

Add more data challenges (#5390 )

2023-09-28 19:30:08 -07:00

deprecated

Clean up & fix GitHub workflows (#6313 )

2023-11-21 10:58:54 +01:00

library

Add more data challenges (#5390 )

2023-09-28 19:30:08 -07:00

verticals

fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions

2024-02-14 01:05:34 +01:00

__init__.py

feat(benchmark): JungleGym WebArena (#6691 )

2024-01-19 20:34:04 +01:00

base.py

fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions

2024-02-14 01:05:34 +01:00

builtin.py

fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions

2024-02-14 01:05:34 +01:00

CHALLENGE.md

Fix skill tree (#5303 )

2023-09-22 13:09:57 -07:00

optional_categories.json

Benchmark changes

2023-09-12 12:13:39 -07:00

README.md

Benchmark changes

2023-09-12 12:13:39 -07:00

webarena_selection.json

fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions

2024-02-14 01:05:34 +01:00

webarena.py

fix(benchmark): Mock mode, python evals, --attempts flag, challenge definitions

2024-02-14 01:05:34 +01:00

README.md

Auto-GPT-Benchmarks

The goal of this repo is to provide easy challenge creation for test driven development with the Auto-GPT-Benchmarks package. This is essentially a library to craft challenges using a dsl (jsons in this case).

This is the up to date dependency graph: https://sapphire-denys-23.tiiny.site/

How to use

Make sure you have the package installed with pip install agbenchmark.

If you would just like to use the default challenges, don't worry about this repo. Just install the package and you will have access to the default challenges.

To add new challenges as you develop, add this repo as a submodule to your project/agbenchmark folder. Any new challenges you add within the submodule will get registered automatically.