From dad4804b4e53f4aab4f2615345d4638719399da1 Mon Sep 17 00:00:00 2001 From: Douglas Schonholtz <15002691+dschonholtz@users.noreply.github.com> Date: Tue, 18 Apr 2023 10:29:05 -0400 Subject: [PATCH] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 123c87e8..f3b54648 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ A set of standardised benchmarks to assess the performance of Auto-GPTs. - [ ] Build longer form tasks, (code fix backed by testing) - [ ] Explicitly note the common failure modes in the test harness and fix them. Most of these appear to be failure modes with the core AutoGPT project - [ ] Switch to a ubuntu container so it can do more things (git, bash, etc) -- [ ] Lower priority, but put this in a webserver backend so we have a good API +- [ ] Lower priority, but put this in a webserver backend so we have a good API rather than doing container and file management for our interface between evals and our agent. - [ ] Get token counting data from the model Add scores to result files based on pricing associated with tokens and models used - [ ] Think about how this can be applied to other projects besides AutoGPT so we can be THE agent evaluation framework. - [ ] Figure our how the OpenAI Evals results are saved...