diff --git a/ROADMAP.md b/ROADMAP.md index 902f226..cd1f815 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -1,16 +1,28 @@ -# Capability improvement roadmap -- [ ] Continuous capability measurements - - [ ] Create a step that asks “did it run/work/perfect” in the end [#240](https://github.com/AntonOsika/gpt-engineer/issues/240) - - [ ] Run the benchmark repeatedly and document the results for the different "step configs" (`STEPS` in `steps.py`) [#239](https://github.com/AntonOsika/gpt-engineer/issues/239) - - [ ] Document the best performing configs, and feed this into our roadmap - - [ ] Collect a dataset for gpt engineer to learn from, by storing code generation runs, and if they fail/succeed on an opt out basis +# Roadmap + +We are building AGI by first creating the code generation tooling of the future. + +There are three main milestones we think can improve gpt-engineer's capability 2x: +- Continuous evaluation of our progress +- Make code generation become small, verifiable steps +- Run tests and fix errors with GPT4 + + +## Steps to achieve our roadmap + +- [ ] Continuous evaluation of our progress + - [ ] Create a step that asks “did it run/work/perfect” in the end of each run [#240](https://github.com/AntonOsika/gpt-engineer/issues/240) + - [ ] Run the benchmark multiple times, and document the results for the different "step configs" (`STEPS` in `steps.py`) [#239](https://github.com/AntonOsika/gpt-engineer/issues/239) + - [ ] Document the best performing configs, and feed these learnings into our roadmap + - [ ] Collect a dataset for gpt engineer to learn from, by storing code generation runs, and if they fail/succeed (on an opt out basis) - [ ] Self healing code - [ ] Feed the results of failing tests back into GPT4 and ask it to fix the code - [ ] Let human give feedback - - [ ] Ask human for what is not working as expected in a loop, and feed it into -GPT4 to fix the code, until the human is happy or gives up -- [ ] Break down the code generation into small parts - - [ ] For each small part, generate tests for each subpart, and do the loop of running the tests for each part, feeding + - [ ] Ask human for what is not working as expected in a loop, and feed it into GPT4 to fix the code, until the human is happy or gives up +- [ ] Make code generation become small, verifiable steps + - [ ] Ask GPT4 to decide how to sequence the entire generation, and do one + prompt for each subcomponent + - [ ] For each small part, generate tests for that subpart, and do the loop of running the tests for each part, feeding results into GPT4, and let it edit the code until they pass - [ ] LLM tests in CI - [ ] Run very small tests with GPT3.5 in CI, to make sure we don't worsen