Remove submodule (#314)

Signed-off-by: Merwane Hamadi <merwanehamadi@gmail.com>
2026-02-14 10:44:20 +01:00 · 2023-08-16 14:57:52 -07:00
parent 277f3e4e4d
commit 82ed4a136a
211 changed files with 2737 additions and 9 deletions
--- a/.gitmodules
+++ b/.gitmodules
@@ -30,9 +30,6 @@
 	path = agent/PolyGPT
 	url = https://github.com/polywrap/PolyGPT.git
 	branch = nerfzael-use-local-wrap-library
-[submodule "agbenchmark/challenges"]
-	path = agbenchmark/challenges
-	url = https://github.com/agbenchmark/agent-evals.git
-[submodule "frontend"]
-	path = frontend
-	url = https://github.com/agbenchmark/agbenchmark-frontend.git
+[submodule "frontend"]
+	path = frontend
+	url = https://github.com/agbenchmark/agbenchmark-frontend.git
--- a/agbenchmark/challenges
+++ b/agbenchmark/challenges
--- a/agbenchmark/challenges/CHALLENGE.md
+++ b/agbenchmark/challenges/CHALLENGE.md
@@ -0,0 +1,85 @@
+# Challenges Data Schema of Benchmark
+
+## General challenges
+
+Input:
+
+- **name** (str): Name of the challenge.
+- **category** (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. _this is not currently used. for the future it may be needed_
+- **task** (str): The task that the agent needs to solve.
+- **dependencies** (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
+- **ground** (dict): The ground truth.
+  - **answer** (str): The raw text of the ground truth answer.
+  - **should_contain** (list): The exact strings that are required in the final answer.
+  - **should_not_contain** (list): The exact strings that should not be in the final answer.
+  - **files** (list): Files that are used for retrieval. Can specify file here or an extension.
+- **mock** (dict): Mock response for testing.
+  - **mock_func** (str): Function to mock the agent's response. This is used for testing purposes.
+  - **mock_task** (str): Task to provide for the mock function.
+- **info** (dict): Additional info about the challenge.
+  - **difficulty** (str): The difficulty of this query.
+  - **description** (str): Description of the challenge.
+  - **side_effects** (str[]): Describes the effects of the challenge.
+
+Example:
+
+```json
+{
+  "category": ["basic"],
+  "task": "Print the the capital of America to a .txt file",
+  "dependencies": ["TestWriteFile"], // the class name of the test
+  "ground": {
+    "answer": "Washington",
+    "should_contain": ["Washington"],
+    "should_not_contain": ["New York", "Los Angeles", "San Francisco"],
+    "files": [".txt"],
+    "eval": {
+      "type": "llm" or "file" or "python",
+      "scoring": "percentage" or "scale" or "binary", // only if the type is llm
+      "template": "rubric" or "reference" or "custom" // only if the type is llm
+    }
+  },
+  "info": {
+    "difficulty": "basic",
+    "description": "Tests the writing to file",
+    "side_effects": ["tests if there is in fact an LLM attached"]
+  }
+}
+```
+
+## Evals
+
+This is the method of evaluation for a challenge.
+
+### file
+
+This is the default method of evaluation. It will compare the files specified in "files" field to the "should_contain" and "should_not_contain" ground truths.
+
+### python
+
+This runs a python function in the specified "files" which captures the the print statements to be scored using the "should_contain" and "should_not_contain" ground truths.
+
+### llm
+
+This uses a language model to evaluate the answer.
+
+- There are 3 different templates - "rubric", "reference", and "custom". "rubric" will evaluate based on a rubric you provide in the "answer" field. "reference" will evaluate based on the ideal reference response in "answer". "custom" will not use any predefined scoring method, the prompt will be what you put in "answer".
+- The "scoring" field is used to determine how to score the answer. "percentage" will assign a percentage out of 100. "scale" will score the answer 1-10. "binary" will score the answer based on whether the answer is correct or not.
+- You can still use the "should_contain" and "should_not_contain" fields to directly match the answer along with the llm eval.
+
+## Add files to challenges:
+
+### artifacts_in
+
+This folder contains all the files you want the agent to have in its workspace BEFORE the challenge starts
+
+### artifacts_out
+
+This folder contains all the files you would like the agent to generate. This folder is used to mock the agent.
+This allows to run agbenchmark start --test=TestExample --mock and make sure our challenge actually works.
+
+### custom_python
+
+This folder contains files that will be copied into the agent's workspace and run after the challenge is completed.
+For example we can have a test.py in it and run this file in the workspace to easily import code generated by the agent.
+Example: TestBasicCodeGeneration challenge.
--- a/agbenchmark/challenges/README.md
+++ b/agbenchmark/challenges/README.md
@@ -0,0 +1,13 @@
+# This is the official challenge library for https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks
+
+The goal of this repo is to provide easy challenge creation for test driven development with the Auto-GPT-Benchmarks package. This is essentially a library to craft challenges using a dsl (jsons in this case).
+
+This is the up to date dependency graph: https://sapphire-denys-23.tiiny.site/
+
+### How to use
+
+Make sure you have the package installed with `pip install agbenchmark`.
+
+If you would just like to use the default challenges, don't worry about this repo. Just install the package and you will have access to the default challenges.
+
+To add new challenges as you develop, add this repo as a submodule to your `project/agbenchmark` folder. Any new challenges you add within the submodule will get registered automatically.
--- a/agbenchmark/challenges/SUITES.md
+++ b/agbenchmark/challenges/SUITES.md
@@ -0,0 +1,123 @@
+All tests within a suite folder must all start with the prefix defined in `suite.json`. There are two types of suites.
+
+#### same_task
+
+If same_task is set to true, all of the data.jsons are combined into one test. A single test runs, but multiple regression tests, internal_infos, dependencies, and reports are created. The artifacts_in/out and custom python should be in the suite folder as it's shared between tests. **An example of this can be found in "agbenchmark/challenges/retrieval/r2_search_suite_1"**
+
+```json
+{
+  "same_task": true,
+  "prefix": "TestRevenueRetrieval",
+  "dependencies": ["TestBasicRetrieval"],
+  "cutoff": 60,
+  "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
+  "shared_category": ["retrieval"]
+}
+```
+
+The structure for a same_task report looks like this:
+
+```
+"TestRevenueRetrieval": {
+            "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1",
+            "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
+            "category": [
+                "retrieval"
+            ],
+            "metrics": {
+                "percentage": 100.0,
+                "highest_difficulty": "intermediate",
+                "run_time": "0.016 seconds"
+            },
+            "tests": {
+                "TestRevenueRetrieval_1.0": {
+                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/1_tesla_revenue/data.json",
+                    "is_regression": false,
+                    "answer": "It was $81.462 billion in 2022.",
+                    "description": "A no guardrails search for info",
+                    "metrics": {
+                        "difficulty": "novice",
+                        "success": true,
+                        "success_%": 100.0
+                    }
+                },
+                "TestRevenueRetrieval_1.1": {
+                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/2_specific/data.json",
+                    "is_regression": false,
+                    "answer": "It was $81.462 billion in 2022.",
+                    "description": "This one checks the accuracy of the information over r2",
+                    "metrics": {
+                        "difficulty": "novice",
+                        "success": true,
+                        "success_%": 0
+                    }
+                },
+            },
+            "reached_cutoff": false
+        },
+```
+
+#### same_task
+
+If same_task is set to false, the main functionality added is being able to run via the --suite flag, and the ability to run the test in reverse order (can't work). Also, this should generate a single report similar to the above also with a %
+
+```json
+{
+  "same_task": false,
+  "reverse_order": true,
+  "prefix": "TestReturnCode"
+}
+```
+
+The structure for a non same_task report looks like this:
+
+```
+"TestReturnCode": {
+            "data_path": "agbenchmark/challenges/code/c1_writing_suite_1",
+            "metrics": {
+                "percentage": 0.0,
+                "highest_difficulty": "No successful tests",
+                "run_time": "15.972 seconds"
+            },
+            "tests": {
+                "TestReturnCode_Simple": {
+                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json",
+                    "is_regression": false,
+                    "category": [
+                        "code",
+                        "iterate"
+                    ],
+                    "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py",
+                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
+                    "description": "Simple test if a simple code instruction can be executed",
+                    "metrics": {
+                        "difficulty": "basic",
+                        "success": false,
+                        "fail_reason": "assert 1 in [0.0]",
+                        "success_%": 0.0,
+                        "run_time": "15.96 seconds"
+                    },
+                    "reached_cutoff": false
+                },
+                "TestReturnCode_Write": {
+                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json",
+                    "is_regression": false,
+                    "category": [
+                        "code",
+                        "iterate"
+                    ],
+                    "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py",
+                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
+                    "description": "Small step up, just writing the function with a name as well as the return statement.",
+                    "metrics": {
+                        "difficulty": "novice",
+                        "success": false,
+                        "fail_reason": "agbenchmark/challenges/test_all.py::TestReturnCode_Write::test_method[challenge_data0] depends on agbenchmark/challenges/test_all.py::TestReturnCode_Simple::test_method[challenge_data0]",
+                        "success_%": 0.0,
+                        "run_time": "0.004 seconds"
+                    },
+                    "reached_cutoff": false
+                },
+            }
+        }
+```
--- a/agbenchmark/challenges/init.py
+++ b/agbenchmark/challenges/init.py
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/init.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/init.py
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/code.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/code.py
@@ -0,0 +1,13 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        typo
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/test.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/init.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/init.py
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/code.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/code.py
@@ -0,0 +1,12 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/test.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/adapatability/a1_debug/data.json
+++ b/agbenchmark/challenges/adapatability/a1_debug/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestAdaptSimpleTypoWithGuidance",
+  "category": ["adaptability"],
+  "task": "1- Run code.py.\n2- Read test.py.\n3- Modify test.py.\nRepeat step 1, 2 and 3 until code.py runs without errors.\n",
+  "dependencies": ["TestDebugSimpleTypoWithGuidance"],
+  "cutoff": 75,
+  "ground": {
+    "answer": "Same as the original DebugWithGuidance test - [0, 1] [2, 5] [0, 3]",
+    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "intermediate",
+    "description": "A version of the debug challenge where the wrong files to test and change are given. The llm must figure the challenges out",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/adapatability/a2_tesla_revenue/artifacts_out/random_file.txt
+++ b/agbenchmark/challenges/adapatability/a2_tesla_revenue/artifacts_out/random_file.txt
@@ -0,0 +1 @@
+81,462 Millions
--- a/agbenchmark/challenges/adapatability/a2_tesla_revenue/data.json
+++ b/agbenchmark/challenges/adapatability/a2_tesla_revenue/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestAdaptTeslaRevenue",
+  "category": ["adaptability"],
+  "task": "I'm a financial planner, please help me write tesla's r in 2022 into a .txt file.",
+  "cutoff": 60,
+  "dependencies": ["TestRevenueRetrieval_1.0"],
+  "ground": {
+    "answer": "It was $81.462 billion in 2022.",
+    "should_contain": ["81"],
+    "should_not_contain": [],
+    "files": [".txt"],
+    "eval": {
+      "type": "file"
+    }
+  },
+  "info": {
+    "difficulty": "intermediate",
+    "description": "Given context adapt to the missing of the word revenue",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/adapatability/a3_book_price/artifacts_out/random_file.txt
+++ b/agbenchmark/challenges/adapatability/a3_book_price/artifacts_out/random_file.txt
@@ -0,0 +1 @@
+25.89
--- a/agbenchmark/challenges/adapatability/a3_book_price/data.json
+++ b/agbenchmark/challenges/adapatability/a3_book_price/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestAdaptLink",
+  "category": ["adaptability"],
+  "task": "Write the price of the book in this url 'books.toscrape.com/catalogue/meditations_33/index.html' to a .txt file.",
+  "dependencies": ["TestBasicRetrieval"],
+  "cutoff": 60,
+  "ground": {
+    "answer": "£25.89",
+    "should_contain": ["25.89"],
+    "should_not_contain": [],
+    "files": [".txt"],
+    "eval": {
+      "type": "file"
+    }
+  },
+  "info": {
+    "difficulty": "novice",
+    "description": "An advanced version of this -> remove.html as well. Same as TestBasicRetrieval but link is slightly broken, supposed to be http:// at the start.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/init.py
@@ -0,0 +1 @@
+# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/code.py
@@ -0,0 +1,5 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int) -> int:
+    multiplied_num = num * 2
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/test.py
@@ -0,0 +1,17 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, expected_result: int) -> None:
+    result = multiply_int(num)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    expected_result = 8
+    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/init.py
@@ -0,0 +1 @@
+# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/code.py
@@ -0,0 +1,6 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int) -> int:
+    multiplied_num = num * 2
+    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/test.py
@@ -0,0 +1,17 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, expected_result: int) -> None:
+    result = multiply_int(num)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    expected_result = 8
+    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestReturnCode_Simple",
+  "category": ["code", "iterate"],
+  "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py",
+  "dependencies": ["TestReadFile"],
+  "cutoff": 120,
+  "ground": {
+    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
+    "should_contain": ["8"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "basic",
+    "description": "Simple test if a simple code instruction can be executed",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/init.py
@@ -0,0 +1 @@
+# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/code.py
@@ -0,0 +1 @@
+# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/test.py
@@ -0,0 +1,17 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, expected_result: int) -> None:
+    result = multiply_int(num)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    expected_result = 8
+    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/init.py
@@ -0,0 +1 @@
+# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/code.py
@@ -0,0 +1,6 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int) -> int:
+    multiplied_num = num * 2
+    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/test.py
@@ -0,0 +1,17 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, expected_result: int) -> None:
+    result = multiply_int(num)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    expected_result = 8
+    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestReturnCode_Write",
+  "category": ["code", "iterate"],
+  "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 120,
+  "ground": {
+    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
+    "should_contain": ["8"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "novice",
+    "description": "Small step up, just writing the function with a name as well as the return statement.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/code.py
@@ -0,0 +1,6 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int) -> int:
+    multiplied_num = num * 2
+    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/test.py
@@ -0,0 +1,30 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
+    result = multiply_int(num, multiplier)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    multiplier = 2
+    expected_result = 8
+    test_multiply_int(num, multiplier, expected_result)
+
+    # so its not hard coded
+    num = 7
+    multiplier = 7
+    expected_result = 49
+    test_multiply_int(num, multiplier, expected_result)
+
+    # negative numbers
+    num = -6
+    multiplier = 2
+    expected_result = -12
+    test_multiply_int(num, multiplier, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/code.py
@@ -0,0 +1,6 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int, multiplier: int) -> int:
+    multiplied_num = num * multiplier
+    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/test.py
@@ -0,0 +1,30 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
+    result = multiply_int(num, multiplier)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    multiplier = 2
+    expected_result = 8
+    test_multiply_int(num, multiplier, expected_result)
+
+    # so its not hard coded
+    num = 7
+    multiplier = 7
+    expected_result = 49
+    test_multiply_int(num, multiplier, expected_result)
+
+    # negative numbers
+    num = -6
+    multiplier = 2
+    expected_result = -12
+    test_multiply_int(num, multiplier, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestReturnCode_Modify",
+  "category": ["code", "iterate"],
+  "task": "Modify the multiply_int function in code.py to be able to pass in a 'multiplier' argument to multiply the 'num' by 'multiplier'. Both arguments are integers. You can make sure you have correctly done this by running test.py",
+  "dependencies": ["TestReturnCode_Write"],
+  "cutoff": 120,
+  "ground": {
+    "answer": "def multiply_int(num, multiplier):\n    return num * multiplier\n",
+    "should_contain": ["8", "49", "-12"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "intermediate",
+    "description": "Builds on the previous function also take a multiplier .",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/code.py
@@ -0,0 +1,6 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int) -> int:
+    multiplied_num = num * 2
+    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/test.py
@@ -0,0 +1,18 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
+    result = multiply_int(num, multiplier)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # create a trivial test that has 4 as the num, and 2 as the multiplier. Make sure to fill in the expected result
+    num =
+    multiplier = 
+    expected_result = 
+    test_multiply_int()
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/code.py
@@ -0,0 +1,6 @@
+# mypy: ignore-errors
+
+
+def multiply_int(num: int, multiplier: int) -> int:
+    multiplied_num = num * multiplier
+    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/test.py
@@ -0,0 +1,30 @@
+# mypy: ignore-errors
+from code import multiply_int
+
+
+def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
+    result = multiply_int(num, multiplier)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case
+    num = 4
+    multiplier = 2
+    expected_result = 8
+    test_multiply_int(num, multiplier, expected_result)
+
+    # so its not hard coded
+    num = 7
+    multiplier = 7
+    expected_result = 49
+    test_multiply_int(num, multiplier, expected_result)
+
+    # negative numbers
+    num = -6
+    multiplier = 2
+    expected_result = -12
+    test_multiply_int(num, multiplier, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestReturnCode_Tests",
+  "category": ["code", "iterate"],
+  "task": "First, modify test.py to fill in the test case to be able to test the code in code.py. Next, modify the multiply_int function in code.py to be able to pass in a 'multiplier' argument to multiply the 'num' by 'multiplier'. Both arguments are integers. You can make sure you have correctly done this by running test.py that you previously modified.",
+  "dependencies": ["TestReturnCode_Modify"],
+  "cutoff": 120,
+  "ground": {
+    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
+    "should_contain": ["8", "49", "-12"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Small step up, just writing the function with a name as well as the return statement.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/suite.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/suite.json
@@ -0,0 +1,5 @@
+{
+  "same_task": false,
+  "reverse_order": true,
+  "prefix": "TestReturnCode"
+}
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_in/code.py
@@ -0,0 +1,13 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        typo
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_in/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_out/code.py
@@ -0,0 +1,12 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/artifacts_out/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/data.json
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_guided/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestDebugSimpleTypoWithGuidance",
+  "category": ["code", "iterate"],
+  "task": "1- Run test.py.\n2- Read code.py.\n3- Modify code.py.\nRepeat step 1, 2 and 3 until test.py runs without errors.\n",
+  "dependencies": ["TestReadFile"],
+  "cutoff": 75,
+  "ground": {
+    "answer": "[0, 1] [2, 5] [0, 3]",
+    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "novice",
+    "description": "Tests ability for the agent to debug python code with a simple typo in it.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_in/code.py
@@ -0,0 +1,13 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        typo
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_in/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_out/code.py
@@ -0,0 +1,12 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/artifacts_out/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/data.json
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_vague/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestDebugSimpleTypoWithoutGuidance",
+  "category": ["code", "iterate"],
+  "task": "Make test.py run without errors.",
+  "dependencies": ["TestDebugSimpleTypoWithGuidance"],
+  "cutoff": 75,
+  "ground": {
+    "answer": "[0, 1] [2, 5] [0, 3]",
+    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "intermediate",
+    "description": "Tests ability for the agent to debug python code with a simple typo in it, using a very broad prompt without guidance",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_in/code.py
@@ -0,0 +1,13 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        typo
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_in/test.py
@@ -0,0 +1,33 @@
+# mypy: ignore-errors
+# fmt: off
+from typing import List
+
+from import
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_out/code.py
@@ -0,0 +1,12 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/artifacts_out/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/data.json
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.3_import/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestDebugMultipleTypo",
+  "category": ["code", "iterate"],
+  "task": "Make test.py run without errors.",
+  "dependencies": ["TestDebugSimpleTypoWithoutGuidance"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "[0, 1] [2, 5] [0, 3]",
+    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Now it's not just the typo error, but also an incomplete import statement",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/code.py
@@ -0,0 +1,23 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def three_sum(nums: List[int], target: int) -> Optional[List[int]]:
+    nums_indices = [(num, index) for index, num in enumerate(nums)]
+    nums_indices.sort()
+    for i in range(len(nums_indices) - 2):
+        if i > 0 and nums_indices[i] == nums_indices[i - 1]:
+            continue
+        l, r = i + 1, len(nums_indices) - 1
+        while l < r:
+            three_sum = nums_indices[i][0] + nums_indices[l][0] + nums_indices[r][0]
+            if three_sum < target:
+                l += 1
+            elif three_sum > target:
+                r -= 1
+            else:
+                indices = sorted(
+                    [nums_indices[i][1], nums_indices[l][1], nums_indices[r][1]]
+                )
+                return indices
+    return None
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/custom_python/test.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/custom_python/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import three_sum
+from typing import List
+
+
+def test_three_sum(nums: List[int], target: int, expected_result: List[int]) -> None:
+    result = three_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first three numbers
+    nums = [2, 7, 11, 15]
+    target = 20
+    expected_result = [0, 1, 2]
+    test_three_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 2
+    expected_result = [0, 2, 5]
+    test_three_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = 9
+    expected_result = [0, 2, 3]
+    test_three_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/data.json
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestThreeSum",
+  "category": ["code", "iterate"],
+  "task": "Create a three_sum function in a file called code.py. Given an array of integers, return indices of the three numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Example: Given nums = [2, 7, 11, 15], target = 20, Because nums[0] + nums[1] + nums[2] = 2 + 7 + 11 = 20, return [0, 1, 2].",
+  "dependencies": ["TestFunctionCodeGeneration"],
+  "cutoff": 60,
+  "ground": {
+    "answer": "The three_sum function coded properly.",
+    "should_contain": ["[0, 1, 2]", "[0, 2, 5]", "[0, 2, 3]"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Tests ability for the agent to create the three_sum function.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/code.py
@@ -0,0 +1,12 @@
+# mypy: ignore-errors
+from typing import List, Optional
+
+
+def two_sum(nums: List, target: int) -> Optional[List[int]]:
+    seen = {}
+    for i, num in enumerate(nums):
+        complement = target - num
+        if complement in seen:
+            return [seen[complement], i]
+        seen[num] = i
+    return None
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/custom_python/test.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/custom_python/test.py
@@ -0,0 +1,31 @@
+# mypy: ignore-errors
+from code import two_sum
+from typing import List
+
+
+def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
+    result = two_sum(nums, target)
+    print(result)
+    assert (
+        result == expected_result
+    ), f"AssertionError: Expected the output to be {expected_result}"
+
+
+if __name__ == "__main__":
+    # test the trivial case with the first two numbers
+    nums = [2, 7, 11, 15]
+    target = 9
+    expected_result = [0, 1]
+    test_two_sum(nums, target, expected_result)
+
+    # test for ability to use zero and the same number twice
+    nums = [2, 7, 0, 15, 12, 0]
+    target = 0
+    expected_result = [2, 5]
+    test_two_sum(nums, target, expected_result)
+
+    # test for first and last index usage and negative numbers
+    nums = [-6, 7, 11, 4]
+    target = -2
+    expected_result = [0, 3]
+    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/data.json
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestFunctionCodeGeneration",
+  "category": ["code"],
+  "task": "Create a two_sum function in a file called code.py. Given an array of integers, return indices of the two numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Example: Given nums = [2, 7, 11, 15], target = 9, Because nums[0] + nums[1] = 2 + 7 = 9, return [0, 1].",
+  "dependencies": ["TestReturnCode_Write"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "The two_sum function coded properly.",
+    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Tests ability for the agent to create the two_sum function.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/artifacts_out/password_generator.py
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/artifacts_out/password_generator.py
@@ -0,0 +1,23 @@
+import random
+import string
+
+
+def generate_password(length: int) -> str:
+    if length < 8 or length > 16:
+        raise ValueError("Password length must be between 8 and 16 characters.")
+
+    characters = string.ascii_letters + string.digits + string.punctuation
+    password = [
+        random.choice(string.ascii_lowercase),
+        random.choice(string.ascii_uppercase),
+        random.choice(string.digits),
+        random.choice(string.punctuation),
+    ]
+    password += [random.choice(characters) for _ in range(length - 4)]
+    random.shuffle(password)
+    return "".join(password)
+
+
+if __name__ == "__main__":
+    password_length = random.randint(8, 16)
+    print(generate_password(password_length))
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/custom_python/test.py
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/custom_python/test.py
@@ -0,0 +1,29 @@
+import unittest
+
+import password_generator
+
+
+class TestPasswordGenerator(unittest.TestCase):
+    def test_password_length(self):
+        for i in range(8, 17):
+            password = password_generator.generate_password(i)
+            self.assertEqual(len(password), i)
+
+    def test_value_error(self):
+        with self.assertRaises(ValueError):
+            password_generator.generate_password(7)
+        with self.assertRaises(ValueError):
+            password_generator.generate_password(17)
+
+    def test_password_content(self):
+        password = password_generator.generate_password(8)
+        self.assertTrue(any(c.islower() for c in password))
+        self.assertTrue(any(c.isupper() for c in password))
+        self.assertTrue(any(c.isdigit() for c in password))
+        self.assertTrue(
+            any(c in password_generator.string.punctuation for c in password)
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/data.json
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/1_password_generator/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestPasswordGenerator_Easy",
+  "category": ["code"],
+  "task": "Create a random password generator. The password should have between 8 and 16 characters and should contain letters, numbers and symbols. The password should be printed to the console. The entry point will be a python file that can be run this way: python password_generator.py",
+  "dependencies": ["TestWriteFile"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "password_generator.py is created and satisfies the requirements.",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "basic",
+    "description": "Tests ability for the agent to code a file organizer.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/artifacts_out/organize_files.py
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/artifacts_out/organize_files.py
@@ -0,0 +1,48 @@
+import argparse
+import os
+import shutil
+
+
+def organize_files(directory_path):
+    # Define file type groups
+    file_types = {
+        "images": [".png", ".jpg", ".jpeg"],
+        "documents": [".pdf", ".docx", ".txt"],
+        "audio": [".mp3", ".wav", ".flac"],
+    }
+
+    # Create the folders if they don't exist
+    for folder_name in file_types.keys():
+        folder_path = os.path.join(directory_path, folder_name)
+        if not os.path.exists(folder_path):
+            os.makedirs(folder_path)
+
+    # Traverse through all files and folders in the specified directory
+    for foldername, subfolders, filenames in os.walk(directory_path):
+        for filename in filenames:
+            # Get file extension
+            _, file_extension = os.path.splitext(filename)
+
+            # Move files to corresponding folders
+            for folder_name, extensions in file_types.items():
+                if file_extension in extensions:
+                    old_path = os.path.join(foldername, filename)
+                    new_path = os.path.join(directory_path, folder_name, filename)
+                    if old_path != new_path:
+                        shutil.move(old_path, new_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Organize files in a directory based on their file types"
+    )
+    parser.add_argument(
+        "--directory_path",
+        type=str,
+        required=True,
+        help="The path of the directory to be organized",
+    )
+
+    args = parser.parse_args()
+
+    organize_files(args.directory_path)
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/custom_python/test.py
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/custom_python/test.py
@@ -0,0 +1,45 @@
+import os
+import subprocess
+import tempfile
+import unittest
+
+
+class TestOrganizeFiles(unittest.TestCase):
+    def setUp(self):
+        # Create temporary directory
+        self.test_dir = tempfile.mkdtemp()
+
+        # File types and their corresponding directory
+        self.file_types = {
+            "test_image.png": "images",
+            "test_doc.txt": "documents",
+            "test_audio.mp3": "audio",
+        }
+
+        # Create test files
+        for file_name in self.file_types.keys():
+            open(os.path.join(self.test_dir, file_name), "a").close()
+
+    def test_organize_files(self):
+        # Call the organize_files.py script using subprocess
+        subprocess.call(
+            ["python", "organize_files.py", "--directory_path=" + self.test_dir]
+        )
+
+        # Check if the files have been moved to the correct directories
+        for file_name, directory in self.file_types.items():
+            self.assertTrue(
+                os.path.isfile(os.path.join(self.test_dir, directory, file_name))
+            )
+
+    def tearDown(self):
+        # Delete test directory and its contents
+        for file_name, directory in self.file_types.items():
+            os.remove(os.path.join(self.test_dir, directory, file_name))
+        for directory in set(self.file_types.values()):
+            os.rmdir(os.path.join(self.test_dir, directory))
+        os.rmdir(self.test_dir)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/data.json
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/2_file_organizer/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestWritingCLI_FileOrganizer",
+  "category": ["code"],
+  "task": "Create a file organizer CLI tool in Python that sorts files in a directory based on their file types (e.g., images, documents, audio) and moves them into these corresponding folders: 'images', 'documents', 'audio'. The entry point will be a python file that can be run this way: python organize_files.py --directory_path=YOUR_DIRECTORY_PATH",
+  "dependencies": ["TestPasswordGenerator_Easy"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "The correct python file is written and organizes the files accordingly",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "basic",
+    "description": "Tests ability for the agent to create a random password generator.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c4_writing_cli_suite_3/suite.json
+++ b/agbenchmark/challenges/code/c4_writing_cli_suite_3/suite.json
@@ -0,0 +1,5 @@
+{
+  "same_task": false,
+  "reverse_order": false,
+  "prefix": "TestWritingCLI"
+}
--- a/agbenchmark/challenges/code/c5_web_app_suite/1_list_animals/artifacts_out/animal_list.html
+++ b/agbenchmark/challenges/code/c5_web_app_suite/1_list_animals/artifacts_out/animal_list.html
@@ -0,0 +1,29 @@
+<!DOCTYPE html>
+<html>
+
+<head>
+    <title>List of Animals</title>
+</head>
+
+<body>
+
+    <h2>List of Animals</h2>
+
+    <ul>
+        <li id="dog">Dog</li>
+        <li>Cat</li>
+        <li>Rabbit</li>
+        <li>Horse</li>
+    </ul>
+
+    <div id="info"></div>
+
+    <script>
+        document.getElementById("dog").addEventListener("click", function() {
+            document.getElementById("info").innerHTML = "Dogs are known as man's best friend!";
+        });
+    </script>
+
+</body>
+
+</html>
--- a/agbenchmark/challenges/code/c5_web_app_suite/1_list_animals/custom_python/test.py
+++ b/agbenchmark/challenges/code/c5_web_app_suite/1_list_animals/custom_python/test.py
@@ -0,0 +1,48 @@
+import os
+import time
+
+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support import expected_conditions as EC
+from selenium.webdriver.support.ui import WebDriverWait
+
+# Path to the HTML file
+current_path = os.path.abspath(__file__)
+current_directory = os.path.dirname(current_path)
+file_path = f"file://{current_directory}/animal_list.html"
+
+# Create a new instance of the Chrome driver
+
+chrome_options = Options()
+chrome_options.add_argument("--headless")
+chrome_options.add_argument("--disable-gpu")
+chrome_options.add_argument("--window-size=1024x768")
+chrome_options.add_argument("--no-sandbox")
+chrome_options.add_argument("--disable-dev-shm-usage")
+driver = webdriver.Chrome(options=chrome_options)
+
+# Navigate to the HTML file
+driver.get(file_path)
+
+# Wait for up to 10 seconds for the "dog" element to be available
+wait = WebDriverWait(driver, 10)
+dog_li = wait.until(EC.presence_of_element_located((By.ID, "dog")))
+
+# Click on the "dog" list item
+dog_li.click()
+
+# Find the "info" div and get its text
+info_div = driver.find_element(By.ID, "info")
+info_text = info_div.text
+
+# Assert that the text is what we expect
+assert info_text == "Dogs are known as man's best friend!"
+
+print("Test passed!")
+
+# Wait for 5 seconds
+time.sleep(5)
+
+# Close the browser window
+driver.quit()
--- a/agbenchmark/challenges/code/c5_web_app_suite/1_list_animals/data.json
+++ b/agbenchmark/challenges/code/c5_web_app_suite/1_list_animals/data.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestWebApp_ListAnimals",
+  "category": ["code"],
+  "task": "Build a web page with a list of animals. When someone clicks on the word 'Dog', a message should appear that says 'Dogs are known as man's best friend!'. You'll need to make a list with the name 'Dog' and then write a little bit of JavaScript to make the message appear when the name is clicked. Mark the div containing dog with the id 'dog'. Put the message inside a <div> with the id 'info'. Create a single html file called animal_list.html.",
+  "dependencies": ["TestWritingCLI_FileOrganizer"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "A web app where we can list animals and have details about dogs.",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "basic",
+    "description": "Tests the agent's ability to build a basic html app.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c5_web_app_suite/suite.json
+++ b/agbenchmark/challenges/code/c5_web_app_suite/suite.json
@@ -0,0 +1,5 @@
+{
+  "same_task": false,
+  "reverse_order": false,
+  "prefix": "TestWebApp"
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/10_url_shortener/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/10_url_shortener/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngUrlShortener",
+  "category": ["code"],
+  "task": "Create a URL shortener app using HTML, CSS, JavaScript, and a backend language like Python or Node.js. Allow users to input a long URL and generate a shortened version that redirects to the original URL. Store the shortened URLs in a database.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/1_currency_converter/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/1_currency_converter/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngCurrencyConverter",
+  "category": ["code"],
+  "task": "Build a currency converter app using an API for exchange rates. Use HTML, CSS, and JavaScript for the frontend and Node.js for the backend. Allow users to convert between different currencies.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "Tries converting three different currencies which should match the API set up in test.py",
+    "should_contain": ["True", "True", "True"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Converts currency by calling an API and returning the result.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/2_file_explorer/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/2_file_explorer/data_draft.json
@@ -0,0 +1,25 @@
+{
+  "name": "TestEngFileExplorer",
+  "category": ["code"],
+  "task": "Create a file organizer CLI tool in Python that sorts files in a directory based on their file types (e.g., images, documents, audio) and moves them into corresponding folders.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "Given a directory with audio files, images, and txt files, it should sort them into folders.",
+    "should_contain": [
+      "Audio sorted correctly",
+      "Images sorted correctly",
+      ".txt files sorted correctly"
+    ],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Testing the creation of a file organizer CLI tool",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/3_file_organizer/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/3_file_organizer/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngFileOrganizer",
+  "category": ["code"],
+  "task": "Build a currency converter app using an API for exchange rates. Use HTML, CSS, and JavaScript for the frontend and Node.js for the backend. Allow users to convert between different currencies.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "Tries converting three different currencies which should match the API set up in test.py",
+    "should_contain": ["True", "True", "True"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Converts currency by calling an API and returning the result.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/4_image_resizer/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/4_image_resizer/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngImageResizer",
+  "category": ["code"],
+  "task": "Create a CLI tool in Python that allows users to resize images by specifying the desired width and height. Use the Pillow library for image manipulation.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "Takes two image files img1.jpg and img2.png and checks if they have been resized correctly",
+    "should_contain": ["1280*1280", "640*640"],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Asks to build CLI tool that resizes images to a specified width and height.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/5_markdown_editor/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/5_markdown_editor/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngMarkdownEditor",
+  "category": ["code"],
+  "task": "Build a simple markdown editor using HTML, CSS, and JavaScript. Allow users to input markdown text and display the formatted output in real-time.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/6_password_generator/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/6_password_generator/data_draft.json
@@ -0,0 +1,23 @@
+{
+  "name": "TestEngPassGen",
+  "category": ["code"],
+  "task": "Create a password generator CLI tool in Python that generates strong, random passwords based on user-specified criteria, such as length and character types (letters, numbers, symbols).",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "Does the following password fulfill the requirements of the user?",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "llm",
+      "scoring": "binary",
+      "template": "question"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "Test.py will get content in the format of 1) Length: 10 2) Character types: letters, numbers, symbols 3) Password: 1a2b3c4d5e which the llm will score.",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/7_pomodoro_timer/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/7_pomodoro_timer/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngPomodoro",
+  "category": ["code"],
+  "task": "Develop a Pomodoro timer app using HTML, CSS, and JavaScript. Allow users to set work and break intervals and receive notifications when it's time to switch.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/8_timer_app/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/8_timer_app/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngTimerApp",
+  "category": ["code"],
+  "task": "Create a simple timer app using HTML, CSS, and JavaScript that allows users to set a countdown timer and receive an alert when the time is up.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/9_todo_list/data_draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/9_todo_list/data_draft.json
@@ -0,0 +1,21 @@
+{
+  "name": "TestEngTodoList",
+  "category": ["code"],
+  "task": "Create a simple to-do list app using HTML, CSS, and JavaScript. Store tasks in local storage and allow users to add, edit, and delete tasks.",
+  "dependencies": ["TestReturnCode_Simple"],
+  "cutoff": 90,
+  "ground": {
+    "answer": "",
+    "should_contain": [],
+    "should_not_contain": [],
+    "files": ["test.py"],
+    "eval": {
+      "type": "python"
+    }
+  },
+  "info": {
+    "difficulty": "advanced",
+    "description": "",
+    "side_effects": []
+  }
+}
--- a/agbenchmark/challenges/code/c9_realistic_suite/draft.json
+++ b/agbenchmark/challenges/code/c9_realistic_suite/draft.json
@@ -0,0 +1,5 @@
+{
+  "same_task": false,
+  "reverse_order": false,
+  "prefix": "TestEng"
+}
--- a/agbenchmark/challenges/content_gen/1_summary/artifacts_in/challenges.txt
+++ b/agbenchmark/challenges/content_gen/1_summary/artifacts_in/challenges.txt
@@ -0,0 +1,5 @@
+1. Rising levels of air pollution in major cities.
+2. The decline of linguistic diversity and death of minor languages.
+3. Increased demand for sustainable and eco-friendly products.
+4. The remote work revolution due to global pandemics.
+5. Growing concerns about meat consumption's environmental and ethical implications.
--- a/Show More
+++ b/Show More