report # bug, adding submodule challenges (#193)

2026-02-14 10:44:20 +01:00 · 2023-07-26 13:53:10 +01:00
parent 6b7e2da1df
commit 80506e9a3b
163 changed files with 56 additions and 1783 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -190,6 +190,7 @@ jobs:
            ${prefix}agbenchmark start --mock --category=memory
            ${prefix}agbenchmark start --mock --category=iterate
            ${prefix}agbenchmark start --mock --suite TestReturnCode 
+            ${prefix}agbenchmark start --mock --suite TestRevenueRetrieval
          else
            bash -c "$(curl -fsSL https://raw.githubusercontent.com/Helicone/helicone/779bb99c6e9cd878e324e5e1c6a41c0d8db81754/mitmproxy.sh)" -s start
            ${prefix}agbenchmark start || echo "This command will always return a non zero exit code unless all the challenges are solved."
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,28 +1,31 @@
-[submodule "agent/Auto-GPT"]
-	path = agent/Auto-GPT
-	url = https://github.com/merwanehamadi/Auto-GPT.git
-	branch = remove-append-to-file
-[submodule "agent/gpt-engineer"]
-	path = agent/gpt-engineer
-	url = https://github.com/merwanehamadi/gpt-engineer.git
-	branch = benchmark-integration
-[submodule "agent/mini-agi"]
-	path = agent/mini-agi
-	url = https://github.com/SilenNaihin/mini-agi.git
-	branch = benchmark-integration
-[submodule "agent/smol-developer"]
-	path = agent/smol-developer
-	url = https://github.com/merwanehamadi/developer.git
-	branch = benchmark-integration
-[submodule "agent/SuperAGI"]
-	path = agent/SuperAGI
-	url = https://github.com/SilenNaihin/SuperAGI.git
-	branch = benchmark-integration
-[submodule "agent/BabyAGI"]
-	path = agent/BabyAGI
-	url = https://github.com/SilenNaihin/babyagi.git
-	branch = benchmark-integration
-[submodule "agent/beebot"]
-	path = agent/beebot
-	url = https://github.com/merwanehamadi/beebot.git
-	branch = master
+[submodule "agent/Auto-GPT"]
+	path = agent/Auto-GPT
+	url = https://github.com/merwanehamadi/Auto-GPT.git
+	branch = remove-append-to-file
+[submodule "agent/gpt-engineer"]
+	path = agent/gpt-engineer
+	url = https://github.com/merwanehamadi/gpt-engineer.git
+	branch = benchmark-integration
+[submodule "agent/mini-agi"]
+	path = agent/mini-agi
+	url = https://github.com/SilenNaihin/mini-agi.git
+	branch = benchmark-integration
+[submodule "agent/smol-developer"]
+	path = agent/smol-developer
+	url = https://github.com/merwanehamadi/developer.git
+	branch = benchmark-integration
+[submodule "agent/SuperAGI"]
+	path = agent/SuperAGI
+	url = https://github.com/SilenNaihin/SuperAGI.git
+	branch = benchmark-integration
+[submodule "agent/BabyAGI"]
+	path = agent/BabyAGI
+	url = https://github.com/SilenNaihin/babyagi.git
+	branch = benchmark-integration
+[submodule "agent/beebot"]
+	path = agent/beebot
+	url = https://github.com/merwanehamadi/beebot.git
+	branch = master
+[submodule "agbenchmark/challenges"]
+	path = agbenchmark/challenges
+	url = https://github.com/SilenNaihin/agbenchmark_challenges.git
--- a/agbenchmark/README.md
+++ b/agbenchmark/README.md
@@ -20,7 +20,8 @@
 3. `poetry shell`
 4. `poetry install`
 5. `cp .env_example .env`
-6. `agbenchmark start --mock`
+6. `git submodule update --init --remote --recursive`
+7. `agbenchmark start --mock`
   Keep config the same and watch the logs :)

 ### To run with mini-agi
@@ -28,7 +29,8 @@
 1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
 2. `pip install -r requirements.txt`
 3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
-4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`
+4. set `AGENT_NAME=mini-agi` in `.env` file and where you want your `REPORT_LOCATION` to be
+5. Make sure to follow the commands above, and remove mock flag `agbenchmark start`

 - To add requirements `poetry add requirement`.

@@ -65,6 +67,6 @@ https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/pull/48/files
 **To just use as the benchmark for your agent**. `pip install` the package and run `agbenchmark start`

 **For internal Auto-GPT ci runs**, specify the `AGENT_NAME` you want you use and set the `HOME_ENV`.
-Ex. `HOME_ENV=ci AGENT_NAME=mini-agi`
+Ex. `AGENT_NAME=mini-agi`

 **To develop agent alongside benchmark**, you can specify the `AGENT_NAME` you want you use and add as a submodule to the repo
--- a/agbenchmark/challenges
+++ b/agbenchmark/challenges
--- a/agbenchmark/challenges/README.md
+++ b/agbenchmark/challenges/README.md
@@ -1,63 +0,0 @@
-# Challenges Data Schema of Benchmark
-
-## General challenges
-
-Input:
-
- **name** (str): Name of the challenge.
- **category** (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. _this is not currently used. for the future it may be needed_
- **task** (str): The task that the agent needs to solve.
- **dependencies** (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
- **ground** (dict): The ground truth.
-  - **answer** (str): The raw text of the ground truth answer.
-  - **should_contain** (list): The exact strings that are required in the final answer.
-  - **should_not_contain** (list): The exact strings that should not be in the final answer.
-  - **files** (list): Files that are used for retrieval. Can specify file here or an extension.
- **mock** (dict): Mock response for testing.
-  - **mock_func** (str): Function to mock the agent's response. This is used for testing purposes.
-  - **mock_task** (str): Task to provide for the mock function.
- **info** (dict): Additional info about the challenge.
-  - **difficulty** (str): The difficulty of this query.
-  - **description** (str): Description of the challenge.
-  - **side_effects** (str[]): Describes the effects of the challenge.
-
-Example:
-
-```python
-{
-  "category": ["basic"],
-  "task": "Print the the capital of America to a .txt file",
-  "dependencies": ["TestWriteFile"], # the class name of the test
-  "ground": {
-    "answer": "Washington",
-    "should_contain": ["Washington"],
-    "should_not_contain": ["New York", "Los Angeles", "San Francisco"],
-    "files": [".txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "basic",
-    "description": "Tests the writing to file",
-    "side_effects": ["tests if there is in fact an LLM attached"]
-  }
-}
-```
-
-Current Output:
-
- **score** (float): scores range from [0, 1]
-
-## Add files to challenges: 
-
-### artifacts_in
-
-This folder contains all the files you want the agent to have in its workspace BEFORE the challenge starts
-
-### artifacts_out
-This folder contains all the files you would like the agent to generate. This folder is used to mock the agent.
-This allows to run agbenchmark start --test=TestExample --mock and make sure our challenge actually works.
-
-### custom_python
-This folder contains files that will be copied into the agent's workspace and run after the challenge is completed.
-For example we can have a test.py in it and run this file in the workspace to easily import code generated by the agent.
-Example: TestBasicCodeGeneration challenge.
--- a/agbenchmark/challenges/init.py
+++ b/agbenchmark/challenges/init.py
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/init.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/init.py
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/code.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/code.py
@@ -1,13 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        typo
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/test.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_in/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/init.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/init.py
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/code.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/code.py
@@ -1,12 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/test.py
+++ b/agbenchmark/challenges/adapatability/a1_debug/artifacts_out/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/adapatability/a1_debug/data.json
+++ b/agbenchmark/challenges/adapatability/a1_debug/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestAdaptSimpleTypoWithGuidance",
-  "category": ["adaptability"],
-  "task": "1- Run code.py.\n2- Read test.py.\n3- Modify test.py.\nRepeat step 1, 2 and 3 until code.py runs without errors.\n",
-  "dependencies": ["TestDebugSimpleTypoWithGuidance"],
-  "cutoff": 75,
-  "ground": {
-    "answer": "Same as the original DebugWithGuidance test - [0, 1] [2, 5] [0, 3]",
-    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "intermediate",
-    "description": "A version of the debug challenge where the wrong files to test and change are given. The llm must figure the challenges out",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/adapatability/a2_tesla_revenue/artifacts_out/random_file.txt
+++ b/agbenchmark/challenges/adapatability/a2_tesla_revenue/artifacts_out/random_file.txt
@@ -1 +0,0 @@
-81,462 Millions
--- a/agbenchmark/challenges/adapatability/a2_tesla_revenue/data.json
+++ b/agbenchmark/challenges/adapatability/a2_tesla_revenue/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestAdaptTeslaRevenue",
-  "category": ["adaptability"],
-  "task": "I'm a financial planner, please help me write tesla's r in 2022 into a .txt file.",
-  "cutoff": 60,
-  "dependencies": ["TestRevenueRetrieval_1.0"],
-  "ground": {
-    "answer": "It was $81.462 billion in 2022.",
-    "should_contain": ["81"],
-    "should_not_contain": [],
-    "files": [".txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "intermediate",
-    "description": "Given context adapt to the missing of the word revenue",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/adapatability/a3_book_price/artifacts_out/random_file.txt
+++ b/agbenchmark/challenges/adapatability/a3_book_price/artifacts_out/random_file.txt
@@ -1 +0,0 @@
-25.89
--- a/agbenchmark/challenges/adapatability/a3_book_price/data.json
+++ b/agbenchmark/challenges/adapatability/a3_book_price/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestAdaptLink",
-  "category": ["adaptability"],
-  "task": "Write the price of the book in this url 'books.toscrape.com/catalogue/meditations_33/index.html' to a .txt file.",
-  "dependencies": ["TestBasicRetrieval"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "£25.89",
-    "should_contain": ["25.89"],
-    "should_not_contain": [],
-    "files": [".txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "novice",
-    "description": "An advanced version of this -> remove.html as well. Same as TestBasicRetrieval but link is slightly broken, supposed to be http:// at the start.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/init.py
@@ -1 +0,0 @@
-# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/code.py
@@ -1,5 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int) -> int:
-    multiplied_num = num * 2
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_in/test.py
@@ -1,17 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, expected_result: int) -> None:
-    result = multiply_int(num)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    expected_result = 8
-    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/init.py
@@ -1 +0,0 @@
-# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/code.py
@@ -1,6 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int) -> int:
-    multiplied_num = num * 2
-    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/artifacts_out/test.py
@@ -1,17 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, expected_result: int) -> None:
-    result = multiply_int(num)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    expected_result = 8
-    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestReturnCode_Simple",
-  "category": ["code", "iterate"],
-  "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py",
-  "dependencies": ["TestReadFile", "TestWriteFile"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
-    "should_contain": ["8"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "basic",
-    "description": "Simple test if a simple code instruction can be executed",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/init.py
@@ -1 +0,0 @@
-# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/code.py
@@ -1 +0,0 @@
-# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_in/test.py
@@ -1,17 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, expected_result: int) -> None:
-    result = multiply_int(num)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    expected_result = 8
-    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/init.py
@@ -1 +0,0 @@
-# mypy: ignore-errors
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/code.py
@@ -1,6 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int) -> int:
-    multiplied_num = num * 2
-    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/artifacts_out/test.py
@@ -1,17 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, expected_result: int) -> None:
-    result = multiply_int(num)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    expected_result = 8
-    test_multiply_int(num, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestReturnCode_Write",
-  "category": ["code", "iterate"],
-  "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py",
-  "dependencies": ["TestReturnCode_Simple"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
-    "should_contain": ["8"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "novice",
-    "description": "Small step up, just writing the function with a name as well as the return statement.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/code.py
@@ -1,6 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int) -> int:
-    multiplied_num = num * 2
-    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_in/test.py
@@ -1,30 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
-    result = multiply_int(num, multiplier)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    multiplier = 2
-    expected_result = 8
-    test_multiply_int(num, multiplier, expected_result)
-
-    # so its not hard coded
-    num = 7
-    multiplier = 7
-    expected_result = 49
-    test_multiply_int(num, multiplier, expected_result)
-
-    # negative numbers
-    num = -6
-    multiplier = 2
-    expected_result = -12
-    test_multiply_int(num, multiplier, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/code.py
@@ -1,6 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int, multiplier: int) -> int:
-    multiplied_num = num * multiplier
-    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/artifacts_out/test.py
@@ -1,30 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
-    result = multiply_int(num, multiplier)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    multiplier = 2
-    expected_result = 8
-    test_multiply_int(num, multiplier, expected_result)
-
-    # so its not hard coded
-    num = 7
-    multiplier = 7
-    expected_result = 49
-    test_multiply_int(num, multiplier, expected_result)
-
-    # negative numbers
-    num = -6
-    multiplier = 2
-    expected_result = -12
-    test_multiply_int(num, multiplier, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/3_modify/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestReturnCode_Modify",
-  "category": ["code", "iterate"],
-  "task": "Modify the multiply_int function in code.py to be able to pass in a 'multiplier' argument to multiply the 'num' by 'multiplier'. Both arguments are integers. You can make sure you have correctly done this by running test.py",
-  "dependencies": ["TestReturnCode_Write"],
-  "cutoff": 75,
-  "ground": {
-    "answer": "def multiply_int(num, multiplier):\n    return num * multiplier\n",
-    "should_contain": ["8", "49", "-12"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "intermediate",
-    "description": "Builds on the previous function also take a multiplier .",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/code.py
@@ -1,6 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int) -> int:
-    multiplied_num = num * 2
-    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_in/test.py
@@ -1,18 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
-    result = multiply_int(num, multiplier)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # create a trivial test that has 4 as the num, and 2 as the multiplier. Make sure to fill in the expected result
-    num =
-    multiplier = 
-    expected_result = 
-    test_multiply_int()
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/code.py
@@ -1,6 +0,0 @@
-# mypy: ignore-errors
-
-
-def multiply_int(num: int, multiplier: int) -> int:
-    multiplied_num = num * multiplier
-    return multiplied_num
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/artifacts_out/test.py
@@ -1,30 +0,0 @@
-# mypy: ignore-errors
-from code import multiply_int
-
-
-def test_multiply_int(num: int, multiplier, expected_result: int) -> None:
-    result = multiply_int(num, multiplier)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case
-    num = 4
-    multiplier = 2
-    expected_result = 8
-    test_multiply_int(num, multiplier, expected_result)
-
-    # so its not hard coded
-    num = 7
-    multiplier = 7
-    expected_result = 49
-    test_multiply_int(num, multiplier, expected_result)
-
-    # negative numbers
-    num = -6
-    multiplier = 2
-    expected_result = -12
-    test_multiply_int(num, multiplier, expected_result)
--- a/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/data.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/4_tests/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestReturnCode_Tests",
-  "category": ["code", "iterate"],
-  "task": "First, modify test.py to fill in the test case to be able to test the code in code.py. Next, modify the multiply_int function in code.py to be able to pass in a 'multiplier' argument to multiply the 'num' by 'multiplier'. Both arguments are integers. You can make sure you have correctly done this by running test.py that you previously modified.",
-  "dependencies": ["TestReturnCode_Modify"],
-  "cutoff": 90,
-  "ground": {
-    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
-    "should_contain": ["8", "49", "-12"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "advanced",
-    "description": "Small step up, just writing the function with a name as well as the return statement.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c1_writing_suite_1/suite.json
+++ b/agbenchmark/challenges/code/c1_writing_suite_1/suite.json
@@ -1,5 +0,0 @@
-{
-  "same_task": false,
-  "reverse_order": true,
-  "prefix": "TestReturnCode"
-}
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_in/code.py
@@ -1,13 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        typo
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_in/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_out/code.py
@@ -1,12 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/artifacts_out/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/data.json
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.1_vague/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestDebugSimpleTypoWithoutGuidance",
-  "category": ["code", "iterate"],
-  "task": "Make test.py run without errors.",
-  "dependencies": ["TestDebugSimpleTypoWithGuidance"],
-  "cutoff": 75,
-  "ground": {
-    "answer": "[0, 1] [2, 5] [0, 3]",
-    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "intermediate",
-    "description": "Tests ability for the agent to debug python code with a simple typo in it, using a very broad prompt without guidance",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_in/code.py
@@ -1,13 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        typo
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_in/test.py
@@ -1,33 +0,0 @@
-# mypy: ignore-errors
-# fmt: off
-from typing import List
-
-from import
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_out/code.py
@@ -1,12 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/artifacts_out/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/data.json
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2.2_import/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestDebugMultipleTypo",
-  "category": ["code", "iterate"],
-  "task": "Make test.py run without errors.",
-  "dependencies": ["TestDebugSimpleTypoWithoutGuidance"],
-  "cutoff": 90,
-  "ground": {
-    "answer": "[0, 1] [2, 5] [0, 3]",
-    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "advanced",
-    "description": "Now it's not just the typo error, but also an incomplete import statement",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_in/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_in/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_in/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_in/code.py
@@ -1,13 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        typo
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_in/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_in/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_out/code.py
@@ -1,12 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_out/test.py
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/artifacts_out/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c2_debug_suite/d2_debug/data.json
+++ b/agbenchmark/challenges/code/c2_debug_suite/d2_debug/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestDebugSimpleTypoWithGuidance",
-  "category": ["code", "iterate"],
-  "task": "1- Run test.py.\n2- Read code.py.\n3- Modify code.py.\nRepeat step 1, 2 and 3 until test.py runs without errors.\n",
-  "dependencies": ["TestReadFile", "TestWriteFile"],
-  "cutoff": 75,
-  "ground": {
-    "answer": "[0, 1] [2, 5] [0, 3]",
-    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "novice",
-    "description": "Tests ability for the agent to debug python code with a simple typo in it.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/artifacts_out/code.py
@@ -1,23 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def three_sum(nums: List[int], target: int) -> Optional[List[int]]:
-    nums_indices = [(num, index) for index, num in enumerate(nums)]
-    nums_indices.sort()
-    for i in range(len(nums_indices) - 2):
-        if i > 0 and nums_indices[i] == nums_indices[i - 1]:
-            continue
-        l, r = i + 1, len(nums_indices) - 1
-        while l < r:
-            three_sum = nums_indices[i][0] + nums_indices[l][0] + nums_indices[r][0]
-            if three_sum < target:
-                l += 1
-            elif three_sum > target:
-                r -= 1
-            else:
-                indices = sorted(
-                    [nums_indices[i][1], nums_indices[l][1], nums_indices[r][1]]
-                )
-                return indices
-    return None
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/custom_python/test.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/custom_python/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import three_sum
-from typing import List
-
-
-def test_three_sum(nums: List[int], target: int, expected_result: List[int]) -> None:
-    result = three_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first three numbers
-    nums = [2, 7, 11, 15]
-    target = 20
-    expected_result = [0, 1, 2]
-    test_three_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 2
-    expected_result = [0, 2, 5]
-    test_three_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = 9
-    expected_result = [0, 2, 3]
-    test_three_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/data.json
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3.1_three_sum/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestThreeSum",
-  "category": ["code", "iterate"],
-  "task": "Create a three_sum function in a file called code.py. Given an array of integers, return indices of the three numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Example: Given nums = [2, 7, 11, 15], target = 20, Because nums[0] + nums[1] + nums[2] = 2 + 7 + 11 = 20, return [0, 1, 2].",
-  "dependencies": ["TestFunctionCodeGeneration"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "The three_sum function coded properly.",
-    "should_contain": ["[0, 1, 2]", "[0, 2, 5]", "[0, 2, 3]"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "advanced",
-    "description": "Tests ability for the agent to create the three_sum function.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/init.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/init.py
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/code.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/artifacts_out/code.py
@@ -1,12 +0,0 @@
-# mypy: ignore-errors
-from typing import List, Optional
-
-
-def two_sum(nums: List, target: int) -> Optional[List[int]]:
-    seen = {}
-    for i, num in enumerate(nums):
-        complement = target - num
-        if complement in seen:
-            return [seen[complement], i]
-        seen[num] = i
-    return None
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/custom_python/test.py
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/custom_python/test.py
@@ -1,31 +0,0 @@
-# mypy: ignore-errors
-from code import two_sum
-from typing import List
-
-
-def test_two_sum(nums: List, target: int, expected_result: List[int]) -> None:
-    result = two_sum(nums, target)
-    print(result)
-    assert (
-        result == expected_result
-    ), f"AssertionError: Expected the output to be {expected_result}"
-
-
-if __name__ == "__main__":
-    # test the trivial case with the first two numbers
-    nums = [2, 7, 11, 15]
-    target = 9
-    expected_result = [0, 1]
-    test_two_sum(nums, target, expected_result)
-
-    # test for ability to use zero and the same number twice
-    nums = [2, 7, 0, 15, 12, 0]
-    target = 0
-    expected_result = [2, 5]
-    test_two_sum(nums, target, expected_result)
-
-    # test for first and last index usage and negative numbers
-    nums = [-6, 7, 11, 4]
-    target = -2
-    expected_result = [0, 3]
-    test_two_sum(nums, target, expected_result)
--- a/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/data.json
+++ b/agbenchmark/challenges/code/c3_writing_suite_2/d3_two_sum/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestFunctionCodeGeneration",
-  "category": ["code"],
-  "task": "Create a two_sum function in a file called code.py. Given an array of integers, return indices of the two numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Example: Given nums = [2, 7, 11, 15], target = 9, Because nums[0] + nums[1] = 2 + 7 = 9, return [0, 1].",
-  "dependencies": ["TestReturnCode_Write"],
-  "cutoff": 90,
-  "ground": {
-    "answer": "The two_sum function coded properly.",
-    "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"],
-    "should_not_contain": [],
-    "files": ["test.py"],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "advanced",
-    "description": "Tests ability for the agent to create the two_sum function.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/code/d4_web_server/custom_python/api_tests.py
+++ b/agbenchmark/challenges/code/d4_web_server/custom_python/api_tests.py
@@ -1,41 +0,0 @@
-import os
-from typing import Any, Dict
-from unittest.mock import Mock, patch
-
-import requests
-
-
-def test_make_request_and_assert() -> None:
-    result = make_request_and_assert()
-    print(result)
-    expected_result = "OK"
-    error_message = f"AssertionError: Expected the output to be {expected_result}"
-    print(error_message)
-    assert result == expected_result, error_message
-
-
-def make_assertion() -> None:
-    if os.environ.get("MOCK_TEST", "False").lower() == "true":
-        mock_response = Mock(requests.Response)
-        mock_response.status_code = 200
-        mock_response.json.return_value = {"status": "OK"}
-
-        with patch("requests.get", return_value=mock_response):
-            make_request_and_assert()
-    else:
-        make_request_and_assert()
-
-
-def make_request_and_assert() -> Dict[str, Any]:
-    response = requests.get("http://localhost:8079/health")
-    if response.status_code != 200:
-        raise AssertionError(
-            f"Expected status code 200, but got {response.status_code}"
-        )
-
-    return response.json()
-
-
-if __name__ == "__main__":
-    # test for the case when server is healthy
-    test_make_request_and_assert()
--- a/agbenchmark/challenges/code/d4_web_server/data.json
+++ b/agbenchmark/challenges/code/d4_web_server/data.json
@@ -1,27 +0,0 @@
-{
-  "name": "TestCreateSimpleWebServer",
-  "category": [
-    "code"
-  ],
-  "task": "Create a basic flask web server that responds to a GET localhost:8079/health with a 200 OK. Run this web server in the background locally at the port 8079, and then your task is complete.",
-  "dependencies": [
-    "TestDebugSimpleTypoWithGuidance"
-  ],
-  "cutoff": 90,
-  "ground": {
-    "answer": "GET localhost:8079/health responds with a 200 OK",
-    "should_contain": [],
-    "should_not_contain": [
-      "AssertionError"
-    ],
-    "files": [
-      "api_tests.py"
-    ],
-    "type": "execute_python_code"
-  },
-  "info": {
-    "difficulty": "advanced",
-    "description": "Tests ability for the agent to build a simple web server locally",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/interface/read_file/artifacts_in/file_to_check.txt
+++ b/agbenchmark/challenges/interface/read_file/artifacts_in/file_to_check.txt
@@ -1 +0,0 @@
-Hello World!
--- a/agbenchmark/challenges/interface/read_file/artifacts_out/file_to_check.txt
+++ b/agbenchmark/challenges/interface/read_file/artifacts_out/file_to_check.txt
@@ -1 +0,0 @@
-random string Hello World!
--- a/agbenchmark/challenges/interface/read_file/data.json
+++ b/agbenchmark/challenges/interface/read_file/data.json
@@ -1,18 +0,0 @@
-{
-  "name": "TestReadFile",
-  "category": ["interface"],
-  "task": "Write the string 'random string' before any existing text to the file called file_to_check.txt",
-  "dependencies": ["TestWriteFile"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "random string Hello World!",
-    "should_contain": ["random string", "Hello World!"],
-    "files": ["file_to_check.txt"],
-    "type": "file"
-  },
-  "info": {
-    "description": "This reads the file quickly",
-    "difficulty": "interface",
-    "side_effects": [""]
-  }
-}
--- a/agbenchmark/challenges/interface/search/artifacts_out/random_file.txt
+++ b/agbenchmark/challenges/interface/search/artifacts_out/random_file.txt
@@ -1,2 +0,0 @@
-This is a Heading
-This is a paragraph.
--- a/agbenchmark/challenges/interface/search/data.json
+++ b/agbenchmark/challenges/interface/search/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestSearch",
-  "category": ["interface"],
-  "task": "Open 'https://silennaihin.com/random/plain.html' and paste the text on the page in a .txt file",
-  "dependencies": ["TestWriteFile"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "This is a Heading\nThis is a paragraph.",
-    "should_contain": ["Heading", "paragraph"],
-    "should_not_contain": ["The", "the"],
-    "files": [".txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "interface",
-    "description": "Tests if an llm can search",
-    "side_effects": [""]
-  }
-}
--- a/agbenchmark/challenges/interface/write_file/artifacts_out/random_file.txt
+++ b/agbenchmark/challenges/interface/write_file/artifacts_out/random_file.txt
@@ -1 +0,0 @@
-Washington
--- a/agbenchmark/challenges/interface/write_file/data.json
+++ b/agbenchmark/challenges/interface/write_file/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestWriteFile",
-  "category": ["interface"],
-  "task": "Print the the capital of America to a .txt file",
-  "dependencies": [],
-  "cutoff": 60,
-  "ground": {
-    "answer": "Washington",
-    "should_contain": ["Washington"],
-    "should_not_contain": ["New York", "Los Angeles", "San Francisco"],
-    "files": [".txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "interface",
-    "description": "Tests the writing to file",
-    "side_effects": ["tests if there is in fact an LLM attached"]
-  }
-}
--- a/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_1.txt
+++ b/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_1.txt
@@ -1,2 +0,0 @@
-The id to remember is 2314
-Read the file instructions_2.txt
--- a/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_2.txt
+++ b/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_2.txt
@@ -1 +0,0 @@
-Read the file instructions_3.txt
--- a/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_3.txt
+++ b/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_3.txt
@@ -1 +0,0 @@
-Read the file instructions_4.txt
--- a/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_4.txt
+++ b/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_4.txt
@@ -1 +0,0 @@
-Read the file instructions_5.txt
--- a/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_5.txt
+++ b/agbenchmark/challenges/memory/m1_id/artifacts_in/instructions_5.txt
@@ -1 +0,0 @@
-Write the id previously mentioned to a file called result.txt.
--- a/agbenchmark/challenges/memory/m1_id/artifacts_out/result.txt
+++ b/agbenchmark/challenges/memory/m1_id/artifacts_out/result.txt
@@ -1 +0,0 @@
-2314
--- a/agbenchmark/challenges/memory/m1_id/data.json
+++ b/agbenchmark/challenges/memory/m1_id/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestBasicMemory",
-  "category": ["memory"],
-  "task": "Follow the instructions in the instructions_1.txt file",
-  "dependencies": ["TestReadFile", "TestWriteFile"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "2314",
-    "should_contain": ["2314"],
-    "should_not_contain": [],
-    "files": ["result.txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "basic",
-    "description": "Tests ability for the agent to remember information between each action. An id is presented initially and the agent has to remember it after reading 4 other files",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_1.txt
+++ b/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_1.txt
@@ -1 +0,0 @@
-The id to remember is 3145. Read the file instructions_2.txt.
--- a/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_2.txt
+++ b/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_2.txt
@@ -1 +0,0 @@
-The id to remember is 3791. Read the file instructions_3.txt.
--- a/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_3.txt
+++ b/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_3.txt
@@ -1 +0,0 @@
-The id to remember is 9317. Read the file instructions_4.txt.
--- a/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_4.txt
+++ b/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_4.txt
@@ -1 +0,0 @@
-The id to remember is 9471. Read the file instructions_5.txt.
--- a/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_5.txt
+++ b/agbenchmark/challenges/memory/m2_multiple/artifacts_in/instructions_5.txt
@@ -1 +0,0 @@
-Write the ids previously mentioned to a file called result.txt.
--- a/agbenchmark/challenges/memory/m2_multiple/artifacts_out/result.txt
+++ b/agbenchmark/challenges/memory/m2_multiple/artifacts_out/result.txt
@@ -1,4 +0,0 @@
-3145
-3791
-9317
-9471
--- a/agbenchmark/challenges/memory/m2_multiple/data.json
+++ b/agbenchmark/challenges/memory/m2_multiple/data.json
@@ -1,19 +0,0 @@
-{
-  "name": "TestRememberMultipleIds",
-  "category": ["memory"],
-  "task": "Follow the instructions in the instructions_1.txt file",
-  "dependencies": ["TestBasicMemory"],
-  "cutoff": 60,
-  "ground": {
-    "answer": "3145\n3791\n9317\n9471",
-    "should_contain": ["3145", "3791", "9317", "9471"],
-    "should_not_contain": [],
-    "files": ["result.txt"],
-    "type": "file"
-  },
-  "info": {
-    "difficulty": "novice",
-    "description": "Tests ability for the agent to remember information between each action. Multiple ids are presented in different files and the last file instructs the agent to write these ids into another file.",
-    "side_effects": []
-  }
-}
--- a/agbenchmark/challenges/memory/m3_noise/artifacts_in/instructions_1.txt
+++ b/agbenchmark/challenges/memory/m3_noise/artifacts_in/instructions_1.txt
@@ -1,5 +0,0 @@
-xBd9cZTfaExYalwtUVR7m0pe3Nfaf5uBS4IxGFZPZcQjDf8Tfk2vdpqWI0ESBWCdVzsdlxMmUxq43INSz1iftsv6PTOMGQ88Cojwj5mQXp8XKZ6VJC893BDLVLLW00iQy8VerWjQay9rAJz8rYVZHa6dByYNWZNOgtjC7ejnKt0VYZtUFfRBZNWG2HNX3tgX8H2h4xeu7GIfm4wETGvzlZDANeqiY2hMCXEAsliVXXjgmvVeB05tjkS5uvq5uV2DnNyStimIVVdEMFI5Ft0qM82aMvJlUtVj6TJEmE0qPTqBXeHud72iRTcBa9okCzjYiAd6oSoJ8k9o6lmFTeq323ILYCGzsICjqoysuVonfHUDh1Ll2LTo4I2AygfPqCqvgQWq9wa8YfWKBlwPPVy2lymJRTd1mS7RUaiagoNn76ApJviCYh2fWEZcxULCKAbbn0E6vz1CBADSOEIVB14ZyyRfiDcXbgYYcnOShwMsg0vYcKDKfAHk
-
-The id to remember is 3145. Read the file instructions_2.txt.
-
-OueiZyOoM5eGI5VkTt6Ue1XboZ4jztAa5IGWqSbhIhLiI4X2nOmJw6tLBd3smZNwKQpq8NHxZSk76Xd82yGI3l9KhLARXRT37MRkY7WG8YQVJEurki86cIGuXthSWjq9dDKh6ILSq4DdWPIHfDeV12psErCcgEUxWtoU6bnSMnaoYteOkKWTAkXdC1t4j5p3rXbMv1j92nLmCmoslT2A9noQIODWLdudtCecTMmrbq85RLBt5WFLGMfWVsuSrSMGo5tiN7vC1siLfhlhco0q5QaMv0px6kVg44Wceg3UXIUoMxTNoh9G8uEABJhvsF2qzxkbHuhk6VRuydIWoGgfN01upk6BDfvooyAkdcTJG5jFlHOJixTe4ramT5uP54oZ0anJTB6w7hybN3o9vb4xLbAFQxCZIXZ9HXgeBnl1L8qIvQg9VsklntCMsu2cm5CgIryRBGPqnTAbrhmAmFOkNyLSVFfYmu2wtlMov2NIkYilT4Oa1Rkt
--- a/agbenchmark/challenges/memory/m3_noise/artifacts_in/instructions_2.txt
+++ b/agbenchmark/challenges/memory/m3_noise/artifacts_in/instructions_2.txt
@@ -1,5 +0,0 @@
-2yXfw2vPZCZq4jGOTHF4NEUYLbAUBIcmkgLxG7qXnYLNsvvZDqAvBPw4OcOfleIWvS6S5GThSPcrSfX94yB1TT6SVHGqPkulJUk4W1wfIFRIiOSps6V8ulLyrmeZsEJ6l9B9Vrm4h6SZTQVP750TUfECOH4d5j5AtupugjqThyw3t6ZFYHr2eUYRiOiTlng2uvsoZiioBQlUitrjQ4mw8FRL3VaR2aAhHGwaNV0Q7XelFU50YQgcUYqfxHxmqCLqb7dtZ7WWwxrLcqwVbj4y1YteFzPZyU4TJKopMVizgWaam8tKa1hYAQHqEaiAAHigqvYhutPHarpzc4PP2RLE4AZCxRblSY40iXpxQ9waXsrUEZ51ZRFmvm5G17wuKghMcKea2jN2MIgvSxNBy7cszFyBTe6V6u6IMk1wVWa0YulPslLc0bXUVKqZ54b61lyBAKSoFbJVRFYB5XZBL5tp2efvTsEQ3JvFmSREEOhmawIriifCApy1
-
-The id to remember is 3791. Read the file instructions_3.txt.
-
-BDLfeJBcfS4iqE9sNAm4ndZT2F1fsFYdXGRpRQ6xSXl014c9H7NeMbQCtFb7kRtVvzx9AItPj1uqtjA0R35N2Pj8FgxfSPDb8dlizLA6dbKY4JfCWmibzrBYoFzoxiPX57m3n8yLKHA0aejG38aMJ6XjR75kAjBW0Cw9d3Ny0MphakfW8KDZoMO3qwsPLLASYrz42K7JjThVGZvEXczRBY8la4NJPZpj91GmfsQuJezCvcI87gjfjtRDp1GECU9SmLSWBufjQWWlc4p6z5XtPPu0vqxRjoiFDFZvafU35KkEDcWAHv3KhR0Z20JD2qIrJ4CHntwGBAk61nMBpKhNx0t3ONK5X0WD7gNCdG64obji2ifsI8ZydLkROJkAJCpe4zRd04mkydCwKGJzmCGv0lu1KRn4QobFq7mEeuzD0xvvGtyiuiVXJSVqphf5ySmfjD4EvDCMRDNZx7c4pECUnLBPDlB8JwMyugfyD5mslte9YCG9kK6n
--- a/Show More
+++ b/Show More
				`@@ -1 +0,0 @@`
				`Write the id previously mentioned to a file called result.txt.`
				`@@ -1 +0,0 @@`
				`The id to remember is 3145. Read the file instructions_2.txt.`
				`@@ -1 +0,0 @@`
				`The id to remember is 3791. Read the file instructions_3.txt.`
				`@@ -1 +0,0 @@`
				`The id to remember is 9317. Read the file instructions_4.txt.`
				`@@ -1 +0,0 @@`
				`The id to remember is 9471. Read the file instructions_5.txt.`
				`@@ -1 +0,0 @@`
				`Write the ids previously mentioned to a file called result.txt.`