category (str): information-retrieval
query (str): the question need to be solve.
ground (dict): The ground truth.
- answer (str): The raw text of ground truth answer
- should_contain (list): the exact strings that is required in the final answer
- should_not_contain (list): the exact strings that should not be in the final answer
difficulty_level(str): the difficulty of this query. choices from ["easy", "medium", "hard"]

Example:

{
    "category": "information-retrieval",
    "query": "what is the capital of America",
    "ground": {
        "answer": "Washington",
        "should_contain": ["Washington"],
        "should_not_contain": ["New York", "Los Angeles", "San Francisco"]
    },
    "difficulty_level": "easy"
}

Output:

score (float): scores range from [0, 1]