Tutorial-Codebase-Knowledge/docs/Crawl4AI/06_extractionstrategy.md

# Chapter 6: Getting Specific Data - ExtractionStrategy

In the previous chapter, [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md), we learned how to sift through the cleaned webpage content to keep only the parts relevant to our query or goal, producing a focused `fit_markdown`. This is great for tasks like summarization or getting the main gist of an article.

But sometimes, we need more than just relevant text. Imagine you're analyzing an e-commerce website listing products. You don't just want the *description*; you need the exact **product name**, the specific **price**, the **customer rating**, and maybe the **SKU number**, all neatly organized. How do we tell Crawl4AI to find these *specific* pieces of information and return them in a structured format, like a JSON object?

## What Problem Does `ExtractionStrategy` Solve?

Think of the content we've processed so far (like the cleaned HTML or the generated Markdown) as a detailed report delivered by a researcher. `RelevantContentFilter` helped trim the report down to the most relevant pages.

Now, we need to give specific instructions to an **Analyst** to go through that focused report and pull out precise data points. We don't just want the report; we want a filled-in spreadsheet with columns for "Product Name," "Price," and "Rating."

`ExtractionStrategy` is the set of instructions we give to this Analyst. It defines *how* to locate and extract specific, structured information (like fields in a database or keys in a JSON object) from the content.

## What is `ExtractionStrategy`?

`ExtractionStrategy` is a core concept (a blueprint) in Crawl4AI that represents the **method used to extract structured data** from the processed content (which could be HTML or Markdown). It specifies *that* we need a way to find specific fields, but the actual *technique* used to find them can vary.

This allows us to choose the best "Analyst" for the job, depending on the complexity of the website and the data we need.

## The Different Analysts: Ways to Extract Data

Crawl4AI offers several concrete implementations (the different Analysts) for extracting structured data:

1.  **The Precise Locator (`JsonCssExtractionStrategy` & `JsonXPathExtractionStrategy`)**
    *   **Analogy:** An analyst who uses very precise map coordinates (CSS Selectors or XPath expressions) to find information on a page. They need to be told exactly where to look. "The price is always in the HTML element with the ID `#product-price`."
    *   **How it works:** You define a **schema** (a Python dictionary) that maps the names of the fields you want (e.g., "product_name", "price") to the specific CSS selector (`JsonCssExtractionStrategy`) or XPath expression (`JsonXPathExtractionStrategy`) that locates that information within the HTML structure.
    *   **Pros:** Very fast and reliable if the website structure is consistent and predictable. Doesn't require external AI services.
    *   **Cons:** Can break easily if the website changes its layout (selectors become invalid). Requires you to inspect the HTML and figure out the correct selectors.
    *   **Input:** Typically works directly on the raw or cleaned HTML.

2.  **The Smart Interpreter (`LLMExtractionStrategy`)**
    *   **Analogy:** A highly intelligent analyst who can *read and understand* the content. You give them a list of fields you need (a schema) or even just natural language instructions ("Find the product name, its price, and a short description"). They read the content (usually Markdown) and use their understanding of language and context to figure out the values, even if the layout isn't perfectly consistent.
    *   **How it works:** You provide a desired output schema (e.g., a Pydantic model or a dictionary structure) or a natural language instruction. The strategy sends the content (often the generated Markdown, possibly split into chunks) along with your schema/instruction to a configured Large Language Model (LLM) like GPT or Llama. The LLM reads the text and generates the structured data (usually JSON) according to your request.
    *   **Pros:** Much more resilient to website layout changes. Can understand context and handle variations. Can extract data based on meaning, not just location.
    *   **Cons:** Requires setting up access to an LLM (API keys, potentially costs). Can be significantly slower than selector-based methods. The quality of extraction depends on the LLM's capabilities and the clarity of your instructions/schema.
    *   **Input:** Often works best on the cleaned Markdown representation of the content, but can sometimes use HTML.

## How to Use an `ExtractionStrategy`

You tell the `AsyncWebCrawler` which extraction strategy to use (if any) by setting the `extraction_strategy` parameter within the [CrawlerRunConfig](03_crawlerrunconfig.md) object you pass to `arun` or `arun_many`.

### Example 1: Extracting Data with `JsonCssExtractionStrategy`

Let's imagine we want to extract the title (from the `<h1>` tag) and the main heading (from the `<h1>` tag) of the simple `httpbin.org/html` page.

```python
# chapter6_example_1.py
import asyncio
import json
from crawl4ai import (
    AsyncWebCrawler,
    CrawlerRunConfig,
    JsonCssExtractionStrategy # Import the CSS strategy
)

async def main():
    # 1. Define the extraction schema (Field Name -> CSS Selector)
    extraction_schema = {
        "baseSelector": "body", # Operate within the body tag
        "fields": [
            {"name": "page_title", "selector": "title", "type": "text"},
            {"name": "main_heading", "selector": "h1", "type": "text"}
        ]
    }
    print("Extraction Schema defined using CSS selectors.")

    # 2. Create an instance of the strategy with the schema
    css_extractor = JsonCssExtractionStrategy(schema=extraction_schema)
    print(f"Using strategy: {css_extractor.__class__.__name__}")

    # 3. Create CrawlerRunConfig and set the extraction_strategy
    run_config = CrawlerRunConfig(
        extraction_strategy=css_extractor
    )

    # 4. Run the crawl
    async with AsyncWebCrawler() as crawler:
        url_to_crawl = "https://httpbin.org/html"
        print(f"\nCrawling {url_to_crawl} to extract structured data...")

        result = await crawler.arun(url=url_to_crawl, config=run_config)

        if result.success and result.extracted_content:
            print("\nExtraction successful!")
            # The extracted data is stored as a JSON string in result.extracted_content
            # Parse the JSON string to work with the data as a Python object
            extracted_data = json.loads(result.extracted_content)
            print("Extracted Data:")
            # Print the extracted data nicely formatted
            print(json.dumps(extracted_data, indent=2))
        elif result.success:
            print("\nCrawl successful, but no structured data extracted.")
        else:
            print(f"\nCrawl failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())
```

**Explanation:**

1.  **Schema Definition:** We create a Python dictionary `extraction_schema`.
    *   `baseSelector: "body"` tells the strategy to look for items within the `<body>` tag of the HTML.
    *   `fields` is a list of dictionaries, each defining a field to extract:
        *   `name`: The key for this field in the output JSON (e.g., "page_title").
        *   `selector`: The CSS selector to find the element containing the data (e.g., "title" finds the `<title>` tag, "h1" finds the `<h1>` tag).
        *   `type`: How to get the data from the selected element (`"text"` means get the text content).
2.  **Instantiate Strategy:** We create an instance of `JsonCssExtractionStrategy`, passing our `extraction_schema`. This strategy knows its input format should be HTML.
3.  **Configure Run:** We create a `CrawlerRunConfig` and assign our `css_extractor` instance to the `extraction_strategy` parameter.
4.  **Crawl:** We run `crawler.arun`. After fetching and basic scraping, the `AsyncWebCrawler` will see the `extraction_strategy` in the config and call our `css_extractor`.
5.  **Result:** The `CrawlResult` object now contains a field called `extracted_content`. This field holds the structured data found by the strategy, formatted as a **JSON string**. We use `json.loads()` to convert this string back into a Python list/dictionary.

**Expected Output (Conceptual):**

```
Extraction Schema defined using CSS selectors.
Using strategy: JsonCssExtractionStrategy

Crawling https://httpbin.org/html to extract structured data...

Extraction successful!
Extracted Data:
[
  {
    "page_title": "Herman Melville - Moby-Dick",
    "main_heading": "Moby Dick"
  }
]
```
*(Note: The actual output is a list containing one dictionary because `baseSelector: "body"` matches one element, and we extract fields relative to that.)*

### Example 2: Extracting Data with `LLMExtractionStrategy` (Conceptual)

Now, let's imagine we want the same information (title, heading) but using an AI. We'll provide a schema describing what we want. (Note: This requires setting up LLM access separately, e.g., API keys).

```python
# chapter6_example_2.py
import asyncio
import json
from crawl4ai import (
    AsyncWebCrawler,
    CrawlerRunConfig,
    LLMExtractionStrategy, # Import the LLM strategy
    LlmConfig             # Import LLM configuration helper
)

# Assume llm_config is properly configured with provider, API key, etc.
# This is just a placeholder - replace with your actual LLM setup
# E.g., llm_config = LlmConfig(provider="openai", api_token="env:OPENAI_API_KEY")
class MockLlmConfig: provider="mock"; api_token="mock"; base_url=None
llm_config = MockLlmConfig()


async def main():
    # 1. Define the desired output schema (what fields we want)
    #    This helps guide the LLM.
    output_schema = {
        "page_title": "string",
        "main_heading": "string"
    }
    print("Extraction Schema defined for LLM.")

    # 2. Create an instance of the LLM strategy
    #    We pass the schema and the LLM configuration.
    #    We also specify input_format='markdown' (common for LLMs).
    llm_extractor = LLMExtractionStrategy(
        schema=output_schema,
        llmConfig=llm_config, # Pass the LLM provider details
        input_format="markdown" # Tell it to read the Markdown content
    )
    print(f"Using strategy: {llm_extractor.__class__.__name__}")
    print(f"LLM Provider (mocked): {llm_config.provider}")

    # 3. Create CrawlerRunConfig with the strategy
    run_config = CrawlerRunConfig(
        extraction_strategy=llm_extractor
    )

    # 4. Run the crawl
    async with AsyncWebCrawler() as crawler:
        url_to_crawl = "https://httpbin.org/html"
        print(f"\nCrawling {url_to_crawl} using LLM to extract...")

        # This would make calls to the configured LLM API
        result = await crawler.arun(url=url_to_crawl, config=run_config)

        if result.success and result.extracted_content:
            print("\nExtraction successful (using LLM)!")
            # Extracted data is a JSON string
            try:
                extracted_data = json.loads(result.extracted_content)
                print("Extracted Data:")
                print(json.dumps(extracted_data, indent=2))
            except json.JSONDecodeError:
                print("Could not parse LLM output as JSON:")
                print(result.extracted_content)
        elif result.success:
            print("\nCrawl successful, but no structured data extracted by LLM.")
            # This might happen if the mock LLM doesn't return valid JSON
            # or if the content was too small/irrelevant for extraction.
        else:
            print(f"\nCrawl failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())

```

**Explanation:**

1.  **Schema Definition:** We define a simple dictionary `output_schema` telling the LLM we want fields named "page_title" and "main_heading", both expected to be strings.
2.  **Instantiate Strategy:** We create `LLMExtractionStrategy`, passing:
    *   `schema=output_schema`: Our desired output structure.
    *   `llmConfig=llm_config`: The configuration telling the strategy *which* LLM to use and how to authenticate (here, it's mocked).
    *   `input_format="markdown"`: Instructs the strategy to feed the generated Markdown content (from `result.markdown.raw_markdown`) to the LLM, which is often easier for LLMs to parse than raw HTML.
3.  **Configure Run & Crawl:** Same as before, we set the `extraction_strategy` in `CrawlerRunConfig` and run the crawl.
4.  **Result:** The `AsyncWebCrawler` calls the `llm_extractor`. The strategy sends the Markdown content and the schema instructions to the configured LLM. The LLM analyzes the text and (hopefully) returns a JSON object matching the schema. This JSON is stored as a string in `result.extracted_content`.

**Expected Output (Conceptual, with a real LLM):**

```
Extraction Schema defined for LLM.
Using strategy: LLMExtractionStrategy
LLM Provider (mocked): mock

Crawling https://httpbin.org/html using LLM to extract...

Extraction successful (using LLM)!
Extracted Data:
[
  {
    "page_title": "Herman Melville - Moby-Dick",
    "main_heading": "Moby Dick"
  }
]
```
*(Note: LLM output format might vary slightly, but it aims to match the requested schema based on the content it reads.)*

## How It Works Inside (Under the Hood)

When you provide an `extraction_strategy` in the `CrawlerRunConfig`, how does `AsyncWebCrawler` use it?

1.  **Fetch & Scrape:** The crawler fetches the raw HTML ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)) and performs initial cleaning/scraping ([ContentScrapingStrategy](04_contentscrapingstrategy.md)) to get `cleaned_html`, links, etc.
2.  **Markdown Generation:** It usually generates Markdown representation ([DefaultMarkdownGenerator](05_relevantcontentfilter.md#how-relevantcontentfilter-is-used-via-markdown-generation)).
3.  **Check for Strategy:** The `AsyncWebCrawler` (specifically in its internal `aprocess_html` method) checks if `config.extraction_strategy` is set.
4.  **Execute Strategy:** If a strategy exists:
    *   It determines the required input format (e.g., "html" for `JsonCssExtractionStrategy`, "markdown" for `LLMExtractionStrategy` based on its `input_format` attribute).
    *   It retrieves the corresponding content (e.g., `result.cleaned_html` or `result.markdown.raw_markdown`).
    *   If the content is long and the strategy supports chunking (like `LLMExtractionStrategy`), it might first split the content into smaller chunks.
    *   It calls the strategy's `run` method, passing the content chunk(s).
    *   The strategy performs its logic (applying selectors, calling LLM API).
    *   The strategy returns the extracted data (typically as a list of dictionaries).
5.  **Store Result:** The `AsyncWebCrawler` converts the returned structured data into a JSON string and stores it in `CrawlResult.extracted_content`.

Here's a simplified view:

```mermaid
sequenceDiagram
    participant User
    participant AWC as AsyncWebCrawler
    participant Config as CrawlerRunConfig
    participant Processor as HTML Processing
    participant Extractor as ExtractionStrategy
    participant Result as CrawlResult

    User->>AWC: arun(url, config=my_config)
    Note over AWC: Config includes an Extraction Strategy
    AWC->>Processor: Process HTML (scrape, generate markdown)
    Processor-->>AWC: Processed Content (HTML, Markdown)
    AWC->>Extractor: Run extraction on content (using Strategy's input format)
    Note over Extractor: Applying logic (CSS, XPath, LLM...)
    Extractor-->>AWC: Structured Data (List[Dict])
    AWC->>AWC: Convert data to JSON String
    AWC->>Result: Store JSON String in extracted_content
    AWC-->>User: Return CrawlResult
```

### Code Glimpse (`extraction_strategy.py`)

Inside the `crawl4ai` library, the file `extraction_strategy.py` defines the blueprint and the implementations.

**The Blueprint (Abstract Base Class):**

```python
# Simplified from crawl4ai/extraction_strategy.py
from abc import ABC, abstractmethod
from typing import List, Dict, Any

class ExtractionStrategy(ABC):
    """Abstract base class for all extraction strategies."""
    def __init__(self, input_format: str = "markdown", **kwargs):
        self.input_format = input_format # e.g., 'html', 'markdown'
        # ... other common init ...

    @abstractmethod
    def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
        """Extract structured data from a single chunk of content."""
        pass

    def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
        """Process content sections (potentially chunked) and call extract."""
        # Default implementation might process sections in parallel or sequentially
        all_extracted_data = []
        for section in sections:
             all_extracted_data.extend(self.extract(url, section, **kwargs))
        return all_extracted_data
```

**Example Implementation (`JsonCssExtractionStrategy`):**

```python
# Simplified from crawl4ai/extraction_strategy.py
from bs4 import BeautifulSoup # Uses BeautifulSoup for CSS selectors

class JsonCssExtractionStrategy(ExtractionStrategy):
    def __init__(self, schema: Dict[str, Any], **kwargs):
        # Force input format to HTML for CSS selectors
        super().__init__(input_format="html", **kwargs)
        self.schema = schema # Store the user-defined schema

    def extract(self, url: str, html_content: str, *q, **kwargs) -> List[Dict[str, Any]]:
        # Parse the HTML content chunk
        soup = BeautifulSoup(html_content, "html.parser")
        extracted_items = []

        # Find base elements defined in the schema
        base_elements = soup.select(self.schema.get("baseSelector", "body"))

        for element in base_elements:
            item = {}
            # Extract fields based on schema selectors and types
            fields_to_extract = self.schema.get("fields", [])
            for field_def in fields_to_extract:
                try:
                    # Find the specific sub-element using CSS selector
                    target_element = element.select_one(field_def["selector"])
                    if target_element:
                        if field_def["type"] == "text":
                            item[field_def["name"]] = target_element.get_text(strip=True)
                        elif field_def["type"] == "attribute":
                            item[field_def["name"]] = target_element.get(field_def["attribute"])
                        # ... other types like 'html', 'list', 'nested' ...
                except Exception as e:
                    # Handle errors, maybe log them if verbose
                    pass
            if item:
                extracted_items.append(item)

        return extracted_items

    # run() method likely uses the default implementation from base class
```

**Example Implementation (`LLMExtractionStrategy`):**

```python
# Simplified from crawl4ai/extraction_strategy.py
# Needs imports for LLM interaction (e.g., perform_completion_with_backoff)
from .utils import perform_completion_with_backoff, chunk_documents, escape_json_string
from .prompts import PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Example prompt

class LLMExtractionStrategy(ExtractionStrategy):
    def __init__(self, schema: Dict = None, instruction: str = None, llmConfig=None, input_format="markdown", **kwargs):
        super().__init__(input_format=input_format, **kwargs)
        self.schema = schema
        self.instruction = instruction
        self.llmConfig = llmConfig # Contains provider, API key, etc.
        # ... other LLM specific setup ...

    def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
        # Prepare the prompt for the LLM
        prompt = self._build_llm_prompt(url, content_chunk)

        # Call the LLM API
        response = perform_completion_with_backoff(
            provider=self.llmConfig.provider,
            prompt_with_variables=prompt,
            api_token=self.llmConfig.api_token,
            base_url=self.llmConfig.base_url,
            json_response=True # Often expect JSON from LLM for extraction
            # ... pass other necessary args ...
        )

        # Parse the LLM's response (which should ideally be JSON)
        try:
            extracted_data = json.loads(response.choices[0].message.content)
            # Ensure it's a list
            if isinstance(extracted_data, dict):
                extracted_data = [extracted_data]
            return extracted_data
        except Exception as e:
            # Handle LLM response parsing errors
            print(f"Error parsing LLM response: {e}")
            return [{"error": "Failed to parse LLM output", "raw_output": response.choices[0].message.content}]

    def _build_llm_prompt(self, url: str, content_chunk: str) -> str:
        # Logic to construct the prompt using self.schema or self.instruction
        # and the content_chunk. Example:
        prompt_template = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Choose appropriate prompt
        variable_values = {
            "URL": url,
            "CONTENT": escape_json_string(content_chunk), # Send Markdown or HTML chunk
            "SCHEMA": json.dumps(self.schema) if self.schema else "{}",
            "REQUEST": self.instruction if self.instruction else "Extract relevant data based on the schema."
        }
        prompt = prompt_template
        for var, val in variable_values.items():
            prompt = prompt.replace("{" + var + "}", str(val))
        return prompt

    # run() method might override the base to handle chunking specifically for LLMs
    def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
        # Potentially chunk sections based on token limits before calling extract
        # chunked_content = chunk_documents(sections, ...)
        # extracted_data = []
        # for chunk in chunked_content:
        #    extracted_data.extend(self.extract(url, chunk, **kwargs))
        # return extracted_data
        # Simplified for now:
        return super().run(url, sections, *q, **kwargs)

```

## Conclusion

You've learned about `ExtractionStrategy`, Crawl4AI's way of giving instructions to an "Analyst" to pull out specific, structured data from web content.

*   It solves the problem of needing precise data points (like product names, prices) in an organized format, not just blocks of text.
*   You can choose your "Analyst":
    *   **Precise Locators (`JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`):** Use exact CSS/XPath selectors defined in a schema. Fast but brittle.
    *   **Smart Interpreter (`LLMExtractionStrategy`):** Uses an AI (LLM) guided by a schema or instructions. More flexible but slower and needs setup.
*   You configure the desired strategy within the [CrawlerRunConfig](03_crawlerrunconfig.md).
*   The extracted structured data is returned as a JSON string in the `CrawlResult.extracted_content` field.

Now that we understand how to fetch, clean, filter, and extract data, let's put it all together and look at the final package that Crawl4AI delivers after a crawl.

**Next:** Let's dive into the details of the output with [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md).

---

Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)