Files
Tutorial-Codebase-Knowledge/docs/Crawl4AI/06_extractionstrategy.md
zachary62 2ebad5e5f2 init push
2025-04-04 13:03:54 -04:00

442 lines
23 KiB
Markdown

# Chapter 6: Getting Specific Data - ExtractionStrategy
In the previous chapter, [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md), we learned how to sift through the cleaned webpage content to keep only the parts relevant to our query or goal, producing a focused `fit_markdown`. This is great for tasks like summarization or getting the main gist of an article.
But sometimes, we need more than just relevant text. Imagine you're analyzing an e-commerce website listing products. You don't just want the *description*; you need the exact **product name**, the specific **price**, the **customer rating**, and maybe the **SKU number**, all neatly organized. How do we tell Crawl4AI to find these *specific* pieces of information and return them in a structured format, like a JSON object?
## What Problem Does `ExtractionStrategy` Solve?
Think of the content we've processed so far (like the cleaned HTML or the generated Markdown) as a detailed report delivered by a researcher. `RelevantContentFilter` helped trim the report down to the most relevant pages.
Now, we need to give specific instructions to an **Analyst** to go through that focused report and pull out precise data points. We don't just want the report; we want a filled-in spreadsheet with columns for "Product Name," "Price," and "Rating."
`ExtractionStrategy` is the set of instructions we give to this Analyst. It defines *how* to locate and extract specific, structured information (like fields in a database or keys in a JSON object) from the content.
## What is `ExtractionStrategy`?
`ExtractionStrategy` is a core concept (a blueprint) in Crawl4AI that represents the **method used to extract structured data** from the processed content (which could be HTML or Markdown). It specifies *that* we need a way to find specific fields, but the actual *technique* used to find them can vary.
This allows us to choose the best "Analyst" for the job, depending on the complexity of the website and the data we need.
## The Different Analysts: Ways to Extract Data
Crawl4AI offers several concrete implementations (the different Analysts) for extracting structured data:
1. **The Precise Locator (`JsonCssExtractionStrategy` & `JsonXPathExtractionStrategy`)**
* **Analogy:** An analyst who uses very precise map coordinates (CSS Selectors or XPath expressions) to find information on a page. They need to be told exactly where to look. "The price is always in the HTML element with the ID `#product-price`."
* **How it works:** You define a **schema** (a Python dictionary) that maps the names of the fields you want (e.g., "product_name", "price") to the specific CSS selector (`JsonCssExtractionStrategy`) or XPath expression (`JsonXPathExtractionStrategy`) that locates that information within the HTML structure.
* **Pros:** Very fast and reliable if the website structure is consistent and predictable. Doesn't require external AI services.
* **Cons:** Can break easily if the website changes its layout (selectors become invalid). Requires you to inspect the HTML and figure out the correct selectors.
* **Input:** Typically works directly on the raw or cleaned HTML.
2. **The Smart Interpreter (`LLMExtractionStrategy`)**
* **Analogy:** A highly intelligent analyst who can *read and understand* the content. You give them a list of fields you need (a schema) or even just natural language instructions ("Find the product name, its price, and a short description"). They read the content (usually Markdown) and use their understanding of language and context to figure out the values, even if the layout isn't perfectly consistent.
* **How it works:** You provide a desired output schema (e.g., a Pydantic model or a dictionary structure) or a natural language instruction. The strategy sends the content (often the generated Markdown, possibly split into chunks) along with your schema/instruction to a configured Large Language Model (LLM) like GPT or Llama. The LLM reads the text and generates the structured data (usually JSON) according to your request.
* **Pros:** Much more resilient to website layout changes. Can understand context and handle variations. Can extract data based on meaning, not just location.
* **Cons:** Requires setting up access to an LLM (API keys, potentially costs). Can be significantly slower than selector-based methods. The quality of extraction depends on the LLM's capabilities and the clarity of your instructions/schema.
* **Input:** Often works best on the cleaned Markdown representation of the content, but can sometimes use HTML.
## How to Use an `ExtractionStrategy`
You tell the `AsyncWebCrawler` which extraction strategy to use (if any) by setting the `extraction_strategy` parameter within the [CrawlerRunConfig](03_crawlerrunconfig.md) object you pass to `arun` or `arun_many`.
### Example 1: Extracting Data with `JsonCssExtractionStrategy`
Let's imagine we want to extract the title (from the `<h1>` tag) and the main heading (from the `<h1>` tag) of the simple `httpbin.org/html` page.
```python
# chapter6_example_1.py
import asyncio
import json
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
JsonCssExtractionStrategy # Import the CSS strategy
)
async def main():
# 1. Define the extraction schema (Field Name -> CSS Selector)
extraction_schema = {
"baseSelector": "body", # Operate within the body tag
"fields": [
{"name": "page_title", "selector": "title", "type": "text"},
{"name": "main_heading", "selector": "h1", "type": "text"}
]
}
print("Extraction Schema defined using CSS selectors.")
# 2. Create an instance of the strategy with the schema
css_extractor = JsonCssExtractionStrategy(schema=extraction_schema)
print(f"Using strategy: {css_extractor.__class__.__name__}")
# 3. Create CrawlerRunConfig and set the extraction_strategy
run_config = CrawlerRunConfig(
extraction_strategy=css_extractor
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
print(f"\nCrawling {url_to_crawl} to extract structured data...")
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success and result.extracted_content:
print("\nExtraction successful!")
# The extracted data is stored as a JSON string in result.extracted_content
# Parse the JSON string to work with the data as a Python object
extracted_data = json.loads(result.extracted_content)
print("Extracted Data:")
# Print the extracted data nicely formatted
print(json.dumps(extracted_data, indent=2))
elif result.success:
print("\nCrawl successful, but no structured data extracted.")
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Schema Definition:** We create a Python dictionary `extraction_schema`.
* `baseSelector: "body"` tells the strategy to look for items within the `<body>` tag of the HTML.
* `fields` is a list of dictionaries, each defining a field to extract:
* `name`: The key for this field in the output JSON (e.g., "page_title").
* `selector`: The CSS selector to find the element containing the data (e.g., "title" finds the `<title>` tag, "h1" finds the `<h1>` tag).
* `type`: How to get the data from the selected element (`"text"` means get the text content).
2. **Instantiate Strategy:** We create an instance of `JsonCssExtractionStrategy`, passing our `extraction_schema`. This strategy knows its input format should be HTML.
3. **Configure Run:** We create a `CrawlerRunConfig` and assign our `css_extractor` instance to the `extraction_strategy` parameter.
4. **Crawl:** We run `crawler.arun`. After fetching and basic scraping, the `AsyncWebCrawler` will see the `extraction_strategy` in the config and call our `css_extractor`.
5. **Result:** The `CrawlResult` object now contains a field called `extracted_content`. This field holds the structured data found by the strategy, formatted as a **JSON string**. We use `json.loads()` to convert this string back into a Python list/dictionary.
**Expected Output (Conceptual):**
```
Extraction Schema defined using CSS selectors.
Using strategy: JsonCssExtractionStrategy
Crawling https://httpbin.org/html to extract structured data...
Extraction successful!
Extracted Data:
[
{
"page_title": "Herman Melville - Moby-Dick",
"main_heading": "Moby Dick"
}
]
```
*(Note: The actual output is a list containing one dictionary because `baseSelector: "body"` matches one element, and we extract fields relative to that.)*
### Example 2: Extracting Data with `LLMExtractionStrategy` (Conceptual)
Now, let's imagine we want the same information (title, heading) but using an AI. We'll provide a schema describing what we want. (Note: This requires setting up LLM access separately, e.g., API keys).
```python
# chapter6_example_2.py
import asyncio
import json
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
LLMExtractionStrategy, # Import the LLM strategy
LlmConfig # Import LLM configuration helper
)
# Assume llm_config is properly configured with provider, API key, etc.
# This is just a placeholder - replace with your actual LLM setup
# E.g., llm_config = LlmConfig(provider="openai", api_token="env:OPENAI_API_KEY")
class MockLlmConfig: provider="mock"; api_token="mock"; base_url=None
llm_config = MockLlmConfig()
async def main():
# 1. Define the desired output schema (what fields we want)
# This helps guide the LLM.
output_schema = {
"page_title": "string",
"main_heading": "string"
}
print("Extraction Schema defined for LLM.")
# 2. Create an instance of the LLM strategy
# We pass the schema and the LLM configuration.
# We also specify input_format='markdown' (common for LLMs).
llm_extractor = LLMExtractionStrategy(
schema=output_schema,
llmConfig=llm_config, # Pass the LLM provider details
input_format="markdown" # Tell it to read the Markdown content
)
print(f"Using strategy: {llm_extractor.__class__.__name__}")
print(f"LLM Provider (mocked): {llm_config.provider}")
# 3. Create CrawlerRunConfig with the strategy
run_config = CrawlerRunConfig(
extraction_strategy=llm_extractor
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
print(f"\nCrawling {url_to_crawl} using LLM to extract...")
# This would make calls to the configured LLM API
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success and result.extracted_content:
print("\nExtraction successful (using LLM)!")
# Extracted data is a JSON string
try:
extracted_data = json.loads(result.extracted_content)
print("Extracted Data:")
print(json.dumps(extracted_data, indent=2))
except json.JSONDecodeError:
print("Could not parse LLM output as JSON:")
print(result.extracted_content)
elif result.success:
print("\nCrawl successful, but no structured data extracted by LLM.")
# This might happen if the mock LLM doesn't return valid JSON
# or if the content was too small/irrelevant for extraction.
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Schema Definition:** We define a simple dictionary `output_schema` telling the LLM we want fields named "page_title" and "main_heading", both expected to be strings.
2. **Instantiate Strategy:** We create `LLMExtractionStrategy`, passing:
* `schema=output_schema`: Our desired output structure.
* `llmConfig=llm_config`: The configuration telling the strategy *which* LLM to use and how to authenticate (here, it's mocked).
* `input_format="markdown"`: Instructs the strategy to feed the generated Markdown content (from `result.markdown.raw_markdown`) to the LLM, which is often easier for LLMs to parse than raw HTML.
3. **Configure Run & Crawl:** Same as before, we set the `extraction_strategy` in `CrawlerRunConfig` and run the crawl.
4. **Result:** The `AsyncWebCrawler` calls the `llm_extractor`. The strategy sends the Markdown content and the schema instructions to the configured LLM. The LLM analyzes the text and (hopefully) returns a JSON object matching the schema. This JSON is stored as a string in `result.extracted_content`.
**Expected Output (Conceptual, with a real LLM):**
```
Extraction Schema defined for LLM.
Using strategy: LLMExtractionStrategy
LLM Provider (mocked): mock
Crawling https://httpbin.org/html using LLM to extract...
Extraction successful (using LLM)!
Extracted Data:
[
{
"page_title": "Herman Melville - Moby-Dick",
"main_heading": "Moby Dick"
}
]
```
*(Note: LLM output format might vary slightly, but it aims to match the requested schema based on the content it reads.)*
## How It Works Inside (Under the Hood)
When you provide an `extraction_strategy` in the `CrawlerRunConfig`, how does `AsyncWebCrawler` use it?
1. **Fetch & Scrape:** The crawler fetches the raw HTML ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)) and performs initial cleaning/scraping ([ContentScrapingStrategy](04_contentscrapingstrategy.md)) to get `cleaned_html`, links, etc.
2. **Markdown Generation:** It usually generates Markdown representation ([DefaultMarkdownGenerator](05_relevantcontentfilter.md#how-relevantcontentfilter-is-used-via-markdown-generation)).
3. **Check for Strategy:** The `AsyncWebCrawler` (specifically in its internal `aprocess_html` method) checks if `config.extraction_strategy` is set.
4. **Execute Strategy:** If a strategy exists:
* It determines the required input format (e.g., "html" for `JsonCssExtractionStrategy`, "markdown" for `LLMExtractionStrategy` based on its `input_format` attribute).
* It retrieves the corresponding content (e.g., `result.cleaned_html` or `result.markdown.raw_markdown`).
* If the content is long and the strategy supports chunking (like `LLMExtractionStrategy`), it might first split the content into smaller chunks.
* It calls the strategy's `run` method, passing the content chunk(s).
* The strategy performs its logic (applying selectors, calling LLM API).
* The strategy returns the extracted data (typically as a list of dictionaries).
5. **Store Result:** The `AsyncWebCrawler` converts the returned structured data into a JSON string and stores it in `CrawlResult.extracted_content`.
Here's a simplified view:
```mermaid
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Config as CrawlerRunConfig
participant Processor as HTML Processing
participant Extractor as ExtractionStrategy
participant Result as CrawlResult
User->>AWC: arun(url, config=my_config)
Note over AWC: Config includes an Extraction Strategy
AWC->>Processor: Process HTML (scrape, generate markdown)
Processor-->>AWC: Processed Content (HTML, Markdown)
AWC->>Extractor: Run extraction on content (using Strategy's input format)
Note over Extractor: Applying logic (CSS, XPath, LLM...)
Extractor-->>AWC: Structured Data (List[Dict])
AWC->>AWC: Convert data to JSON String
AWC->>Result: Store JSON String in extracted_content
AWC-->>User: Return CrawlResult
```
### Code Glimpse (`extraction_strategy.py`)
Inside the `crawl4ai` library, the file `extraction_strategy.py` defines the blueprint and the implementations.
**The Blueprint (Abstract Base Class):**
```python
# Simplified from crawl4ai/extraction_strategy.py
from abc import ABC, abstractmethod
from typing import List, Dict, Any
class ExtractionStrategy(ABC):
"""Abstract base class for all extraction strategies."""
def __init__(self, input_format: str = "markdown", **kwargs):
self.input_format = input_format # e.g., 'html', 'markdown'
# ... other common init ...
@abstractmethod
def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
"""Extract structured data from a single chunk of content."""
pass
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
"""Process content sections (potentially chunked) and call extract."""
# Default implementation might process sections in parallel or sequentially
all_extracted_data = []
for section in sections:
all_extracted_data.extend(self.extract(url, section, **kwargs))
return all_extracted_data
```
**Example Implementation (`JsonCssExtractionStrategy`):**
```python
# Simplified from crawl4ai/extraction_strategy.py
from bs4 import BeautifulSoup # Uses BeautifulSoup for CSS selectors
class JsonCssExtractionStrategy(ExtractionStrategy):
def __init__(self, schema: Dict[str, Any], **kwargs):
# Force input format to HTML for CSS selectors
super().__init__(input_format="html", **kwargs)
self.schema = schema # Store the user-defined schema
def extract(self, url: str, html_content: str, *q, **kwargs) -> List[Dict[str, Any]]:
# Parse the HTML content chunk
soup = BeautifulSoup(html_content, "html.parser")
extracted_items = []
# Find base elements defined in the schema
base_elements = soup.select(self.schema.get("baseSelector", "body"))
for element in base_elements:
item = {}
# Extract fields based on schema selectors and types
fields_to_extract = self.schema.get("fields", [])
for field_def in fields_to_extract:
try:
# Find the specific sub-element using CSS selector
target_element = element.select_one(field_def["selector"])
if target_element:
if field_def["type"] == "text":
item[field_def["name"]] = target_element.get_text(strip=True)
elif field_def["type"] == "attribute":
item[field_def["name"]] = target_element.get(field_def["attribute"])
# ... other types like 'html', 'list', 'nested' ...
except Exception as e:
# Handle errors, maybe log them if verbose
pass
if item:
extracted_items.append(item)
return extracted_items
# run() method likely uses the default implementation from base class
```
**Example Implementation (`LLMExtractionStrategy`):**
```python
# Simplified from crawl4ai/extraction_strategy.py
# Needs imports for LLM interaction (e.g., perform_completion_with_backoff)
from .utils import perform_completion_with_backoff, chunk_documents, escape_json_string
from .prompts import PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Example prompt
class LLMExtractionStrategy(ExtractionStrategy):
def __init__(self, schema: Dict = None, instruction: str = None, llmConfig=None, input_format="markdown", **kwargs):
super().__init__(input_format=input_format, **kwargs)
self.schema = schema
self.instruction = instruction
self.llmConfig = llmConfig # Contains provider, API key, etc.
# ... other LLM specific setup ...
def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
# Prepare the prompt for the LLM
prompt = self._build_llm_prompt(url, content_chunk)
# Call the LLM API
response = perform_completion_with_backoff(
provider=self.llmConfig.provider,
prompt_with_variables=prompt,
api_token=self.llmConfig.api_token,
base_url=self.llmConfig.base_url,
json_response=True # Often expect JSON from LLM for extraction
# ... pass other necessary args ...
)
# Parse the LLM's response (which should ideally be JSON)
try:
extracted_data = json.loads(response.choices[0].message.content)
# Ensure it's a list
if isinstance(extracted_data, dict):
extracted_data = [extracted_data]
return extracted_data
except Exception as e:
# Handle LLM response parsing errors
print(f"Error parsing LLM response: {e}")
return [{"error": "Failed to parse LLM output", "raw_output": response.choices[0].message.content}]
def _build_llm_prompt(self, url: str, content_chunk: str) -> str:
# Logic to construct the prompt using self.schema or self.instruction
# and the content_chunk. Example:
prompt_template = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Choose appropriate prompt
variable_values = {
"URL": url,
"CONTENT": escape_json_string(content_chunk), # Send Markdown or HTML chunk
"SCHEMA": json.dumps(self.schema) if self.schema else "{}",
"REQUEST": self.instruction if self.instruction else "Extract relevant data based on the schema."
}
prompt = prompt_template
for var, val in variable_values.items():
prompt = prompt.replace("{" + var + "}", str(val))
return prompt
# run() method might override the base to handle chunking specifically for LLMs
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
# Potentially chunk sections based on token limits before calling extract
# chunked_content = chunk_documents(sections, ...)
# extracted_data = []
# for chunk in chunked_content:
# extracted_data.extend(self.extract(url, chunk, **kwargs))
# return extracted_data
# Simplified for now:
return super().run(url, sections, *q, **kwargs)
```
## Conclusion
You've learned about `ExtractionStrategy`, Crawl4AI's way of giving instructions to an "Analyst" to pull out specific, structured data from web content.
* It solves the problem of needing precise data points (like product names, prices) in an organized format, not just blocks of text.
* You can choose your "Analyst":
* **Precise Locators (`JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`):** Use exact CSS/XPath selectors defined in a schema. Fast but brittle.
* **Smart Interpreter (`LLMExtractionStrategy`):** Uses an AI (LLM) guided by a schema or instructions. More flexible but slower and needs setup.
* You configure the desired strategy within the [CrawlerRunConfig](03_crawlerrunconfig.md).
* The extracted structured data is returned as a JSON string in the `CrawlResult.extracted_content` field.
Now that we understand how to fetch, clean, filter, and extract data, let's put it all together and look at the final package that Crawl4AI delivers after a crawl.
**Next:** Let's dive into the details of the output with [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)