init push

This commit is contained in:
zachary62
2025-04-04 13:03:54 -04:00
parent e62ee2cb13
commit 2ebad5e5f2
160 changed files with 2 additions and 0 deletions

View File

@@ -0,0 +1,242 @@
# Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy
Welcome to the Crawl4AI tutorial series! Our goal is to build intelligent agents that can understand and extract information from the web. The very first step in this process is actually *getting* the content from a webpage. This chapter explains how Crawl4AI handles that fundamental task.
Imagine you need to pick up a package from a specific address. How do you get there and retrieve it?
* You could send a **simple, fast drone** that just grabs the package off the porch (if it's easily accessible). This is quick but might fail if the package is inside or requires a signature.
* Or, you could send a **full delivery truck with a driver**. The driver can ring the bell, wait, sign for the package, and even handle complex instructions. This is more versatile but takes more time and resources.
In Crawl4AI, the `AsyncCrawlerStrategy` is like choosing your delivery vehicle. It defines *how* the crawler fetches the raw content (like the HTML, CSS, and maybe JavaScript results) of a webpage.
## What Exactly is AsyncCrawlerStrategy?
`AsyncCrawlerStrategy` is a core concept in Crawl4AI that represents the **method** or **technique** used to download the content of a given URL. Think of it as a blueprint: it specifies *that* we need a way to fetch content, but the specific *details* of how it's done can vary.
This "blueprint" approach is powerful because it allows us to swap out the fetching mechanism depending on our needs, without changing the rest of our crawling logic.
## The Default: AsyncPlaywrightCrawlerStrategy (The Delivery Truck)
By default, Crawl4AI uses `AsyncPlaywrightCrawlerStrategy`. This strategy uses a real, automated web browser engine (like Chrome, Firefox, or WebKit) behind the scenes.
**Why use a full browser?**
* **Handles JavaScript:** Modern websites rely heavily on JavaScript to load content, change the layout, or fetch data after the initial page load. `AsyncPlaywrightCrawlerStrategy` runs this JavaScript, just like your normal browser does.
* **Simulates User Interaction:** It can wait for elements to appear, handle dynamic content, and see the page *after* scripts have run.
* **Gets the "Final" View:** It fetches the content as a user would see it in their browser.
This is our "delivery truck" powerful and capable of handling complex websites. However, like a real truck, it's slower and uses more memory and CPU compared to simpler methods.
You generally don't need to *do* anything to use it, as it's the default! When you start Crawl4AI, it picks this strategy automatically.
## Another Option: AsyncHTTPCrawlerStrategy (The Delivery Drone)
Crawl4AI also offers `AsyncHTTPCrawlerStrategy`. This strategy is much simpler. It directly requests the URL and downloads the *initial* HTML source code that the web server sends back.
**Why use this simpler strategy?**
* **Speed:** It's significantly faster because it doesn't need to start a browser, render the page, or execute JavaScript.
* **Efficiency:** It uses much less memory and CPU.
This is our "delivery drone" super fast and efficient for simple tasks.
**What's the catch?**
* **No JavaScript:** It won't run any JavaScript on the page. If content is loaded dynamically by scripts, this strategy will likely miss it.
* **Basic HTML Only:** You get the raw HTML source, not necessarily what a user *sees* after the browser processes everything.
This strategy is great for websites with simple, static HTML content or when you only need the basic structure and metadata very quickly.
## Why Have Different Strategies? (The Power of Abstraction)
Having `AsyncCrawlerStrategy` as a distinct concept offers several advantages:
1. **Flexibility:** You can choose the best tool for the job. Need to crawl complex, dynamic sites? Use the default `AsyncPlaywrightCrawlerStrategy`. Need to quickly fetch basic HTML from thousands of simple pages? Switch to `AsyncHTTPCrawlerStrategy`.
2. **Maintainability:** The logic for *fetching* content is kept separate from the logic for *processing* it.
3. **Extensibility:** Advanced users could even create their *own* custom strategies for specialized fetching needs (though that's beyond this beginner tutorial).
## How It Works Conceptually
When you ask Crawl4AI to crawl a URL, the main `AsyncWebCrawler` doesn't fetch the content itself. Instead, it delegates the task to the currently selected `AsyncCrawlerStrategy`.
Here's a simplified flow:
```mermaid
sequenceDiagram
participant C as AsyncWebCrawler
participant S as AsyncCrawlerStrategy
participant W as Website
C->>S: Please crawl("https://example.com")
Note over S: I'm using my method (e.g., Browser or HTTP)
S->>W: Request Page Content
W-->>S: Return Raw Content (HTML, etc.)
S-->>C: Here's the result (AsyncCrawlResponse)
```
The `AsyncWebCrawler` only needs to know how to talk to *any* strategy through a common interface (the `crawl` method). The strategy handles the specific details of the fetching process.
## Using the Default Strategy (You're Already Doing It!)
Let's see how you use the default `AsyncPlaywrightCrawlerStrategy` without even needing to specify it.
```python
# main_example.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
# When you create AsyncWebCrawler without specifying a strategy,
# it automatically uses AsyncPlaywrightCrawlerStrategy!
async with AsyncWebCrawler() as crawler:
print("Crawler is ready using the default strategy (Playwright).")
# Let's crawl a simple page that just returns HTML
# We use CacheMode.BYPASS to ensure we fetch it fresh each time for this demo.
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result = await crawler.arun(
url="https://httpbin.org/html",
config=config
)
if result.success:
print("\nSuccessfully fetched content!")
# The strategy fetched the raw HTML.
# AsyncWebCrawler then processes it (more on that later).
print(f"First 100 chars of fetched HTML: {result.html[:100]}...")
else:
print(f"\nFailed to fetch content: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. We import `AsyncWebCrawler` and supporting classes.
2. We create an instance of `AsyncWebCrawler()` inside an `async with` block (this handles setup and cleanup). Since we didn't tell it *which* strategy to use, it defaults to `AsyncPlaywrightCrawlerStrategy`.
3. We call `crawler.arun()` to crawl the URL. Under the hood, the `AsyncPlaywrightCrawlerStrategy` starts a browser, navigates to the page, gets the content, and returns it.
4. We print the first part of the fetched HTML from the `result`.
## Explicitly Choosing the HTTP Strategy
What if you know the page is simple and want the speed of the "delivery drone"? You can explicitly tell `AsyncWebCrawler` to use `AsyncHTTPCrawlerStrategy`.
```python
# http_strategy_example.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
# Import the specific strategies we want to use
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
async def main():
# 1. Create an instance of the strategy you want
http_strategy = AsyncHTTPCrawlerStrategy()
# 2. Pass the strategy instance when creating the AsyncWebCrawler
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
print("Crawler is ready using the explicit HTTP strategy.")
# Crawl the same simple page
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result = await crawler.arun(
url="https://httpbin.org/html",
config=config
)
if result.success:
print("\nSuccessfully fetched content using HTTP strategy!")
print(f"First 100 chars of fetched HTML: {result.html[:100]}...")
else:
print(f"\nFailed to fetch content: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. We now also import `AsyncHTTPCrawlerStrategy`.
2. We create an instance: `http_strategy = AsyncHTTPCrawlerStrategy()`.
3. We pass this instance to the `AsyncWebCrawler` constructor: `AsyncWebCrawler(crawler_strategy=http_strategy)`.
4. The rest of the code is the same, but now `crawler.arun()` will use the faster, simpler HTTP GET request method defined by `AsyncHTTPCrawlerStrategy`.
For a simple page like `httpbin.org/html`, both strategies will likely return the same HTML content, but the HTTP strategy would generally be faster and use fewer resources. On a complex JavaScript-heavy site, the HTTP strategy might fail to get the full content, while the Playwright strategy would handle it correctly.
## A Glimpse Under the Hood
You don't *need* to know the deep internals to use the strategies, but it helps to understand the structure. Inside the `crawl4ai` library, you'd find a file like `async_crawler_strategy.py`.
It defines the "blueprint" (an Abstract Base Class):
```python
# Simplified from async_crawler_strategy.py
from abc import ABC, abstractmethod
from .models import AsyncCrawlResponse # Defines the structure of the result
class AsyncCrawlerStrategy(ABC):
"""
Abstract base class for crawler strategies.
"""
@abstractmethod
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
"""Fetch content from the URL."""
pass # Each specific strategy must implement this
```
And then the specific implementations:
```python
# Simplified from async_crawler_strategy.py
from playwright.async_api import Page # Playwright library for browser automation
# ... other imports
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# ... (Initialization code to manage browsers)
async def crawl(self, url: str, config: CrawlerRunConfig, **kwargs) -> AsyncCrawlResponse:
# Uses Playwright to:
# 1. Get a browser page
# 2. Navigate to the url (page.goto(url))
# 3. Wait for content, run JS, etc.
# 4. Get the final HTML (page.content())
# 5. Optionally take screenshots, etc.
# 6. Return an AsyncCrawlResponse
# ... implementation details ...
pass
```
```python
# Simplified from async_crawler_strategy.py
import aiohttp # Library for making HTTP requests asynchronously
# ... other imports
class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
# ... (Initialization code to manage HTTP sessions)
async def crawl(self, url: str, config: CrawlerRunConfig, **kwargs) -> AsyncCrawlResponse:
# Uses aiohttp to:
# 1. Make an HTTP GET (or other method) request to the url
# 2. Read the response body (HTML)
# 3. Get response headers and status code
# 4. Return an AsyncCrawlResponse
# ... implementation details ...
pass
```
The key takeaway is that both strategies implement the same `crawl` method, allowing `AsyncWebCrawler` to use them interchangeably.
## Conclusion
You've learned about `AsyncCrawlerStrategy`, the core concept defining *how* Crawl4AI fetches webpage content.
* It's like choosing a vehicle: a powerful browser (`AsyncPlaywrightCrawlerStrategy`, the default) or a fast, simple HTTP request (`AsyncHTTPCrawlerStrategy`).
* This abstraction gives you flexibility to choose the right fetching method for your task.
* You usually don't need to worry about it, as the default handles most modern websites well.
Now that we understand how the raw content is fetched, the next step is to look at the main class that orchestrates the entire crawling process.
**Next:** Let's dive into the [AsyncWebCrawler](02_asyncwebcrawler.md) itself!
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,339 @@
# Chapter 2: Meet the General Manager - AsyncWebCrawler
In [Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy](01_asynccrawlerstrategy.md), we learned about the different ways Crawl4AI can fetch the raw content of a webpage, like choosing between a fast drone (`AsyncHTTPCrawlerStrategy`) or a versatile delivery truck (`AsyncPlaywrightCrawlerStrategy`).
But who decides *which* delivery vehicle to use? Who tells it *which* address (URL) to go to? And who takes the delivered package (the raw HTML) and turns it into something useful?
That's where the `AsyncWebCrawler` comes in. Think of it as the **General Manager** of the entire crawling operation.
## What Problem Does `AsyncWebCrawler` Solve?
Imagine you want to get information from a website. You need to:
1. Decide *how* to fetch the page (like choosing the drone or truck from Chapter 1).
2. Actually *fetch* the page content.
3. Maybe *clean up* the messy HTML.
4. Perhaps *extract* specific pieces of information (like product prices or article titles).
5. Maybe *save* the results so you don't have to fetch them again immediately (caching).
6. Finally, give you the *final, processed result*.
Doing all these steps manually for every URL would be tedious and complex. `AsyncWebCrawler` acts as the central coordinator, managing all these steps for you. You just tell it what URL to crawl and maybe some preferences, and it handles the rest.
## What is `AsyncWebCrawler`?
`AsyncWebCrawler` is the main class you'll interact with when using Crawl4AI. It's the primary entry point for starting any crawling task.
**Key Responsibilities:**
* **Initialization:** Sets up the necessary components, like the browser (if needed).
* **Coordination:** Takes your request (a URL and configuration) and orchestrates the different parts:
* Delegates fetching to an [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md).
* Manages caching using [CacheContext / CacheMode](09_cachecontext___cachemode.md).
* Uses a [ContentScrapingStrategy](04_contentscrapingstrategy.md) to clean and parse HTML.
* Applies a [RelevantContentFilter](05_relevantcontentfilter.md) if configured.
* Uses an [ExtractionStrategy](06_extractionstrategy.md) to pull out specific data if needed.
* **Result Packaging:** Bundles everything up into a neat [CrawlResult](07_crawlresult.md) object.
* **Resource Management:** Handles starting and stopping resources (like browsers) cleanly.
It's the "conductor" making sure all the different instruments play together harmoniously.
## Your First Crawl: Using `arun`
Let's see the `AsyncWebCrawler` in action. The most common way to use it is with an `async with` block, which automatically handles setup and cleanup. The main method to crawl a single URL is `arun`.
```python
# chapter2_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler # Import the General Manager
async def main():
# Create the General Manager instance using 'async with'
# This handles setup (like starting a browser if needed)
# and cleanup (closing the browser).
async with AsyncWebCrawler() as crawler:
print("Crawler is ready!")
# Tell the manager to crawl a specific URL
url_to_crawl = "https://httpbin.org/html" # A simple example page
print(f"Asking the crawler to fetch: {url_to_crawl}")
result = await crawler.arun(url=url_to_crawl)
# Check if the crawl was successful
if result.success:
print("\nSuccess! Crawler got the content.")
# The result object contains the processed data
# We'll learn more about CrawlResult in Chapter 7
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
print(f"First 100 chars of Markdown: {result.markdown.raw_markdown[:100]}...")
else:
print(f"\nFailed to crawl: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **`import AsyncWebCrawler`**: We import the main class.
2. **`async def main():`**: Crawl4AI uses Python's `asyncio` for efficiency, so our code needs to be in an `async` function.
3. **`async with AsyncWebCrawler() as crawler:`**: This is the standard way to create and manage the crawler. The `async with` statement ensures that resources (like the underlying browser used by the default `AsyncPlaywrightCrawlerStrategy`) are properly started and stopped, even if errors occur.
4. **`crawler.arun(url=url_to_crawl)`**: This is the core command. We tell our `crawler` instance (the General Manager) to run (`arun`) the crawling process for the specified `url`. `await` is used because fetching webpages takes time, and `asyncio` allows other tasks to run while waiting.
5. **`result`**: The `arun` method returns a `CrawlResult` object. This object contains all the information gathered during the crawl (HTML, cleaned text, metadata, etc.). We'll explore this object in detail in [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md).
6. **`result.success`**: We check this boolean flag to see if the crawl completed without critical errors.
7. **Accessing Data:** If successful, we can access processed information like the page title (`result.metadata['title']`) or the content formatted as Markdown (`result.markdown.raw_markdown`).
## Configuring the Crawl
Sometimes, the default behavior isn't quite what you need. Maybe you want to use the faster "drone" strategy from Chapter 1, or perhaps you want to ensure you *always* fetch a fresh copy of the page, ignoring any saved cache.
You can customize the behavior of a specific `arun` call by passing a `CrawlerRunConfig` object. Think of this as giving specific instructions to the General Manager for *this particular job*.
```python
# chapter2_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import CrawlerRunConfig # Import configuration class
from crawl4ai import CacheMode # Import cache options
async def main():
async with AsyncWebCrawler() as crawler:
print("Crawler is ready!")
url_to_crawl = "https://httpbin.org/html"
# Create a specific configuration for this run
# Tell the crawler to BYPASS the cache (fetch fresh)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
print("Configuration: Bypass cache for this run.")
# Pass the config object to the arun method
result = await crawler.arun(
url=url_to_crawl,
config=run_config # Pass the specific instructions
)
if result.success:
print("\nSuccess! Crawler got fresh content (cache bypassed).")
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
else:
print(f"\nFailed to crawl: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **`from crawl4ai import CrawlerRunConfig, CacheMode`**: We import the necessary classes for configuration.
2. **`run_config = CrawlerRunConfig(...)`**: We create an instance of `CrawlerRunConfig`. This object holds various settings for a specific crawl job.
3. **`cache_mode=CacheMode.BYPASS`**: We set the `cache_mode`. `CacheMode.BYPASS` tells the crawler to ignore any previously saved results for this URL and fetch it directly from the web server. We'll learn all about caching options in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md).
4. **`crawler.arun(..., config=run_config)`**: We pass our custom `run_config` object to the `arun` method using the `config` parameter.
The `CrawlerRunConfig` is very powerful and lets you control many aspects of the crawl, including which scraping or extraction methods to use. We'll dive deep into it in the next chapter: [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md).
## What Happens When You Call `arun`? (The Flow)
When you call `crawler.arun(url="...")`, the `AsyncWebCrawler` (our General Manager) springs into action and coordinates several steps behind the scenes:
```mermaid
sequenceDiagram
participant U as User
participant AWC as AsyncWebCrawler (Manager)
participant CC as Cache Check
participant CS as AsyncCrawlerStrategy (Fetcher)
participant SP as Scraping/Processing
participant CR as CrawlResult (Final Report)
U->>AWC: arun("https://example.com", config)
AWC->>CC: Need content for "https://example.com"? (Respect CacheMode in config)
alt Cache Hit & Cache Mode allows reading
CC-->>AWC: Yes, here's the cached result.
AWC-->>CR: Package cached result.
AWC-->>U: Here is the CrawlResult
else Cache Miss or Cache Mode prevents reading
CC-->>AWC: No cached result / Cannot read cache.
AWC->>CS: Please fetch "https://example.com" (using configured strategy)
CS-->>AWC: Here's the raw response (HTML, etc.)
AWC->>SP: Process this raw content (Scrape, Filter, Extract based on config)
SP-->>AWC: Here's the processed data (Markdown, Metadata, etc.)
AWC->>CC: Cache this result? (Respect CacheMode in config)
CC-->>AWC: OK, cached.
AWC-->>CR: Package new result.
AWC-->>U: Here is the CrawlResult
end
```
**Simplified Steps:**
1. **Receive Request:** The `AsyncWebCrawler` gets the URL and configuration from your `arun` call.
2. **Check Cache:** It checks if a valid result for this URL is already saved (cached) and if the `CacheMode` allows using it. (See [Chapter 9](09_cachecontext___cachemode.md)).
3. **Fetch (if needed):** If no valid cached result exists or caching is bypassed, it asks the configured [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) (e.g., Playwright or HTTP) to fetch the raw page content.
4. **Process Content:** It takes the raw HTML and passes it through various processing steps based on the configuration:
* **Scraping:** Cleaning up HTML, extracting basic structure using a [ContentScrapingStrategy](04_contentscrapingstrategy.md).
* **Filtering:** Optionally filtering content for relevance using a [RelevantContentFilter](05_relevantcontentfilter.md).
* **Extraction:** Optionally extracting specific structured data using an [ExtractionStrategy](06_extractionstrategy.md).
5. **Cache Result (if needed):** If caching is enabled for writing, it saves the final processed result.
6. **Return Result:** It bundles everything into a [CrawlResult](07_crawlresult.md) object and returns it to you.
## Crawling Many Pages: `arun_many`
What if you have a whole list of URLs to crawl? Calling `arun` in a loop works, but it might not be the most efficient way. `AsyncWebCrawler` provides the `arun_many` method designed for this.
```python
# chapter2_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
async with AsyncWebCrawler() as crawler:
urls_to_crawl = [
"https://httpbin.org/html",
"https://httpbin.org/links/10/0",
"https://httpbin.org/robots.txt"
]
print(f"Asking crawler to fetch {len(urls_to_crawl)} URLs.")
# Use arun_many for multiple URLs
# We can still pass a config that applies to all URLs in the batch
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
results = await crawler.arun_many(urls=urls_to_crawl, config=config)
print(f"\nFinished crawling! Got {len(results)} results.")
for result in results:
status = "Success" if result.success else "Failed"
url_short = result.url.split('/')[-1] # Get last part of URL
print(f"- URL: {url_short:<10} | Status: {status:<7} | Title: {result.metadata.get('title', 'N/A')}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **`urls_to_crawl = [...]`**: We define a list of URLs.
2. **`await crawler.arun_many(urls=urls_to_crawl, config=config)`**: We call `arun_many`, passing the list of URLs. It handles crawling them concurrently (like dispatching multiple delivery trucks or drones efficiently).
3. **`results`**: `arun_many` returns a list where each item is a `CrawlResult` object corresponding to one of the input URLs.
`arun_many` is much more efficient for batch processing as it leverages `asyncio` to handle multiple fetches and processing tasks concurrently. It uses a [BaseDispatcher](10_basedispatcher.md) internally to manage this concurrency.
## Under the Hood (A Peek at the Code)
You don't need to know the internal details to use `AsyncWebCrawler`, but seeing the structure can help. Inside the `crawl4ai` library, the file `async_webcrawler.py` defines this class.
```python
# Simplified from async_webcrawler.py
# ... imports ...
from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy
from .async_configs import BrowserConfig, CrawlerRunConfig
from .models import CrawlResult
from .cache_context import CacheContext, CacheMode
# ... other strategy imports ...
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: AsyncCrawlerStrategy = None, # You can provide a strategy...
config: BrowserConfig = None, # Configuration for the browser
# ... other parameters like logger, base_directory ...
):
# If no strategy is given, it defaults to Playwright (the 'truck')
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(...)
self.browser_config = config or BrowserConfig()
# ... setup logger, directories, etc. ...
self.ready = False # Flag to track if setup is complete
async def __aenter__(self):
# This is called when you use 'async with'. It starts the strategy.
await self.crawler_strategy.__aenter__()
await self.awarmup() # Perform internal setup
self.ready = True
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
# This is called when exiting 'async with'. It cleans up.
await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
self.ready = False
async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult:
# 1. Ensure config exists, set defaults (like CacheMode.ENABLED)
crawler_config = config or CrawlerRunConfig()
if crawler_config.cache_mode is None:
crawler_config.cache_mode = CacheMode.ENABLED
# 2. Create CacheContext to manage caching logic
cache_context = CacheContext(url, crawler_config.cache_mode)
# 3. Try reading from cache if allowed
cached_result = None
if cache_context.should_read():
cached_result = await async_db_manager.aget_cached_url(url)
# 4. If cache hit and valid, return cached result
if cached_result and self._is_cache_valid(cached_result, crawler_config):
# ... log cache hit ...
return cached_result
# 5. If no cache hit or cache invalid/bypassed: Fetch fresh content
# Delegate to the configured AsyncCrawlerStrategy
async_response = await self.crawler_strategy.crawl(url, config=crawler_config)
# 6. Process the HTML (scrape, filter, extract)
# This involves calling other strategies based on config
crawl_result = await self.aprocess_html(
url=url,
html=async_response.html,
config=crawler_config,
# ... other details from async_response ...
)
# 7. Write to cache if allowed
if cache_context.should_write():
await async_db_manager.acache_url(crawl_result)
# 8. Return the final CrawlResult
return crawl_result
async def aprocess_html(self, url: str, html: str, config: CrawlerRunConfig, ...) -> CrawlResult:
# This internal method handles:
# - Getting the configured ContentScrapingStrategy
# - Calling its 'scrap' method
# - Getting the configured MarkdownGenerationStrategy
# - Calling its 'generate_markdown' method
# - Getting the configured ExtractionStrategy (if any)
# - Calling its 'run' method
# - Packaging everything into a CrawlResult
# ... implementation details ...
pass # Simplified
async def arun_many(self, urls: List[str], config: Optional[CrawlerRunConfig] = None, ...) -> List[CrawlResult]:
# Uses a Dispatcher (like MemoryAdaptiveDispatcher)
# to run self.arun for each URL concurrently.
# ... implementation details using a dispatcher ...
pass # Simplified
# ... other methods like awarmup, close, caching helpers ...
```
The key takeaway is that `AsyncWebCrawler` doesn't do the fetching or detailed processing *itself*. It acts as the central hub, coordinating calls to the various specialized `Strategy` classes based on the provided configuration.
## Conclusion
You've met the General Manager: `AsyncWebCrawler`!
* It's the **main entry point** for using Crawl4AI.
* It **coordinates** all the steps: fetching, caching, scraping, extracting.
* You primarily interact with it using `async with` and the `arun()` (single URL) or `arun_many()` (multiple URLs) methods.
* It takes a URL and an optional `CrawlerRunConfig` object to customize the crawl.
* It returns a comprehensive `CrawlResult` object.
Now that you understand the central role of `AsyncWebCrawler`, let's explore how to give it detailed instructions for each crawling job.
**Next:** Let's dive into the specifics of configuration with [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,277 @@
# Chapter 3: Giving Instructions - CrawlerRunConfig
In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method.
But what if we want to tell the crawler *how* to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version?
Passing all these different instructions individually every time we call `arun` could get complicated and messy.
```python
# Imagine doing this every time - it gets long!
# result = await crawler.arun(
# url="https://example.com",
# take_screenshot=True,
# ignore_cache=True,
# only_look_at_this_part="#main-content",
# wait_for_this_element="#data-table",
# # ... maybe many more settings ...
# )
```
That's where `CrawlerRunConfig` comes in!
## What Problem Does `CrawlerRunConfig` Solve?
Think of `CrawlerRunConfig` as the **Instruction Manual** for a *specific* crawl job. Instead of giving the `AsyncWebCrawler` manager lots of separate instructions each time, you bundle them all neatly into a single `CrawlerRunConfig` object.
This object tells the `AsyncWebCrawler` exactly *how* to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage.
## What is `CrawlerRunConfig`?
`CrawlerRunConfig` is a configuration class that holds all the settings for a single crawl operation initiated by `AsyncWebCrawler.arun()` or `arun_many()`.
It allows you to customize various aspects of the crawl, such as:
* **Taking Screenshots:** Should the crawler capture an image of the page? (`screenshot`)
* **Waiting:** How long should the crawler wait for the page or specific elements to load? (`page_timeout`, `wait_for`)
* **Focusing Content:** Should the crawler only process a specific part of the page? (`css_selector`)
* **Extracting Data:** Should the crawler use a specific method to pull out structured data? ([ExtractionStrategy](06_extractionstrategy.md))
* **Caching:** How should the crawler interact with previously saved results? ([CacheMode](09_cachecontext___cachemode.md))
* **And much more!** (like handling JavaScript, filtering links, etc.)
## Using `CrawlerRunConfig`
Let's see how to use it. Remember our basic crawl from Chapter 2?
```python
# chapter3_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
print(f"Crawling {url_to_crawl} with default settings...")
# This uses the default behavior (no specific config)
result = await crawler.arun(url=url_to_crawl)
if result.success:
print("Success! Got the content.")
print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No
# We'll learn about CacheMode later, but it defaults to using the cache
else:
print(f"Failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
Now, let's say for this *specific* crawl, we want to bypass the cache (fetch fresh) and also take a screenshot.
We create a `CrawlerRunConfig` instance and pass it to `arun`:
```python
# chapter3_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai import CrawlerRunConfig # 1. Import the config class
from crawl4ai import CacheMode # Import cache options
async def main():
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
print(f"Crawling {url_to_crawl} with custom settings...")
# 2. Create an instance of CrawlerRunConfig with our desired settings
my_instructions = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh
screenshot=True # Take a screenshot
)
print("Instructions: Bypass cache, take screenshot.")
# 3. Pass the config object to arun()
result = await crawler.arun(
url=url_to_crawl,
config=my_instructions # Pass our instruction manual
)
if result.success:
print("\nSuccess! Got the content with custom config.")
print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes
# Check if the screenshot file path exists in result.screenshot
if result.screenshot:
print(f"Screenshot saved to: {result.screenshot}")
else:
print(f"\nFailed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Import:** We import `CrawlerRunConfig` and `CacheMode`.
2. **Create Config:** We create an instance: `my_instructions = CrawlerRunConfig(...)`. We set `cache_mode` to `CacheMode.BYPASS` and `screenshot` to `True`. All other settings remain at their defaults.
3. **Pass Config:** We pass this `my_instructions` object to `crawler.arun` using the `config=` parameter.
Now, when `AsyncWebCrawler` runs this job, it will look inside `my_instructions` and follow those specific settings for *this run only*.
## Some Common `CrawlerRunConfig` Parameters
`CrawlerRunConfig` has many options, but here are a few common ones you might use:
* **`cache_mode`**: Controls caching behavior.
* `CacheMode.ENABLED` (Default): Use the cache if available, otherwise fetch and save.
* `CacheMode.BYPASS`: Always fetch fresh, ignoring any cached version (but still save the new result).
* `CacheMode.DISABLED`: Never read from or write to the cache.
* *(More details in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md))*
* **`screenshot` (bool)**: If `True`, takes a screenshot of the fully rendered page. The path to the screenshot file will be in `CrawlResult.screenshot`. Default: `False`.
* **`pdf` (bool)**: If `True`, generates a PDF of the page. The path to the PDF file will be in `CrawlResult.pdf`. Default: `False`.
* **`css_selector` (str)**: If provided (e.g., `"#main-content"` or `.article-body`), the crawler will try to extract *only* the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: `None` (process the whole page).
* **`wait_for` (str)**: A CSS selector (e.g., `"#data-loaded-indicator"`). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: `None`.
* **`page_timeout` (int)**: Maximum time in milliseconds to wait for page navigation or certain operations. Default: `60000` (60 seconds).
* **`extraction_strategy`**: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: `None`. *(See [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md))*
* **`scraping_strategy`**: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: `WebScrapingStrategy()`. *(See [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md))*
Let's try combining a few: focus on a specific part of the page and wait for something to appear.
```python
# chapter3_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# This example site has a heading 'H1' inside a 'body' tag.
url_to_crawl = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
print(f"Crawling {url_to_crawl}, focusing on the H1 tag...")
# Instructions: Only get the H1 tag, wait max 10s for it
specific_config = CrawlerRunConfig(
css_selector="h1", # Only grab content inside <h1> tags
page_timeout=10000 # Set page timeout to 10 seconds
# We could also add wait_for="h1" if needed for dynamic loading
)
result = await crawler.arun(url=url_to_crawl, config=specific_config)
if result.success:
print("\nSuccess! Focused crawl completed.")
# The markdown should now ONLY contain the H1 content
print(f"Markdown content:\n---\n{result.markdown.raw_markdown.strip()}\n---")
else:
print(f"\nFailed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
This time, the `result.markdown` should only contain the text from the `<h1>` tag on that page, because we used `css_selector="h1"` in our `CrawlerRunConfig`.
## How `AsyncWebCrawler` Uses the Config (Under the Hood)
You don't need to know the exact internal code, but it helps to understand the flow. When you call `crawler.arun(url, config=my_config)`, the `AsyncWebCrawler` essentially does this:
1. Receives the `url` and the `my_config` object.
2. Before fetching, it checks `my_config.cache_mode` to see if it should look in the cache first.
3. If fetching is needed, it passes `my_config` to the underlying [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md).
4. The strategy uses settings from `my_config` like `page_timeout`, `wait_for`, and whether to take a `screenshot`.
5. After getting the raw HTML, `AsyncWebCrawler` uses the `my_config.scraping_strategy` and `my_config.css_selector` to process the content.
6. If `my_config.extraction_strategy` is set, it uses that to extract structured data.
7. Finally, it bundles everything into a `CrawlResult` and returns it.
Here's a simplified view:
```mermaid
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Config as CrawlerRunConfig
participant Fetcher as AsyncCrawlerStrategy
participant Processor as Scraping/Extraction
User->>AWC: arun(url, config=my_config)
AWC->>Config: Check my_config.cache_mode
alt Need to Fetch
AWC->>Fetcher: crawl(url, config=my_config)
Note over Fetcher: Uses my_config settings (timeout, wait_for, screenshot...)
Fetcher-->>AWC: Raw Response (HTML, screenshot?)
AWC->>Processor: Process HTML (using my_config.css_selector, my_config.extraction_strategy...)
Processor-->>AWC: Processed Data
else Use Cache
AWC->>AWC: Retrieve from Cache
end
AWC-->>User: Return CrawlResult
```
The `CrawlerRunConfig` acts as a messenger carrying your specific instructions throughout the crawling process.
Inside the `crawl4ai` library, in the file `async_configs.py`, you'll find the definition of the `CrawlerRunConfig` class. It looks something like this (simplified):
```python
# Simplified from crawl4ai/async_configs.py
from .cache_context import CacheMode
from .extraction_strategy import ExtractionStrategy
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
# ... other imports ...
class CrawlerRunConfig():
"""
Configuration class for controlling how the crawler runs each crawl operation.
"""
def __init__(
self,
# Caching
cache_mode: CacheMode = CacheMode.BYPASS, # Default behavior if not specified
# Content Selection / Waiting
css_selector: str = None,
wait_for: str = None,
page_timeout: int = 60000, # 60 seconds
# Media
screenshot: bool = False,
pdf: bool = False,
# Processing Strategies
scraping_strategy: ContentScrapingStrategy = None, # Defaults internally if None
extraction_strategy: ExtractionStrategy = None,
# ... many other parameters omitted for clarity ...
**kwargs # Allows for flexibility
):
self.cache_mode = cache_mode
self.css_selector = css_selector
self.wait_for = wait_for
self.page_timeout = page_timeout
self.screenshot = screenshot
self.pdf = pdf
# Assign scraping strategy, ensuring a default if None is provided
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
self.extraction_strategy = extraction_strategy
# ... initialize other attributes ...
# Helper methods like 'clone', 'to_dict', 'from_kwargs' might exist too
# ...
```
The key idea is that it's a class designed to hold various settings together. When you create an instance `CrawlerRunConfig(...)`, you're essentially creating an object that stores your choices for these parameters.
## Conclusion
You've learned about `CrawlerRunConfig`, the "Instruction Manual" for individual crawl jobs in Crawl4AI!
* It solves the problem of passing many settings individually to `AsyncWebCrawler`.
* You create an instance of `CrawlerRunConfig` and set the parameters you want to customize (like `cache_mode`, `screenshot`, `css_selector`, `wait_for`).
* You pass this config object to `crawler.arun(url, config=your_config)`.
* This makes your code cleaner and gives you fine-grained control over *how* each crawl is performed.
Now that we know how to fetch content ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)), manage the overall process ([AsyncWebCrawler](02_asyncwebcrawler.md)), and give specific instructions ([CrawlerRunConfig](03_crawlerrunconfig.md)), let's look at how the raw, messy HTML fetched from the web is initially cleaned up and processed.
**Next:** Let's explore [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,321 @@
# Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy
In [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md), we learned how to give specific instructions to our `AsyncWebCrawler` using `CrawlerRunConfig`. This included telling it *how* to fetch the page and potentially take screenshots or PDFs.
Now, imagine the crawler has successfully fetched the raw HTML content of a webpage. What's next? Raw HTML is often messy! It contains not just the main article or product description you might care about, but also:
* Navigation menus
* Advertisements
* Headers and footers
* Hidden code like JavaScript (`<script>`) and styling information (`<style>`)
* Comments left by developers
Before we can really understand the *meaning* of the page or extract specific important information, we need to clean up this mess and get a basic understanding of its structure.
## What Problem Does `ContentScrapingStrategy` Solve?
Think of the raw HTML fetched by the crawler as a very rough first draft of a book manuscript. It has the core story, but it's full of editor's notes, coffee stains, layout instructions for the printer, and maybe even doodles in the margins.
Before the *main* editor (who focuses on plot and character) can work on it, someone needs to do an initial cleanup. This "First Pass Editor" would:
1. Remove the coffee stains and doodles (irrelevant stuff like ads, scripts, styles).
2. Identify the basic structure: chapter headings (like the page title), paragraph text, image captions (image alt text), and maybe a list of illustrations (links).
3. Produce a tidier version of the manuscript, ready for more detailed analysis.
In Crawl4AI, the `ContentScrapingStrategy` acts as this **First Pass Editor**. It takes the raw HTML and performs an initial cleanup and structure extraction. Its job is to transform the messy HTML into a more manageable format, identifying key elements like text content, links, images, and basic page metadata (like the title).
## What is `ContentScrapingStrategy`?
`ContentScrapingStrategy` is an abstract concept (like a job description) in Crawl4AI that defines *how* the initial processing of raw HTML should happen. It specifies *that* we need a method to clean HTML and extract basic structure, but the specific tools and techniques used can vary.
This allows Crawl4AI to be flexible. Different strategies might use different underlying libraries or have different performance characteristics.
## The Implementations: Meet the Editors
Crawl4AI provides concrete implementations (the actual editors doing the work) of this strategy:
1. **`WebScrapingStrategy` (The Default Editor):**
* This is the strategy used by default if you don't specify otherwise.
* It uses a popular Python library called `BeautifulSoup` behind the scenes to parse and manipulate the HTML.
* It's generally robust and good at handling imperfect HTML.
* Think of it as a reliable, experienced editor who does a thorough job.
2. **`LXMLWebScrapingStrategy` (The Speedy Editor):**
* This strategy uses another powerful library called `lxml`.
* `lxml` is often faster than `BeautifulSoup`, especially on large or complex pages.
* Think of it as a very fast editor who might be slightly stricter about the manuscript's format but gets the job done quickly.
For most beginners, the default `WebScrapingStrategy` works perfectly fine! You usually don't need to worry about switching unless you encounter performance issues on very large-scale crawls (which is a more advanced topic).
## How It Works Conceptually
Here's the flow:
1. The [AsyncWebCrawler](02_asyncwebcrawler.md) receives the raw HTML from the [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) (the fetcher).
2. It looks at the [CrawlerRunConfig](03_crawlerrunconfig.md) to see which `ContentScrapingStrategy` to use (defaulting to `WebScrapingStrategy` if none is specified).
3. It hands the raw HTML over to the chosen strategy's `scrap` method.
4. The strategy parses the HTML, removes unwanted tags (like `<script>`, `<style>`, `<nav>`, `<aside>`, etc., based on its internal rules), extracts all links (`<a>` tags), images (`<img>` tags with their `alt` text), and metadata (like the `<title>` tag).
5. It returns the results packaged in a `ScrapingResult` object, containing the cleaned HTML, lists of links and media items, and extracted metadata.
6. The `AsyncWebCrawler` then takes this `ScrapingResult` and uses its contents (along with other info) to build the final [CrawlResult](07_crawlresult.md).
```mermaid
sequenceDiagram
participant AWC as AsyncWebCrawler (Manager)
participant Fetcher as AsyncCrawlerStrategy
participant HTML as Raw HTML
participant CSS as ContentScrapingStrategy (Editor)
participant SR as ScrapingResult (Cleaned Draft)
participant CR as CrawlResult (Final Report)
AWC->>Fetcher: Fetch("https://example.com")
Fetcher-->>AWC: Here's the Raw HTML
AWC->>CSS: Please scrap this Raw HTML (using config)
Note over CSS: Parsing HTML... Removing scripts, styles, ads... Extracting links, images, title...
CSS-->>AWC: Here's the ScrapingResult (Cleaned HTML, Links, Media, Metadata)
AWC->>CR: Combine ScrapingResult with other info
AWC-->>User: Return final CrawlResult
```
## Using the Default Strategy (`WebScrapingStrategy`)
You're likely already using it without realizing it! When you run a basic crawl, `AsyncWebCrawler` automatically employs `WebScrapingStrategy`.
```python
# chapter4_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
# Uses the default AsyncPlaywrightCrawlerStrategy (fetching)
# AND the default WebScrapingStrategy (scraping/cleaning)
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html" # A very simple HTML page
# We don't specify a scraping_strategy in the config, so it uses the default
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) # Fetch fresh
print(f"Crawling {url_to_crawl} using default scraping strategy...")
result = await crawler.arun(url=url_to_crawl, config=config)
if result.success:
print("\nSuccess! Content fetched and scraped.")
# The 'result' object now contains info processed by WebScrapingStrategy
# 1. Metadata extracted (e.g., page title)
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
# 2. Links extracted
print(f"Found {len(result.links.internal)} internal links and {len(result.links.external)} external links.")
# Example: print first external link if exists
if result.links.external:
print(f" Example external link: {result.links.external[0].href}")
# 3. Media extracted (images, videos, etc.)
print(f"Found {len(result.media.images)} images.")
# Example: print first image alt text if exists
if result.media.images:
print(f" Example image alt text: '{result.media.images[0].alt}'")
# 4. Cleaned HTML (scripts, styles etc. removed) - might still be complex
# print(f"\nCleaned HTML snippet:\n---\n{result.cleaned_html[:200]}...\n---")
# 5. Markdown representation (generated AFTER scraping)
print(f"\nMarkdown snippet:\n---\n{result.markdown.raw_markdown[:200]}...\n---")
else:
print(f"\nFailed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. We create `AsyncWebCrawler` and `CrawlerRunConfig` as usual.
2. We **don't** set the `scraping_strategy` parameter in `CrawlerRunConfig`. Crawl4AI automatically picks `WebScrapingStrategy`.
3. When `crawler.arun` executes, after fetching the HTML, it internally calls `WebScrapingStrategy.scrap()`.
4. The `result` (a [CrawlResult](07_crawlresult.md) object) contains fields populated by the scraping strategy:
* `result.metadata`: Contains things like the page title found in `<title>` tags.
* `result.links`: Contains lists of internal and external links found (`<a>` tags).
* `result.media`: Contains lists of images (`<img>`), videos (`<video>`), etc.
* `result.cleaned_html`: The HTML after the strategy removed unwanted tags and attributes (this is then used to generate the Markdown).
* `result.markdown`: While not *directly* created by the scraping strategy, the cleaned HTML it produces is the input for generating the Markdown representation.
## Explicitly Choosing a Strategy (e.g., `LXMLWebScrapingStrategy`)
What if you want to try the potentially faster `LXMLWebScrapingStrategy`? You can specify it in the `CrawlerRunConfig`.
```python
# chapter4_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
# 1. Import the specific strategy you want to use
from crawl4ai import LXMLWebScrapingStrategy
async def main():
# 2. Create an instance of the desired scraping strategy
lxml_editor = LXMLWebScrapingStrategy()
print(f"Using scraper: {lxml_editor.__class__.__name__}")
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
# 3. Create a CrawlerRunConfig and pass the strategy instance
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
scraping_strategy=lxml_editor # Tell the config which strategy to use
)
print(f"Crawling {url_to_crawl} with explicit LXML scraping strategy...")
result = await crawler.arun(url=url_to_crawl, config=config)
if result.success:
print("\nSuccess! Content fetched and scraped using LXML.")
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
print(f"Found {len(result.links.external)} external links.")
# Output should be largely the same as the default strategy for simple pages
else:
print(f"\nFailed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Import:** We import `LXMLWebScrapingStrategy` alongside the other classes.
2. **Instantiate:** We create an instance: `lxml_editor = LXMLWebScrapingStrategy()`.
3. **Configure:** We create `CrawlerRunConfig` and pass our instance to the `scraping_strategy` parameter: `CrawlerRunConfig(..., scraping_strategy=lxml_editor)`.
4. **Run:** Now, when `crawler.arun` is called with this config, it will use `LXMLWebScrapingStrategy` instead of the default `WebScrapingStrategy` for the initial HTML processing step.
For simple pages, the results from both strategies will often be very similar. The choice typically comes down to performance considerations in more advanced scenarios.
## A Glimpse Under the Hood
Inside the `crawl4ai` library, the file `content_scraping_strategy.py` defines the blueprint and the implementations.
**The Blueprint (Abstract Base Class):**
```python
# Simplified from crawl4ai/content_scraping_strategy.py
from abc import ABC, abstractmethod
from .models import ScrapingResult # Defines the structure of the result
class ContentScrapingStrategy(ABC):
"""Abstract base class for content scraping strategies."""
@abstractmethod
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
"""
Synchronous method to scrape content.
Takes raw HTML, returns structured ScrapingResult.
"""
pass
@abstractmethod
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
"""
Asynchronous method to scrape content.
Takes raw HTML, returns structured ScrapingResult.
"""
pass
```
**The Implementations:**
```python
# Simplified from crawl4ai/content_scraping_strategy.py
from bs4 import BeautifulSoup # Library used by WebScrapingStrategy
# ... other imports like models ...
class WebScrapingStrategy(ContentScrapingStrategy):
def __init__(self, logger=None):
self.logger = logger
# ... potentially other setup ...
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# 1. Parse HTML using BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # Or another parser
# 2. Find the main content area (maybe using kwargs['css_selector'])
# 3. Remove unwanted tags (scripts, styles, nav, footer, ads...)
# 4. Extract metadata (title, description...)
# 5. Extract all links (<a> tags)
# 6. Extract all images (<img> tags) and other media
# 7. Get the remaining cleaned HTML text content
# ... complex cleaning and extraction logic using BeautifulSoup methods ...
# 8. Package results into a ScrapingResult object
cleaned_html_content = "<html><body>Cleaned content...</body></html>" # Placeholder
links_data = Links(...)
media_data = Media(...)
metadata_dict = {"title": "Page Title"}
return ScrapingResult(
cleaned_html=cleaned_html_content,
links=links_data,
media=media_data,
metadata=metadata_dict,
success=True
)
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# Often delegates to the synchronous version for CPU-bound tasks
return await asyncio.to_thread(self.scrap, url, html, **kwargs)
```
```python
# Simplified from crawl4ai/content_scraping_strategy.py
from lxml import html as lhtml # Library used by LXMLWebScrapingStrategy
# ... other imports like models ...
class LXMLWebScrapingStrategy(WebScrapingStrategy): # Often inherits for shared logic
def __init__(self, logger=None):
super().__init__(logger)
# ... potentially LXML specific setup ...
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# 1. Parse HTML using lxml
doc = lhtml.document_fromstring(html)
# 2. Find main content, remove unwanted tags, extract info
# ... complex cleaning and extraction logic using lxml's XPath or CSS selectors ...
# 3. Package results into a ScrapingResult object
cleaned_html_content = "<html><body>Cleaned LXML content...</body></html>" # Placeholder
links_data = Links(...)
media_data = Media(...)
metadata_dict = {"title": "Page Title LXML"}
return ScrapingResult(
cleaned_html=cleaned_html_content,
links=links_data,
media=media_data,
metadata=metadata_dict,
success=True
)
# ascrap might also delegate or have specific async optimizations
```
The key takeaway is that both strategies implement the `scrap` (and `ascrap`) method, taking raw HTML and returning a structured `ScrapingResult`. The `AsyncWebCrawler` can use either one thanks to this common interface.
## Conclusion
You've learned about `ContentScrapingStrategy`, Crawl4AI's "First Pass Editor" for raw HTML.
* It tackles the problem of messy HTML by cleaning it and extracting basic structure.
* It acts as a blueprint, with `WebScrapingStrategy` (default, using BeautifulSoup) and `LXMLWebScrapingStrategy` (using lxml) as concrete implementations.
* It's used automatically by `AsyncWebCrawler` after fetching content.
* You can specify which strategy to use via `CrawlerRunConfig`.
* Its output (cleaned HTML, links, media, metadata) is packaged into a `ScrapingResult` and contributes significantly to the final `CrawlResult`.
Now that we have this initially cleaned and structured content, we might want to further filter it. What if we only care about the parts of the page that are *relevant* to a specific topic?
**Next:** Let's explore how to filter content for relevance with [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,425 @@
# Chapter 5: Focusing on What Matters - RelevantContentFilter
In [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md), we learned how Crawl4AI takes the raw, messy HTML from a webpage and cleans it up using a `ContentScrapingStrategy`. This gives us a tidier version of the HTML (`cleaned_html`) and extracts basic elements like links and images.
But even after this initial cleanup, the page might still contain a lot of "noise" relative to what we *actually* care about. Imagine a news article page: the `ContentScrapingStrategy` might remove scripts and styles, but it could still leave the main article text, plus related article links, user comments, sidebars with ads, and maybe a lengthy footer.
If our goal is just to get the main article content (e.g., to summarize it or feed it to an AI), all that extra stuff is just noise. How can we filter the cleaned content even further to keep only the truly relevant parts?
## What Problem Does `RelevantContentFilter` Solve?
Think of the `cleaned_html` from the previous step like flour that's been roughly sifted the biggest lumps are gone, but there might still be smaller clumps or bran mixed in. If you want super fine flour for a delicate cake, you need a finer sieve.
`RelevantContentFilter` acts as this **finer sieve** or a **Relevance Sieve**. It's a strategy applied *after* the initial cleaning by `ContentScrapingStrategy` but *before* the final processing (like generating the final Markdown output or using an AI for extraction). Its job is to go through the cleaned content and decide which parts are truly relevant to our goal, removing the rest.
This helps us:
1. **Reduce Noise:** Eliminate irrelevant sections like comments, footers, navigation bars, or tangential "related content" blocks.
2. **Focus AI:** If we're sending the content to a Large Language Model (LLM), feeding it only the most relevant parts saves processing time (and potentially money) and can lead to better results.
3. **Improve Accuracy:** By removing distracting noise, subsequent steps like data extraction are less likely to grab the wrong information.
## What is `RelevantContentFilter`?
`RelevantContentFilter` is an abstract concept (a blueprint) in Crawl4AI representing a **method for identifying and retaining only the relevant portions of cleaned HTML content**. It defines *that* we need a way to filter for relevance, but the specific technique used can vary.
This allows us to choose different filtering approaches depending on the task and the type of content.
## The Different Filters: Tools for Sieving
Crawl4AI provides several concrete implementations (the actual sieves) of `RelevantContentFilter`:
1. **`BM25ContentFilter` (The Keyword Sieve):**
* **Analogy:** Like a mini search engine operating *within* the webpage.
* **How it Works:** You give it (or it figures out) some keywords related to what you're looking for (e.g., from a user query like "product specifications" or derived from the page title). It then uses a search algorithm called BM25 to score different chunks of the cleaned HTML based on how relevant they are to those keywords. Only the chunks scoring above a certain threshold are kept.
* **Good For:** Finding specific sections about a known topic within a larger page (e.g., finding only the paragraphs discussing "climate change impact" on a long environmental report page).
2. **`PruningContentFilter` (The Structural Sieve):**
* **Analogy:** Like a gardener pruning a bush, removing weak or unnecessary branches based on their structure.
* **How it Works:** This filter doesn't care about keywords. Instead, it looks at the *structure* and *characteristics* of the HTML elements. It removes elements that often represent noise, such as those with very little text compared to the number of links (low text density), elements with common "noise" words in their CSS classes or IDs (like `sidebar`, `comments`, `footer`), or elements deemed structurally insignificant.
* **Good For:** Removing common boilerplate sections (like headers, footers, simple sidebars, navigation) based purely on layout and density clues, even if you don't have a specific topic query.
3. **`LLMContentFilter` (The AI Sieve):**
* **Analogy:** Asking a smart assistant to read the cleaned content and pick out only the parts relevant to your request.
* **How it Works:** This filter sends the cleaned HTML (often broken into manageable chunks) to a Large Language Model (like GPT). You provide an instruction (e.g., "Extract only the main article content, removing all comments and related links" or "Keep only the sections discussing financial results"). The AI uses its understanding of language and context to identify and return only the relevant parts, often already formatted nicely (like in Markdown).
* **Good For:** Handling complex relevance decisions that require understanding meaning and context, following nuanced natural language instructions. (Note: Requires configuring LLM access, like API keys, and can be slower and potentially costlier than other methods).
## How `RelevantContentFilter` is Used (Via Markdown Generation)
In Crawl4AI, the `RelevantContentFilter` is typically integrated into the **Markdown generation** step. The standard markdown generator (`DefaultMarkdownGenerator`) can accept a `RelevantContentFilter` instance.
When configured this way:
1. The `AsyncWebCrawler` fetches the page and uses the `ContentScrapingStrategy` to get `cleaned_html`.
2. It then calls the `DefaultMarkdownGenerator` to produce the Markdown output.
3. The generator first creates the standard, "raw" Markdown from the *entire* `cleaned_html`.
4. **If** a `RelevantContentFilter` was provided to the generator, it then uses this filter on the `cleaned_html` to select only the relevant HTML fragments.
5. It converts *these filtered fragments* into Markdown. This becomes the `fit_markdown`.
So, the `CrawlResult` will contain *both*:
* `result.markdown.raw_markdown`: Markdown based on the full `cleaned_html`.
* `result.markdown.fit_markdown`: Markdown based *only* on the parts deemed relevant by the filter.
Let's see how to configure this.
### Example 1: Using `BM25ContentFilter` to find specific content
Imagine we crawled a page about renewable energy, but we only want the parts specifically discussing **solar power**.
```python
# chapter5_example_1.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
DefaultMarkdownGenerator, # The standard markdown generator
BM25ContentFilter # The keyword-based filter
)
async def main():
# 1. Create the BM25 filter with our query
solar_filter = BM25ContentFilter(user_query="solar power technology")
print(f"Filter created for query: '{solar_filter.user_query}'")
# 2. Create a Markdown generator that USES this filter
markdown_generator_with_filter = DefaultMarkdownGenerator(
content_filter=solar_filter
)
print("Markdown generator configured with BM25 filter.")
# 3. Create CrawlerRunConfig using this specific markdown generator
run_config = CrawlerRunConfig(
markdown_generator=markdown_generator_with_filter
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
# Example URL (replace with a real page having relevant content)
url_to_crawl = "https://en.wikipedia.org/wiki/Renewable_energy"
print(f"\nCrawling {url_to_crawl}...")
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success:
print("\nCrawl successful!")
print(f"Raw Markdown length: {len(result.markdown.raw_markdown)}")
print(f"Fit Markdown length: {len(result.markdown.fit_markdown)}")
# The fit_markdown should be shorter and focused on solar power
print("\n--- Start of Fit Markdown (Solar Power Focus) ---")
# Print first 500 chars of the filtered markdown
print(result.markdown.fit_markdown[:500] + "...")
print("--- End of Fit Markdown Snippet ---")
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Create Filter:** We make an instance of `BM25ContentFilter`, telling it we're interested in "solar power technology".
2. **Create Generator:** We make an instance of `DefaultMarkdownGenerator` and pass our `solar_filter` to its `content_filter` parameter.
3. **Configure Run:** We create `CrawlerRunConfig` and tell it to use our special `markdown_generator_with_filter` for this run.
4. **Crawl & Check:** We run the crawl as usual. In the `result`, `result.markdown.raw_markdown` will have the markdown for the whole page, while `result.markdown.fit_markdown` will *only* contain markdown derived from the HTML parts that the `BM25ContentFilter` scored highly for relevance to "solar power technology". You'll likely see the `fit_markdown` is significantly shorter.
### Example 2: Using `PruningContentFilter` to remove boilerplate
Now, let's try removing common noise like sidebars or footers based on structure, without needing a specific query.
```python
# chapter5_example_2.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter # The structural filter
)
async def main():
# 1. Create the Pruning filter (no query needed)
pruning_filter = PruningContentFilter()
print("Filter created: PruningContentFilter (structural)")
# 2. Create a Markdown generator that uses this filter
markdown_generator_with_filter = DefaultMarkdownGenerator(
content_filter=pruning_filter
)
print("Markdown generator configured with Pruning filter.")
# 3. Create CrawlerRunConfig using this generator
run_config = CrawlerRunConfig(
markdown_generator=markdown_generator_with_filter
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
# Example URL (replace with a real page that has boilerplate)
url_to_crawl = "https://www.python.org/" # Python homepage likely has headers/footers
print(f"\nCrawling {url_to_crawl}...")
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success:
print("\nCrawl successful!")
print(f"Raw Markdown length: {len(result.markdown.raw_markdown)}")
print(f"Fit Markdown length: {len(result.markdown.fit_markdown)}")
# fit_markdown should have less header/footer/sidebar content
print("\n--- Start of Fit Markdown (Pruned) ---")
print(result.markdown.fit_markdown[:500] + "...")
print("--- End of Fit Markdown Snippet ---")
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
The structure is the same as the BM25 example, but:
1. We instantiate `PruningContentFilter()`, which doesn't require a `user_query`.
2. We pass this filter to the `DefaultMarkdownGenerator`.
3. The resulting `result.markdown.fit_markdown` should contain Markdown primarily from the main content areas of the page, with structurally identified boilerplate removed.
### Example 3: Using `LLMContentFilter` (Conceptual)
Using `LLMContentFilter` follows the same pattern, but requires setting up LLM provider details.
```python
# chapter5_example_3_conceptual.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
DefaultMarkdownGenerator,
LLMContentFilter,
# Assume LlmConfig is set up correctly (see LLM-specific docs)
# from crawl4ai.async_configs import LlmConfig
)
# Assume llm_config is properly configured with API keys, provider, etc.
# Example: llm_config = LlmConfig(provider="openai", api_token="env:OPENAI_API_KEY")
# For this example, we'll pretend it's ready.
class MockLlmConfig: # Mock for demonstration
provider = "mock_provider"
api_token = "mock_token"
base_url = None
llm_config = MockLlmConfig()
async def main():
# 1. Create the LLM filter with an instruction
instruction = "Extract only the main news article content. Remove headers, footers, ads, comments, and related links."
llm_filter = LLMContentFilter(
instruction=instruction,
llmConfig=llm_config # Pass the LLM configuration
)
print(f"Filter created: LLMContentFilter")
print(f"Instruction: '{llm_filter.instruction}'")
# 2. Create a Markdown generator using this filter
markdown_generator_with_filter = DefaultMarkdownGenerator(
content_filter=llm_filter
)
print("Markdown generator configured with LLM filter.")
# 3. Create CrawlerRunConfig
run_config = CrawlerRunConfig(
markdown_generator=markdown_generator_with_filter
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
# Example URL (replace with a real news article)
url_to_crawl = "https://httpbin.org/html" # Using simple page for demo
print(f"\nCrawling {url_to_crawl}...")
# In a real scenario, this would call the LLM API
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success:
print("\nCrawl successful!")
# The fit_markdown would contain the AI-filtered content
print("\n--- Start of Fit Markdown (AI Filtered - Conceptual) ---")
# Because we used a mock LLM/simple page, fit_markdown might be empty or simple.
# On a real page with a real LLM, it would ideally contain just the main article.
print(result.markdown.fit_markdown[:500] + "...")
print("--- End of Fit Markdown Snippet ---")
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. We create `LLMContentFilter`, providing our natural language `instruction` and the necessary `llmConfig` (which holds provider details and API keys - mocked here for simplicity).
2. We integrate it into `DefaultMarkdownGenerator` and `CrawlerRunConfig` as before.
3. When `arun` is called, the `LLMContentFilter` would (in a real scenario) interact with the configured LLM API, sending chunks of the `cleaned_html` and the instruction, then assembling the AI's response into the `fit_markdown`.
## Under the Hood: How Filtering Fits In
The `RelevantContentFilter` doesn't run on its own; it's invoked by another component, typically the `DefaultMarkdownGenerator`.
Here's the sequence:
```mermaid
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Config as CrawlerRunConfig
participant Scraper as ContentScrapingStrategy
participant MDGen as DefaultMarkdownGenerator
participant Filter as RelevantContentFilter
participant Result as CrawlResult
User->>AWC: arun(url, config=my_config)
Note over AWC: Config includes Markdown Generator with a Filter
AWC->>Scraper: scrap(raw_html)
Scraper-->>AWC: cleaned_html, links, etc.
AWC->>MDGen: generate_markdown(cleaned_html, config=my_config)
Note over MDGen: Uses html2text for raw markdown
MDGen-->>MDGen: raw_markdown = html2text(cleaned_html)
Note over MDGen: Now, check for content_filter
alt Filter Provided in MDGen
MDGen->>Filter: filter_content(cleaned_html)
Filter-->>MDGen: filtered_html_fragments
Note over MDGen: Uses html2text on filtered fragments
MDGen-->>MDGen: fit_markdown = html2text(filtered_html_fragments)
else No Filter Provided
MDGen-->>MDGen: fit_markdown = "" (or None)
end
Note over MDGen: Generate citations if needed
MDGen-->>AWC: MarkdownGenerationResult (raw, fit, references)
AWC->>Result: Package everything
AWC-->>User: Return CrawlResult
```
**Code Glimpse:**
Inside `crawl4ai/markdown_generation_strategy.py`, the `DefaultMarkdownGenerator`'s `generate_markdown` method has logic like this (simplified):
```python
# Simplified from markdown_generation_strategy.py
from .models import MarkdownGenerationResult
from .html2text import CustomHTML2Text
from .content_filter_strategy import RelevantContentFilter # Import filter base class
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
# ... __init__ stores self.content_filter ...
def generate_markdown(
self,
cleaned_html: str,
# ... other params like base_url, options ...
content_filter: Optional[RelevantContentFilter] = None,
**kwargs,
) -> MarkdownGenerationResult:
h = CustomHTML2Text(...) # Setup html2text converter
# ... apply options ...
# 1. Generate raw markdown from the full cleaned_html
raw_markdown = h.handle(cleaned_html)
# ... post-process raw_markdown ...
# 2. Convert links to citations (if enabled)
markdown_with_citations, references_markdown = self.convert_links_to_citations(...)
# 3. Generate fit markdown IF a filter is available
fit_markdown = ""
filtered_html = ""
# Use the filter passed directly, or the one stored during initialization
active_filter = content_filter or self.content_filter
if active_filter:
try:
# Call the filter's main method
filtered_html_fragments = active_filter.filter_content(cleaned_html)
# Join fragments (assuming filter returns list of HTML strings)
filtered_html = "\n".join(filtered_html_fragments)
# Convert ONLY the filtered HTML to markdown
fit_markdown = h.handle(filtered_html)
except Exception as e:
fit_markdown = f"Error during filtering: {e}"
# Log error...
return MarkdownGenerationResult(
raw_markdown=raw_markdown,
markdown_with_citations=markdown_with_citations,
references_markdown=references_markdown,
fit_markdown=fit_markdown, # Contains the filtered result
fit_html=filtered_html, # The HTML fragments kept by the filter
)
```
And inside `crawl4ai/content_filter_strategy.py`, you find the blueprint and implementations:
```python
# Simplified from content_filter_strategy.py
from abc import ABC, abstractmethod
from typing import List
# ... other imports like BeautifulSoup, BM25Okapi ...
class RelevantContentFilter(ABC):
"""Abstract base class for content filtering strategies"""
def __init__(self, user_query: str = None, ...):
self.user_query = user_query
# ... common setup ...
@abstractmethod
def filter_content(self, html: str) -> List[str]:
"""
Takes cleaned HTML, returns a list of HTML fragments
deemed relevant by the specific strategy.
"""
pass
# ... common helper methods like extract_page_query, is_excluded ...
class BM25ContentFilter(RelevantContentFilter):
def __init__(self, user_query: str = None, bm25_threshold: float = 1.0, ...):
super().__init__(user_query)
self.bm25_threshold = bm25_threshold
# ... BM25 specific setup ...
def filter_content(self, html: str) -> List[str]:
# 1. Parse HTML (e.g., with BeautifulSoup)
# 2. Extract text chunks (candidates)
# 3. Determine query (user_query or extracted)
# 4. Tokenize query and chunks
# 5. Calculate BM25 scores for chunks vs query
# 6. Filter chunks based on score and threshold
# 7. Return the HTML string of the selected chunks
# ... implementation details ...
relevant_html_fragments = ["<p>Relevant paragraph 1...</p>", "<h2>Relevant Section</h2>..."] # Placeholder
return relevant_html_fragments
# ... Implementations for PruningContentFilter and LLMContentFilter ...
```
The key is that each filter implements the `filter_content` method, returning the list of HTML fragments it considers relevant. The `DefaultMarkdownGenerator` then uses these fragments to create the `fit_markdown`.
## Conclusion
You've learned about `RelevantContentFilter`, Crawl4AI's "Relevance Sieve"!
* It addresses the problem that even cleaned HTML can contain noise relative to a specific goal.
* It acts as a strategy to filter cleaned HTML, keeping only the relevant parts.
* Different filter types exist: `BM25ContentFilter` (keywords), `PruningContentFilter` (structure), and `LLMContentFilter` (AI/semantic).
* It's typically used *within* the `DefaultMarkdownGenerator` to produce a focused `fit_markdown` output in the `CrawlResult`, alongside the standard `raw_markdown`.
* You configure it by passing the chosen filter instance to the `DefaultMarkdownGenerator` and then passing that generator to the `CrawlerRunConfig`.
By using `RelevantContentFilter`, you can significantly improve the signal-to-noise ratio of the content you get from webpages, making downstream tasks like summarization or analysis more effective.
But what if just getting relevant *text* isn't enough? What if you need specific, *structured* data like product names, prices, and ratings from an e-commerce page, or names and affiliations from a list of conference speakers?
**Next:** Let's explore how to extract structured data with [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,442 @@
# Chapter 6: Getting Specific Data - ExtractionStrategy
In the previous chapter, [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md), we learned how to sift through the cleaned webpage content to keep only the parts relevant to our query or goal, producing a focused `fit_markdown`. This is great for tasks like summarization or getting the main gist of an article.
But sometimes, we need more than just relevant text. Imagine you're analyzing an e-commerce website listing products. You don't just want the *description*; you need the exact **product name**, the specific **price**, the **customer rating**, and maybe the **SKU number**, all neatly organized. How do we tell Crawl4AI to find these *specific* pieces of information and return them in a structured format, like a JSON object?
## What Problem Does `ExtractionStrategy` Solve?
Think of the content we've processed so far (like the cleaned HTML or the generated Markdown) as a detailed report delivered by a researcher. `RelevantContentFilter` helped trim the report down to the most relevant pages.
Now, we need to give specific instructions to an **Analyst** to go through that focused report and pull out precise data points. We don't just want the report; we want a filled-in spreadsheet with columns for "Product Name," "Price," and "Rating."
`ExtractionStrategy` is the set of instructions we give to this Analyst. It defines *how* to locate and extract specific, structured information (like fields in a database or keys in a JSON object) from the content.
## What is `ExtractionStrategy`?
`ExtractionStrategy` is a core concept (a blueprint) in Crawl4AI that represents the **method used to extract structured data** from the processed content (which could be HTML or Markdown). It specifies *that* we need a way to find specific fields, but the actual *technique* used to find them can vary.
This allows us to choose the best "Analyst" for the job, depending on the complexity of the website and the data we need.
## The Different Analysts: Ways to Extract Data
Crawl4AI offers several concrete implementations (the different Analysts) for extracting structured data:
1. **The Precise Locator (`JsonCssExtractionStrategy` & `JsonXPathExtractionStrategy`)**
* **Analogy:** An analyst who uses very precise map coordinates (CSS Selectors or XPath expressions) to find information on a page. They need to be told exactly where to look. "The price is always in the HTML element with the ID `#product-price`."
* **How it works:** You define a **schema** (a Python dictionary) that maps the names of the fields you want (e.g., "product_name", "price") to the specific CSS selector (`JsonCssExtractionStrategy`) or XPath expression (`JsonXPathExtractionStrategy`) that locates that information within the HTML structure.
* **Pros:** Very fast and reliable if the website structure is consistent and predictable. Doesn't require external AI services.
* **Cons:** Can break easily if the website changes its layout (selectors become invalid). Requires you to inspect the HTML and figure out the correct selectors.
* **Input:** Typically works directly on the raw or cleaned HTML.
2. **The Smart Interpreter (`LLMExtractionStrategy`)**
* **Analogy:** A highly intelligent analyst who can *read and understand* the content. You give them a list of fields you need (a schema) or even just natural language instructions ("Find the product name, its price, and a short description"). They read the content (usually Markdown) and use their understanding of language and context to figure out the values, even if the layout isn't perfectly consistent.
* **How it works:** You provide a desired output schema (e.g., a Pydantic model or a dictionary structure) or a natural language instruction. The strategy sends the content (often the generated Markdown, possibly split into chunks) along with your schema/instruction to a configured Large Language Model (LLM) like GPT or Llama. The LLM reads the text and generates the structured data (usually JSON) according to your request.
* **Pros:** Much more resilient to website layout changes. Can understand context and handle variations. Can extract data based on meaning, not just location.
* **Cons:** Requires setting up access to an LLM (API keys, potentially costs). Can be significantly slower than selector-based methods. The quality of extraction depends on the LLM's capabilities and the clarity of your instructions/schema.
* **Input:** Often works best on the cleaned Markdown representation of the content, but can sometimes use HTML.
## How to Use an `ExtractionStrategy`
You tell the `AsyncWebCrawler` which extraction strategy to use (if any) by setting the `extraction_strategy` parameter within the [CrawlerRunConfig](03_crawlerrunconfig.md) object you pass to `arun` or `arun_many`.
### Example 1: Extracting Data with `JsonCssExtractionStrategy`
Let's imagine we want to extract the title (from the `<h1>` tag) and the main heading (from the `<h1>` tag) of the simple `httpbin.org/html` page.
```python
# chapter6_example_1.py
import asyncio
import json
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
JsonCssExtractionStrategy # Import the CSS strategy
)
async def main():
# 1. Define the extraction schema (Field Name -> CSS Selector)
extraction_schema = {
"baseSelector": "body", # Operate within the body tag
"fields": [
{"name": "page_title", "selector": "title", "type": "text"},
{"name": "main_heading", "selector": "h1", "type": "text"}
]
}
print("Extraction Schema defined using CSS selectors.")
# 2. Create an instance of the strategy with the schema
css_extractor = JsonCssExtractionStrategy(schema=extraction_schema)
print(f"Using strategy: {css_extractor.__class__.__name__}")
# 3. Create CrawlerRunConfig and set the extraction_strategy
run_config = CrawlerRunConfig(
extraction_strategy=css_extractor
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
print(f"\nCrawling {url_to_crawl} to extract structured data...")
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success and result.extracted_content:
print("\nExtraction successful!")
# The extracted data is stored as a JSON string in result.extracted_content
# Parse the JSON string to work with the data as a Python object
extracted_data = json.loads(result.extracted_content)
print("Extracted Data:")
# Print the extracted data nicely formatted
print(json.dumps(extracted_data, indent=2))
elif result.success:
print("\nCrawl successful, but no structured data extracted.")
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Schema Definition:** We create a Python dictionary `extraction_schema`.
* `baseSelector: "body"` tells the strategy to look for items within the `<body>` tag of the HTML.
* `fields` is a list of dictionaries, each defining a field to extract:
* `name`: The key for this field in the output JSON (e.g., "page_title").
* `selector`: The CSS selector to find the element containing the data (e.g., "title" finds the `<title>` tag, "h1" finds the `<h1>` tag).
* `type`: How to get the data from the selected element (`"text"` means get the text content).
2. **Instantiate Strategy:** We create an instance of `JsonCssExtractionStrategy`, passing our `extraction_schema`. This strategy knows its input format should be HTML.
3. **Configure Run:** We create a `CrawlerRunConfig` and assign our `css_extractor` instance to the `extraction_strategy` parameter.
4. **Crawl:** We run `crawler.arun`. After fetching and basic scraping, the `AsyncWebCrawler` will see the `extraction_strategy` in the config and call our `css_extractor`.
5. **Result:** The `CrawlResult` object now contains a field called `extracted_content`. This field holds the structured data found by the strategy, formatted as a **JSON string**. We use `json.loads()` to convert this string back into a Python list/dictionary.
**Expected Output (Conceptual):**
```
Extraction Schema defined using CSS selectors.
Using strategy: JsonCssExtractionStrategy
Crawling https://httpbin.org/html to extract structured data...
Extraction successful!
Extracted Data:
[
{
"page_title": "Herman Melville - Moby-Dick",
"main_heading": "Moby Dick"
}
]
```
*(Note: The actual output is a list containing one dictionary because `baseSelector: "body"` matches one element, and we extract fields relative to that.)*
### Example 2: Extracting Data with `LLMExtractionStrategy` (Conceptual)
Now, let's imagine we want the same information (title, heading) but using an AI. We'll provide a schema describing what we want. (Note: This requires setting up LLM access separately, e.g., API keys).
```python
# chapter6_example_2.py
import asyncio
import json
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
LLMExtractionStrategy, # Import the LLM strategy
LlmConfig # Import LLM configuration helper
)
# Assume llm_config is properly configured with provider, API key, etc.
# This is just a placeholder - replace with your actual LLM setup
# E.g., llm_config = LlmConfig(provider="openai", api_token="env:OPENAI_API_KEY")
class MockLlmConfig: provider="mock"; api_token="mock"; base_url=None
llm_config = MockLlmConfig()
async def main():
# 1. Define the desired output schema (what fields we want)
# This helps guide the LLM.
output_schema = {
"page_title": "string",
"main_heading": "string"
}
print("Extraction Schema defined for LLM.")
# 2. Create an instance of the LLM strategy
# We pass the schema and the LLM configuration.
# We also specify input_format='markdown' (common for LLMs).
llm_extractor = LLMExtractionStrategy(
schema=output_schema,
llmConfig=llm_config, # Pass the LLM provider details
input_format="markdown" # Tell it to read the Markdown content
)
print(f"Using strategy: {llm_extractor.__class__.__name__}")
print(f"LLM Provider (mocked): {llm_config.provider}")
# 3. Create CrawlerRunConfig with the strategy
run_config = CrawlerRunConfig(
extraction_strategy=llm_extractor
)
# 4. Run the crawl
async with AsyncWebCrawler() as crawler:
url_to_crawl = "https://httpbin.org/html"
print(f"\nCrawling {url_to_crawl} using LLM to extract...")
# This would make calls to the configured LLM API
result = await crawler.arun(url=url_to_crawl, config=run_config)
if result.success and result.extracted_content:
print("\nExtraction successful (using LLM)!")
# Extracted data is a JSON string
try:
extracted_data = json.loads(result.extracted_content)
print("Extracted Data:")
print(json.dumps(extracted_data, indent=2))
except json.JSONDecodeError:
print("Could not parse LLM output as JSON:")
print(result.extracted_content)
elif result.success:
print("\nCrawl successful, but no structured data extracted by LLM.")
# This might happen if the mock LLM doesn't return valid JSON
# or if the content was too small/irrelevant for extraction.
else:
print(f"\nCrawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Schema Definition:** We define a simple dictionary `output_schema` telling the LLM we want fields named "page_title" and "main_heading", both expected to be strings.
2. **Instantiate Strategy:** We create `LLMExtractionStrategy`, passing:
* `schema=output_schema`: Our desired output structure.
* `llmConfig=llm_config`: The configuration telling the strategy *which* LLM to use and how to authenticate (here, it's mocked).
* `input_format="markdown"`: Instructs the strategy to feed the generated Markdown content (from `result.markdown.raw_markdown`) to the LLM, which is often easier for LLMs to parse than raw HTML.
3. **Configure Run & Crawl:** Same as before, we set the `extraction_strategy` in `CrawlerRunConfig` and run the crawl.
4. **Result:** The `AsyncWebCrawler` calls the `llm_extractor`. The strategy sends the Markdown content and the schema instructions to the configured LLM. The LLM analyzes the text and (hopefully) returns a JSON object matching the schema. This JSON is stored as a string in `result.extracted_content`.
**Expected Output (Conceptual, with a real LLM):**
```
Extraction Schema defined for LLM.
Using strategy: LLMExtractionStrategy
LLM Provider (mocked): mock
Crawling https://httpbin.org/html using LLM to extract...
Extraction successful (using LLM)!
Extracted Data:
[
{
"page_title": "Herman Melville - Moby-Dick",
"main_heading": "Moby Dick"
}
]
```
*(Note: LLM output format might vary slightly, but it aims to match the requested schema based on the content it reads.)*
## How It Works Inside (Under the Hood)
When you provide an `extraction_strategy` in the `CrawlerRunConfig`, how does `AsyncWebCrawler` use it?
1. **Fetch & Scrape:** The crawler fetches the raw HTML ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)) and performs initial cleaning/scraping ([ContentScrapingStrategy](04_contentscrapingstrategy.md)) to get `cleaned_html`, links, etc.
2. **Markdown Generation:** It usually generates Markdown representation ([DefaultMarkdownGenerator](05_relevantcontentfilter.md#how-relevantcontentfilter-is-used-via-markdown-generation)).
3. **Check for Strategy:** The `AsyncWebCrawler` (specifically in its internal `aprocess_html` method) checks if `config.extraction_strategy` is set.
4. **Execute Strategy:** If a strategy exists:
* It determines the required input format (e.g., "html" for `JsonCssExtractionStrategy`, "markdown" for `LLMExtractionStrategy` based on its `input_format` attribute).
* It retrieves the corresponding content (e.g., `result.cleaned_html` or `result.markdown.raw_markdown`).
* If the content is long and the strategy supports chunking (like `LLMExtractionStrategy`), it might first split the content into smaller chunks.
* It calls the strategy's `run` method, passing the content chunk(s).
* The strategy performs its logic (applying selectors, calling LLM API).
* The strategy returns the extracted data (typically as a list of dictionaries).
5. **Store Result:** The `AsyncWebCrawler` converts the returned structured data into a JSON string and stores it in `CrawlResult.extracted_content`.
Here's a simplified view:
```mermaid
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Config as CrawlerRunConfig
participant Processor as HTML Processing
participant Extractor as ExtractionStrategy
participant Result as CrawlResult
User->>AWC: arun(url, config=my_config)
Note over AWC: Config includes an Extraction Strategy
AWC->>Processor: Process HTML (scrape, generate markdown)
Processor-->>AWC: Processed Content (HTML, Markdown)
AWC->>Extractor: Run extraction on content (using Strategy's input format)
Note over Extractor: Applying logic (CSS, XPath, LLM...)
Extractor-->>AWC: Structured Data (List[Dict])
AWC->>AWC: Convert data to JSON String
AWC->>Result: Store JSON String in extracted_content
AWC-->>User: Return CrawlResult
```
### Code Glimpse (`extraction_strategy.py`)
Inside the `crawl4ai` library, the file `extraction_strategy.py` defines the blueprint and the implementations.
**The Blueprint (Abstract Base Class):**
```python
# Simplified from crawl4ai/extraction_strategy.py
from abc import ABC, abstractmethod
from typing import List, Dict, Any
class ExtractionStrategy(ABC):
"""Abstract base class for all extraction strategies."""
def __init__(self, input_format: str = "markdown", **kwargs):
self.input_format = input_format # e.g., 'html', 'markdown'
# ... other common init ...
@abstractmethod
def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
"""Extract structured data from a single chunk of content."""
pass
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
"""Process content sections (potentially chunked) and call extract."""
# Default implementation might process sections in parallel or sequentially
all_extracted_data = []
for section in sections:
all_extracted_data.extend(self.extract(url, section, **kwargs))
return all_extracted_data
```
**Example Implementation (`JsonCssExtractionStrategy`):**
```python
# Simplified from crawl4ai/extraction_strategy.py
from bs4 import BeautifulSoup # Uses BeautifulSoup for CSS selectors
class JsonCssExtractionStrategy(ExtractionStrategy):
def __init__(self, schema: Dict[str, Any], **kwargs):
# Force input format to HTML for CSS selectors
super().__init__(input_format="html", **kwargs)
self.schema = schema # Store the user-defined schema
def extract(self, url: str, html_content: str, *q, **kwargs) -> List[Dict[str, Any]]:
# Parse the HTML content chunk
soup = BeautifulSoup(html_content, "html.parser")
extracted_items = []
# Find base elements defined in the schema
base_elements = soup.select(self.schema.get("baseSelector", "body"))
for element in base_elements:
item = {}
# Extract fields based on schema selectors and types
fields_to_extract = self.schema.get("fields", [])
for field_def in fields_to_extract:
try:
# Find the specific sub-element using CSS selector
target_element = element.select_one(field_def["selector"])
if target_element:
if field_def["type"] == "text":
item[field_def["name"]] = target_element.get_text(strip=True)
elif field_def["type"] == "attribute":
item[field_def["name"]] = target_element.get(field_def["attribute"])
# ... other types like 'html', 'list', 'nested' ...
except Exception as e:
# Handle errors, maybe log them if verbose
pass
if item:
extracted_items.append(item)
return extracted_items
# run() method likely uses the default implementation from base class
```
**Example Implementation (`LLMExtractionStrategy`):**
```python
# Simplified from crawl4ai/extraction_strategy.py
# Needs imports for LLM interaction (e.g., perform_completion_with_backoff)
from .utils import perform_completion_with_backoff, chunk_documents, escape_json_string
from .prompts import PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Example prompt
class LLMExtractionStrategy(ExtractionStrategy):
def __init__(self, schema: Dict = None, instruction: str = None, llmConfig=None, input_format="markdown", **kwargs):
super().__init__(input_format=input_format, **kwargs)
self.schema = schema
self.instruction = instruction
self.llmConfig = llmConfig # Contains provider, API key, etc.
# ... other LLM specific setup ...
def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
# Prepare the prompt for the LLM
prompt = self._build_llm_prompt(url, content_chunk)
# Call the LLM API
response = perform_completion_with_backoff(
provider=self.llmConfig.provider,
prompt_with_variables=prompt,
api_token=self.llmConfig.api_token,
base_url=self.llmConfig.base_url,
json_response=True # Often expect JSON from LLM for extraction
# ... pass other necessary args ...
)
# Parse the LLM's response (which should ideally be JSON)
try:
extracted_data = json.loads(response.choices[0].message.content)
# Ensure it's a list
if isinstance(extracted_data, dict):
extracted_data = [extracted_data]
return extracted_data
except Exception as e:
# Handle LLM response parsing errors
print(f"Error parsing LLM response: {e}")
return [{"error": "Failed to parse LLM output", "raw_output": response.choices[0].message.content}]
def _build_llm_prompt(self, url: str, content_chunk: str) -> str:
# Logic to construct the prompt using self.schema or self.instruction
# and the content_chunk. Example:
prompt_template = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Choose appropriate prompt
variable_values = {
"URL": url,
"CONTENT": escape_json_string(content_chunk), # Send Markdown or HTML chunk
"SCHEMA": json.dumps(self.schema) if self.schema else "{}",
"REQUEST": self.instruction if self.instruction else "Extract relevant data based on the schema."
}
prompt = prompt_template
for var, val in variable_values.items():
prompt = prompt.replace("{" + var + "}", str(val))
return prompt
# run() method might override the base to handle chunking specifically for LLMs
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
# Potentially chunk sections based on token limits before calling extract
# chunked_content = chunk_documents(sections, ...)
# extracted_data = []
# for chunk in chunked_content:
# extracted_data.extend(self.extract(url, chunk, **kwargs))
# return extracted_data
# Simplified for now:
return super().run(url, sections, *q, **kwargs)
```
## Conclusion
You've learned about `ExtractionStrategy`, Crawl4AI's way of giving instructions to an "Analyst" to pull out specific, structured data from web content.
* It solves the problem of needing precise data points (like product names, prices) in an organized format, not just blocks of text.
* You can choose your "Analyst":
* **Precise Locators (`JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`):** Use exact CSS/XPath selectors defined in a schema. Fast but brittle.
* **Smart Interpreter (`LLMExtractionStrategy`):** Uses an AI (LLM) guided by a schema or instructions. More flexible but slower and needs setup.
* You configure the desired strategy within the [CrawlerRunConfig](03_crawlerrunconfig.md).
* The extracted structured data is returned as a JSON string in the `CrawlResult.extracted_content` field.
Now that we understand how to fetch, clean, filter, and extract data, let's put it all together and look at the final package that Crawl4AI delivers after a crawl.
**Next:** Let's dive into the details of the output with [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,341 @@
# Chapter 7: Understanding the Results - CrawlResult
In the previous chapter, [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md), we learned how to teach Crawl4AI to act like an analyst, extracting specific, structured data points from a webpage using an `ExtractionStrategy`. We've seen how Crawl4AI can fetch pages, clean them, filter them, and even extract precise information.
But after all that work, where does all the gathered information go? When you ask the `AsyncWebCrawler` to crawl a URL using `arun()`, what do you actually get back?
## What Problem Does `CrawlResult` Solve?
Imagine you sent a research assistant to the library (a website) with a set of instructions: "Find this book (URL), make a clean copy of the relevant chapter (clean HTML/Markdown), list all the cited references (links), take photos of the illustrations (media), find the author and publication date (metadata), and maybe extract specific quotes (structured data)."
When the assistant returns, they wouldn't just hand you a single piece of paper. They'd likely give you a folder containing everything you asked for: the clean copy, the list of references, the photos, the metadata notes, and the extracted quotes, all neatly organized. They might also include a note if they encountered any problems (errors).
`CrawlResult` is exactly this **final report folder** or **delivery package**. It's a single object that neatly contains *all* the information Crawl4AI gathered and processed for a specific URL during a crawl operation. Instead of getting lots of separate pieces of data back, you get one convenient container.
## What is `CrawlResult`?
`CrawlResult` is a Python object (specifically, a Pydantic model, which is like a super-powered dictionary) that acts as a data container. It holds the results of a single crawl task performed by `AsyncWebCrawler.arun()` or one of the results from `arun_many()`.
Think of it as a toolbox filled with different tools and information related to the crawled page.
**Key Information Stored in `CrawlResult`:**
* **`url` (string):** The original URL that was requested.
* **`success` (boolean):** Did the crawl complete without critical errors? `True` if successful, `False` otherwise. **Always check this first!**
* **`html` (string):** The raw, original HTML source code fetched from the page.
* **`cleaned_html` (string):** The HTML after initial cleaning by the [ContentScrapingStrategy](04_contentscrapingstrategy.md) (e.g., scripts, styles removed).
* **`markdown` (object):** An object containing different Markdown representations of the content.
* `markdown.raw_markdown`: Basic Markdown generated from `cleaned_html`.
* `markdown.fit_markdown`: Markdown generated *only* from content deemed relevant by a [RelevantContentFilter](05_relevantcontentfilter.md) (if one was used). Might be empty if no filter was applied.
* *(Other fields like `markdown_with_citations` might exist)*
* **`extracted_content` (string):** If you used an [ExtractionStrategy](06_extractionstrategy.md), this holds the extracted structured data, usually formatted as a JSON string. `None` if no extraction was performed or nothing was found.
* **`metadata` (dictionary):** Information extracted from the page's metadata tags, like the page title (`metadata['title']`), description, keywords, etc.
* **`links` (object):** Contains lists of links found on the page.
* `links.internal`: List of links pointing to the same website.
* `links.external`: List of links pointing to other websites.
* **`media` (object):** Contains lists of media items found.
* `media.images`: List of images (`<img>` tags).
* `media.videos`: List of videos (`<video>` tags).
* *(Other media types might be included)*
* **`screenshot` (string):** If you requested a screenshot (`screenshot=True` in `CrawlerRunConfig`), this holds the file path to the saved image. `None` otherwise.
* **`pdf` (bytes):** If you requested a PDF (`pdf=True` in `CrawlerRunConfig`), this holds the PDF data as bytes. `None` otherwise. (Note: Previously might have been a path, now often bytes).
* **`error_message` (string):** If `success` is `False`, this field usually contains details about what went wrong.
* **`status_code` (integer):** The HTTP status code received from the server (e.g., 200 for OK, 404 for Not Found).
* **`response_headers` (dictionary):** The HTTP response headers sent by the server.
* **`redirected_url` (string):** If the original URL redirected, this shows the final URL the crawler landed on.
## Accessing the `CrawlResult`
You get a `CrawlResult` object back every time you `await` a call to `crawler.arun()`:
```python
# chapter7_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html"
print(f"Crawling {url}...")
# The 'arun' method returns a CrawlResult object
result: CrawlResult = await crawler.arun(url=url) # Type hint optional
print("Crawl finished!")
# Now 'result' holds all the information
print(f"Result object type: {type(result)}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. We call `crawler.arun(url=url)`.
2. The `await` keyword pauses execution until the crawl is complete.
3. The value returned by `arun` is assigned to the `result` variable.
4. This `result` variable is our `CrawlResult` object.
If you use `crawler.arun_many()`, it returns a list where each item is a `CrawlResult` object for one of the requested URLs (or an async generator if `stream=True`).
## Exploring the Attributes: Using the Toolbox
Once you have the `result` object, you can access its attributes using dot notation (e.g., `result.success`, `result.markdown`).
**1. Checking for Success (Most Important!)**
Before you try to use any data, always check if the crawl was successful:
```python
# chapter7_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlResult # Import CrawlResult for type hint
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html" # A working URL
# url = "https://httpbin.org/status/404" # Try this URL to see failure
result: CrawlResult = await crawler.arun(url=url)
# --- ALWAYS CHECK 'success' FIRST! ---
if result.success:
print(f"✅ Successfully crawled: {result.url}")
# Now it's safe to access other attributes
print(f" Page Title: {result.metadata.get('title', 'N/A')}")
else:
print(f"❌ Failed to crawl: {result.url}")
print(f" Error: {result.error_message}")
print(f" Status Code: {result.status_code}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We use an `if result.success:` block.
* If `True`, we proceed to access other data like `result.metadata`.
* If `False`, we print the `result.error_message` and `result.status_code` to understand why it failed.
**2. Accessing Content (HTML, Markdown)**
```python
# chapter7_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlResult
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html"
result: CrawlResult = await crawler.arun(url=url)
if result.success:
print("--- Content ---")
# Print the first 150 chars of raw HTML
print(f"Raw HTML snippet: {result.html[:150]}...")
# Access the raw markdown
if result.markdown: # Check if markdown object exists
print(f"Markdown snippet: {result.markdown.raw_markdown[:150]}...")
else:
print("Markdown not generated.")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We access `result.html` for the original HTML.
* We access `result.markdown.raw_markdown` for the main Markdown content. Note the two dots: `result.markdown` gives the `MarkdownGenerationResult` object, and `.raw_markdown` accesses the specific string within it. We also check `if result.markdown:` first, just in case markdown generation failed for some reason.
**3. Getting Metadata, Links, and Media**
```python
# chapter7_example_4.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlResult
async def main():
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/links/10/0" # A page with links
result: CrawlResult = await crawler.arun(url=url)
if result.success:
print("--- Metadata & Links ---")
print(f"Title: {result.metadata.get('title', 'N/A')}")
print(f"Found {len(result.links.internal)} internal links.")
print(f"Found {len(result.links.external)} external links.")
if result.links.internal:
print(f" First internal link text: '{result.links.internal[0].text}'")
# Similarly access result.media.images etc.
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* `result.metadata` is a dictionary; use `.get()` for safe access.
* `result.links` and `result.media` are objects containing lists (`internal`, `external`, `images`, etc.). We can check their lengths (`len()`) and access individual items by index (e.g., `[0]`).
**4. Checking for Extracted Data, Screenshots, PDFs**
```python
# chapter7_example_5.py
import asyncio
import json
from crawl4ai import (
AsyncWebCrawler, CrawlResult, CrawlerRunConfig,
JsonCssExtractionStrategy # Example extractor
)
async def main():
# Define a simple extraction strategy (from Chapter 6)
schema = {"baseSelector": "body", "fields": [{"name": "heading", "selector": "h1", "type": "text"}]}
extractor = JsonCssExtractionStrategy(schema=schema)
# Configure the run to extract and take a screenshot
config = CrawlerRunConfig(
extraction_strategy=extractor,
screenshot=True
)
async with AsyncWebCrawler() as crawler:
url = "https://httpbin.org/html"
result: CrawlResult = await crawler.arun(url=url, config=config)
if result.success:
print("--- Extracted Data & Media ---")
# Check if structured data was extracted
if result.extracted_content:
print("Extracted Data found:")
data = json.loads(result.extracted_content) # Parse the JSON string
print(json.dumps(data, indent=2))
else:
print("No structured data extracted.")
# Check if a screenshot was taken
if result.screenshot:
print(f"Screenshot saved to: {result.screenshot}")
else:
print("Screenshot not taken.")
# Check for PDF (would be bytes if requested and successful)
if result.pdf:
print(f"PDF data captured ({len(result.pdf)} bytes).")
else:
print("PDF not generated.")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We check if `result.extracted_content` is not `None` or empty before trying to parse it as JSON.
* We check if `result.screenshot` is not `None` to see if the file path exists.
* We check if `result.pdf` is not `None` to see if the PDF data (bytes) was captured.
## How is `CrawlResult` Created? (Under the Hood)
You don't interact with the `CrawlResult` constructor directly. The `AsyncWebCrawler` creates it for you at the very end of the `arun` process, typically inside its internal `aprocess_html` method (or just before returning if fetching from cache).
Heres a simplified sequence:
1. **Fetch:** `AsyncWebCrawler` calls the [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to get the raw `html`, `status_code`, `response_headers`, etc.
2. **Scrape:** It passes the `html` to the [ContentScrapingStrategy](04_contentscrapingstrategy.md) to get `cleaned_html`, `links`, `media`, `metadata`.
3. **Markdown:** It generates Markdown using the configured generator, possibly involving a [RelevantContentFilter](05_relevantcontentfilter.md), resulting in a `MarkdownGenerationResult` object.
4. **Extract (Optional):** If an [ExtractionStrategy](06_extractionstrategy.md) is configured, it runs it on the appropriate content (HTML or Markdown) to get `extracted_content`.
5. **Screenshot/PDF (Optional):** If requested, the fetching strategy captures the `screenshot` path or `pdf` data.
6. **Package:** `AsyncWebCrawler` gathers all these pieces (`url`, `html`, `cleaned_html`, the markdown object, `links`, `media`, `metadata`, `extracted_content`, `screenshot`, `pdf`, `success` status, `error_message`, etc.).
7. **Instantiate:** It creates the `CrawlResult` object, passing all the gathered data into its constructor.
8. **Return:** It returns this fully populated `CrawlResult` object to your code.
## Code Glimpse (`models.py`)
The `CrawlResult` is defined in the `crawl4ai/models.py` file. It uses Pydantic, a library that helps define data structures with type hints and validation. Here's a simplified view:
```python
# Simplified from crawl4ai/models.py
from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional, Any
# Other related models (simplified)
class MarkdownGenerationResult(BaseModel):
raw_markdown: str
fit_markdown: Optional[str] = None
# ... other markdown fields ...
class Links(BaseModel):
internal: List[Dict] = []
external: List[Dict] = []
class Media(BaseModel):
images: List[Dict] = []
videos: List[Dict] = []
# The main CrawlResult model
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Media = Media() # Use the Media model
links: Links = Links() # Use the Links model
screenshot: Optional[str] = None
pdf: Optional[bytes] = None
# Uses a private attribute and property for markdown for compatibility
_markdown: Optional[MarkdownGenerationResult] = None # Actual storage
extracted_content: Optional[str] = None # JSON string
metadata: Optional[Dict[str, Any]] = None
error_message: Optional[str] = None
status_code: Optional[int] = None
response_headers: Optional[Dict[str, str]] = None
redirected_url: Optional[str] = None
# ... other fields like session_id, ssl_certificate ...
# Custom property to access markdown data
@property
def markdown(self) -> Optional[MarkdownGenerationResult]:
return self._markdown
# Configuration for Pydantic
class Config:
arbitrary_types_allowed = True
# Custom init and model_dump might exist for backward compatibility handling
# ... (omitted for simplicity) ...
```
**Explanation:**
* It's defined as a `class CrawlResult(BaseModel):`.
* Each attribute (like `url`, `html`, `success`) is defined with a type hint (like `str`, `bool`, `Optional[str]`). `Optional[str]` means the field can be a string or `None`.
* Some attributes are themselves complex objects defined by other Pydantic models (like `media: Media`, `links: Links`).
* The `markdown` field uses a common pattern (property wrapping a private attribute) to provide the `MarkdownGenerationResult` object while maintaining some backward compatibility. You access it simply as `result.markdown`.
## Conclusion
You've now met the `CrawlResult` object the final, comprehensive report delivered by Crawl4AI after processing a URL.
* It acts as a **container** holding all gathered information (HTML, Markdown, metadata, links, media, extracted data, errors, etc.).
* It's the **return value** of `AsyncWebCrawler.arun()` and `arun_many()`.
* The most crucial attribute is **`success` (boolean)**, which you should always check first.
* You can easily **access** all the different pieces of information using dot notation (e.g., `result.metadata['title']`, `result.markdown.raw_markdown`, `result.links.external`).
Understanding the `CrawlResult` is key to effectively using the information Crawl4AI provides.
So far, we've focused on crawling single pages or lists of specific URLs. But what if you want to start at one page and automatically discover and crawl linked pages, exploring a website more deeply?
**Next:** Let's explore how to perform multi-page crawls with [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,378 @@
# Chapter 8: Exploring Websites - DeepCrawlStrategy
In [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md), we saw the final report (`CrawlResult`) that Crawl4AI gives us after processing a single URL. This report contains cleaned content, links, metadata, and maybe even extracted data.
But what if you want to explore a website *beyond* just the first page? Imagine you land on a blog's homepage. You don't just want the homepage content; you want to automatically discover and crawl all the individual blog posts linked from it. How can you tell Crawl4AI to act like an explorer, following links and venturing deeper into the website?
## What Problem Does `DeepCrawlStrategy` Solve?
Think of the `AsyncWebCrawler.arun()` method we've used so far like visiting just the entrance hall of a vast library. You get information about that specific hall, but you don't automatically explore the adjoining rooms or different floors.
What if you want to systematically explore the library? You need a plan:
* Do you explore room by room on the current floor before going upstairs? (Level by level)
* Do you pick one wing and explore all its rooms down to the very end before exploring another wing? (Go deep first)
* Do you have a map highlighting potentially interesting sections and prioritize visiting those first? (Prioritize promising paths)
`DeepCrawlStrategy` provides this **exploration plan**. It defines the logic for how Crawl4AI should discover and crawl new URLs starting from the initial one(s) by following the links it finds on each page. It turns the crawler from a single-page visitor into a website explorer.
## What is `DeepCrawlStrategy`?
`DeepCrawlStrategy` is a concept (a blueprint) in Crawl4AI that represents the **method or logic used to navigate and crawl multiple pages by following links**. It tells the crawler *which links* to follow and in *what order* to visit them.
It essentially takes over the process when you call `arun()` if a deep crawl is requested, managing a queue or list of URLs to visit and coordinating the crawling of those URLs, potentially up to a certain depth or number of pages.
## Different Exploration Plans: The Strategies
Crawl4AI provides several concrete exploration plans (implementations) for `DeepCrawlStrategy`:
1. **`BFSDeepCrawlStrategy` (Level-by-Level Explorer):**
* **Analogy:** Like ripples spreading in a pond.
* **How it works:** It first crawls the starting URL (Level 0). Then, it crawls all the valid links found on that page (Level 1). Then, it crawls all the valid links found on *those* pages (Level 2), and so on. It explores the website layer by layer.
* **Good for:** Finding the shortest path to all reachable pages, getting a broad overview quickly near the start page.
2. **`DFSDeepCrawlStrategy` (Deep Path Explorer):**
* **Analogy:** Like exploring one specific corridor in a maze all the way to the end before backtracking and trying another corridor.
* **How it works:** It starts at the initial URL, follows one link, then follows a link from *that* page, and continues going deeper down one path as far as possible (or until a specified depth limit). Only when it hits a dead end or the limit does it backtrack and try another path.
* **Good for:** Exploring specific branches of a website thoroughly, potentially reaching deeper pages faster than BFS (if the target is down a specific path).
3. **`BestFirstCrawlingStrategy` (Priority Explorer):**
* **Analogy:** Like using a treasure map where some paths are marked as more promising than others.
* **How it works:** This strategy uses a **scoring system**. It looks at all the discovered (but not yet visited) links and assigns a score to each one based on how "promising" it seems (e.g., does the URL contain relevant keywords? Is it from a trusted domain?). It then crawls the link with the *best* score first, regardless of its depth.
* **Good for:** Focusing the crawl on the most relevant or important pages first, especially useful when you can't crawl the entire site and need to prioritize.
**Guiding the Explorer: Filters and Scorers**
Deep crawl strategies often work together with:
* **Filters:** Rules that decide *if* a discovered link should even be considered for crawling. Examples:
* `DomainFilter`: Only follow links within the starting website's domain.
* `URLPatternFilter`: Only follow links matching a specific pattern (e.g., `/blog/posts/...`).
* `ContentTypeFilter`: Avoid following links to non-HTML content like PDFs or images.
* **Scorers:** (Used mainly by `BestFirstCrawlingStrategy`) Rules that assign a score to a potential link to help prioritize it. Examples:
* `KeywordRelevanceScorer`: Scores links higher if the URL contains certain keywords.
* `PathDepthScorer`: Might score links differently based on how deep they are.
These act like instructions for the explorer: "Only explore rooms on this floor (filter)," "Ignore corridors marked 'Staff Only' (filter)," or "Check rooms marked with a star first (scorer)."
## How to Use a `DeepCrawlStrategy`
You enable deep crawling by adding a `DeepCrawlStrategy` instance to your `CrawlerRunConfig`. Let's try exploring a website layer by layer using `BFSDeepCrawlStrategy`, going only one level deep from the start page.
```python
# chapter8_example_1.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
BFSDeepCrawlStrategy, # 1. Import the desired strategy
DomainFilter # Import a filter to stay on the same site
)
async def main():
# 2. Create an instance of the strategy
# - max_depth=1: Crawl start URL (depth 0) + links found (depth 1)
# - filter_chain: Use DomainFilter to only follow links on the same website
bfs_explorer = BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=[DomainFilter()] # Stay within the initial domain
)
print(f"Strategy: BFS, Max Depth: {bfs_explorer.max_depth}")
# 3. Create CrawlerRunConfig and set the deep_crawl_strategy
# Also set stream=True to get results as they come in.
run_config = CrawlerRunConfig(
deep_crawl_strategy=bfs_explorer,
stream=True # Get results one by one using async for
)
# 4. Run the crawl - arun now handles the deep crawl!
async with AsyncWebCrawler() as crawler:
start_url = "https://httpbin.org/links/10/0" # A page with 10 internal links
print(f"\nStarting deep crawl from: {start_url}...")
crawl_results_generator = await crawler.arun(url=start_url, config=run_config)
crawled_count = 0
# Iterate over the results as they are yielded
async for result in crawl_results_generator:
crawled_count += 1
status = "✅" if result.success else "❌"
depth = result.metadata.get("depth", "N/A")
parent = result.metadata.get("parent_url", "Start")
url_short = result.url.split('/')[-1] # Show last part of URL
print(f" {status} Crawled: {url_short:<6} (Depth: {depth})")
print(f"\nFinished deep crawl. Total pages processed: {crawled_count}")
# Expecting 1 (start URL) + 10 (links) = 11 results
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Import:** We import `AsyncWebCrawler`, `CrawlerRunConfig`, `BFSDeepCrawlStrategy`, and `DomainFilter`.
2. **Instantiate Strategy:** We create `BFSDeepCrawlStrategy`.
* `max_depth=1`: We tell it to crawl the starting URL (depth 0) and any valid links it finds on that page (depth 1), but not to go any further.
* `filter_chain=[DomainFilter()]`: We provide a list containing `DomainFilter`. This tells the strategy to only consider following links that point to the same domain as the `start_url`. Links to external sites will be ignored.
3. **Configure Run:** We create a `CrawlerRunConfig` and pass our `bfs_explorer` instance to the `deep_crawl_strategy` parameter. We also set `stream=True` so we can process results as soon as they are ready, rather than waiting for the entire crawl to finish.
4. **Crawl:** We call `await crawler.arun(url=start_url, config=run_config)`. Because the config contains a `deep_crawl_strategy`, `arun` doesn't just crawl the single `start_url`. Instead, it activates the deep crawl logic defined by `BFSDeepCrawlStrategy`.
5. **Process Results:** Since we used `stream=True`, the return value is an asynchronous generator. We use `async for result in crawl_results_generator:` to loop through the `CrawlResult` objects as they are produced by the deep crawl. For each result, we print its status and depth.
You'll see the output showing the crawl starting, then processing the initial page (`links/10/0` at depth 0), followed by the 10 linked pages (e.g., `9`, `8`, ... `0` at depth 1).
## How It Works (Under the Hood)
How does simply putting a strategy in the config change `arun`'s behavior? It involves a bit of Python magic called a **decorator**.
1. **Decorator:** When you create an `AsyncWebCrawler`, its `arun` method is automatically wrapped by a `DeepCrawlDecorator`.
2. **Check Config:** When you call `await crawler.arun(url=..., config=...)`, this decorator checks if `config.deep_crawl_strategy` is set.
3. **Delegate or Run Original:**
* If a strategy **is set**, the decorator *doesn't* run the original single-page crawl logic. Instead, it calls the `arun` method of your chosen `DeepCrawlStrategy` instance (e.g., `bfs_explorer.arun(...)`), passing it the `crawler` itself, the `start_url`, and the `config`.
* If no strategy is set, the decorator simply calls the original `arun` logic to crawl the single page.
4. **Strategy Takes Over:** The `DeepCrawlStrategy`'s `arun` method now manages the crawl.
* It maintains a list or queue of URLs to visit (e.g., `current_level` in BFS, a stack in DFS, a priority queue in BestFirst).
* It repeatedly takes batches of URLs from its list/queue.
* For each batch, it calls `crawler.arun_many(urls=batch_urls, config=batch_config)` (with deep crawling disabled in `batch_config` to avoid infinite loops!).
* As results come back from `arun_many`, the strategy processes them:
* It yields the `CrawlResult` if running in stream mode.
* It extracts links using its `link_discovery` method.
* `link_discovery` uses `can_process_url` (which applies filters) to validate links.
* Valid new links are added to the list/queue for future crawling.
* This continues until the list/queue is empty, the max depth/pages limit is reached, or it's cancelled.
```mermaid
sequenceDiagram
participant User
participant Decorator as DeepCrawlDecorator
participant Strategy as DeepCrawlStrategy (e.g., BFS)
participant AWC as AsyncWebCrawler
User->>Decorator: arun(start_url, config_with_strategy)
Decorator->>Strategy: arun(start_url, crawler=AWC, config)
Note over Strategy: Initialize queue/level with start_url
loop Until Queue Empty or Limits Reached
Strategy->>Strategy: Get next batch of URLs from queue
Note over Strategy: Create batch_config (deep_crawl=None)
Strategy->>AWC: arun_many(batch_urls, config=batch_config)
AWC-->>Strategy: batch_results (List/Stream of CrawlResult)
loop For each result in batch_results
Strategy->>Strategy: Process result (yield if streaming)
Strategy->>Strategy: Discover links (apply filters)
Strategy->>Strategy: Add valid new links to queue
end
end
Strategy-->>Decorator: Final result (List or Generator)
Decorator-->>User: Final result
```
## Code Glimpse
Let's peek at the simplified structure:
**1. The Decorator (`deep_crawling/base_strategy.py`)**
```python
# Simplified from deep_crawling/base_strategy.py
from contextvars import ContextVar
from functools import wraps
# ... other imports
class DeepCrawlDecorator:
deep_crawl_active = ContextVar("deep_crawl_active", default=False)
def __init__(self, crawler: AsyncWebCrawler):
self.crawler = crawler
def __call__(self, original_arun):
@wraps(original_arun)
async def wrapped_arun(url: str, config: CrawlerRunConfig = None, **kwargs):
# Is a strategy present AND not already inside a deep crawl?
if config and config.deep_crawl_strategy and not self.deep_crawl_active.get():
# Mark that we are starting a deep crawl
token = self.deep_crawl_active.set(True)
try:
# Call the STRATEGY's arun method instead of the original
strategy_result = await config.deep_crawl_strategy.arun(
crawler=self.crawler,
start_url=url,
config=config
)
# Handle streaming if needed
if config.stream:
# Return an async generator that resets the context var on exit
async def result_wrapper():
try:
async for result in strategy_result: yield result
finally: self.deep_crawl_active.reset(token)
return result_wrapper()
else:
return strategy_result # Return the list of results directly
finally:
# Reset the context var if not streaming (or handled in wrapper)
if not config.stream: self.deep_crawl_active.reset(token)
else:
# No strategy or already deep crawling, call the original single-page arun
return await original_arun(url, config=config, **kwargs)
return wrapped_arun
```
**2. The Strategy Blueprint (`deep_crawling/base_strategy.py`)**
```python
# Simplified from deep_crawling/base_strategy.py
from abc import ABC, abstractmethod
# ... other imports
class DeepCrawlStrategy(ABC):
@abstractmethod
async def _arun_batch(self, start_url, crawler, config) -> List[CrawlResult]:
# Implementation for non-streaming mode
pass
@abstractmethod
async def _arun_stream(self, start_url, crawler, config) -> AsyncGenerator[CrawlResult, None]:
# Implementation for streaming mode
pass
async def arun(self, start_url, crawler, config) -> RunManyReturn:
# Decides whether to call _arun_batch or _arun_stream
if config.stream:
return self._arun_stream(start_url, crawler, config)
else:
return await self._arun_batch(start_url, crawler, config)
@abstractmethod
async def can_process_url(self, url: str, depth: int) -> bool:
# Applies filters to decide if a URL is valid to crawl
pass
@abstractmethod
async def link_discovery(self, result, source_url, current_depth, visited, next_level, depths):
# Extracts, validates, and prepares links for the next step
pass
@abstractmethod
async def shutdown(self):
# Cleanup logic
pass
```
**3. Example: BFS Implementation (`deep_crawling/bfs_strategy.py`)**
```python
# Simplified from deep_crawling/bfs_strategy.py
# ... imports ...
from .base_strategy import DeepCrawlStrategy # Import the base class
class BFSDeepCrawlStrategy(DeepCrawlStrategy):
def __init__(self, max_depth, filter_chain=None, url_scorer=None, ...):
self.max_depth = max_depth
self.filter_chain = filter_chain or FilterChain() # Use default if none
self.url_scorer = url_scorer
# ... other init ...
self._pages_crawled = 0
async def can_process_url(self, url: str, depth: int) -> bool:
# ... (validation logic using self.filter_chain) ...
is_valid = True # Placeholder
if depth != 0 and not await self.filter_chain.apply(url):
is_valid = False
return is_valid
async def link_discovery(self, result, source_url, current_depth, visited, next_level, depths):
# ... (logic to get links from result.links) ...
links = result.links.get("internal", []) # Example: only internal
for link_data in links:
url = link_data.get("href")
if url and url not in visited:
if await self.can_process_url(url, current_depth + 1):
# Check scoring, max_pages limit etc.
depths[url] = current_depth + 1
next_level.append((url, source_url)) # Add (url, parent) tuple
async def _arun_batch(self, start_url, crawler, config) -> List[CrawlResult]:
visited = set()
current_level = [(start_url, None)] # List of (url, parent_url)
depths = {start_url: 0}
all_results = []
while current_level: # While there are pages in the current level
next_level = []
urls_in_level = [url for url, parent in current_level]
visited.update(urls_in_level)
# Create config for this batch (no deep crawl recursion)
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
# Crawl all URLs in the current level
batch_results = await crawler.arun_many(urls=urls_in_level, config=batch_config)
for result in batch_results:
# Add metadata (depth, parent)
depth = depths.get(result.url, 0)
result.metadata = result.metadata or {}
result.metadata["depth"] = depth
# ... find parent ...
all_results.append(result)
# Discover links for the *next* level
if result.success:
await self.link_discovery(result, result.url, depth, visited, next_level, depths)
current_level = next_level # Move to the next level
return all_results
async def _arun_stream(self, start_url, crawler, config) -> AsyncGenerator[CrawlResult, None]:
# Similar logic to _arun_batch, but uses 'yield result'
# and processes results as they come from arun_many stream
visited = set()
current_level = [(start_url, None)] # List of (url, parent_url)
depths = {start_url: 0}
while current_level:
next_level = []
urls_in_level = [url for url, parent in current_level]
visited.update(urls_in_level)
# Use stream=True for arun_many
batch_config = config.clone(deep_crawl_strategy=None, stream=True)
batch_results_gen = await crawler.arun_many(urls=urls_in_level, config=batch_config)
async for result in batch_results_gen:
# Add metadata
depth = depths.get(result.url, 0)
result.metadata = result.metadata or {}
result.metadata["depth"] = depth
# ... find parent ...
yield result # Yield result immediately
# Discover links for the next level
if result.success:
await self.link_discovery(result, result.url, depth, visited, next_level, depths)
current_level = next_level
# ... shutdown method ...
```
## Conclusion
You've learned about `DeepCrawlStrategy`, the component that turns Crawl4AI into a website explorer!
* It solves the problem of crawling beyond a single starting page by following links.
* It defines the **exploration plan**:
* `BFSDeepCrawlStrategy`: Level by level.
* `DFSDeepCrawlStrategy`: Deep paths first.
* `BestFirstCrawlingStrategy`: Prioritized by score.
* **Filters** and **Scorers** help guide the exploration.
* You enable it by setting `deep_crawl_strategy` in the `CrawlerRunConfig`.
* A decorator mechanism intercepts `arun` calls to activate the strategy.
* The strategy manages the queue of URLs and uses `crawler.arun_many` to crawl them in batches.
Deep crawling allows you to gather information from multiple related pages automatically. But how does Crawl4AI avoid re-fetching the same page over and over again, especially during these deeper crawls? The answer lies in caching.
**Next:** Let's explore how Crawl4AI smartly caches results with [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,346 @@
# Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode
In the previous chapter, [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md), we saw how Crawl4AI can explore websites by following links, potentially visiting many pages. During such explorations, or even when you run the same crawl multiple times, the crawler might try to fetch the exact same webpage again and again. This can be slow and might unnecessarily put a load on the website you're crawling. Wouldn't it be smarter to remember the result from the first time and just reuse it?
## What Problem Does Caching Solve?
Imagine you need to download a large instruction manual (a webpage) from the internet.
* **Without Caching:** Every single time you need the manual, you download the entire file again. This takes time and uses bandwidth every time.
* **With Caching:** The first time you download it, you save a copy on your computer (the "cache"). The next time you need it, you first check your local copy. If it's there, you use it instantly! You only download it again if you specifically want the absolute latest version or if your local copy is missing.
Caching in Crawl4AI works the same way. It's a mechanism to **store the results** of crawling a webpage locally (in a database file). When asked to crawl a URL again, Crawl4AI can check its cache first. If a valid result is already stored, it can return that saved result almost instantly, saving time and resources.
## Introducing `CacheMode` and `CacheContext`
Crawl4AI uses two key concepts to manage this caching behavior:
1. **`CacheMode` (The Cache Policy):**
* Think of this like setting the rules for how you interact with your saved instruction manuals.
* It's an **instruction** you give the crawler for a specific run, telling it *how* to use the cache.
* **Analogy:** Should you *always* use your saved copy if you have one? (`ENABLED`) Should you *ignore* your saved copies and always download a fresh one? (`BYPASS`) Should you *never* save any copies? (`DISABLED`) Should you save new copies but never reuse old ones? (`WRITE_ONLY`)
* `CacheMode` lets you choose the caching behavior that best fits your needs for a particular task.
2. **`CacheContext` (The Decision Maker):**
* This is an internal helper that Crawl4AI uses *during* a crawl. You don't usually interact with it directly.
* It looks at the `CacheMode` you provided (the policy) and the type of URL being processed.
* **Analogy:** Imagine a librarian who checks the library's borrowing rules (`CacheMode`) and the type of item you're requesting (e.g., a reference book that can't be checked out, like `raw:` HTML which isn't cached). Based on these, the librarian (`CacheContext`) decides if you can borrow an existing copy (read from cache) or if a new copy should be added to the library (write to cache).
* It helps the main `AsyncWebCrawler` make the right decision about reading from or writing to the cache for each specific URL based on the active policy.
## Setting the Cache Policy: Using `CacheMode`
You control the caching behavior by setting the `cache_mode` parameter within the `CrawlerRunConfig` object that you pass to `crawler.arun()` or `crawler.arun_many()`.
Let's explore the most common `CacheMode` options:
**1. `CacheMode.ENABLED` (The Default Behavior - If not specified)**
* **Policy:** "Use the cache if a valid result exists. If not, fetch the page, save the result to the cache, and then return it."
* This is the standard, balanced approach. It saves time on repeated crawls but ensures you get the content eventually.
* *Note: In recent versions, the default if `cache_mode` is left completely unspecified might be `CacheMode.BYPASS`. Always check the documentation or explicitly set the mode for clarity.* For this tutorial, let's assume we explicitly set it.
```python
# chapter9_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
# Explicitly set the mode to ENABLED
config_enabled = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
print(f"Running with CacheMode: {config_enabled.cache_mode.name}")
# First run: Fetches, caches, and returns result
print("First run (ENABLED)...")
result1 = await crawler.arun(url=url, config=config_enabled)
print(f"Got result 1? {'Yes' if result1.success else 'No'}")
# Second run: Finds result in cache and returns it instantly
print("Second run (ENABLED)...")
result2 = await crawler.arun(url=url, config=config_enabled)
print(f"Got result 2? {'Yes' if result2.success else 'No'}")
# This second run should be much faster!
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We create a `CrawlerRunConfig` with `cache_mode=CacheMode.ENABLED`.
* The first `arun` call fetches the page from the web and saves the result in the cache.
* The second `arun` call (for the same URL and config affecting cache key) finds the saved result in the cache and returns it immediately, skipping the web fetch.
**2. `CacheMode.BYPASS`**
* **Policy:** "Ignore any existing saved copy. Always fetch a fresh copy from the web. After fetching, save this new result to the cache (overwriting any old one)."
* Useful when you *always* need the absolute latest version of the page, but you still want to update the cache for potential future use with `CacheMode.ENABLED`.
```python
# chapter9_example_2.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
import time
async def main():
url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
# Set the mode to BYPASS
config_bypass = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
print(f"Running with CacheMode: {config_bypass.cache_mode.name}")
# First run: Fetches, caches, and returns result
print("First run (BYPASS)...")
start_time = time.perf_counter()
result1 = await crawler.arun(url=url, config=config_bypass)
duration1 = time.perf_counter() - start_time
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
# Second run: Ignores cache, fetches again, updates cache, returns result
print("Second run (BYPASS)...")
start_time = time.perf_counter()
result2 = await crawler.arun(url=url, config=config_bypass)
duration2 = time.perf_counter() - start_time
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
# Both runs should take a similar amount of time (fetching time)
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We set `cache_mode=CacheMode.BYPASS`.
* Both the first and second `arun` calls will fetch the page directly from the web, ignoring any previously cached result. They will still write the newly fetched result to the cache. Notice both runs take roughly the same amount of time (network fetch time).
**3. `CacheMode.DISABLED`**
* **Policy:** "Completely ignore the cache. Never read from it, never write to it."
* Useful when you don't want Crawl4AI to interact with the cache files at all, perhaps for debugging or if you have storage constraints.
```python
# chapter9_example_3.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
import time
async def main():
url = "https://httpbin.org/html"
async with AsyncWebCrawler() as crawler:
# Set the mode to DISABLED
config_disabled = CrawlerRunConfig(cache_mode=CacheMode.DISABLED)
print(f"Running with CacheMode: {config_disabled.cache_mode.name}")
# First run: Fetches, returns result (does NOT cache)
print("First run (DISABLED)...")
start_time = time.perf_counter()
result1 = await crawler.arun(url=url, config=config_disabled)
duration1 = time.perf_counter() - start_time
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
# Second run: Fetches again, returns result (does NOT cache)
print("Second run (DISABLED)...")
start_time = time.perf_counter()
result2 = await crawler.arun(url=url, config=config_disabled)
duration2 = time.perf_counter() - start_time
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
# Both runs fetch fresh, and nothing is ever saved to the cache.
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We set `cache_mode=CacheMode.DISABLED`.
* Both `arun` calls fetch fresh content from the web. Crucially, neither run reads from nor writes to the cache database.
**Other Modes (`READ_ONLY`, `WRITE_ONLY`):**
* `CacheMode.READ_ONLY`: Only uses existing cached results. If a result isn't in the cache, it will fail or return an empty result rather than fetching it. Never saves anything new.
* `CacheMode.WRITE_ONLY`: Never reads from the cache (always fetches fresh). It *only* writes the newly fetched result to the cache.
## How Caching Works Internally
When you call `crawler.arun(url="...", config=...)`:
1. **Create Context:** The `AsyncWebCrawler` creates a `CacheContext` instance using the `url` and the `config.cache_mode`.
2. **Check Read:** It asks the `CacheContext`, "Should I read from the cache?" (`cache_context.should_read()`).
3. **Try Reading:** If `should_read()` is `True`, it asks the database manager ([`AsyncDatabaseManager`](async_database.py)) to look for a cached result for the `url`.
4. **Cache Hit?**
* If a valid cached result is found: The `AsyncWebCrawler` returns this cached `CrawlResult` immediately. Done!
* If no cached result is found (or if `should_read()` was `False`): Proceed to fetching.
5. **Fetch:** The `AsyncWebCrawler` calls the appropriate [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to fetch the content from the web.
6. **Process:** It processes the fetched HTML (scraping, filtering, extracting) to create a new `CrawlResult`.
7. **Check Write:** It asks the `CacheContext`, "Should I write this result to the cache?" (`cache_context.should_write()`).
8. **Write Cache:** If `should_write()` is `True`, it tells the database manager to save the new `CrawlResult` into the cache database.
9. **Return:** The `AsyncWebCrawler` returns the newly created `CrawlResult`.
```mermaid
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Ctx as CacheContext
participant DB as DatabaseManager
participant Fetcher as AsyncCrawlerStrategy
User->>AWC: arun(url, config)
AWC->>Ctx: Create CacheContext(url, config.cache_mode)
AWC->>Ctx: should_read()?
alt Cache Read Allowed
Ctx-->>AWC: Yes
AWC->>DB: aget_cached_url(url)
DB-->>AWC: Cached Result (or None)
alt Cache Hit & Valid
AWC-->>User: Return Cached CrawlResult
else Cache Miss or Invalid
AWC->>AWC: Proceed to Fetch
end
else Cache Read Not Allowed
Ctx-->>AWC: No
AWC->>AWC: Proceed to Fetch
end
Note over AWC: Fetching Required
AWC->>Fetcher: crawl(url, config)
Fetcher-->>AWC: Raw Response
AWC->>AWC: Process HTML -> New CrawlResult
AWC->>Ctx: should_write()?
alt Cache Write Allowed
Ctx-->>AWC: Yes
AWC->>DB: acache_url(New CrawlResult)
DB-->>AWC: OK
else Cache Write Not Allowed
Ctx-->>AWC: No
end
AWC-->>User: Return New CrawlResult
```
## Code Glimpse
Let's look at simplified code snippets.
**Inside `async_webcrawler.py` (where `arun` uses caching):**
```python
# Simplified from crawl4ai/async_webcrawler.py
from .cache_context import CacheContext, CacheMode
from .async_database import async_db_manager
from .models import CrawlResult
# ... other imports
class AsyncWebCrawler:
# ... (init, other methods) ...
async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult:
# ... (ensure config exists, set defaults) ...
if config.cache_mode is None:
config.cache_mode = CacheMode.ENABLED # Example default
# 1. Create CacheContext
cache_context = CacheContext(url, config.cache_mode)
cached_result = None
# 2. Check if cache read is allowed
if cache_context.should_read():
# 3. Try reading from database
cached_result = await async_db_manager.aget_cached_url(url)
# 4. If cache hit and valid, return it
if cached_result and self._is_cache_valid(cached_result, config):
self.logger.info("Cache hit for: %s", url) # Example log
return cached_result # Return early
# 5. Fetch fresh content (if no cache hit or read disabled)
async_response = await self.crawler_strategy.crawl(url, config=config)
html = async_response.html # ... and other data ...
# 6. Process the HTML to get a new CrawlResult
crawl_result = await self.aprocess_html(
url=url, html=html, config=config, # ... other params ...
)
# 7. Check if cache write is allowed
if cache_context.should_write():
# 8. Write the new result to the database
await async_db_manager.acache_url(crawl_result)
# 9. Return the new result
return crawl_result
def _is_cache_valid(self, cached_result: CrawlResult, config: CrawlerRunConfig) -> bool:
# Internal logic to check if cached result meets current needs
# (e.g., was screenshot requested now but not cached?)
if config.screenshot and not cached_result.screenshot: return False
if config.pdf and not cached_result.pdf: return False
# ... other checks ...
return True
```
**Inside `cache_context.py` (defining the concepts):**
```python
# Simplified from crawl4ai/cache_context.py
from enum import Enum
class CacheMode(Enum):
"""Defines the caching behavior for web crawling operations."""
ENABLED = "enabled" # Read and Write
DISABLED = "disabled" # No Read, No Write
READ_ONLY = "read_only" # Read Only, No Write
WRITE_ONLY = "write_only" # Write Only, No Read
BYPASS = "bypass" # No Read, Write Only (similar to WRITE_ONLY but explicit intention)
class CacheContext:
"""Encapsulates cache-related decisions and URL handling."""
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
self.url = url
self.cache_mode = cache_mode
self.always_bypass = always_bypass # Usually False
# Determine if URL type is cacheable (e.g., not 'raw:')
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
# ... other URL type checks ...
def should_read(self) -> bool:
"""Determines if cache should be read based on context."""
if self.always_bypass or not self.is_cacheable:
return False
# Allow read if mode is ENABLED or READ_ONLY
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
def should_write(self) -> bool:
"""Determines if cache should be written based on context."""
if self.always_bypass or not self.is_cacheable:
return False
# Allow write if mode is ENABLED, WRITE_ONLY, or BYPASS
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY, CacheMode.BYPASS]
@property
def display_url(self) -> str:
"""Returns the URL in display format."""
return self.url if not self.url.startswith("raw:") else "Raw HTML"
# Helper for backward compatibility (may be removed later)
def _legacy_to_cache_mode(...) -> CacheMode:
# ... logic to convert old boolean flags ...
pass
```
## Conclusion
You've learned how Crawl4AI uses caching to avoid redundant work and speed up repeated crawls!
* **Caching** stores results locally to reuse them later.
* **`CacheMode`** is the policy you set in `CrawlerRunConfig` to control *how* the cache is used (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
* **`CacheContext`** is an internal helper that makes decisions based on the `CacheMode` and URL type.
* Using the cache effectively (especially `CacheMode.ENABLED`) can significantly speed up your crawling tasks, particularly during development or when dealing with many URLs, including deep crawls.
We've seen how Crawl4AI can crawl single pages, lists of pages (`arun_many`), and even explore websites (`DeepCrawlStrategy`). But how does `arun_many` or a deep crawl manage running potentially hundreds or thousands of individual crawl tasks efficiently without overwhelming your system or the target website?
**Next:** Let's explore the component responsible for managing concurrent tasks: [Chapter 10: Orchestrating the Crawl - BaseDispatcher](10_basedispatcher.md).
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,387 @@
# Chapter 10: Orchestrating the Crawl - BaseDispatcher
In [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md), we learned how Crawl4AI uses caching to cleverly avoid re-fetching the same webpage multiple times, which is especially helpful when crawling many URLs. We've also seen how methods like `arun_many()` ([Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md)) or strategies like [DeepCrawlStrategy](08_deepcrawlstrategy.md) can lead to potentially hundreds or thousands of individual URLs needing to be crawled.
This raises a question: if we have 1000 URLs to crawl, does Crawl4AI try to crawl all 1000 simultaneously? That would likely overwhelm your computer's resources (like memory and CPU) and could also flood the target website with too many requests, potentially getting you blocked! How does Crawl4AI manage running many crawls efficiently and responsibly?
## What Problem Does `BaseDispatcher` Solve?
Imagine you're managing a fleet of delivery drones (`AsyncWebCrawler` tasks) that need to pick up packages from many different addresses (URLs). If you launch all 1000 drones at the exact same moment:
* Your control station (your computer) might crash due to the processing load.
* The central warehouse (the target website) might get overwhelmed by simultaneous arrivals.
* Some drones might collide or interfere with each other.
You need a **Traffic Controller** or a **Dispatch Center** to manage the fleet. This controller decides:
1. How many drones can be active in the air at any one time.
2. When to launch the next drone, maybe based on available airspace (system resources) or just a simple count limit.
3. How to handle potential delays or issues (like rate limiting from a specific website).
In Crawl4AI, the `BaseDispatcher` acts as this **Traffic Controller** or **Task Scheduler** for concurrent crawling operations, primarily when using `arun_many()`. It manages *how* multiple crawl tasks are executed concurrently, ensuring the process is efficient without overwhelming your system or the target websites.
## What is `BaseDispatcher`?
`BaseDispatcher` is an abstract concept (a blueprint or job description) in Crawl4AI. It defines *that* we need a system for managing the execution of multiple, concurrent crawling tasks. It specifies the *interface* for how the main `AsyncWebCrawler` interacts with such a system, but the specific *logic* for managing concurrency can vary.
Think of it as the control panel for our drone fleet the panel exists, but the specific rules programmed into it determine how drones are dispatched.
## The Different Controllers: Ways to Dispatch Tasks
Crawl4AI provides concrete implementations (the actual traffic control systems) based on the `BaseDispatcher` blueprint:
1. **`SemaphoreDispatcher` (The Simple Counter):**
* **Analogy:** A parking garage with a fixed number of spots (e.g., 10). A gate (`asyncio.Semaphore`) only lets a new car in if one of the 10 spots is free.
* **How it works:** You tell it the maximum number of crawls that can run *at the same time* (e.g., `semaphore_count=10`). It uses a simple counter (a semaphore) to ensure that no more than this number of crawls are active simultaneously. When one crawl finishes, it allows another one from the queue to start.
* **Good for:** Simple, direct control over concurrency when you know a specific limit works well for your system and the target sites.
2. **`MemoryAdaptiveDispatcher` (The Resource-Aware Controller - Default):**
* **Analogy:** A smart parking garage attendant who checks not just the number of cars, but also the *total space* they occupy (system memory). They might stop letting cars in if the garage is nearing its memory capacity, even if some numbered spots are technically free.
* **How it works:** This dispatcher monitors your system's available memory. It tries to run multiple crawls concurrently (up to a configurable maximum like `max_session_permit`), but it will pause launching new crawls if the system memory usage exceeds a certain threshold (e.g., `memory_threshold_percent=90.0`). It adapts the concurrency level based on available resources.
* **Good for:** Automatically adjusting concurrency to prevent out-of-memory errors, especially when crawl tasks vary significantly in resource usage. **This is the default dispatcher used by `arun_many` if you don't specify one.**
These dispatchers can also optionally work with a `RateLimiter` component, which adds politeness rules for specific websites (e.g., slowing down requests to a domain if it returns "429 Too Many Requests").
## How `arun_many` Uses the Dispatcher
When you call `crawler.arun_many(urls=...)`, here's the basic flow involving the dispatcher:
1. **Get URLs:** `arun_many` receives the list of URLs you want to crawl.
2. **Select Dispatcher:** It checks if you provided a specific `dispatcher` instance. If not, it creates an instance of the default `MemoryAdaptiveDispatcher`.
3. **Delegate Execution:** It hands over the list of URLs and the `CrawlerRunConfig` to the chosen dispatcher's `run_urls` (or `run_urls_stream`) method.
4. **Manage Tasks:** The dispatcher takes charge:
* It iterates through the URLs.
* For each URL, it decides *when* to start the actual crawl based on its rules (semaphore count, memory usage, rate limits).
* When ready, it typically calls the single-page `crawler.arun(url, config)` method internally for that specific URL, wrapped within its concurrency control mechanism.
* It manages the running tasks (e.g., using `asyncio.create_task` and `asyncio.wait`).
5. **Collect Results:** As individual `arun` calls complete, the dispatcher collects their `CrawlResult` objects.
6. **Return:** Once all URLs are processed, the dispatcher returns the list of results (or yields them if streaming).
```mermaid
sequenceDiagram
participant User
participant AWC as AsyncWebCrawler
participant Dispatcher as BaseDispatcher (e.g., MemoryAdaptive)
participant TaskPool as Concurrency Manager
User->>AWC: arun_many(urls, config, dispatcher?)
AWC->>Dispatcher: run_urls(crawler=AWC, urls, config)
Dispatcher->>TaskPool: Initialize (e.g., set max concurrency)
loop For each URL in urls
Dispatcher->>TaskPool: Can I start a new task? (Checks limits)
alt Yes
TaskPool-->>Dispatcher: OK
Note over Dispatcher: Create task: call AWC.arun(url, config) internally
Dispatcher->>TaskPool: Add new task
else No
TaskPool-->>Dispatcher: Wait
Note over Dispatcher: Waits for a running task to finish
end
end
Note over Dispatcher: Manages running tasks, collects results
Dispatcher-->>AWC: List of CrawlResults
AWC-->>User: List of CrawlResults
```
## Using the Dispatcher (Often Implicitly!)
Most of the time, you don't need to think about the dispatcher explicitly. When you use `arun_many`, the default `MemoryAdaptiveDispatcher` handles things automatically.
```python
# chapter10_example_1.py
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
urls_to_crawl = [
"https://httpbin.org/html",
"https://httpbin.org/links/5/0", # Page with 5 links
"https://httpbin.org/robots.txt",
"https://httpbin.org/status/200",
]
# We DON'T specify a dispatcher here.
# arun_many will use the default MemoryAdaptiveDispatcher.
async with AsyncWebCrawler() as crawler:
print(f"Crawling {len(urls_to_crawl)} URLs using the default dispatcher...")
config = CrawlerRunConfig(stream=False) # Get results as a list at the end
# The MemoryAdaptiveDispatcher manages concurrency behind the scenes.
results = await crawler.arun_many(urls=urls_to_crawl, config=config)
print(f"\nFinished! Got {len(results)} results.")
for result in results:
status = "✅" if result.success else "❌"
url_short = result.url.split('/')[-1]
print(f" {status} {url_short:<15} | Title: {result.metadata.get('title', 'N/A')}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
* We call `crawler.arun_many` without passing a `dispatcher` argument.
* Crawl4AI automatically creates and uses a `MemoryAdaptiveDispatcher`.
* This dispatcher runs the crawls concurrently, adapting to your system's memory, and returns all the results once completed (because `stream=False`). You benefit from concurrency without explicit setup.
## Explicitly Choosing a Dispatcher
What if you want simpler, fixed concurrency? You can explicitly create and pass a `SemaphoreDispatcher`.
```python
# chapter10_example_2.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
SemaphoreDispatcher # 1. Import the specific dispatcher
)
async def main():
urls_to_crawl = [
"https://httpbin.org/delay/1", # Takes 1 second
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
]
# 2. Create an instance of the SemaphoreDispatcher
# Allow only 2 crawls to run at the same time.
semaphore_controller = SemaphoreDispatcher(semaphore_count=2)
print(f"Using SemaphoreDispatcher with limit: {semaphore_controller.semaphore_count}")
async with AsyncWebCrawler() as crawler:
print(f"Crawling {len(urls_to_crawl)} URLs with explicit dispatcher...")
config = CrawlerRunConfig(stream=False)
# 3. Pass the dispatcher instance to arun_many
results = await crawler.arun_many(
urls=urls_to_crawl,
config=config,
dispatcher=semaphore_controller # Pass our controller
)
print(f"\nFinished! Got {len(results)} results.")
# This crawl likely took around 3 seconds (5 tasks, 1s each, 2 concurrent = ceil(5/2)*1s)
for result in results:
status = "✅" if result.success else "❌"
print(f" {status} {result.url}")
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation:**
1. **Import:** We import `SemaphoreDispatcher`.
2. **Instantiate:** We create `SemaphoreDispatcher(semaphore_count=2)`, limiting concurrency to 2 simultaneous crawls.
3. **Pass Dispatcher:** We pass our `semaphore_controller` instance directly to the `dispatcher` parameter of `arun_many`.
4. **Execution:** Now, `arun_many` uses our `SemaphoreDispatcher`. It will start the first two crawls. As one finishes, it will start the next one from the list, always ensuring no more than two are running concurrently.
## A Glimpse Under the Hood
Where are these dispatchers defined? In `crawl4ai/async_dispatcher.py`.
**The Blueprint (`BaseDispatcher`):**
```python
# Simplified from crawl4ai/async_dispatcher.py
from abc import ABC, abstractmethod
from typing import List, Optional
# ... other imports like CrawlerRunConfig, CrawlerTaskResult, AsyncWebCrawler ...
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None # Will be set by arun_many
self.rate_limiter = rate_limiter
self.monitor = monitor
# ... other common state ...
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
# ... maybe other internal params ...
) -> CrawlerTaskResult:
"""Crawls a single URL, potentially handling concurrency primitives."""
# This is often the core worker method called by run_urls
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler",
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
"""Manages the concurrent execution of crawl_url for multiple URLs."""
# This is the main entry point called by arun_many
pass
async def run_urls_stream(
self,
urls: List[str],
crawler: "AsyncWebCrawler",
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlerTaskResult, None]:
""" Streaming version of run_urls (might be implemented in base or subclasses) """
# Example default implementation (subclasses might override)
results = await self.run_urls(urls, crawler, config)
for res in results: yield res # Naive stream, real one is more complex
# ... other potential helper methods ...
```
**Example Implementation (`SemaphoreDispatcher`):**
```python
# Simplified from crawl4ai/async_dispatcher.py
import asyncio
import uuid
import psutil # For memory tracking in crawl_url
import time # For timing in crawl_url
# ... other imports ...
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
# ... other params like rate_limiter, monitor ...
):
super().__init__(...) # Pass rate_limiter, monitor to base
self.semaphore_count = semaphore_count
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None, # Takes the semaphore
) -> CrawlerTaskResult:
# ... (Code to track start time, memory usage - similar to MemoryAdaptiveDispatcher's version)
start_time = time.time()
error_message = ""
memory_usage = peak_memory = 0.0
result = None
try:
# Update monitor state if used
if self.monitor: self.monitor.update_task(task_id, status=CrawlStatus.IN_PROGRESS)
# Wait for rate limiter if used
if self.rate_limiter: await self.rate_limiter.wait_if_needed(url)
# --- Core Semaphore Logic ---
async with semaphore: # Acquire a spot from the semaphore
# Now that we have a spot, run the actual crawl
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
# Call the single-page crawl method of the main crawler
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
# --- Semaphore spot is released automatically on exiting 'async with' ---
# Update rate limiter based on result status if used
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
# Handle retry limit exceeded
error_message = "Rate limit retry count exceeded"
# ... update monitor, prepare error result ...
# Update monitor status (success/fail)
if result and not result.success: error_message = result.error_message
if self.monitor: self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED if result.success else CrawlStatus.FAILED)
except Exception as e:
# Handle unexpected errors during the crawl
error_message = str(e)
if self.monitor: self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
# Create a failed CrawlResult if needed
if not result: result = CrawlResult(url=url, html="", success=False, error_message=error_message)
finally:
# Final monitor update with timing, memory etc.
end_time = time.time()
if self.monitor: self.monitor.update_task(...)
# Package everything into CrawlerTaskResult
return CrawlerTaskResult(...)
async def run_urls(
self,
crawler: "AsyncWebCrawler",
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler # Store the crawler instance
if self.monitor: self.monitor.start()
try:
# Create the semaphore with the specified count
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
# Create a crawl task for each URL, passing the semaphore
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor: self.monitor.add_task(task_id, url)
# Create an asyncio task to run crawl_url
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore=semaphore)
)
tasks.append(task)
# Wait for all created tasks to complete
# asyncio.gather runs them concurrently, respecting the semaphore limit
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results (handle potential exceptions returned by gather)
final_results = []
for res in results:
if isinstance(res, Exception):
# Handle case where gather caught an exception from a task
# You might create a failed CrawlerTaskResult here
pass
elif isinstance(res, CrawlerTaskResult):
final_results.append(res)
return final_results
finally:
if self.monitor: self.monitor.stop()
# run_urls_stream would have similar logic but use asyncio.as_completed
# or manage tasks manually to yield results as they finish.
```
The key takeaway is that the `Dispatcher` orchestrates calls to the single-page `crawler.arun` method, wrapping them with concurrency controls (like the `async with semaphore:` block) before running them using `asyncio`'s concurrency tools (`asyncio.create_task`, `asyncio.gather`, etc.).
## Conclusion
You've learned about `BaseDispatcher`, the crucial "Traffic Controller" that manages concurrent crawls in Crawl4AI, especially for `arun_many`.
* It solves the problem of efficiently running many crawls without overloading systems or websites.
* It acts as a **blueprint** for managing concurrency.
* Key implementations:
* **`SemaphoreDispatcher`**: Uses a simple count limit.
* **`MemoryAdaptiveDispatcher`**: Adjusts concurrency based on system memory (the default for `arun_many`).
* The dispatcher is used **automatically** by `arun_many`, but you can provide a specific instance if needed.
* It orchestrates the execution of individual crawl tasks, respecting defined limits.
Understanding the dispatcher helps appreciate how Crawl4AI handles large-scale crawling tasks responsibly and efficiently.
This concludes our tour of the core concepts in Crawl4AI! We've covered how pages are fetched, how the process is managed, how content is cleaned, filtered, and extracted, how deep crawls are performed, how caching optimizes fetches, and finally, how concurrency is managed. You now have a solid foundation to start building powerful web data extraction and processing applications with Crawl4AI. Happy crawling!
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

52
docs/Crawl4AI/index.md Normal file
View File

@@ -0,0 +1,52 @@
# Tutorial: Crawl4AI
`Crawl4AI` is a flexible Python library for *asynchronously crawling websites* and *extracting structured content*, specifically designed for **AI use cases**.
You primarily interact with the `AsyncWebCrawler`, which acts as the main coordinator. You provide it with URLs and a `CrawlerRunConfig` detailing *how* to crawl (e.g., using specific strategies for fetching, scraping, filtering, and extraction).
It can handle single pages or multiple URLs concurrently using a `BaseDispatcher`, optionally crawl deeper by following links via `DeepCrawlStrategy`, manage `CacheMode`, and apply `RelevantContentFilter` before finally returning a `CrawlResult` containing all the gathered data.
**Source Repository:** [https://github.com/unclecode/crawl4ai/tree/9c58e4ce2ee025debd3f36bf213330bd72b90e46/crawl4ai](https://github.com/unclecode/crawl4ai/tree/9c58e4ce2ee025debd3f36bf213330bd72b90e46/crawl4ai)
```mermaid
flowchart TD
A0["AsyncWebCrawler"]
A1["CrawlerRunConfig"]
A2["AsyncCrawlerStrategy"]
A3["ContentScrapingStrategy"]
A4["ExtractionStrategy"]
A5["CrawlResult"]
A6["BaseDispatcher"]
A7["DeepCrawlStrategy"]
A8["CacheContext / CacheMode"]
A9["RelevantContentFilter"]
A0 -- "Configured by" --> A1
A0 -- "Uses Fetching Strategy" --> A2
A0 -- "Uses Scraping Strategy" --> A3
A0 -- "Uses Extraction Strategy" --> A4
A0 -- "Produces" --> A5
A0 -- "Uses Dispatcher for `arun_m..." --> A6
A0 -- "Uses Caching Logic" --> A8
A6 -- "Calls Crawler's `arun`" --> A0
A1 -- "Specifies Deep Crawl Strategy" --> A7
A7 -- "Processes Links from" --> A5
A3 -- "Provides Cleaned HTML to" --> A9
A1 -- "Specifies Content Filter" --> A9
```
## Chapters
1. [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)
2. [AsyncWebCrawler](02_asyncwebcrawler.md)
3. [CrawlerRunConfig](03_crawlerrunconfig.md)
4. [ContentScrapingStrategy](04_contentscrapingstrategy.md)
5. [RelevantContentFilter](05_relevantcontentfilter.md)
6. [ExtractionStrategy](06_extractionstrategy.md)
7. [CrawlResult](07_crawlresult.md)
8. [DeepCrawlStrategy](08_deepcrawlstrategy.md)
9. [CacheContext / CacheMode](09_cachecontext___cachemode.md)
10. [BaseDispatcher](10_basedispatcher.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)