mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-19 07:24:20 +01:00
init push
This commit is contained in:
242
docs/Crawl4AI/01_asynccrawlerstrategy.md
Normal file
242
docs/Crawl4AI/01_asynccrawlerstrategy.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy
|
||||
|
||||
Welcome to the Crawl4AI tutorial series! Our goal is to build intelligent agents that can understand and extract information from the web. The very first step in this process is actually *getting* the content from a webpage. This chapter explains how Crawl4AI handles that fundamental task.
|
||||
|
||||
Imagine you need to pick up a package from a specific address. How do you get there and retrieve it?
|
||||
* You could send a **simple, fast drone** that just grabs the package off the porch (if it's easily accessible). This is quick but might fail if the package is inside or requires a signature.
|
||||
* Or, you could send a **full delivery truck with a driver**. The driver can ring the bell, wait, sign for the package, and even handle complex instructions. This is more versatile but takes more time and resources.
|
||||
|
||||
In Crawl4AI, the `AsyncCrawlerStrategy` is like choosing your delivery vehicle. It defines *how* the crawler fetches the raw content (like the HTML, CSS, and maybe JavaScript results) of a webpage.
|
||||
|
||||
## What Exactly is AsyncCrawlerStrategy?
|
||||
|
||||
`AsyncCrawlerStrategy` is a core concept in Crawl4AI that represents the **method** or **technique** used to download the content of a given URL. Think of it as a blueprint: it specifies *that* we need a way to fetch content, but the specific *details* of how it's done can vary.
|
||||
|
||||
This "blueprint" approach is powerful because it allows us to swap out the fetching mechanism depending on our needs, without changing the rest of our crawling logic.
|
||||
|
||||
## The Default: AsyncPlaywrightCrawlerStrategy (The Delivery Truck)
|
||||
|
||||
By default, Crawl4AI uses `AsyncPlaywrightCrawlerStrategy`. This strategy uses a real, automated web browser engine (like Chrome, Firefox, or WebKit) behind the scenes.
|
||||
|
||||
**Why use a full browser?**
|
||||
|
||||
* **Handles JavaScript:** Modern websites rely heavily on JavaScript to load content, change the layout, or fetch data after the initial page load. `AsyncPlaywrightCrawlerStrategy` runs this JavaScript, just like your normal browser does.
|
||||
* **Simulates User Interaction:** It can wait for elements to appear, handle dynamic content, and see the page *after* scripts have run.
|
||||
* **Gets the "Final" View:** It fetches the content as a user would see it in their browser.
|
||||
|
||||
This is our "delivery truck" – powerful and capable of handling complex websites. However, like a real truck, it's slower and uses more memory and CPU compared to simpler methods.
|
||||
|
||||
You generally don't need to *do* anything to use it, as it's the default! When you start Crawl4AI, it picks this strategy automatically.
|
||||
|
||||
## Another Option: AsyncHTTPCrawlerStrategy (The Delivery Drone)
|
||||
|
||||
Crawl4AI also offers `AsyncHTTPCrawlerStrategy`. This strategy is much simpler. It directly requests the URL and downloads the *initial* HTML source code that the web server sends back.
|
||||
|
||||
**Why use this simpler strategy?**
|
||||
|
||||
* **Speed:** It's significantly faster because it doesn't need to start a browser, render the page, or execute JavaScript.
|
||||
* **Efficiency:** It uses much less memory and CPU.
|
||||
|
||||
This is our "delivery drone" – super fast and efficient for simple tasks.
|
||||
|
||||
**What's the catch?**
|
||||
|
||||
* **No JavaScript:** It won't run any JavaScript on the page. If content is loaded dynamically by scripts, this strategy will likely miss it.
|
||||
* **Basic HTML Only:** You get the raw HTML source, not necessarily what a user *sees* after the browser processes everything.
|
||||
|
||||
This strategy is great for websites with simple, static HTML content or when you only need the basic structure and metadata very quickly.
|
||||
|
||||
## Why Have Different Strategies? (The Power of Abstraction)
|
||||
|
||||
Having `AsyncCrawlerStrategy` as a distinct concept offers several advantages:
|
||||
|
||||
1. **Flexibility:** You can choose the best tool for the job. Need to crawl complex, dynamic sites? Use the default `AsyncPlaywrightCrawlerStrategy`. Need to quickly fetch basic HTML from thousands of simple pages? Switch to `AsyncHTTPCrawlerStrategy`.
|
||||
2. **Maintainability:** The logic for *fetching* content is kept separate from the logic for *processing* it.
|
||||
3. **Extensibility:** Advanced users could even create their *own* custom strategies for specialized fetching needs (though that's beyond this beginner tutorial).
|
||||
|
||||
## How It Works Conceptually
|
||||
|
||||
When you ask Crawl4AI to crawl a URL, the main `AsyncWebCrawler` doesn't fetch the content itself. Instead, it delegates the task to the currently selected `AsyncCrawlerStrategy`.
|
||||
|
||||
Here's a simplified flow:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant C as AsyncWebCrawler
|
||||
participant S as AsyncCrawlerStrategy
|
||||
participant W as Website
|
||||
|
||||
C->>S: Please crawl("https://example.com")
|
||||
Note over S: I'm using my method (e.g., Browser or HTTP)
|
||||
S->>W: Request Page Content
|
||||
W-->>S: Return Raw Content (HTML, etc.)
|
||||
S-->>C: Here's the result (AsyncCrawlResponse)
|
||||
```
|
||||
|
||||
The `AsyncWebCrawler` only needs to know how to talk to *any* strategy through a common interface (the `crawl` method). The strategy handles the specific details of the fetching process.
|
||||
|
||||
## Using the Default Strategy (You're Already Doing It!)
|
||||
|
||||
Let's see how you use the default `AsyncPlaywrightCrawlerStrategy` without even needing to specify it.
|
||||
|
||||
```python
|
||||
# main_example.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
# When you create AsyncWebCrawler without specifying a strategy,
|
||||
# it automatically uses AsyncPlaywrightCrawlerStrategy!
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
print("Crawler is ready using the default strategy (Playwright).")
|
||||
|
||||
# Let's crawl a simple page that just returns HTML
|
||||
# We use CacheMode.BYPASS to ensure we fetch it fresh each time for this demo.
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
result = await crawler.arun(
|
||||
url="https://httpbin.org/html",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccessfully fetched content!")
|
||||
# The strategy fetched the raw HTML.
|
||||
# AsyncWebCrawler then processes it (more on that later).
|
||||
print(f"First 100 chars of fetched HTML: {result.html[:100]}...")
|
||||
else:
|
||||
print(f"\nFailed to fetch content: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. We import `AsyncWebCrawler` and supporting classes.
|
||||
2. We create an instance of `AsyncWebCrawler()` inside an `async with` block (this handles setup and cleanup). Since we didn't tell it *which* strategy to use, it defaults to `AsyncPlaywrightCrawlerStrategy`.
|
||||
3. We call `crawler.arun()` to crawl the URL. Under the hood, the `AsyncPlaywrightCrawlerStrategy` starts a browser, navigates to the page, gets the content, and returns it.
|
||||
4. We print the first part of the fetched HTML from the `result`.
|
||||
|
||||
## Explicitly Choosing the HTTP Strategy
|
||||
|
||||
What if you know the page is simple and want the speed of the "delivery drone"? You can explicitly tell `AsyncWebCrawler` to use `AsyncHTTPCrawlerStrategy`.
|
||||
|
||||
```python
|
||||
# http_strategy_example.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
# Import the specific strategies we want to use
|
||||
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
|
||||
|
||||
async def main():
|
||||
# 1. Create an instance of the strategy you want
|
||||
http_strategy = AsyncHTTPCrawlerStrategy()
|
||||
|
||||
# 2. Pass the strategy instance when creating the AsyncWebCrawler
|
||||
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
|
||||
print("Crawler is ready using the explicit HTTP strategy.")
|
||||
|
||||
# Crawl the same simple page
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
result = await crawler.arun(
|
||||
url="https://httpbin.org/html",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccessfully fetched content using HTTP strategy!")
|
||||
print(f"First 100 chars of fetched HTML: {result.html[:100]}...")
|
||||
else:
|
||||
print(f"\nFailed to fetch content: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. We now also import `AsyncHTTPCrawlerStrategy`.
|
||||
2. We create an instance: `http_strategy = AsyncHTTPCrawlerStrategy()`.
|
||||
3. We pass this instance to the `AsyncWebCrawler` constructor: `AsyncWebCrawler(crawler_strategy=http_strategy)`.
|
||||
4. The rest of the code is the same, but now `crawler.arun()` will use the faster, simpler HTTP GET request method defined by `AsyncHTTPCrawlerStrategy`.
|
||||
|
||||
For a simple page like `httpbin.org/html`, both strategies will likely return the same HTML content, but the HTTP strategy would generally be faster and use fewer resources. On a complex JavaScript-heavy site, the HTTP strategy might fail to get the full content, while the Playwright strategy would handle it correctly.
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
You don't *need* to know the deep internals to use the strategies, but it helps to understand the structure. Inside the `crawl4ai` library, you'd find a file like `async_crawler_strategy.py`.
|
||||
|
||||
It defines the "blueprint" (an Abstract Base Class):
|
||||
|
||||
```python
|
||||
# Simplified from async_crawler_strategy.py
|
||||
from abc import ABC, abstractmethod
|
||||
from .models import AsyncCrawlResponse # Defines the structure of the result
|
||||
|
||||
class AsyncCrawlerStrategy(ABC):
|
||||
"""
|
||||
Abstract base class for crawler strategies.
|
||||
"""
|
||||
@abstractmethod
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
"""Fetch content from the URL."""
|
||||
pass # Each specific strategy must implement this
|
||||
```
|
||||
|
||||
And then the specific implementations:
|
||||
|
||||
```python
|
||||
# Simplified from async_crawler_strategy.py
|
||||
from playwright.async_api import Page # Playwright library for browser automation
|
||||
# ... other imports
|
||||
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# ... (Initialization code to manage browsers)
|
||||
|
||||
async def crawl(self, url: str, config: CrawlerRunConfig, **kwargs) -> AsyncCrawlResponse:
|
||||
# Uses Playwright to:
|
||||
# 1. Get a browser page
|
||||
# 2. Navigate to the url (page.goto(url))
|
||||
# 3. Wait for content, run JS, etc.
|
||||
# 4. Get the final HTML (page.content())
|
||||
# 5. Optionally take screenshots, etc.
|
||||
# 6. Return an AsyncCrawlResponse
|
||||
# ... implementation details ...
|
||||
pass
|
||||
```
|
||||
|
||||
```python
|
||||
# Simplified from async_crawler_strategy.py
|
||||
import aiohttp # Library for making HTTP requests asynchronously
|
||||
# ... other imports
|
||||
|
||||
class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# ... (Initialization code to manage HTTP sessions)
|
||||
|
||||
async def crawl(self, url: str, config: CrawlerRunConfig, **kwargs) -> AsyncCrawlResponse:
|
||||
# Uses aiohttp to:
|
||||
# 1. Make an HTTP GET (or other method) request to the url
|
||||
# 2. Read the response body (HTML)
|
||||
# 3. Get response headers and status code
|
||||
# 4. Return an AsyncCrawlResponse
|
||||
# ... implementation details ...
|
||||
pass
|
||||
```
|
||||
|
||||
The key takeaway is that both strategies implement the same `crawl` method, allowing `AsyncWebCrawler` to use them interchangeably.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `AsyncCrawlerStrategy`, the core concept defining *how* Crawl4AI fetches webpage content.
|
||||
|
||||
* It's like choosing a vehicle: a powerful browser (`AsyncPlaywrightCrawlerStrategy`, the default) or a fast, simple HTTP request (`AsyncHTTPCrawlerStrategy`).
|
||||
* This abstraction gives you flexibility to choose the right fetching method for your task.
|
||||
* You usually don't need to worry about it, as the default handles most modern websites well.
|
||||
|
||||
Now that we understand how the raw content is fetched, the next step is to look at the main class that orchestrates the entire crawling process.
|
||||
|
||||
**Next:** Let's dive into the [AsyncWebCrawler](02_asyncwebcrawler.md) itself!
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
339
docs/Crawl4AI/02_asyncwebcrawler.md
Normal file
339
docs/Crawl4AI/02_asyncwebcrawler.md
Normal file
@@ -0,0 +1,339 @@
|
||||
# Chapter 2: Meet the General Manager - AsyncWebCrawler
|
||||
|
||||
In [Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy](01_asynccrawlerstrategy.md), we learned about the different ways Crawl4AI can fetch the raw content of a webpage, like choosing between a fast drone (`AsyncHTTPCrawlerStrategy`) or a versatile delivery truck (`AsyncPlaywrightCrawlerStrategy`).
|
||||
|
||||
But who decides *which* delivery vehicle to use? Who tells it *which* address (URL) to go to? And who takes the delivered package (the raw HTML) and turns it into something useful?
|
||||
|
||||
That's where the `AsyncWebCrawler` comes in. Think of it as the **General Manager** of the entire crawling operation.
|
||||
|
||||
## What Problem Does `AsyncWebCrawler` Solve?
|
||||
|
||||
Imagine you want to get information from a website. You need to:
|
||||
|
||||
1. Decide *how* to fetch the page (like choosing the drone or truck from Chapter 1).
|
||||
2. Actually *fetch* the page content.
|
||||
3. Maybe *clean up* the messy HTML.
|
||||
4. Perhaps *extract* specific pieces of information (like product prices or article titles).
|
||||
5. Maybe *save* the results so you don't have to fetch them again immediately (caching).
|
||||
6. Finally, give you the *final, processed result*.
|
||||
|
||||
Doing all these steps manually for every URL would be tedious and complex. `AsyncWebCrawler` acts as the central coordinator, managing all these steps for you. You just tell it what URL to crawl and maybe some preferences, and it handles the rest.
|
||||
|
||||
## What is `AsyncWebCrawler`?
|
||||
|
||||
`AsyncWebCrawler` is the main class you'll interact with when using Crawl4AI. It's the primary entry point for starting any crawling task.
|
||||
|
||||
**Key Responsibilities:**
|
||||
|
||||
* **Initialization:** Sets up the necessary components, like the browser (if needed).
|
||||
* **Coordination:** Takes your request (a URL and configuration) and orchestrates the different parts:
|
||||
* Delegates fetching to an [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md).
|
||||
* Manages caching using [CacheContext / CacheMode](09_cachecontext___cachemode.md).
|
||||
* Uses a [ContentScrapingStrategy](04_contentscrapingstrategy.md) to clean and parse HTML.
|
||||
* Applies a [RelevantContentFilter](05_relevantcontentfilter.md) if configured.
|
||||
* Uses an [ExtractionStrategy](06_extractionstrategy.md) to pull out specific data if needed.
|
||||
* **Result Packaging:** Bundles everything up into a neat [CrawlResult](07_crawlresult.md) object.
|
||||
* **Resource Management:** Handles starting and stopping resources (like browsers) cleanly.
|
||||
|
||||
It's the "conductor" making sure all the different instruments play together harmoniously.
|
||||
|
||||
## Your First Crawl: Using `arun`
|
||||
|
||||
Let's see the `AsyncWebCrawler` in action. The most common way to use it is with an `async with` block, which automatically handles setup and cleanup. The main method to crawl a single URL is `arun`.
|
||||
|
||||
```python
|
||||
# chapter2_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler # Import the General Manager
|
||||
|
||||
async def main():
|
||||
# Create the General Manager instance using 'async with'
|
||||
# This handles setup (like starting a browser if needed)
|
||||
# and cleanup (closing the browser).
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
print("Crawler is ready!")
|
||||
|
||||
# Tell the manager to crawl a specific URL
|
||||
url_to_crawl = "https://httpbin.org/html" # A simple example page
|
||||
print(f"Asking the crawler to fetch: {url_to_crawl}")
|
||||
|
||||
result = await crawler.arun(url=url_to_crawl)
|
||||
|
||||
# Check if the crawl was successful
|
||||
if result.success:
|
||||
print("\nSuccess! Crawler got the content.")
|
||||
# The result object contains the processed data
|
||||
# We'll learn more about CrawlResult in Chapter 7
|
||||
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
|
||||
print(f"First 100 chars of Markdown: {result.markdown.raw_markdown[:100]}...")
|
||||
else:
|
||||
print(f"\nFailed to crawl: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **`import AsyncWebCrawler`**: We import the main class.
|
||||
2. **`async def main():`**: Crawl4AI uses Python's `asyncio` for efficiency, so our code needs to be in an `async` function.
|
||||
3. **`async with AsyncWebCrawler() as crawler:`**: This is the standard way to create and manage the crawler. The `async with` statement ensures that resources (like the underlying browser used by the default `AsyncPlaywrightCrawlerStrategy`) are properly started and stopped, even if errors occur.
|
||||
4. **`crawler.arun(url=url_to_crawl)`**: This is the core command. We tell our `crawler` instance (the General Manager) to run (`arun`) the crawling process for the specified `url`. `await` is used because fetching webpages takes time, and `asyncio` allows other tasks to run while waiting.
|
||||
5. **`result`**: The `arun` method returns a `CrawlResult` object. This object contains all the information gathered during the crawl (HTML, cleaned text, metadata, etc.). We'll explore this object in detail in [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md).
|
||||
6. **`result.success`**: We check this boolean flag to see if the crawl completed without critical errors.
|
||||
7. **Accessing Data:** If successful, we can access processed information like the page title (`result.metadata['title']`) or the content formatted as Markdown (`result.markdown.raw_markdown`).
|
||||
|
||||
## Configuring the Crawl
|
||||
|
||||
Sometimes, the default behavior isn't quite what you need. Maybe you want to use the faster "drone" strategy from Chapter 1, or perhaps you want to ensure you *always* fetch a fresh copy of the page, ignoring any saved cache.
|
||||
|
||||
You can customize the behavior of a specific `arun` call by passing a `CrawlerRunConfig` object. Think of this as giving specific instructions to the General Manager for *this particular job*.
|
||||
|
||||
```python
|
||||
# chapter2_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import CrawlerRunConfig # Import configuration class
|
||||
from crawl4ai import CacheMode # Import cache options
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
print("Crawler is ready!")
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
|
||||
# Create a specific configuration for this run
|
||||
# Tell the crawler to BYPASS the cache (fetch fresh)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
print("Configuration: Bypass cache for this run.")
|
||||
|
||||
# Pass the config object to the arun method
|
||||
result = await crawler.arun(
|
||||
url=url_to_crawl,
|
||||
config=run_config # Pass the specific instructions
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccess! Crawler got fresh content (cache bypassed).")
|
||||
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
|
||||
else:
|
||||
print(f"\nFailed to crawl: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **`from crawl4ai import CrawlerRunConfig, CacheMode`**: We import the necessary classes for configuration.
|
||||
2. **`run_config = CrawlerRunConfig(...)`**: We create an instance of `CrawlerRunConfig`. This object holds various settings for a specific crawl job.
|
||||
3. **`cache_mode=CacheMode.BYPASS`**: We set the `cache_mode`. `CacheMode.BYPASS` tells the crawler to ignore any previously saved results for this URL and fetch it directly from the web server. We'll learn all about caching options in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md).
|
||||
4. **`crawler.arun(..., config=run_config)`**: We pass our custom `run_config` object to the `arun` method using the `config` parameter.
|
||||
|
||||
The `CrawlerRunConfig` is very powerful and lets you control many aspects of the crawl, including which scraping or extraction methods to use. We'll dive deep into it in the next chapter: [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md).
|
||||
|
||||
## What Happens When You Call `arun`? (The Flow)
|
||||
|
||||
When you call `crawler.arun(url="...")`, the `AsyncWebCrawler` (our General Manager) springs into action and coordinates several steps behind the scenes:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant U as User
|
||||
participant AWC as AsyncWebCrawler (Manager)
|
||||
participant CC as Cache Check
|
||||
participant CS as AsyncCrawlerStrategy (Fetcher)
|
||||
participant SP as Scraping/Processing
|
||||
participant CR as CrawlResult (Final Report)
|
||||
|
||||
U->>AWC: arun("https://example.com", config)
|
||||
AWC->>CC: Need content for "https://example.com"? (Respect CacheMode in config)
|
||||
alt Cache Hit & Cache Mode allows reading
|
||||
CC-->>AWC: Yes, here's the cached result.
|
||||
AWC-->>CR: Package cached result.
|
||||
AWC-->>U: Here is the CrawlResult
|
||||
else Cache Miss or Cache Mode prevents reading
|
||||
CC-->>AWC: No cached result / Cannot read cache.
|
||||
AWC->>CS: Please fetch "https://example.com" (using configured strategy)
|
||||
CS-->>AWC: Here's the raw response (HTML, etc.)
|
||||
AWC->>SP: Process this raw content (Scrape, Filter, Extract based on config)
|
||||
SP-->>AWC: Here's the processed data (Markdown, Metadata, etc.)
|
||||
AWC->>CC: Cache this result? (Respect CacheMode in config)
|
||||
CC-->>AWC: OK, cached.
|
||||
AWC-->>CR: Package new result.
|
||||
AWC-->>U: Here is the CrawlResult
|
||||
end
|
||||
|
||||
```
|
||||
|
||||
**Simplified Steps:**
|
||||
|
||||
1. **Receive Request:** The `AsyncWebCrawler` gets the URL and configuration from your `arun` call.
|
||||
2. **Check Cache:** It checks if a valid result for this URL is already saved (cached) and if the `CacheMode` allows using it. (See [Chapter 9](09_cachecontext___cachemode.md)).
|
||||
3. **Fetch (if needed):** If no valid cached result exists or caching is bypassed, it asks the configured [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) (e.g., Playwright or HTTP) to fetch the raw page content.
|
||||
4. **Process Content:** It takes the raw HTML and passes it through various processing steps based on the configuration:
|
||||
* **Scraping:** Cleaning up HTML, extracting basic structure using a [ContentScrapingStrategy](04_contentscrapingstrategy.md).
|
||||
* **Filtering:** Optionally filtering content for relevance using a [RelevantContentFilter](05_relevantcontentfilter.md).
|
||||
* **Extraction:** Optionally extracting specific structured data using an [ExtractionStrategy](06_extractionstrategy.md).
|
||||
5. **Cache Result (if needed):** If caching is enabled for writing, it saves the final processed result.
|
||||
6. **Return Result:** It bundles everything into a [CrawlResult](07_crawlresult.md) object and returns it to you.
|
||||
|
||||
## Crawling Many Pages: `arun_many`
|
||||
|
||||
What if you have a whole list of URLs to crawl? Calling `arun` in a loop works, but it might not be the most efficient way. `AsyncWebCrawler` provides the `arun_many` method designed for this.
|
||||
|
||||
```python
|
||||
# chapter2_example_3.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
urls_to_crawl = [
|
||||
"https://httpbin.org/html",
|
||||
"https://httpbin.org/links/10/0",
|
||||
"https://httpbin.org/robots.txt"
|
||||
]
|
||||
print(f"Asking crawler to fetch {len(urls_to_crawl)} URLs.")
|
||||
|
||||
# Use arun_many for multiple URLs
|
||||
# We can still pass a config that applies to all URLs in the batch
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
results = await crawler.arun_many(urls=urls_to_crawl, config=config)
|
||||
|
||||
print(f"\nFinished crawling! Got {len(results)} results.")
|
||||
for result in results:
|
||||
status = "Success" if result.success else "Failed"
|
||||
url_short = result.url.split('/')[-1] # Get last part of URL
|
||||
print(f"- URL: {url_short:<10} | Status: {status:<7} | Title: {result.metadata.get('title', 'N/A')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **`urls_to_crawl = [...]`**: We define a list of URLs.
|
||||
2. **`await crawler.arun_many(urls=urls_to_crawl, config=config)`**: We call `arun_many`, passing the list of URLs. It handles crawling them concurrently (like dispatching multiple delivery trucks or drones efficiently).
|
||||
3. **`results`**: `arun_many` returns a list where each item is a `CrawlResult` object corresponding to one of the input URLs.
|
||||
|
||||
`arun_many` is much more efficient for batch processing as it leverages `asyncio` to handle multiple fetches and processing tasks concurrently. It uses a [BaseDispatcher](10_basedispatcher.md) internally to manage this concurrency.
|
||||
|
||||
## Under the Hood (A Peek at the Code)
|
||||
|
||||
You don't need to know the internal details to use `AsyncWebCrawler`, but seeing the structure can help. Inside the `crawl4ai` library, the file `async_webcrawler.py` defines this class.
|
||||
|
||||
```python
|
||||
# Simplified from async_webcrawler.py
|
||||
|
||||
# ... imports ...
|
||||
from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy
|
||||
from .async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from .models import CrawlResult
|
||||
from .cache_context import CacheContext, CacheMode
|
||||
# ... other strategy imports ...
|
||||
|
||||
class AsyncWebCrawler:
|
||||
def __init__(
|
||||
self,
|
||||
crawler_strategy: AsyncCrawlerStrategy = None, # You can provide a strategy...
|
||||
config: BrowserConfig = None, # Configuration for the browser
|
||||
# ... other parameters like logger, base_directory ...
|
||||
):
|
||||
# If no strategy is given, it defaults to Playwright (the 'truck')
|
||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(...)
|
||||
self.browser_config = config or BrowserConfig()
|
||||
# ... setup logger, directories, etc. ...
|
||||
self.ready = False # Flag to track if setup is complete
|
||||
|
||||
async def __aenter__(self):
|
||||
# This is called when you use 'async with'. It starts the strategy.
|
||||
await self.crawler_strategy.__aenter__()
|
||||
await self.awarmup() # Perform internal setup
|
||||
self.ready = True
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
# This is called when exiting 'async with'. It cleans up.
|
||||
await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
|
||||
self.ready = False
|
||||
|
||||
async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult:
|
||||
# 1. Ensure config exists, set defaults (like CacheMode.ENABLED)
|
||||
crawler_config = config or CrawlerRunConfig()
|
||||
if crawler_config.cache_mode is None:
|
||||
crawler_config.cache_mode = CacheMode.ENABLED
|
||||
|
||||
# 2. Create CacheContext to manage caching logic
|
||||
cache_context = CacheContext(url, crawler_config.cache_mode)
|
||||
|
||||
# 3. Try reading from cache if allowed
|
||||
cached_result = None
|
||||
if cache_context.should_read():
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
# 4. If cache hit and valid, return cached result
|
||||
if cached_result and self._is_cache_valid(cached_result, crawler_config):
|
||||
# ... log cache hit ...
|
||||
return cached_result
|
||||
|
||||
# 5. If no cache hit or cache invalid/bypassed: Fetch fresh content
|
||||
# Delegate to the configured AsyncCrawlerStrategy
|
||||
async_response = await self.crawler_strategy.crawl(url, config=crawler_config)
|
||||
|
||||
# 6. Process the HTML (scrape, filter, extract)
|
||||
# This involves calling other strategies based on config
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url,
|
||||
html=async_response.html,
|
||||
config=crawler_config,
|
||||
# ... other details from async_response ...
|
||||
)
|
||||
|
||||
# 7. Write to cache if allowed
|
||||
if cache_context.should_write():
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
# 8. Return the final CrawlResult
|
||||
return crawl_result
|
||||
|
||||
async def aprocess_html(self, url: str, html: str, config: CrawlerRunConfig, ...) -> CrawlResult:
|
||||
# This internal method handles:
|
||||
# - Getting the configured ContentScrapingStrategy
|
||||
# - Calling its 'scrap' method
|
||||
# - Getting the configured MarkdownGenerationStrategy
|
||||
# - Calling its 'generate_markdown' method
|
||||
# - Getting the configured ExtractionStrategy (if any)
|
||||
# - Calling its 'run' method
|
||||
# - Packaging everything into a CrawlResult
|
||||
# ... implementation details ...
|
||||
pass # Simplified
|
||||
|
||||
async def arun_many(self, urls: List[str], config: Optional[CrawlerRunConfig] = None, ...) -> List[CrawlResult]:
|
||||
# Uses a Dispatcher (like MemoryAdaptiveDispatcher)
|
||||
# to run self.arun for each URL concurrently.
|
||||
# ... implementation details using a dispatcher ...
|
||||
pass # Simplified
|
||||
|
||||
# ... other methods like awarmup, close, caching helpers ...
|
||||
```
|
||||
|
||||
The key takeaway is that `AsyncWebCrawler` doesn't do the fetching or detailed processing *itself*. It acts as the central hub, coordinating calls to the various specialized `Strategy` classes based on the provided configuration.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've met the General Manager: `AsyncWebCrawler`!
|
||||
|
||||
* It's the **main entry point** for using Crawl4AI.
|
||||
* It **coordinates** all the steps: fetching, caching, scraping, extracting.
|
||||
* You primarily interact with it using `async with` and the `arun()` (single URL) or `arun_many()` (multiple URLs) methods.
|
||||
* It takes a URL and an optional `CrawlerRunConfig` object to customize the crawl.
|
||||
* It returns a comprehensive `CrawlResult` object.
|
||||
|
||||
Now that you understand the central role of `AsyncWebCrawler`, let's explore how to give it detailed instructions for each crawling job.
|
||||
|
||||
**Next:** Let's dive into the specifics of configuration with [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
277
docs/Crawl4AI/03_crawlerrunconfig.md
Normal file
277
docs/Crawl4AI/03_crawlerrunconfig.md
Normal file
@@ -0,0 +1,277 @@
|
||||
# Chapter 3: Giving Instructions - CrawlerRunConfig
|
||||
|
||||
In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method.
|
||||
|
||||
But what if we want to tell the crawler *how* to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version?
|
||||
|
||||
Passing all these different instructions individually every time we call `arun` could get complicated and messy.
|
||||
|
||||
```python
|
||||
# Imagine doing this every time - it gets long!
|
||||
# result = await crawler.arun(
|
||||
# url="https://example.com",
|
||||
# take_screenshot=True,
|
||||
# ignore_cache=True,
|
||||
# only_look_at_this_part="#main-content",
|
||||
# wait_for_this_element="#data-table",
|
||||
# # ... maybe many more settings ...
|
||||
# )
|
||||
```
|
||||
|
||||
That's where `CrawlerRunConfig` comes in!
|
||||
|
||||
## What Problem Does `CrawlerRunConfig` Solve?
|
||||
|
||||
Think of `CrawlerRunConfig` as the **Instruction Manual** for a *specific* crawl job. Instead of giving the `AsyncWebCrawler` manager lots of separate instructions each time, you bundle them all neatly into a single `CrawlerRunConfig` object.
|
||||
|
||||
This object tells the `AsyncWebCrawler` exactly *how* to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage.
|
||||
|
||||
## What is `CrawlerRunConfig`?
|
||||
|
||||
`CrawlerRunConfig` is a configuration class that holds all the settings for a single crawl operation initiated by `AsyncWebCrawler.arun()` or `arun_many()`.
|
||||
|
||||
It allows you to customize various aspects of the crawl, such as:
|
||||
|
||||
* **Taking Screenshots:** Should the crawler capture an image of the page? (`screenshot`)
|
||||
* **Waiting:** How long should the crawler wait for the page or specific elements to load? (`page_timeout`, `wait_for`)
|
||||
* **Focusing Content:** Should the crawler only process a specific part of the page? (`css_selector`)
|
||||
* **Extracting Data:** Should the crawler use a specific method to pull out structured data? ([ExtractionStrategy](06_extractionstrategy.md))
|
||||
* **Caching:** How should the crawler interact with previously saved results? ([CacheMode](09_cachecontext___cachemode.md))
|
||||
* **And much more!** (like handling JavaScript, filtering links, etc.)
|
||||
|
||||
## Using `CrawlerRunConfig`
|
||||
|
||||
Let's see how to use it. Remember our basic crawl from Chapter 2?
|
||||
|
||||
```python
|
||||
# chapter3_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
print(f"Crawling {url_to_crawl} with default settings...")
|
||||
|
||||
# This uses the default behavior (no specific config)
|
||||
result = await crawler.arun(url=url_to_crawl)
|
||||
|
||||
if result.success:
|
||||
print("Success! Got the content.")
|
||||
print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No
|
||||
# We'll learn about CacheMode later, but it defaults to using the cache
|
||||
else:
|
||||
print(f"Failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Now, let's say for this *specific* crawl, we want to bypass the cache (fetch fresh) and also take a screenshot.
|
||||
|
||||
We create a `CrawlerRunConfig` instance and pass it to `arun`:
|
||||
|
||||
```python
|
||||
# chapter3_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import CrawlerRunConfig # 1. Import the config class
|
||||
from crawl4ai import CacheMode # Import cache options
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
print(f"Crawling {url_to_crawl} with custom settings...")
|
||||
|
||||
# 2. Create an instance of CrawlerRunConfig with our desired settings
|
||||
my_instructions = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh
|
||||
screenshot=True # Take a screenshot
|
||||
)
|
||||
print("Instructions: Bypass cache, take screenshot.")
|
||||
|
||||
# 3. Pass the config object to arun()
|
||||
result = await crawler.arun(
|
||||
url=url_to_crawl,
|
||||
config=my_instructions # Pass our instruction manual
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccess! Got the content with custom config.")
|
||||
print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes
|
||||
# Check if the screenshot file path exists in result.screenshot
|
||||
if result.screenshot:
|
||||
print(f"Screenshot saved to: {result.screenshot}")
|
||||
else:
|
||||
print(f"\nFailed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Import:** We import `CrawlerRunConfig` and `CacheMode`.
|
||||
2. **Create Config:** We create an instance: `my_instructions = CrawlerRunConfig(...)`. We set `cache_mode` to `CacheMode.BYPASS` and `screenshot` to `True`. All other settings remain at their defaults.
|
||||
3. **Pass Config:** We pass this `my_instructions` object to `crawler.arun` using the `config=` parameter.
|
||||
|
||||
Now, when `AsyncWebCrawler` runs this job, it will look inside `my_instructions` and follow those specific settings for *this run only*.
|
||||
|
||||
## Some Common `CrawlerRunConfig` Parameters
|
||||
|
||||
`CrawlerRunConfig` has many options, but here are a few common ones you might use:
|
||||
|
||||
* **`cache_mode`**: Controls caching behavior.
|
||||
* `CacheMode.ENABLED` (Default): Use the cache if available, otherwise fetch and save.
|
||||
* `CacheMode.BYPASS`: Always fetch fresh, ignoring any cached version (but still save the new result).
|
||||
* `CacheMode.DISABLED`: Never read from or write to the cache.
|
||||
* *(More details in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md))*
|
||||
* **`screenshot` (bool)**: If `True`, takes a screenshot of the fully rendered page. The path to the screenshot file will be in `CrawlResult.screenshot`. Default: `False`.
|
||||
* **`pdf` (bool)**: If `True`, generates a PDF of the page. The path to the PDF file will be in `CrawlResult.pdf`. Default: `False`.
|
||||
* **`css_selector` (str)**: If provided (e.g., `"#main-content"` or `.article-body`), the crawler will try to extract *only* the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: `None` (process the whole page).
|
||||
* **`wait_for` (str)**: A CSS selector (e.g., `"#data-loaded-indicator"`). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: `None`.
|
||||
* **`page_timeout` (int)**: Maximum time in milliseconds to wait for page navigation or certain operations. Default: `60000` (60 seconds).
|
||||
* **`extraction_strategy`**: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: `None`. *(See [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md))*
|
||||
* **`scraping_strategy`**: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: `WebScrapingStrategy()`. *(See [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md))*
|
||||
|
||||
Let's try combining a few: focus on a specific part of the page and wait for something to appear.
|
||||
|
||||
```python
|
||||
# chapter3_example_3.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# This example site has a heading 'H1' inside a 'body' tag.
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
print(f"Crawling {url_to_crawl}, focusing on the H1 tag...")
|
||||
|
||||
# Instructions: Only get the H1 tag, wait max 10s for it
|
||||
specific_config = CrawlerRunConfig(
|
||||
css_selector="h1", # Only grab content inside <h1> tags
|
||||
page_timeout=10000 # Set page timeout to 10 seconds
|
||||
# We could also add wait_for="h1" if needed for dynamic loading
|
||||
)
|
||||
|
||||
result = await crawler.arun(url=url_to_crawl, config=specific_config)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccess! Focused crawl completed.")
|
||||
# The markdown should now ONLY contain the H1 content
|
||||
print(f"Markdown content:\n---\n{result.markdown.raw_markdown.strip()}\n---")
|
||||
else:
|
||||
print(f"\nFailed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This time, the `result.markdown` should only contain the text from the `<h1>` tag on that page, because we used `css_selector="h1"` in our `CrawlerRunConfig`.
|
||||
|
||||
## How `AsyncWebCrawler` Uses the Config (Under the Hood)
|
||||
|
||||
You don't need to know the exact internal code, but it helps to understand the flow. When you call `crawler.arun(url, config=my_config)`, the `AsyncWebCrawler` essentially does this:
|
||||
|
||||
1. Receives the `url` and the `my_config` object.
|
||||
2. Before fetching, it checks `my_config.cache_mode` to see if it should look in the cache first.
|
||||
3. If fetching is needed, it passes `my_config` to the underlying [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md).
|
||||
4. The strategy uses settings from `my_config` like `page_timeout`, `wait_for`, and whether to take a `screenshot`.
|
||||
5. After getting the raw HTML, `AsyncWebCrawler` uses the `my_config.scraping_strategy` and `my_config.css_selector` to process the content.
|
||||
6. If `my_config.extraction_strategy` is set, it uses that to extract structured data.
|
||||
7. Finally, it bundles everything into a `CrawlResult` and returns it.
|
||||
|
||||
Here's a simplified view:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant AWC as AsyncWebCrawler
|
||||
participant Config as CrawlerRunConfig
|
||||
participant Fetcher as AsyncCrawlerStrategy
|
||||
participant Processor as Scraping/Extraction
|
||||
|
||||
User->>AWC: arun(url, config=my_config)
|
||||
AWC->>Config: Check my_config.cache_mode
|
||||
alt Need to Fetch
|
||||
AWC->>Fetcher: crawl(url, config=my_config)
|
||||
Note over Fetcher: Uses my_config settings (timeout, wait_for, screenshot...)
|
||||
Fetcher-->>AWC: Raw Response (HTML, screenshot?)
|
||||
AWC->>Processor: Process HTML (using my_config.css_selector, my_config.extraction_strategy...)
|
||||
Processor-->>AWC: Processed Data
|
||||
else Use Cache
|
||||
AWC->>AWC: Retrieve from Cache
|
||||
end
|
||||
AWC-->>User: Return CrawlResult
|
||||
```
|
||||
|
||||
The `CrawlerRunConfig` acts as a messenger carrying your specific instructions throughout the crawling process.
|
||||
|
||||
Inside the `crawl4ai` library, in the file `async_configs.py`, you'll find the definition of the `CrawlerRunConfig` class. It looks something like this (simplified):
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/async_configs.py
|
||||
|
||||
from .cache_context import CacheMode
|
||||
from .extraction_strategy import ExtractionStrategy
|
||||
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
|
||||
# ... other imports ...
|
||||
|
||||
class CrawlerRunConfig():
|
||||
"""
|
||||
Configuration class for controlling how the crawler runs each crawl operation.
|
||||
"""
|
||||
def __init__(
|
||||
self,
|
||||
# Caching
|
||||
cache_mode: CacheMode = CacheMode.BYPASS, # Default behavior if not specified
|
||||
|
||||
# Content Selection / Waiting
|
||||
css_selector: str = None,
|
||||
wait_for: str = None,
|
||||
page_timeout: int = 60000, # 60 seconds
|
||||
|
||||
# Media
|
||||
screenshot: bool = False,
|
||||
pdf: bool = False,
|
||||
|
||||
# Processing Strategies
|
||||
scraping_strategy: ContentScrapingStrategy = None, # Defaults internally if None
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
|
||||
# ... many other parameters omitted for clarity ...
|
||||
**kwargs # Allows for flexibility
|
||||
):
|
||||
self.cache_mode = cache_mode
|
||||
self.css_selector = css_selector
|
||||
self.wait_for = wait_for
|
||||
self.page_timeout = page_timeout
|
||||
self.screenshot = screenshot
|
||||
self.pdf = pdf
|
||||
# Assign scraping strategy, ensuring a default if None is provided
|
||||
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
|
||||
self.extraction_strategy = extraction_strategy
|
||||
# ... initialize other attributes ...
|
||||
|
||||
# Helper methods like 'clone', 'to_dict', 'from_kwargs' might exist too
|
||||
# ...
|
||||
```
|
||||
|
||||
The key idea is that it's a class designed to hold various settings together. When you create an instance `CrawlerRunConfig(...)`, you're essentially creating an object that stores your choices for these parameters.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `CrawlerRunConfig`, the "Instruction Manual" for individual crawl jobs in Crawl4AI!
|
||||
|
||||
* It solves the problem of passing many settings individually to `AsyncWebCrawler`.
|
||||
* You create an instance of `CrawlerRunConfig` and set the parameters you want to customize (like `cache_mode`, `screenshot`, `css_selector`, `wait_for`).
|
||||
* You pass this config object to `crawler.arun(url, config=your_config)`.
|
||||
* This makes your code cleaner and gives you fine-grained control over *how* each crawl is performed.
|
||||
|
||||
Now that we know how to fetch content ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)), manage the overall process ([AsyncWebCrawler](02_asyncwebcrawler.md)), and give specific instructions ([CrawlerRunConfig](03_crawlerrunconfig.md)), let's look at how the raw, messy HTML fetched from the web is initially cleaned up and processed.
|
||||
|
||||
**Next:** Let's explore [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
321
docs/Crawl4AI/04_contentscrapingstrategy.md
Normal file
321
docs/Crawl4AI/04_contentscrapingstrategy.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy
|
||||
|
||||
In [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md), we learned how to give specific instructions to our `AsyncWebCrawler` using `CrawlerRunConfig`. This included telling it *how* to fetch the page and potentially take screenshots or PDFs.
|
||||
|
||||
Now, imagine the crawler has successfully fetched the raw HTML content of a webpage. What's next? Raw HTML is often messy! It contains not just the main article or product description you might care about, but also:
|
||||
|
||||
* Navigation menus
|
||||
* Advertisements
|
||||
* Headers and footers
|
||||
* Hidden code like JavaScript (`<script>`) and styling information (`<style>`)
|
||||
* Comments left by developers
|
||||
|
||||
Before we can really understand the *meaning* of the page or extract specific important information, we need to clean up this mess and get a basic understanding of its structure.
|
||||
|
||||
## What Problem Does `ContentScrapingStrategy` Solve?
|
||||
|
||||
Think of the raw HTML fetched by the crawler as a very rough first draft of a book manuscript. It has the core story, but it's full of editor's notes, coffee stains, layout instructions for the printer, and maybe even doodles in the margins.
|
||||
|
||||
Before the *main* editor (who focuses on plot and character) can work on it, someone needs to do an initial cleanup. This "First Pass Editor" would:
|
||||
|
||||
1. Remove the coffee stains and doodles (irrelevant stuff like ads, scripts, styles).
|
||||
2. Identify the basic structure: chapter headings (like the page title), paragraph text, image captions (image alt text), and maybe a list of illustrations (links).
|
||||
3. Produce a tidier version of the manuscript, ready for more detailed analysis.
|
||||
|
||||
In Crawl4AI, the `ContentScrapingStrategy` acts as this **First Pass Editor**. It takes the raw HTML and performs an initial cleanup and structure extraction. Its job is to transform the messy HTML into a more manageable format, identifying key elements like text content, links, images, and basic page metadata (like the title).
|
||||
|
||||
## What is `ContentScrapingStrategy`?
|
||||
|
||||
`ContentScrapingStrategy` is an abstract concept (like a job description) in Crawl4AI that defines *how* the initial processing of raw HTML should happen. It specifies *that* we need a method to clean HTML and extract basic structure, but the specific tools and techniques used can vary.
|
||||
|
||||
This allows Crawl4AI to be flexible. Different strategies might use different underlying libraries or have different performance characteristics.
|
||||
|
||||
## The Implementations: Meet the Editors
|
||||
|
||||
Crawl4AI provides concrete implementations (the actual editors doing the work) of this strategy:
|
||||
|
||||
1. **`WebScrapingStrategy` (The Default Editor):**
|
||||
* This is the strategy used by default if you don't specify otherwise.
|
||||
* It uses a popular Python library called `BeautifulSoup` behind the scenes to parse and manipulate the HTML.
|
||||
* It's generally robust and good at handling imperfect HTML.
|
||||
* Think of it as a reliable, experienced editor who does a thorough job.
|
||||
|
||||
2. **`LXMLWebScrapingStrategy` (The Speedy Editor):**
|
||||
* This strategy uses another powerful library called `lxml`.
|
||||
* `lxml` is often faster than `BeautifulSoup`, especially on large or complex pages.
|
||||
* Think of it as a very fast editor who might be slightly stricter about the manuscript's format but gets the job done quickly.
|
||||
|
||||
For most beginners, the default `WebScrapingStrategy` works perfectly fine! You usually don't need to worry about switching unless you encounter performance issues on very large-scale crawls (which is a more advanced topic).
|
||||
|
||||
## How It Works Conceptually
|
||||
|
||||
Here's the flow:
|
||||
|
||||
1. The [AsyncWebCrawler](02_asyncwebcrawler.md) receives the raw HTML from the [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) (the fetcher).
|
||||
2. It looks at the [CrawlerRunConfig](03_crawlerrunconfig.md) to see which `ContentScrapingStrategy` to use (defaulting to `WebScrapingStrategy` if none is specified).
|
||||
3. It hands the raw HTML over to the chosen strategy's `scrap` method.
|
||||
4. The strategy parses the HTML, removes unwanted tags (like `<script>`, `<style>`, `<nav>`, `<aside>`, etc., based on its internal rules), extracts all links (`<a>` tags), images (`<img>` tags with their `alt` text), and metadata (like the `<title>` tag).
|
||||
5. It returns the results packaged in a `ScrapingResult` object, containing the cleaned HTML, lists of links and media items, and extracted metadata.
|
||||
6. The `AsyncWebCrawler` then takes this `ScrapingResult` and uses its contents (along with other info) to build the final [CrawlResult](07_crawlresult.md).
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant AWC as AsyncWebCrawler (Manager)
|
||||
participant Fetcher as AsyncCrawlerStrategy
|
||||
participant HTML as Raw HTML
|
||||
participant CSS as ContentScrapingStrategy (Editor)
|
||||
participant SR as ScrapingResult (Cleaned Draft)
|
||||
participant CR as CrawlResult (Final Report)
|
||||
|
||||
AWC->>Fetcher: Fetch("https://example.com")
|
||||
Fetcher-->>AWC: Here's the Raw HTML
|
||||
AWC->>CSS: Please scrap this Raw HTML (using config)
|
||||
Note over CSS: Parsing HTML... Removing scripts, styles, ads... Extracting links, images, title...
|
||||
CSS-->>AWC: Here's the ScrapingResult (Cleaned HTML, Links, Media, Metadata)
|
||||
AWC->>CR: Combine ScrapingResult with other info
|
||||
AWC-->>User: Return final CrawlResult
|
||||
```
|
||||
|
||||
## Using the Default Strategy (`WebScrapingStrategy`)
|
||||
|
||||
You're likely already using it without realizing it! When you run a basic crawl, `AsyncWebCrawler` automatically employs `WebScrapingStrategy`.
|
||||
|
||||
```python
|
||||
# chapter4_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
# Uses the default AsyncPlaywrightCrawlerStrategy (fetching)
|
||||
# AND the default WebScrapingStrategy (scraping/cleaning)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url_to_crawl = "https://httpbin.org/html" # A very simple HTML page
|
||||
|
||||
# We don't specify a scraping_strategy in the config, so it uses the default
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) # Fetch fresh
|
||||
|
||||
print(f"Crawling {url_to_crawl} using default scraping strategy...")
|
||||
result = await crawler.arun(url=url_to_crawl, config=config)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccess! Content fetched and scraped.")
|
||||
# The 'result' object now contains info processed by WebScrapingStrategy
|
||||
|
||||
# 1. Metadata extracted (e.g., page title)
|
||||
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
|
||||
|
||||
# 2. Links extracted
|
||||
print(f"Found {len(result.links.internal)} internal links and {len(result.links.external)} external links.")
|
||||
# Example: print first external link if exists
|
||||
if result.links.external:
|
||||
print(f" Example external link: {result.links.external[0].href}")
|
||||
|
||||
# 3. Media extracted (images, videos, etc.)
|
||||
print(f"Found {len(result.media.images)} images.")
|
||||
# Example: print first image alt text if exists
|
||||
if result.media.images:
|
||||
print(f" Example image alt text: '{result.media.images[0].alt}'")
|
||||
|
||||
# 4. Cleaned HTML (scripts, styles etc. removed) - might still be complex
|
||||
# print(f"\nCleaned HTML snippet:\n---\n{result.cleaned_html[:200]}...\n---")
|
||||
|
||||
# 5. Markdown representation (generated AFTER scraping)
|
||||
print(f"\nMarkdown snippet:\n---\n{result.markdown.raw_markdown[:200]}...\n---")
|
||||
|
||||
else:
|
||||
print(f"\nFailed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. We create `AsyncWebCrawler` and `CrawlerRunConfig` as usual.
|
||||
2. We **don't** set the `scraping_strategy` parameter in `CrawlerRunConfig`. Crawl4AI automatically picks `WebScrapingStrategy`.
|
||||
3. When `crawler.arun` executes, after fetching the HTML, it internally calls `WebScrapingStrategy.scrap()`.
|
||||
4. The `result` (a [CrawlResult](07_crawlresult.md) object) contains fields populated by the scraping strategy:
|
||||
* `result.metadata`: Contains things like the page title found in `<title>` tags.
|
||||
* `result.links`: Contains lists of internal and external links found (`<a>` tags).
|
||||
* `result.media`: Contains lists of images (`<img>`), videos (`<video>`), etc.
|
||||
* `result.cleaned_html`: The HTML after the strategy removed unwanted tags and attributes (this is then used to generate the Markdown).
|
||||
* `result.markdown`: While not *directly* created by the scraping strategy, the cleaned HTML it produces is the input for generating the Markdown representation.
|
||||
|
||||
## Explicitly Choosing a Strategy (e.g., `LXMLWebScrapingStrategy`)
|
||||
|
||||
What if you want to try the potentially faster `LXMLWebScrapingStrategy`? You can specify it in the `CrawlerRunConfig`.
|
||||
|
||||
```python
|
||||
# chapter4_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
# 1. Import the specific strategy you want to use
|
||||
from crawl4ai import LXMLWebScrapingStrategy
|
||||
|
||||
async def main():
|
||||
# 2. Create an instance of the desired scraping strategy
|
||||
lxml_editor = LXMLWebScrapingStrategy()
|
||||
print(f"Using scraper: {lxml_editor.__class__.__name__}")
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
|
||||
# 3. Create a CrawlerRunConfig and pass the strategy instance
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
scraping_strategy=lxml_editor # Tell the config which strategy to use
|
||||
)
|
||||
|
||||
print(f"Crawling {url_to_crawl} with explicit LXML scraping strategy...")
|
||||
result = await crawler.arun(url=url_to_crawl, config=config)
|
||||
|
||||
if result.success:
|
||||
print("\nSuccess! Content fetched and scraped using LXML.")
|
||||
print(f"Page Title: {result.metadata.get('title', 'N/A')}")
|
||||
print(f"Found {len(result.links.external)} external links.")
|
||||
# Output should be largely the same as the default strategy for simple pages
|
||||
else:
|
||||
print(f"\nFailed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Import:** We import `LXMLWebScrapingStrategy` alongside the other classes.
|
||||
2. **Instantiate:** We create an instance: `lxml_editor = LXMLWebScrapingStrategy()`.
|
||||
3. **Configure:** We create `CrawlerRunConfig` and pass our instance to the `scraping_strategy` parameter: `CrawlerRunConfig(..., scraping_strategy=lxml_editor)`.
|
||||
4. **Run:** Now, when `crawler.arun` is called with this config, it will use `LXMLWebScrapingStrategy` instead of the default `WebScrapingStrategy` for the initial HTML processing step.
|
||||
|
||||
For simple pages, the results from both strategies will often be very similar. The choice typically comes down to performance considerations in more advanced scenarios.
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
Inside the `crawl4ai` library, the file `content_scraping_strategy.py` defines the blueprint and the implementations.
|
||||
|
||||
**The Blueprint (Abstract Base Class):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/content_scraping_strategy.py
|
||||
from abc import ABC, abstractmethod
|
||||
from .models import ScrapingResult # Defines the structure of the result
|
||||
|
||||
class ContentScrapingStrategy(ABC):
|
||||
"""Abstract base class for content scraping strategies."""
|
||||
|
||||
@abstractmethod
|
||||
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
|
||||
"""
|
||||
Synchronous method to scrape content.
|
||||
Takes raw HTML, returns structured ScrapingResult.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
|
||||
"""
|
||||
Asynchronous method to scrape content.
|
||||
Takes raw HTML, returns structured ScrapingResult.
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
**The Implementations:**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/content_scraping_strategy.py
|
||||
from bs4 import BeautifulSoup # Library used by WebScrapingStrategy
|
||||
# ... other imports like models ...
|
||||
|
||||
class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
def __init__(self, logger=None):
|
||||
self.logger = logger
|
||||
# ... potentially other setup ...
|
||||
|
||||
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
|
||||
# 1. Parse HTML using BeautifulSoup
|
||||
soup = BeautifulSoup(html, 'lxml') # Or another parser
|
||||
|
||||
# 2. Find the main content area (maybe using kwargs['css_selector'])
|
||||
# 3. Remove unwanted tags (scripts, styles, nav, footer, ads...)
|
||||
# 4. Extract metadata (title, description...)
|
||||
# 5. Extract all links (<a> tags)
|
||||
# 6. Extract all images (<img> tags) and other media
|
||||
# 7. Get the remaining cleaned HTML text content
|
||||
|
||||
# ... complex cleaning and extraction logic using BeautifulSoup methods ...
|
||||
|
||||
# 8. Package results into a ScrapingResult object
|
||||
cleaned_html_content = "<html><body>Cleaned content...</body></html>" # Placeholder
|
||||
links_data = Links(...)
|
||||
media_data = Media(...)
|
||||
metadata_dict = {"title": "Page Title"}
|
||||
|
||||
return ScrapingResult(
|
||||
cleaned_html=cleaned_html_content,
|
||||
links=links_data,
|
||||
media=media_data,
|
||||
metadata=metadata_dict,
|
||||
success=True
|
||||
)
|
||||
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
|
||||
# Often delegates to the synchronous version for CPU-bound tasks
|
||||
return await asyncio.to_thread(self.scrap, url, html, **kwargs)
|
||||
|
||||
```
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/content_scraping_strategy.py
|
||||
from lxml import html as lhtml # Library used by LXMLWebScrapingStrategy
|
||||
# ... other imports like models ...
|
||||
|
||||
class LXMLWebScrapingStrategy(WebScrapingStrategy): # Often inherits for shared logic
|
||||
def __init__(self, logger=None):
|
||||
super().__init__(logger)
|
||||
# ... potentially LXML specific setup ...
|
||||
|
||||
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
|
||||
# 1. Parse HTML using lxml
|
||||
doc = lhtml.document_fromstring(html)
|
||||
|
||||
# 2. Find main content, remove unwanted tags, extract info
|
||||
# ... complex cleaning and extraction logic using lxml's XPath or CSS selectors ...
|
||||
|
||||
# 3. Package results into a ScrapingResult object
|
||||
cleaned_html_content = "<html><body>Cleaned LXML content...</body></html>" # Placeholder
|
||||
links_data = Links(...)
|
||||
media_data = Media(...)
|
||||
metadata_dict = {"title": "Page Title LXML"}
|
||||
|
||||
return ScrapingResult(
|
||||
cleaned_html=cleaned_html_content,
|
||||
links=links_data,
|
||||
media=media_data,
|
||||
metadata=metadata_dict,
|
||||
success=True
|
||||
)
|
||||
|
||||
# ascrap might also delegate or have specific async optimizations
|
||||
```
|
||||
|
||||
The key takeaway is that both strategies implement the `scrap` (and `ascrap`) method, taking raw HTML and returning a structured `ScrapingResult`. The `AsyncWebCrawler` can use either one thanks to this common interface.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `ContentScrapingStrategy`, Crawl4AI's "First Pass Editor" for raw HTML.
|
||||
|
||||
* It tackles the problem of messy HTML by cleaning it and extracting basic structure.
|
||||
* It acts as a blueprint, with `WebScrapingStrategy` (default, using BeautifulSoup) and `LXMLWebScrapingStrategy` (using lxml) as concrete implementations.
|
||||
* It's used automatically by `AsyncWebCrawler` after fetching content.
|
||||
* You can specify which strategy to use via `CrawlerRunConfig`.
|
||||
* Its output (cleaned HTML, links, media, metadata) is packaged into a `ScrapingResult` and contributes significantly to the final `CrawlResult`.
|
||||
|
||||
Now that we have this initially cleaned and structured content, we might want to further filter it. What if we only care about the parts of the page that are *relevant* to a specific topic?
|
||||
|
||||
**Next:** Let's explore how to filter content for relevance with [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
425
docs/Crawl4AI/05_relevantcontentfilter.md
Normal file
425
docs/Crawl4AI/05_relevantcontentfilter.md
Normal file
@@ -0,0 +1,425 @@
|
||||
# Chapter 5: Focusing on What Matters - RelevantContentFilter
|
||||
|
||||
In [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md), we learned how Crawl4AI takes the raw, messy HTML from a webpage and cleans it up using a `ContentScrapingStrategy`. This gives us a tidier version of the HTML (`cleaned_html`) and extracts basic elements like links and images.
|
||||
|
||||
But even after this initial cleanup, the page might still contain a lot of "noise" relative to what we *actually* care about. Imagine a news article page: the `ContentScrapingStrategy` might remove scripts and styles, but it could still leave the main article text, plus related article links, user comments, sidebars with ads, and maybe a lengthy footer.
|
||||
|
||||
If our goal is just to get the main article content (e.g., to summarize it or feed it to an AI), all that extra stuff is just noise. How can we filter the cleaned content even further to keep only the truly relevant parts?
|
||||
|
||||
## What Problem Does `RelevantContentFilter` Solve?
|
||||
|
||||
Think of the `cleaned_html` from the previous step like flour that's been roughly sifted – the biggest lumps are gone, but there might still be smaller clumps or bran mixed in. If you want super fine flour for a delicate cake, you need a finer sieve.
|
||||
|
||||
`RelevantContentFilter` acts as this **finer sieve** or a **Relevance Sieve**. It's a strategy applied *after* the initial cleaning by `ContentScrapingStrategy` but *before* the final processing (like generating the final Markdown output or using an AI for extraction). Its job is to go through the cleaned content and decide which parts are truly relevant to our goal, removing the rest.
|
||||
|
||||
This helps us:
|
||||
|
||||
1. **Reduce Noise:** Eliminate irrelevant sections like comments, footers, navigation bars, or tangential "related content" blocks.
|
||||
2. **Focus AI:** If we're sending the content to a Large Language Model (LLM), feeding it only the most relevant parts saves processing time (and potentially money) and can lead to better results.
|
||||
3. **Improve Accuracy:** By removing distracting noise, subsequent steps like data extraction are less likely to grab the wrong information.
|
||||
|
||||
## What is `RelevantContentFilter`?
|
||||
|
||||
`RelevantContentFilter` is an abstract concept (a blueprint) in Crawl4AI representing a **method for identifying and retaining only the relevant portions of cleaned HTML content**. It defines *that* we need a way to filter for relevance, but the specific technique used can vary.
|
||||
|
||||
This allows us to choose different filtering approaches depending on the task and the type of content.
|
||||
|
||||
## The Different Filters: Tools for Sieving
|
||||
|
||||
Crawl4AI provides several concrete implementations (the actual sieves) of `RelevantContentFilter`:
|
||||
|
||||
1. **`BM25ContentFilter` (The Keyword Sieve):**
|
||||
* **Analogy:** Like a mini search engine operating *within* the webpage.
|
||||
* **How it Works:** You give it (or it figures out) some keywords related to what you're looking for (e.g., from a user query like "product specifications" or derived from the page title). It then uses a search algorithm called BM25 to score different chunks of the cleaned HTML based on how relevant they are to those keywords. Only the chunks scoring above a certain threshold are kept.
|
||||
* **Good For:** Finding specific sections about a known topic within a larger page (e.g., finding only the paragraphs discussing "climate change impact" on a long environmental report page).
|
||||
|
||||
2. **`PruningContentFilter` (The Structural Sieve):**
|
||||
* **Analogy:** Like a gardener pruning a bush, removing weak or unnecessary branches based on their structure.
|
||||
* **How it Works:** This filter doesn't care about keywords. Instead, it looks at the *structure* and *characteristics* of the HTML elements. It removes elements that often represent noise, such as those with very little text compared to the number of links (low text density), elements with common "noise" words in their CSS classes or IDs (like `sidebar`, `comments`, `footer`), or elements deemed structurally insignificant.
|
||||
* **Good For:** Removing common boilerplate sections (like headers, footers, simple sidebars, navigation) based purely on layout and density clues, even if you don't have a specific topic query.
|
||||
|
||||
3. **`LLMContentFilter` (The AI Sieve):**
|
||||
* **Analogy:** Asking a smart assistant to read the cleaned content and pick out only the parts relevant to your request.
|
||||
* **How it Works:** This filter sends the cleaned HTML (often broken into manageable chunks) to a Large Language Model (like GPT). You provide an instruction (e.g., "Extract only the main article content, removing all comments and related links" or "Keep only the sections discussing financial results"). The AI uses its understanding of language and context to identify and return only the relevant parts, often already formatted nicely (like in Markdown).
|
||||
* **Good For:** Handling complex relevance decisions that require understanding meaning and context, following nuanced natural language instructions. (Note: Requires configuring LLM access, like API keys, and can be slower and potentially costlier than other methods).
|
||||
|
||||
## How `RelevantContentFilter` is Used (Via Markdown Generation)
|
||||
|
||||
In Crawl4AI, the `RelevantContentFilter` is typically integrated into the **Markdown generation** step. The standard markdown generator (`DefaultMarkdownGenerator`) can accept a `RelevantContentFilter` instance.
|
||||
|
||||
When configured this way:
|
||||
|
||||
1. The `AsyncWebCrawler` fetches the page and uses the `ContentScrapingStrategy` to get `cleaned_html`.
|
||||
2. It then calls the `DefaultMarkdownGenerator` to produce the Markdown output.
|
||||
3. The generator first creates the standard, "raw" Markdown from the *entire* `cleaned_html`.
|
||||
4. **If** a `RelevantContentFilter` was provided to the generator, it then uses this filter on the `cleaned_html` to select only the relevant HTML fragments.
|
||||
5. It converts *these filtered fragments* into Markdown. This becomes the `fit_markdown`.
|
||||
|
||||
So, the `CrawlResult` will contain *both*:
|
||||
* `result.markdown.raw_markdown`: Markdown based on the full `cleaned_html`.
|
||||
* `result.markdown.fit_markdown`: Markdown based *only* on the parts deemed relevant by the filter.
|
||||
|
||||
Let's see how to configure this.
|
||||
|
||||
### Example 1: Using `BM25ContentFilter` to find specific content
|
||||
|
||||
Imagine we crawled a page about renewable energy, but we only want the parts specifically discussing **solar power**.
|
||||
|
||||
```python
|
||||
# chapter5_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
DefaultMarkdownGenerator, # The standard markdown generator
|
||||
BM25ContentFilter # The keyword-based filter
|
||||
)
|
||||
|
||||
async def main():
|
||||
# 1. Create the BM25 filter with our query
|
||||
solar_filter = BM25ContentFilter(user_query="solar power technology")
|
||||
print(f"Filter created for query: '{solar_filter.user_query}'")
|
||||
|
||||
# 2. Create a Markdown generator that USES this filter
|
||||
markdown_generator_with_filter = DefaultMarkdownGenerator(
|
||||
content_filter=solar_filter
|
||||
)
|
||||
print("Markdown generator configured with BM25 filter.")
|
||||
|
||||
# 3. Create CrawlerRunConfig using this specific markdown generator
|
||||
run_config = CrawlerRunConfig(
|
||||
markdown_generator=markdown_generator_with_filter
|
||||
)
|
||||
|
||||
# 4. Run the crawl
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Example URL (replace with a real page having relevant content)
|
||||
url_to_crawl = "https://en.wikipedia.org/wiki/Renewable_energy"
|
||||
print(f"\nCrawling {url_to_crawl}...")
|
||||
|
||||
result = await crawler.arun(url=url_to_crawl, config=run_config)
|
||||
|
||||
if result.success:
|
||||
print("\nCrawl successful!")
|
||||
print(f"Raw Markdown length: {len(result.markdown.raw_markdown)}")
|
||||
print(f"Fit Markdown length: {len(result.markdown.fit_markdown)}")
|
||||
|
||||
# The fit_markdown should be shorter and focused on solar power
|
||||
print("\n--- Start of Fit Markdown (Solar Power Focus) ---")
|
||||
# Print first 500 chars of the filtered markdown
|
||||
print(result.markdown.fit_markdown[:500] + "...")
|
||||
print("--- End of Fit Markdown Snippet ---")
|
||||
else:
|
||||
print(f"\nCrawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Create Filter:** We make an instance of `BM25ContentFilter`, telling it we're interested in "solar power technology".
|
||||
2. **Create Generator:** We make an instance of `DefaultMarkdownGenerator` and pass our `solar_filter` to its `content_filter` parameter.
|
||||
3. **Configure Run:** We create `CrawlerRunConfig` and tell it to use our special `markdown_generator_with_filter` for this run.
|
||||
4. **Crawl & Check:** We run the crawl as usual. In the `result`, `result.markdown.raw_markdown` will have the markdown for the whole page, while `result.markdown.fit_markdown` will *only* contain markdown derived from the HTML parts that the `BM25ContentFilter` scored highly for relevance to "solar power technology". You'll likely see the `fit_markdown` is significantly shorter.
|
||||
|
||||
### Example 2: Using `PruningContentFilter` to remove boilerplate
|
||||
|
||||
Now, let's try removing common noise like sidebars or footers based on structure, without needing a specific query.
|
||||
|
||||
```python
|
||||
# chapter5_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
DefaultMarkdownGenerator,
|
||||
PruningContentFilter # The structural filter
|
||||
)
|
||||
|
||||
async def main():
|
||||
# 1. Create the Pruning filter (no query needed)
|
||||
pruning_filter = PruningContentFilter()
|
||||
print("Filter created: PruningContentFilter (structural)")
|
||||
|
||||
# 2. Create a Markdown generator that uses this filter
|
||||
markdown_generator_with_filter = DefaultMarkdownGenerator(
|
||||
content_filter=pruning_filter
|
||||
)
|
||||
print("Markdown generator configured with Pruning filter.")
|
||||
|
||||
# 3. Create CrawlerRunConfig using this generator
|
||||
run_config = CrawlerRunConfig(
|
||||
markdown_generator=markdown_generator_with_filter
|
||||
)
|
||||
|
||||
# 4. Run the crawl
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Example URL (replace with a real page that has boilerplate)
|
||||
url_to_crawl = "https://www.python.org/" # Python homepage likely has headers/footers
|
||||
print(f"\nCrawling {url_to_crawl}...")
|
||||
|
||||
result = await crawler.arun(url=url_to_crawl, config=run_config)
|
||||
|
||||
if result.success:
|
||||
print("\nCrawl successful!")
|
||||
print(f"Raw Markdown length: {len(result.markdown.raw_markdown)}")
|
||||
print(f"Fit Markdown length: {len(result.markdown.fit_markdown)}")
|
||||
|
||||
# fit_markdown should have less header/footer/sidebar content
|
||||
print("\n--- Start of Fit Markdown (Pruned) ---")
|
||||
print(result.markdown.fit_markdown[:500] + "...")
|
||||
print("--- End of Fit Markdown Snippet ---")
|
||||
else:
|
||||
print(f"\nCrawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
The structure is the same as the BM25 example, but:
|
||||
|
||||
1. We instantiate `PruningContentFilter()`, which doesn't require a `user_query`.
|
||||
2. We pass this filter to the `DefaultMarkdownGenerator`.
|
||||
3. The resulting `result.markdown.fit_markdown` should contain Markdown primarily from the main content areas of the page, with structurally identified boilerplate removed.
|
||||
|
||||
### Example 3: Using `LLMContentFilter` (Conceptual)
|
||||
|
||||
Using `LLMContentFilter` follows the same pattern, but requires setting up LLM provider details.
|
||||
|
||||
```python
|
||||
# chapter5_example_3_conceptual.py
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
DefaultMarkdownGenerator,
|
||||
LLMContentFilter,
|
||||
# Assume LlmConfig is set up correctly (see LLM-specific docs)
|
||||
# from crawl4ai.async_configs import LlmConfig
|
||||
)
|
||||
|
||||
# Assume llm_config is properly configured with API keys, provider, etc.
|
||||
# Example: llm_config = LlmConfig(provider="openai", api_token="env:OPENAI_API_KEY")
|
||||
# For this example, we'll pretend it's ready.
|
||||
class MockLlmConfig: # Mock for demonstration
|
||||
provider = "mock_provider"
|
||||
api_token = "mock_token"
|
||||
base_url = None
|
||||
llm_config = MockLlmConfig()
|
||||
|
||||
|
||||
async def main():
|
||||
# 1. Create the LLM filter with an instruction
|
||||
instruction = "Extract only the main news article content. Remove headers, footers, ads, comments, and related links."
|
||||
llm_filter = LLMContentFilter(
|
||||
instruction=instruction,
|
||||
llmConfig=llm_config # Pass the LLM configuration
|
||||
)
|
||||
print(f"Filter created: LLMContentFilter")
|
||||
print(f"Instruction: '{llm_filter.instruction}'")
|
||||
|
||||
# 2. Create a Markdown generator using this filter
|
||||
markdown_generator_with_filter = DefaultMarkdownGenerator(
|
||||
content_filter=llm_filter
|
||||
)
|
||||
print("Markdown generator configured with LLM filter.")
|
||||
|
||||
# 3. Create CrawlerRunConfig
|
||||
run_config = CrawlerRunConfig(
|
||||
markdown_generator=markdown_generator_with_filter
|
||||
)
|
||||
|
||||
# 4. Run the crawl
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Example URL (replace with a real news article)
|
||||
url_to_crawl = "https://httpbin.org/html" # Using simple page for demo
|
||||
print(f"\nCrawling {url_to_crawl}...")
|
||||
|
||||
# In a real scenario, this would call the LLM API
|
||||
result = await crawler.arun(url=url_to_crawl, config=run_config)
|
||||
|
||||
if result.success:
|
||||
print("\nCrawl successful!")
|
||||
# The fit_markdown would contain the AI-filtered content
|
||||
print("\n--- Start of Fit Markdown (AI Filtered - Conceptual) ---")
|
||||
# Because we used a mock LLM/simple page, fit_markdown might be empty or simple.
|
||||
# On a real page with a real LLM, it would ideally contain just the main article.
|
||||
print(result.markdown.fit_markdown[:500] + "...")
|
||||
print("--- End of Fit Markdown Snippet ---")
|
||||
else:
|
||||
print(f"\nCrawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. We create `LLMContentFilter`, providing our natural language `instruction` and the necessary `llmConfig` (which holds provider details and API keys - mocked here for simplicity).
|
||||
2. We integrate it into `DefaultMarkdownGenerator` and `CrawlerRunConfig` as before.
|
||||
3. When `arun` is called, the `LLMContentFilter` would (in a real scenario) interact with the configured LLM API, sending chunks of the `cleaned_html` and the instruction, then assembling the AI's response into the `fit_markdown`.
|
||||
|
||||
## Under the Hood: How Filtering Fits In
|
||||
|
||||
The `RelevantContentFilter` doesn't run on its own; it's invoked by another component, typically the `DefaultMarkdownGenerator`.
|
||||
|
||||
Here's the sequence:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant AWC as AsyncWebCrawler
|
||||
participant Config as CrawlerRunConfig
|
||||
participant Scraper as ContentScrapingStrategy
|
||||
participant MDGen as DefaultMarkdownGenerator
|
||||
participant Filter as RelevantContentFilter
|
||||
participant Result as CrawlResult
|
||||
|
||||
User->>AWC: arun(url, config=my_config)
|
||||
Note over AWC: Config includes Markdown Generator with a Filter
|
||||
AWC->>Scraper: scrap(raw_html)
|
||||
Scraper-->>AWC: cleaned_html, links, etc.
|
||||
AWC->>MDGen: generate_markdown(cleaned_html, config=my_config)
|
||||
Note over MDGen: Uses html2text for raw markdown
|
||||
MDGen-->>MDGen: raw_markdown = html2text(cleaned_html)
|
||||
Note over MDGen: Now, check for content_filter
|
||||
alt Filter Provided in MDGen
|
||||
MDGen->>Filter: filter_content(cleaned_html)
|
||||
Filter-->>MDGen: filtered_html_fragments
|
||||
Note over MDGen: Uses html2text on filtered fragments
|
||||
MDGen-->>MDGen: fit_markdown = html2text(filtered_html_fragments)
|
||||
else No Filter Provided
|
||||
MDGen-->>MDGen: fit_markdown = "" (or None)
|
||||
end
|
||||
Note over MDGen: Generate citations if needed
|
||||
MDGen-->>AWC: MarkdownGenerationResult (raw, fit, references)
|
||||
AWC->>Result: Package everything
|
||||
AWC-->>User: Return CrawlResult
|
||||
```
|
||||
|
||||
**Code Glimpse:**
|
||||
|
||||
Inside `crawl4ai/markdown_generation_strategy.py`, the `DefaultMarkdownGenerator`'s `generate_markdown` method has logic like this (simplified):
|
||||
|
||||
```python
|
||||
# Simplified from markdown_generation_strategy.py
|
||||
from .models import MarkdownGenerationResult
|
||||
from .html2text import CustomHTML2Text
|
||||
from .content_filter_strategy import RelevantContentFilter # Import filter base class
|
||||
|
||||
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
# ... __init__ stores self.content_filter ...
|
||||
|
||||
def generate_markdown(
|
||||
self,
|
||||
cleaned_html: str,
|
||||
# ... other params like base_url, options ...
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
**kwargs,
|
||||
) -> MarkdownGenerationResult:
|
||||
|
||||
h = CustomHTML2Text(...) # Setup html2text converter
|
||||
# ... apply options ...
|
||||
|
||||
# 1. Generate raw markdown from the full cleaned_html
|
||||
raw_markdown = h.handle(cleaned_html)
|
||||
# ... post-process raw_markdown ...
|
||||
|
||||
# 2. Convert links to citations (if enabled)
|
||||
markdown_with_citations, references_markdown = self.convert_links_to_citations(...)
|
||||
|
||||
# 3. Generate fit markdown IF a filter is available
|
||||
fit_markdown = ""
|
||||
filtered_html = ""
|
||||
# Use the filter passed directly, or the one stored during initialization
|
||||
active_filter = content_filter or self.content_filter
|
||||
if active_filter:
|
||||
try:
|
||||
# Call the filter's main method
|
||||
filtered_html_fragments = active_filter.filter_content(cleaned_html)
|
||||
# Join fragments (assuming filter returns list of HTML strings)
|
||||
filtered_html = "\n".join(filtered_html_fragments)
|
||||
# Convert ONLY the filtered HTML to markdown
|
||||
fit_markdown = h.handle(filtered_html)
|
||||
except Exception as e:
|
||||
fit_markdown = f"Error during filtering: {e}"
|
||||
# Log error...
|
||||
|
||||
return MarkdownGenerationResult(
|
||||
raw_markdown=raw_markdown,
|
||||
markdown_with_citations=markdown_with_citations,
|
||||
references_markdown=references_markdown,
|
||||
fit_markdown=fit_markdown, # Contains the filtered result
|
||||
fit_html=filtered_html, # The HTML fragments kept by the filter
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
And inside `crawl4ai/content_filter_strategy.py`, you find the blueprint and implementations:
|
||||
|
||||
```python
|
||||
# Simplified from content_filter_strategy.py
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List
|
||||
# ... other imports like BeautifulSoup, BM25Okapi ...
|
||||
|
||||
class RelevantContentFilter(ABC):
|
||||
"""Abstract base class for content filtering strategies"""
|
||||
def __init__(self, user_query: str = None, ...):
|
||||
self.user_query = user_query
|
||||
# ... common setup ...
|
||||
|
||||
@abstractmethod
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
"""
|
||||
Takes cleaned HTML, returns a list of HTML fragments
|
||||
deemed relevant by the specific strategy.
|
||||
"""
|
||||
pass
|
||||
# ... common helper methods like extract_page_query, is_excluded ...
|
||||
|
||||
class BM25ContentFilter(RelevantContentFilter):
|
||||
def __init__(self, user_query: str = None, bm25_threshold: float = 1.0, ...):
|
||||
super().__init__(user_query)
|
||||
self.bm25_threshold = bm25_threshold
|
||||
# ... BM25 specific setup ...
|
||||
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
# 1. Parse HTML (e.g., with BeautifulSoup)
|
||||
# 2. Extract text chunks (candidates)
|
||||
# 3. Determine query (user_query or extracted)
|
||||
# 4. Tokenize query and chunks
|
||||
# 5. Calculate BM25 scores for chunks vs query
|
||||
# 6. Filter chunks based on score and threshold
|
||||
# 7. Return the HTML string of the selected chunks
|
||||
# ... implementation details ...
|
||||
relevant_html_fragments = ["<p>Relevant paragraph 1...</p>", "<h2>Relevant Section</h2>..."] # Placeholder
|
||||
return relevant_html_fragments
|
||||
|
||||
# ... Implementations for PruningContentFilter and LLMContentFilter ...
|
||||
```
|
||||
|
||||
The key is that each filter implements the `filter_content` method, returning the list of HTML fragments it considers relevant. The `DefaultMarkdownGenerator` then uses these fragments to create the `fit_markdown`.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `RelevantContentFilter`, Crawl4AI's "Relevance Sieve"!
|
||||
|
||||
* It addresses the problem that even cleaned HTML can contain noise relative to a specific goal.
|
||||
* It acts as a strategy to filter cleaned HTML, keeping only the relevant parts.
|
||||
* Different filter types exist: `BM25ContentFilter` (keywords), `PruningContentFilter` (structure), and `LLMContentFilter` (AI/semantic).
|
||||
* It's typically used *within* the `DefaultMarkdownGenerator` to produce a focused `fit_markdown` output in the `CrawlResult`, alongside the standard `raw_markdown`.
|
||||
* You configure it by passing the chosen filter instance to the `DefaultMarkdownGenerator` and then passing that generator to the `CrawlerRunConfig`.
|
||||
|
||||
By using `RelevantContentFilter`, you can significantly improve the signal-to-noise ratio of the content you get from webpages, making downstream tasks like summarization or analysis more effective.
|
||||
|
||||
But what if just getting relevant *text* isn't enough? What if you need specific, *structured* data like product names, prices, and ratings from an e-commerce page, or names and affiliations from a list of conference speakers?
|
||||
|
||||
**Next:** Let's explore how to extract structured data with [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
442
docs/Crawl4AI/06_extractionstrategy.md
Normal file
442
docs/Crawl4AI/06_extractionstrategy.md
Normal file
@@ -0,0 +1,442 @@
|
||||
# Chapter 6: Getting Specific Data - ExtractionStrategy
|
||||
|
||||
In the previous chapter, [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md), we learned how to sift through the cleaned webpage content to keep only the parts relevant to our query or goal, producing a focused `fit_markdown`. This is great for tasks like summarization or getting the main gist of an article.
|
||||
|
||||
But sometimes, we need more than just relevant text. Imagine you're analyzing an e-commerce website listing products. You don't just want the *description*; you need the exact **product name**, the specific **price**, the **customer rating**, and maybe the **SKU number**, all neatly organized. How do we tell Crawl4AI to find these *specific* pieces of information and return them in a structured format, like a JSON object?
|
||||
|
||||
## What Problem Does `ExtractionStrategy` Solve?
|
||||
|
||||
Think of the content we've processed so far (like the cleaned HTML or the generated Markdown) as a detailed report delivered by a researcher. `RelevantContentFilter` helped trim the report down to the most relevant pages.
|
||||
|
||||
Now, we need to give specific instructions to an **Analyst** to go through that focused report and pull out precise data points. We don't just want the report; we want a filled-in spreadsheet with columns for "Product Name," "Price," and "Rating."
|
||||
|
||||
`ExtractionStrategy` is the set of instructions we give to this Analyst. It defines *how* to locate and extract specific, structured information (like fields in a database or keys in a JSON object) from the content.
|
||||
|
||||
## What is `ExtractionStrategy`?
|
||||
|
||||
`ExtractionStrategy` is a core concept (a blueprint) in Crawl4AI that represents the **method used to extract structured data** from the processed content (which could be HTML or Markdown). It specifies *that* we need a way to find specific fields, but the actual *technique* used to find them can vary.
|
||||
|
||||
This allows us to choose the best "Analyst" for the job, depending on the complexity of the website and the data we need.
|
||||
|
||||
## The Different Analysts: Ways to Extract Data
|
||||
|
||||
Crawl4AI offers several concrete implementations (the different Analysts) for extracting structured data:
|
||||
|
||||
1. **The Precise Locator (`JsonCssExtractionStrategy` & `JsonXPathExtractionStrategy`)**
|
||||
* **Analogy:** An analyst who uses very precise map coordinates (CSS Selectors or XPath expressions) to find information on a page. They need to be told exactly where to look. "The price is always in the HTML element with the ID `#product-price`."
|
||||
* **How it works:** You define a **schema** (a Python dictionary) that maps the names of the fields you want (e.g., "product_name", "price") to the specific CSS selector (`JsonCssExtractionStrategy`) or XPath expression (`JsonXPathExtractionStrategy`) that locates that information within the HTML structure.
|
||||
* **Pros:** Very fast and reliable if the website structure is consistent and predictable. Doesn't require external AI services.
|
||||
* **Cons:** Can break easily if the website changes its layout (selectors become invalid). Requires you to inspect the HTML and figure out the correct selectors.
|
||||
* **Input:** Typically works directly on the raw or cleaned HTML.
|
||||
|
||||
2. **The Smart Interpreter (`LLMExtractionStrategy`)**
|
||||
* **Analogy:** A highly intelligent analyst who can *read and understand* the content. You give them a list of fields you need (a schema) or even just natural language instructions ("Find the product name, its price, and a short description"). They read the content (usually Markdown) and use their understanding of language and context to figure out the values, even if the layout isn't perfectly consistent.
|
||||
* **How it works:** You provide a desired output schema (e.g., a Pydantic model or a dictionary structure) or a natural language instruction. The strategy sends the content (often the generated Markdown, possibly split into chunks) along with your schema/instruction to a configured Large Language Model (LLM) like GPT or Llama. The LLM reads the text and generates the structured data (usually JSON) according to your request.
|
||||
* **Pros:** Much more resilient to website layout changes. Can understand context and handle variations. Can extract data based on meaning, not just location.
|
||||
* **Cons:** Requires setting up access to an LLM (API keys, potentially costs). Can be significantly slower than selector-based methods. The quality of extraction depends on the LLM's capabilities and the clarity of your instructions/schema.
|
||||
* **Input:** Often works best on the cleaned Markdown representation of the content, but can sometimes use HTML.
|
||||
|
||||
## How to Use an `ExtractionStrategy`
|
||||
|
||||
You tell the `AsyncWebCrawler` which extraction strategy to use (if any) by setting the `extraction_strategy` parameter within the [CrawlerRunConfig](03_crawlerrunconfig.md) object you pass to `arun` or `arun_many`.
|
||||
|
||||
### Example 1: Extracting Data with `JsonCssExtractionStrategy`
|
||||
|
||||
Let's imagine we want to extract the title (from the `<h1>` tag) and the main heading (from the `<h1>` tag) of the simple `httpbin.org/html` page.
|
||||
|
||||
```python
|
||||
# chapter6_example_1.py
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
JsonCssExtractionStrategy # Import the CSS strategy
|
||||
)
|
||||
|
||||
async def main():
|
||||
# 1. Define the extraction schema (Field Name -> CSS Selector)
|
||||
extraction_schema = {
|
||||
"baseSelector": "body", # Operate within the body tag
|
||||
"fields": [
|
||||
{"name": "page_title", "selector": "title", "type": "text"},
|
||||
{"name": "main_heading", "selector": "h1", "type": "text"}
|
||||
]
|
||||
}
|
||||
print("Extraction Schema defined using CSS selectors.")
|
||||
|
||||
# 2. Create an instance of the strategy with the schema
|
||||
css_extractor = JsonCssExtractionStrategy(schema=extraction_schema)
|
||||
print(f"Using strategy: {css_extractor.__class__.__name__}")
|
||||
|
||||
# 3. Create CrawlerRunConfig and set the extraction_strategy
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=css_extractor
|
||||
)
|
||||
|
||||
# 4. Run the crawl
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
print(f"\nCrawling {url_to_crawl} to extract structured data...")
|
||||
|
||||
result = await crawler.arun(url=url_to_crawl, config=run_config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
print("\nExtraction successful!")
|
||||
# The extracted data is stored as a JSON string in result.extracted_content
|
||||
# Parse the JSON string to work with the data as a Python object
|
||||
extracted_data = json.loads(result.extracted_content)
|
||||
print("Extracted Data:")
|
||||
# Print the extracted data nicely formatted
|
||||
print(json.dumps(extracted_data, indent=2))
|
||||
elif result.success:
|
||||
print("\nCrawl successful, but no structured data extracted.")
|
||||
else:
|
||||
print(f"\nCrawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Schema Definition:** We create a Python dictionary `extraction_schema`.
|
||||
* `baseSelector: "body"` tells the strategy to look for items within the `<body>` tag of the HTML.
|
||||
* `fields` is a list of dictionaries, each defining a field to extract:
|
||||
* `name`: The key for this field in the output JSON (e.g., "page_title").
|
||||
* `selector`: The CSS selector to find the element containing the data (e.g., "title" finds the `<title>` tag, "h1" finds the `<h1>` tag).
|
||||
* `type`: How to get the data from the selected element (`"text"` means get the text content).
|
||||
2. **Instantiate Strategy:** We create an instance of `JsonCssExtractionStrategy`, passing our `extraction_schema`. This strategy knows its input format should be HTML.
|
||||
3. **Configure Run:** We create a `CrawlerRunConfig` and assign our `css_extractor` instance to the `extraction_strategy` parameter.
|
||||
4. **Crawl:** We run `crawler.arun`. After fetching and basic scraping, the `AsyncWebCrawler` will see the `extraction_strategy` in the config and call our `css_extractor`.
|
||||
5. **Result:** The `CrawlResult` object now contains a field called `extracted_content`. This field holds the structured data found by the strategy, formatted as a **JSON string**. We use `json.loads()` to convert this string back into a Python list/dictionary.
|
||||
|
||||
**Expected Output (Conceptual):**
|
||||
|
||||
```
|
||||
Extraction Schema defined using CSS selectors.
|
||||
Using strategy: JsonCssExtractionStrategy
|
||||
|
||||
Crawling https://httpbin.org/html to extract structured data...
|
||||
|
||||
Extraction successful!
|
||||
Extracted Data:
|
||||
[
|
||||
{
|
||||
"page_title": "Herman Melville - Moby-Dick",
|
||||
"main_heading": "Moby Dick"
|
||||
}
|
||||
]
|
||||
```
|
||||
*(Note: The actual output is a list containing one dictionary because `baseSelector: "body"` matches one element, and we extract fields relative to that.)*
|
||||
|
||||
### Example 2: Extracting Data with `LLMExtractionStrategy` (Conceptual)
|
||||
|
||||
Now, let's imagine we want the same information (title, heading) but using an AI. We'll provide a schema describing what we want. (Note: This requires setting up LLM access separately, e.g., API keys).
|
||||
|
||||
```python
|
||||
# chapter6_example_2.py
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMExtractionStrategy, # Import the LLM strategy
|
||||
LlmConfig # Import LLM configuration helper
|
||||
)
|
||||
|
||||
# Assume llm_config is properly configured with provider, API key, etc.
|
||||
# This is just a placeholder - replace with your actual LLM setup
|
||||
# E.g., llm_config = LlmConfig(provider="openai", api_token="env:OPENAI_API_KEY")
|
||||
class MockLlmConfig: provider="mock"; api_token="mock"; base_url=None
|
||||
llm_config = MockLlmConfig()
|
||||
|
||||
|
||||
async def main():
|
||||
# 1. Define the desired output schema (what fields we want)
|
||||
# This helps guide the LLM.
|
||||
output_schema = {
|
||||
"page_title": "string",
|
||||
"main_heading": "string"
|
||||
}
|
||||
print("Extraction Schema defined for LLM.")
|
||||
|
||||
# 2. Create an instance of the LLM strategy
|
||||
# We pass the schema and the LLM configuration.
|
||||
# We also specify input_format='markdown' (common for LLMs).
|
||||
llm_extractor = LLMExtractionStrategy(
|
||||
schema=output_schema,
|
||||
llmConfig=llm_config, # Pass the LLM provider details
|
||||
input_format="markdown" # Tell it to read the Markdown content
|
||||
)
|
||||
print(f"Using strategy: {llm_extractor.__class__.__name__}")
|
||||
print(f"LLM Provider (mocked): {llm_config.provider}")
|
||||
|
||||
# 3. Create CrawlerRunConfig with the strategy
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=llm_extractor
|
||||
)
|
||||
|
||||
# 4. Run the crawl
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url_to_crawl = "https://httpbin.org/html"
|
||||
print(f"\nCrawling {url_to_crawl} using LLM to extract...")
|
||||
|
||||
# This would make calls to the configured LLM API
|
||||
result = await crawler.arun(url=url_to_crawl, config=run_config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
print("\nExtraction successful (using LLM)!")
|
||||
# Extracted data is a JSON string
|
||||
try:
|
||||
extracted_data = json.loads(result.extracted_content)
|
||||
print("Extracted Data:")
|
||||
print(json.dumps(extracted_data, indent=2))
|
||||
except json.JSONDecodeError:
|
||||
print("Could not parse LLM output as JSON:")
|
||||
print(result.extracted_content)
|
||||
elif result.success:
|
||||
print("\nCrawl successful, but no structured data extracted by LLM.")
|
||||
# This might happen if the mock LLM doesn't return valid JSON
|
||||
# or if the content was too small/irrelevant for extraction.
|
||||
else:
|
||||
print(f"\nCrawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Schema Definition:** We define a simple dictionary `output_schema` telling the LLM we want fields named "page_title" and "main_heading", both expected to be strings.
|
||||
2. **Instantiate Strategy:** We create `LLMExtractionStrategy`, passing:
|
||||
* `schema=output_schema`: Our desired output structure.
|
||||
* `llmConfig=llm_config`: The configuration telling the strategy *which* LLM to use and how to authenticate (here, it's mocked).
|
||||
* `input_format="markdown"`: Instructs the strategy to feed the generated Markdown content (from `result.markdown.raw_markdown`) to the LLM, which is often easier for LLMs to parse than raw HTML.
|
||||
3. **Configure Run & Crawl:** Same as before, we set the `extraction_strategy` in `CrawlerRunConfig` and run the crawl.
|
||||
4. **Result:** The `AsyncWebCrawler` calls the `llm_extractor`. The strategy sends the Markdown content and the schema instructions to the configured LLM. The LLM analyzes the text and (hopefully) returns a JSON object matching the schema. This JSON is stored as a string in `result.extracted_content`.
|
||||
|
||||
**Expected Output (Conceptual, with a real LLM):**
|
||||
|
||||
```
|
||||
Extraction Schema defined for LLM.
|
||||
Using strategy: LLMExtractionStrategy
|
||||
LLM Provider (mocked): mock
|
||||
|
||||
Crawling https://httpbin.org/html using LLM to extract...
|
||||
|
||||
Extraction successful (using LLM)!
|
||||
Extracted Data:
|
||||
[
|
||||
{
|
||||
"page_title": "Herman Melville - Moby-Dick",
|
||||
"main_heading": "Moby Dick"
|
||||
}
|
||||
]
|
||||
```
|
||||
*(Note: LLM output format might vary slightly, but it aims to match the requested schema based on the content it reads.)*
|
||||
|
||||
## How It Works Inside (Under the Hood)
|
||||
|
||||
When you provide an `extraction_strategy` in the `CrawlerRunConfig`, how does `AsyncWebCrawler` use it?
|
||||
|
||||
1. **Fetch & Scrape:** The crawler fetches the raw HTML ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)) and performs initial cleaning/scraping ([ContentScrapingStrategy](04_contentscrapingstrategy.md)) to get `cleaned_html`, links, etc.
|
||||
2. **Markdown Generation:** It usually generates Markdown representation ([DefaultMarkdownGenerator](05_relevantcontentfilter.md#how-relevantcontentfilter-is-used-via-markdown-generation)).
|
||||
3. **Check for Strategy:** The `AsyncWebCrawler` (specifically in its internal `aprocess_html` method) checks if `config.extraction_strategy` is set.
|
||||
4. **Execute Strategy:** If a strategy exists:
|
||||
* It determines the required input format (e.g., "html" for `JsonCssExtractionStrategy`, "markdown" for `LLMExtractionStrategy` based on its `input_format` attribute).
|
||||
* It retrieves the corresponding content (e.g., `result.cleaned_html` or `result.markdown.raw_markdown`).
|
||||
* If the content is long and the strategy supports chunking (like `LLMExtractionStrategy`), it might first split the content into smaller chunks.
|
||||
* It calls the strategy's `run` method, passing the content chunk(s).
|
||||
* The strategy performs its logic (applying selectors, calling LLM API).
|
||||
* The strategy returns the extracted data (typically as a list of dictionaries).
|
||||
5. **Store Result:** The `AsyncWebCrawler` converts the returned structured data into a JSON string and stores it in `CrawlResult.extracted_content`.
|
||||
|
||||
Here's a simplified view:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant AWC as AsyncWebCrawler
|
||||
participant Config as CrawlerRunConfig
|
||||
participant Processor as HTML Processing
|
||||
participant Extractor as ExtractionStrategy
|
||||
participant Result as CrawlResult
|
||||
|
||||
User->>AWC: arun(url, config=my_config)
|
||||
Note over AWC: Config includes an Extraction Strategy
|
||||
AWC->>Processor: Process HTML (scrape, generate markdown)
|
||||
Processor-->>AWC: Processed Content (HTML, Markdown)
|
||||
AWC->>Extractor: Run extraction on content (using Strategy's input format)
|
||||
Note over Extractor: Applying logic (CSS, XPath, LLM...)
|
||||
Extractor-->>AWC: Structured Data (List[Dict])
|
||||
AWC->>AWC: Convert data to JSON String
|
||||
AWC->>Result: Store JSON String in extracted_content
|
||||
AWC-->>User: Return CrawlResult
|
||||
```
|
||||
|
||||
### Code Glimpse (`extraction_strategy.py`)
|
||||
|
||||
Inside the `crawl4ai` library, the file `extraction_strategy.py` defines the blueprint and the implementations.
|
||||
|
||||
**The Blueprint (Abstract Base Class):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/extraction_strategy.py
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Dict, Any
|
||||
|
||||
class ExtractionStrategy(ABC):
|
||||
"""Abstract base class for all extraction strategies."""
|
||||
def __init__(self, input_format: str = "markdown", **kwargs):
|
||||
self.input_format = input_format # e.g., 'html', 'markdown'
|
||||
# ... other common init ...
|
||||
|
||||
@abstractmethod
|
||||
def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""Extract structured data from a single chunk of content."""
|
||||
pass
|
||||
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""Process content sections (potentially chunked) and call extract."""
|
||||
# Default implementation might process sections in parallel or sequentially
|
||||
all_extracted_data = []
|
||||
for section in sections:
|
||||
all_extracted_data.extend(self.extract(url, section, **kwargs))
|
||||
return all_extracted_data
|
||||
```
|
||||
|
||||
**Example Implementation (`JsonCssExtractionStrategy`):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/extraction_strategy.py
|
||||
from bs4 import BeautifulSoup # Uses BeautifulSoup for CSS selectors
|
||||
|
||||
class JsonCssExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, schema: Dict[str, Any], **kwargs):
|
||||
# Force input format to HTML for CSS selectors
|
||||
super().__init__(input_format="html", **kwargs)
|
||||
self.schema = schema # Store the user-defined schema
|
||||
|
||||
def extract(self, url: str, html_content: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
# Parse the HTML content chunk
|
||||
soup = BeautifulSoup(html_content, "html.parser")
|
||||
extracted_items = []
|
||||
|
||||
# Find base elements defined in the schema
|
||||
base_elements = soup.select(self.schema.get("baseSelector", "body"))
|
||||
|
||||
for element in base_elements:
|
||||
item = {}
|
||||
# Extract fields based on schema selectors and types
|
||||
fields_to_extract = self.schema.get("fields", [])
|
||||
for field_def in fields_to_extract:
|
||||
try:
|
||||
# Find the specific sub-element using CSS selector
|
||||
target_element = element.select_one(field_def["selector"])
|
||||
if target_element:
|
||||
if field_def["type"] == "text":
|
||||
item[field_def["name"]] = target_element.get_text(strip=True)
|
||||
elif field_def["type"] == "attribute":
|
||||
item[field_def["name"]] = target_element.get(field_def["attribute"])
|
||||
# ... other types like 'html', 'list', 'nested' ...
|
||||
except Exception as e:
|
||||
# Handle errors, maybe log them if verbose
|
||||
pass
|
||||
if item:
|
||||
extracted_items.append(item)
|
||||
|
||||
return extracted_items
|
||||
|
||||
# run() method likely uses the default implementation from base class
|
||||
```
|
||||
|
||||
**Example Implementation (`LLMExtractionStrategy`):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/extraction_strategy.py
|
||||
# Needs imports for LLM interaction (e.g., perform_completion_with_backoff)
|
||||
from .utils import perform_completion_with_backoff, chunk_documents, escape_json_string
|
||||
from .prompts import PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Example prompt
|
||||
|
||||
class LLMExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, schema: Dict = None, instruction: str = None, llmConfig=None, input_format="markdown", **kwargs):
|
||||
super().__init__(input_format=input_format, **kwargs)
|
||||
self.schema = schema
|
||||
self.instruction = instruction
|
||||
self.llmConfig = llmConfig # Contains provider, API key, etc.
|
||||
# ... other LLM specific setup ...
|
||||
|
||||
def extract(self, url: str, content_chunk: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
# Prepare the prompt for the LLM
|
||||
prompt = self._build_llm_prompt(url, content_chunk)
|
||||
|
||||
# Call the LLM API
|
||||
response = perform_completion_with_backoff(
|
||||
provider=self.llmConfig.provider,
|
||||
prompt_with_variables=prompt,
|
||||
api_token=self.llmConfig.api_token,
|
||||
base_url=self.llmConfig.base_url,
|
||||
json_response=True # Often expect JSON from LLM for extraction
|
||||
# ... pass other necessary args ...
|
||||
)
|
||||
|
||||
# Parse the LLM's response (which should ideally be JSON)
|
||||
try:
|
||||
extracted_data = json.loads(response.choices[0].message.content)
|
||||
# Ensure it's a list
|
||||
if isinstance(extracted_data, dict):
|
||||
extracted_data = [extracted_data]
|
||||
return extracted_data
|
||||
except Exception as e:
|
||||
# Handle LLM response parsing errors
|
||||
print(f"Error parsing LLM response: {e}")
|
||||
return [{"error": "Failed to parse LLM output", "raw_output": response.choices[0].message.content}]
|
||||
|
||||
def _build_llm_prompt(self, url: str, content_chunk: str) -> str:
|
||||
# Logic to construct the prompt using self.schema or self.instruction
|
||||
# and the content_chunk. Example:
|
||||
prompt_template = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION # Choose appropriate prompt
|
||||
variable_values = {
|
||||
"URL": url,
|
||||
"CONTENT": escape_json_string(content_chunk), # Send Markdown or HTML chunk
|
||||
"SCHEMA": json.dumps(self.schema) if self.schema else "{}",
|
||||
"REQUEST": self.instruction if self.instruction else "Extract relevant data based on the schema."
|
||||
}
|
||||
prompt = prompt_template
|
||||
for var, val in variable_values.items():
|
||||
prompt = prompt.replace("{" + var + "}", str(val))
|
||||
return prompt
|
||||
|
||||
# run() method might override the base to handle chunking specifically for LLMs
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
# Potentially chunk sections based on token limits before calling extract
|
||||
# chunked_content = chunk_documents(sections, ...)
|
||||
# extracted_data = []
|
||||
# for chunk in chunked_content:
|
||||
# extracted_data.extend(self.extract(url, chunk, **kwargs))
|
||||
# return extracted_data
|
||||
# Simplified for now:
|
||||
return super().run(url, sections, *q, **kwargs)
|
||||
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `ExtractionStrategy`, Crawl4AI's way of giving instructions to an "Analyst" to pull out specific, structured data from web content.
|
||||
|
||||
* It solves the problem of needing precise data points (like product names, prices) in an organized format, not just blocks of text.
|
||||
* You can choose your "Analyst":
|
||||
* **Precise Locators (`JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`):** Use exact CSS/XPath selectors defined in a schema. Fast but brittle.
|
||||
* **Smart Interpreter (`LLMExtractionStrategy`):** Uses an AI (LLM) guided by a schema or instructions. More flexible but slower and needs setup.
|
||||
* You configure the desired strategy within the [CrawlerRunConfig](03_crawlerrunconfig.md).
|
||||
* The extracted structured data is returned as a JSON string in the `CrawlResult.extracted_content` field.
|
||||
|
||||
Now that we understand how to fetch, clean, filter, and extract data, let's put it all together and look at the final package that Crawl4AI delivers after a crawl.
|
||||
|
||||
**Next:** Let's dive into the details of the output with [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
341
docs/Crawl4AI/07_crawlresult.md
Normal file
341
docs/Crawl4AI/07_crawlresult.md
Normal file
@@ -0,0 +1,341 @@
|
||||
# Chapter 7: Understanding the Results - CrawlResult
|
||||
|
||||
In the previous chapter, [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md), we learned how to teach Crawl4AI to act like an analyst, extracting specific, structured data points from a webpage using an `ExtractionStrategy`. We've seen how Crawl4AI can fetch pages, clean them, filter them, and even extract precise information.
|
||||
|
||||
But after all that work, where does all the gathered information go? When you ask the `AsyncWebCrawler` to crawl a URL using `arun()`, what do you actually get back?
|
||||
|
||||
## What Problem Does `CrawlResult` Solve?
|
||||
|
||||
Imagine you sent a research assistant to the library (a website) with a set of instructions: "Find this book (URL), make a clean copy of the relevant chapter (clean HTML/Markdown), list all the cited references (links), take photos of the illustrations (media), find the author and publication date (metadata), and maybe extract specific quotes (structured data)."
|
||||
|
||||
When the assistant returns, they wouldn't just hand you a single piece of paper. They'd likely give you a folder containing everything you asked for: the clean copy, the list of references, the photos, the metadata notes, and the extracted quotes, all neatly organized. They might also include a note if they encountered any problems (errors).
|
||||
|
||||
`CrawlResult` is exactly this **final report folder** or **delivery package**. It's a single object that neatly contains *all* the information Crawl4AI gathered and processed for a specific URL during a crawl operation. Instead of getting lots of separate pieces of data back, you get one convenient container.
|
||||
|
||||
## What is `CrawlResult`?
|
||||
|
||||
`CrawlResult` is a Python object (specifically, a Pydantic model, which is like a super-powered dictionary) that acts as a data container. It holds the results of a single crawl task performed by `AsyncWebCrawler.arun()` or one of the results from `arun_many()`.
|
||||
|
||||
Think of it as a toolbox filled with different tools and information related to the crawled page.
|
||||
|
||||
**Key Information Stored in `CrawlResult`:**
|
||||
|
||||
* **`url` (string):** The original URL that was requested.
|
||||
* **`success` (boolean):** Did the crawl complete without critical errors? `True` if successful, `False` otherwise. **Always check this first!**
|
||||
* **`html` (string):** The raw, original HTML source code fetched from the page.
|
||||
* **`cleaned_html` (string):** The HTML after initial cleaning by the [ContentScrapingStrategy](04_contentscrapingstrategy.md) (e.g., scripts, styles removed).
|
||||
* **`markdown` (object):** An object containing different Markdown representations of the content.
|
||||
* `markdown.raw_markdown`: Basic Markdown generated from `cleaned_html`.
|
||||
* `markdown.fit_markdown`: Markdown generated *only* from content deemed relevant by a [RelevantContentFilter](05_relevantcontentfilter.md) (if one was used). Might be empty if no filter was applied.
|
||||
* *(Other fields like `markdown_with_citations` might exist)*
|
||||
* **`extracted_content` (string):** If you used an [ExtractionStrategy](06_extractionstrategy.md), this holds the extracted structured data, usually formatted as a JSON string. `None` if no extraction was performed or nothing was found.
|
||||
* **`metadata` (dictionary):** Information extracted from the page's metadata tags, like the page title (`metadata['title']`), description, keywords, etc.
|
||||
* **`links` (object):** Contains lists of links found on the page.
|
||||
* `links.internal`: List of links pointing to the same website.
|
||||
* `links.external`: List of links pointing to other websites.
|
||||
* **`media` (object):** Contains lists of media items found.
|
||||
* `media.images`: List of images (`<img>` tags).
|
||||
* `media.videos`: List of videos (`<video>` tags).
|
||||
* *(Other media types might be included)*
|
||||
* **`screenshot` (string):** If you requested a screenshot (`screenshot=True` in `CrawlerRunConfig`), this holds the file path to the saved image. `None` otherwise.
|
||||
* **`pdf` (bytes):** If you requested a PDF (`pdf=True` in `CrawlerRunConfig`), this holds the PDF data as bytes. `None` otherwise. (Note: Previously might have been a path, now often bytes).
|
||||
* **`error_message` (string):** If `success` is `False`, this field usually contains details about what went wrong.
|
||||
* **`status_code` (integer):** The HTTP status code received from the server (e.g., 200 for OK, 404 for Not Found).
|
||||
* **`response_headers` (dictionary):** The HTTP response headers sent by the server.
|
||||
* **`redirected_url` (string):** If the original URL redirected, this shows the final URL the crawler landed on.
|
||||
|
||||
## Accessing the `CrawlResult`
|
||||
|
||||
You get a `CrawlResult` object back every time you `await` a call to `crawler.arun()`:
|
||||
|
||||
```python
|
||||
# chapter7_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url = "https://httpbin.org/html"
|
||||
print(f"Crawling {url}...")
|
||||
|
||||
# The 'arun' method returns a CrawlResult object
|
||||
result: CrawlResult = await crawler.arun(url=url) # Type hint optional
|
||||
|
||||
print("Crawl finished!")
|
||||
# Now 'result' holds all the information
|
||||
print(f"Result object type: {type(result)}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. We call `crawler.arun(url=url)`.
|
||||
2. The `await` keyword pauses execution until the crawl is complete.
|
||||
3. The value returned by `arun` is assigned to the `result` variable.
|
||||
4. This `result` variable is our `CrawlResult` object.
|
||||
|
||||
If you use `crawler.arun_many()`, it returns a list where each item is a `CrawlResult` object for one of the requested URLs (or an async generator if `stream=True`).
|
||||
|
||||
## Exploring the Attributes: Using the Toolbox
|
||||
|
||||
Once you have the `result` object, you can access its attributes using dot notation (e.g., `result.success`, `result.markdown`).
|
||||
|
||||
**1. Checking for Success (Most Important!)**
|
||||
|
||||
Before you try to use any data, always check if the crawl was successful:
|
||||
|
||||
```python
|
||||
# chapter7_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlResult # Import CrawlResult for type hint
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url = "https://httpbin.org/html" # A working URL
|
||||
# url = "https://httpbin.org/status/404" # Try this URL to see failure
|
||||
result: CrawlResult = await crawler.arun(url=url)
|
||||
|
||||
# --- ALWAYS CHECK 'success' FIRST! ---
|
||||
if result.success:
|
||||
print(f"✅ Successfully crawled: {result.url}")
|
||||
# Now it's safe to access other attributes
|
||||
print(f" Page Title: {result.metadata.get('title', 'N/A')}")
|
||||
else:
|
||||
print(f"❌ Failed to crawl: {result.url}")
|
||||
print(f" Error: {result.error_message}")
|
||||
print(f" Status Code: {result.status_code}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We use an `if result.success:` block.
|
||||
* If `True`, we proceed to access other data like `result.metadata`.
|
||||
* If `False`, we print the `result.error_message` and `result.status_code` to understand why it failed.
|
||||
|
||||
**2. Accessing Content (HTML, Markdown)**
|
||||
|
||||
```python
|
||||
# chapter7_example_3.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlResult
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url = "https://httpbin.org/html"
|
||||
result: CrawlResult = await crawler.arun(url=url)
|
||||
|
||||
if result.success:
|
||||
print("--- Content ---")
|
||||
# Print the first 150 chars of raw HTML
|
||||
print(f"Raw HTML snippet: {result.html[:150]}...")
|
||||
|
||||
# Access the raw markdown
|
||||
if result.markdown: # Check if markdown object exists
|
||||
print(f"Markdown snippet: {result.markdown.raw_markdown[:150]}...")
|
||||
else:
|
||||
print("Markdown not generated.")
|
||||
else:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We access `result.html` for the original HTML.
|
||||
* We access `result.markdown.raw_markdown` for the main Markdown content. Note the two dots: `result.markdown` gives the `MarkdownGenerationResult` object, and `.raw_markdown` accesses the specific string within it. We also check `if result.markdown:` first, just in case markdown generation failed for some reason.
|
||||
|
||||
**3. Getting Metadata, Links, and Media**
|
||||
|
||||
```python
|
||||
# chapter7_example_4.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlResult
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url = "https://httpbin.org/links/10/0" # A page with links
|
||||
result: CrawlResult = await crawler.arun(url=url)
|
||||
|
||||
if result.success:
|
||||
print("--- Metadata & Links ---")
|
||||
print(f"Title: {result.metadata.get('title', 'N/A')}")
|
||||
print(f"Found {len(result.links.internal)} internal links.")
|
||||
print(f"Found {len(result.links.external)} external links.")
|
||||
if result.links.internal:
|
||||
print(f" First internal link text: '{result.links.internal[0].text}'")
|
||||
# Similarly access result.media.images etc.
|
||||
else:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* `result.metadata` is a dictionary; use `.get()` for safe access.
|
||||
* `result.links` and `result.media` are objects containing lists (`internal`, `external`, `images`, etc.). We can check their lengths (`len()`) and access individual items by index (e.g., `[0]`).
|
||||
|
||||
**4. Checking for Extracted Data, Screenshots, PDFs**
|
||||
|
||||
```python
|
||||
# chapter7_example_5.py
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler, CrawlResult, CrawlerRunConfig,
|
||||
JsonCssExtractionStrategy # Example extractor
|
||||
)
|
||||
|
||||
async def main():
|
||||
# Define a simple extraction strategy (from Chapter 6)
|
||||
schema = {"baseSelector": "body", "fields": [{"name": "heading", "selector": "h1", "type": "text"}]}
|
||||
extractor = JsonCssExtractionStrategy(schema=schema)
|
||||
|
||||
# Configure the run to extract and take a screenshot
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=extractor,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url = "https://httpbin.org/html"
|
||||
result: CrawlResult = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
print("--- Extracted Data & Media ---")
|
||||
# Check if structured data was extracted
|
||||
if result.extracted_content:
|
||||
print("Extracted Data found:")
|
||||
data = json.loads(result.extracted_content) # Parse the JSON string
|
||||
print(json.dumps(data, indent=2))
|
||||
else:
|
||||
print("No structured data extracted.")
|
||||
|
||||
# Check if a screenshot was taken
|
||||
if result.screenshot:
|
||||
print(f"Screenshot saved to: {result.screenshot}")
|
||||
else:
|
||||
print("Screenshot not taken.")
|
||||
|
||||
# Check for PDF (would be bytes if requested and successful)
|
||||
if result.pdf:
|
||||
print(f"PDF data captured ({len(result.pdf)} bytes).")
|
||||
else:
|
||||
print("PDF not generated.")
|
||||
else:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We check if `result.extracted_content` is not `None` or empty before trying to parse it as JSON.
|
||||
* We check if `result.screenshot` is not `None` to see if the file path exists.
|
||||
* We check if `result.pdf` is not `None` to see if the PDF data (bytes) was captured.
|
||||
|
||||
## How is `CrawlResult` Created? (Under the Hood)
|
||||
|
||||
You don't interact with the `CrawlResult` constructor directly. The `AsyncWebCrawler` creates it for you at the very end of the `arun` process, typically inside its internal `aprocess_html` method (or just before returning if fetching from cache).
|
||||
|
||||
Here’s a simplified sequence:
|
||||
|
||||
1. **Fetch:** `AsyncWebCrawler` calls the [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to get the raw `html`, `status_code`, `response_headers`, etc.
|
||||
2. **Scrape:** It passes the `html` to the [ContentScrapingStrategy](04_contentscrapingstrategy.md) to get `cleaned_html`, `links`, `media`, `metadata`.
|
||||
3. **Markdown:** It generates Markdown using the configured generator, possibly involving a [RelevantContentFilter](05_relevantcontentfilter.md), resulting in a `MarkdownGenerationResult` object.
|
||||
4. **Extract (Optional):** If an [ExtractionStrategy](06_extractionstrategy.md) is configured, it runs it on the appropriate content (HTML or Markdown) to get `extracted_content`.
|
||||
5. **Screenshot/PDF (Optional):** If requested, the fetching strategy captures the `screenshot` path or `pdf` data.
|
||||
6. **Package:** `AsyncWebCrawler` gathers all these pieces (`url`, `html`, `cleaned_html`, the markdown object, `links`, `media`, `metadata`, `extracted_content`, `screenshot`, `pdf`, `success` status, `error_message`, etc.).
|
||||
7. **Instantiate:** It creates the `CrawlResult` object, passing all the gathered data into its constructor.
|
||||
8. **Return:** It returns this fully populated `CrawlResult` object to your code.
|
||||
|
||||
## Code Glimpse (`models.py`)
|
||||
|
||||
The `CrawlResult` is defined in the `crawl4ai/models.py` file. It uses Pydantic, a library that helps define data structures with type hints and validation. Here's a simplified view:
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/models.py
|
||||
from pydantic import BaseModel, HttpUrl
|
||||
from typing import List, Dict, Optional, Any
|
||||
|
||||
# Other related models (simplified)
|
||||
class MarkdownGenerationResult(BaseModel):
|
||||
raw_markdown: str
|
||||
fit_markdown: Optional[str] = None
|
||||
# ... other markdown fields ...
|
||||
|
||||
class Links(BaseModel):
|
||||
internal: List[Dict] = []
|
||||
external: List[Dict] = []
|
||||
|
||||
class Media(BaseModel):
|
||||
images: List[Dict] = []
|
||||
videos: List[Dict] = []
|
||||
|
||||
# The main CrawlResult model
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Media = Media() # Use the Media model
|
||||
links: Links = Links() # Use the Links model
|
||||
screenshot: Optional[str] = None
|
||||
pdf: Optional[bytes] = None
|
||||
# Uses a private attribute and property for markdown for compatibility
|
||||
_markdown: Optional[MarkdownGenerationResult] = None # Actual storage
|
||||
extracted_content: Optional[str] = None # JSON string
|
||||
metadata: Optional[Dict[str, Any]] = None
|
||||
error_message: Optional[str] = None
|
||||
status_code: Optional[int] = None
|
||||
response_headers: Optional[Dict[str, str]] = None
|
||||
redirected_url: Optional[str] = None
|
||||
# ... other fields like session_id, ssl_certificate ...
|
||||
|
||||
# Custom property to access markdown data
|
||||
@property
|
||||
def markdown(self) -> Optional[MarkdownGenerationResult]:
|
||||
return self._markdown
|
||||
|
||||
# Configuration for Pydantic
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
# Custom init and model_dump might exist for backward compatibility handling
|
||||
# ... (omitted for simplicity) ...
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* It's defined as a `class CrawlResult(BaseModel):`.
|
||||
* Each attribute (like `url`, `html`, `success`) is defined with a type hint (like `str`, `bool`, `Optional[str]`). `Optional[str]` means the field can be a string or `None`.
|
||||
* Some attributes are themselves complex objects defined by other Pydantic models (like `media: Media`, `links: Links`).
|
||||
* The `markdown` field uses a common pattern (property wrapping a private attribute) to provide the `MarkdownGenerationResult` object while maintaining some backward compatibility. You access it simply as `result.markdown`.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've now met the `CrawlResult` object – the final, comprehensive report delivered by Crawl4AI after processing a URL.
|
||||
|
||||
* It acts as a **container** holding all gathered information (HTML, Markdown, metadata, links, media, extracted data, errors, etc.).
|
||||
* It's the **return value** of `AsyncWebCrawler.arun()` and `arun_many()`.
|
||||
* The most crucial attribute is **`success` (boolean)**, which you should always check first.
|
||||
* You can easily **access** all the different pieces of information using dot notation (e.g., `result.metadata['title']`, `result.markdown.raw_markdown`, `result.links.external`).
|
||||
|
||||
Understanding the `CrawlResult` is key to effectively using the information Crawl4AI provides.
|
||||
|
||||
So far, we've focused on crawling single pages or lists of specific URLs. But what if you want to start at one page and automatically discover and crawl linked pages, exploring a website more deeply?
|
||||
|
||||
**Next:** Let's explore how to perform multi-page crawls with [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
378
docs/Crawl4AI/08_deepcrawlstrategy.md
Normal file
378
docs/Crawl4AI/08_deepcrawlstrategy.md
Normal file
@@ -0,0 +1,378 @@
|
||||
# Chapter 8: Exploring Websites - DeepCrawlStrategy
|
||||
|
||||
In [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md), we saw the final report (`CrawlResult`) that Crawl4AI gives us after processing a single URL. This report contains cleaned content, links, metadata, and maybe even extracted data.
|
||||
|
||||
But what if you want to explore a website *beyond* just the first page? Imagine you land on a blog's homepage. You don't just want the homepage content; you want to automatically discover and crawl all the individual blog posts linked from it. How can you tell Crawl4AI to act like an explorer, following links and venturing deeper into the website?
|
||||
|
||||
## What Problem Does `DeepCrawlStrategy` Solve?
|
||||
|
||||
Think of the `AsyncWebCrawler.arun()` method we've used so far like visiting just the entrance hall of a vast library. You get information about that specific hall, but you don't automatically explore the adjoining rooms or different floors.
|
||||
|
||||
What if you want to systematically explore the library? You need a plan:
|
||||
|
||||
* Do you explore room by room on the current floor before going upstairs? (Level by level)
|
||||
* Do you pick one wing and explore all its rooms down to the very end before exploring another wing? (Go deep first)
|
||||
* Do you have a map highlighting potentially interesting sections and prioritize visiting those first? (Prioritize promising paths)
|
||||
|
||||
`DeepCrawlStrategy` provides this **exploration plan**. It defines the logic for how Crawl4AI should discover and crawl new URLs starting from the initial one(s) by following the links it finds on each page. It turns the crawler from a single-page visitor into a website explorer.
|
||||
|
||||
## What is `DeepCrawlStrategy`?
|
||||
|
||||
`DeepCrawlStrategy` is a concept (a blueprint) in Crawl4AI that represents the **method or logic used to navigate and crawl multiple pages by following links**. It tells the crawler *which links* to follow and in *what order* to visit them.
|
||||
|
||||
It essentially takes over the process when you call `arun()` if a deep crawl is requested, managing a queue or list of URLs to visit and coordinating the crawling of those URLs, potentially up to a certain depth or number of pages.
|
||||
|
||||
## Different Exploration Plans: The Strategies
|
||||
|
||||
Crawl4AI provides several concrete exploration plans (implementations) for `DeepCrawlStrategy`:
|
||||
|
||||
1. **`BFSDeepCrawlStrategy` (Level-by-Level Explorer):**
|
||||
* **Analogy:** Like ripples spreading in a pond.
|
||||
* **How it works:** It first crawls the starting URL (Level 0). Then, it crawls all the valid links found on that page (Level 1). Then, it crawls all the valid links found on *those* pages (Level 2), and so on. It explores the website layer by layer.
|
||||
* **Good for:** Finding the shortest path to all reachable pages, getting a broad overview quickly near the start page.
|
||||
|
||||
2. **`DFSDeepCrawlStrategy` (Deep Path Explorer):**
|
||||
* **Analogy:** Like exploring one specific corridor in a maze all the way to the end before backtracking and trying another corridor.
|
||||
* **How it works:** It starts at the initial URL, follows one link, then follows a link from *that* page, and continues going deeper down one path as far as possible (or until a specified depth limit). Only when it hits a dead end or the limit does it backtrack and try another path.
|
||||
* **Good for:** Exploring specific branches of a website thoroughly, potentially reaching deeper pages faster than BFS (if the target is down a specific path).
|
||||
|
||||
3. **`BestFirstCrawlingStrategy` (Priority Explorer):**
|
||||
* **Analogy:** Like using a treasure map where some paths are marked as more promising than others.
|
||||
* **How it works:** This strategy uses a **scoring system**. It looks at all the discovered (but not yet visited) links and assigns a score to each one based on how "promising" it seems (e.g., does the URL contain relevant keywords? Is it from a trusted domain?). It then crawls the link with the *best* score first, regardless of its depth.
|
||||
* **Good for:** Focusing the crawl on the most relevant or important pages first, especially useful when you can't crawl the entire site and need to prioritize.
|
||||
|
||||
**Guiding the Explorer: Filters and Scorers**
|
||||
|
||||
Deep crawl strategies often work together with:
|
||||
|
||||
* **Filters:** Rules that decide *if* a discovered link should even be considered for crawling. Examples:
|
||||
* `DomainFilter`: Only follow links within the starting website's domain.
|
||||
* `URLPatternFilter`: Only follow links matching a specific pattern (e.g., `/blog/posts/...`).
|
||||
* `ContentTypeFilter`: Avoid following links to non-HTML content like PDFs or images.
|
||||
* **Scorers:** (Used mainly by `BestFirstCrawlingStrategy`) Rules that assign a score to a potential link to help prioritize it. Examples:
|
||||
* `KeywordRelevanceScorer`: Scores links higher if the URL contains certain keywords.
|
||||
* `PathDepthScorer`: Might score links differently based on how deep they are.
|
||||
|
||||
These act like instructions for the explorer: "Only explore rooms on this floor (filter)," "Ignore corridors marked 'Staff Only' (filter)," or "Check rooms marked with a star first (scorer)."
|
||||
|
||||
## How to Use a `DeepCrawlStrategy`
|
||||
|
||||
You enable deep crawling by adding a `DeepCrawlStrategy` instance to your `CrawlerRunConfig`. Let's try exploring a website layer by layer using `BFSDeepCrawlStrategy`, going only one level deep from the start page.
|
||||
|
||||
```python
|
||||
# chapter8_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
BFSDeepCrawlStrategy, # 1. Import the desired strategy
|
||||
DomainFilter # Import a filter to stay on the same site
|
||||
)
|
||||
|
||||
async def main():
|
||||
# 2. Create an instance of the strategy
|
||||
# - max_depth=1: Crawl start URL (depth 0) + links found (depth 1)
|
||||
# - filter_chain: Use DomainFilter to only follow links on the same website
|
||||
bfs_explorer = BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=[DomainFilter()] # Stay within the initial domain
|
||||
)
|
||||
print(f"Strategy: BFS, Max Depth: {bfs_explorer.max_depth}")
|
||||
|
||||
# 3. Create CrawlerRunConfig and set the deep_crawl_strategy
|
||||
# Also set stream=True to get results as they come in.
|
||||
run_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=bfs_explorer,
|
||||
stream=True # Get results one by one using async for
|
||||
)
|
||||
|
||||
# 4. Run the crawl - arun now handles the deep crawl!
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
start_url = "https://httpbin.org/links/10/0" # A page with 10 internal links
|
||||
print(f"\nStarting deep crawl from: {start_url}...")
|
||||
|
||||
crawl_results_generator = await crawler.arun(url=start_url, config=run_config)
|
||||
|
||||
crawled_count = 0
|
||||
# Iterate over the results as they are yielded
|
||||
async for result in crawl_results_generator:
|
||||
crawled_count += 1
|
||||
status = "✅" if result.success else "❌"
|
||||
depth = result.metadata.get("depth", "N/A")
|
||||
parent = result.metadata.get("parent_url", "Start")
|
||||
url_short = result.url.split('/')[-1] # Show last part of URL
|
||||
print(f" {status} Crawled: {url_short:<6} (Depth: {depth})")
|
||||
|
||||
print(f"\nFinished deep crawl. Total pages processed: {crawled_count}")
|
||||
# Expecting 1 (start URL) + 10 (links) = 11 results
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Import:** We import `AsyncWebCrawler`, `CrawlerRunConfig`, `BFSDeepCrawlStrategy`, and `DomainFilter`.
|
||||
2. **Instantiate Strategy:** We create `BFSDeepCrawlStrategy`.
|
||||
* `max_depth=1`: We tell it to crawl the starting URL (depth 0) and any valid links it finds on that page (depth 1), but not to go any further.
|
||||
* `filter_chain=[DomainFilter()]`: We provide a list containing `DomainFilter`. This tells the strategy to only consider following links that point to the same domain as the `start_url`. Links to external sites will be ignored.
|
||||
3. **Configure Run:** We create a `CrawlerRunConfig` and pass our `bfs_explorer` instance to the `deep_crawl_strategy` parameter. We also set `stream=True` so we can process results as soon as they are ready, rather than waiting for the entire crawl to finish.
|
||||
4. **Crawl:** We call `await crawler.arun(url=start_url, config=run_config)`. Because the config contains a `deep_crawl_strategy`, `arun` doesn't just crawl the single `start_url`. Instead, it activates the deep crawl logic defined by `BFSDeepCrawlStrategy`.
|
||||
5. **Process Results:** Since we used `stream=True`, the return value is an asynchronous generator. We use `async for result in crawl_results_generator:` to loop through the `CrawlResult` objects as they are produced by the deep crawl. For each result, we print its status and depth.
|
||||
|
||||
You'll see the output showing the crawl starting, then processing the initial page (`links/10/0` at depth 0), followed by the 10 linked pages (e.g., `9`, `8`, ... `0` at depth 1).
|
||||
|
||||
## How It Works (Under the Hood)
|
||||
|
||||
How does simply putting a strategy in the config change `arun`'s behavior? It involves a bit of Python magic called a **decorator**.
|
||||
|
||||
1. **Decorator:** When you create an `AsyncWebCrawler`, its `arun` method is automatically wrapped by a `DeepCrawlDecorator`.
|
||||
2. **Check Config:** When you call `await crawler.arun(url=..., config=...)`, this decorator checks if `config.deep_crawl_strategy` is set.
|
||||
3. **Delegate or Run Original:**
|
||||
* If a strategy **is set**, the decorator *doesn't* run the original single-page crawl logic. Instead, it calls the `arun` method of your chosen `DeepCrawlStrategy` instance (e.g., `bfs_explorer.arun(...)`), passing it the `crawler` itself, the `start_url`, and the `config`.
|
||||
* If no strategy is set, the decorator simply calls the original `arun` logic to crawl the single page.
|
||||
4. **Strategy Takes Over:** The `DeepCrawlStrategy`'s `arun` method now manages the crawl.
|
||||
* It maintains a list or queue of URLs to visit (e.g., `current_level` in BFS, a stack in DFS, a priority queue in BestFirst).
|
||||
* It repeatedly takes batches of URLs from its list/queue.
|
||||
* For each batch, it calls `crawler.arun_many(urls=batch_urls, config=batch_config)` (with deep crawling disabled in `batch_config` to avoid infinite loops!).
|
||||
* As results come back from `arun_many`, the strategy processes them:
|
||||
* It yields the `CrawlResult` if running in stream mode.
|
||||
* It extracts links using its `link_discovery` method.
|
||||
* `link_discovery` uses `can_process_url` (which applies filters) to validate links.
|
||||
* Valid new links are added to the list/queue for future crawling.
|
||||
* This continues until the list/queue is empty, the max depth/pages limit is reached, or it's cancelled.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Decorator as DeepCrawlDecorator
|
||||
participant Strategy as DeepCrawlStrategy (e.g., BFS)
|
||||
participant AWC as AsyncWebCrawler
|
||||
|
||||
User->>Decorator: arun(start_url, config_with_strategy)
|
||||
Decorator->>Strategy: arun(start_url, crawler=AWC, config)
|
||||
Note over Strategy: Initialize queue/level with start_url
|
||||
loop Until Queue Empty or Limits Reached
|
||||
Strategy->>Strategy: Get next batch of URLs from queue
|
||||
Note over Strategy: Create batch_config (deep_crawl=None)
|
||||
Strategy->>AWC: arun_many(batch_urls, config=batch_config)
|
||||
AWC-->>Strategy: batch_results (List/Stream of CrawlResult)
|
||||
loop For each result in batch_results
|
||||
Strategy->>Strategy: Process result (yield if streaming)
|
||||
Strategy->>Strategy: Discover links (apply filters)
|
||||
Strategy->>Strategy: Add valid new links to queue
|
||||
end
|
||||
end
|
||||
Strategy-->>Decorator: Final result (List or Generator)
|
||||
Decorator-->>User: Final result
|
||||
```
|
||||
|
||||
## Code Glimpse
|
||||
|
||||
Let's peek at the simplified structure:
|
||||
|
||||
**1. The Decorator (`deep_crawling/base_strategy.py`)**
|
||||
|
||||
```python
|
||||
# Simplified from deep_crawling/base_strategy.py
|
||||
from contextvars import ContextVar
|
||||
from functools import wraps
|
||||
# ... other imports
|
||||
|
||||
class DeepCrawlDecorator:
|
||||
deep_crawl_active = ContextVar("deep_crawl_active", default=False)
|
||||
|
||||
def __init__(self, crawler: AsyncWebCrawler):
|
||||
self.crawler = crawler
|
||||
|
||||
def __call__(self, original_arun):
|
||||
@wraps(original_arun)
|
||||
async def wrapped_arun(url: str, config: CrawlerRunConfig = None, **kwargs):
|
||||
# Is a strategy present AND not already inside a deep crawl?
|
||||
if config and config.deep_crawl_strategy and not self.deep_crawl_active.get():
|
||||
# Mark that we are starting a deep crawl
|
||||
token = self.deep_crawl_active.set(True)
|
||||
try:
|
||||
# Call the STRATEGY's arun method instead of the original
|
||||
strategy_result = await config.deep_crawl_strategy.arun(
|
||||
crawler=self.crawler,
|
||||
start_url=url,
|
||||
config=config
|
||||
)
|
||||
# Handle streaming if needed
|
||||
if config.stream:
|
||||
# Return an async generator that resets the context var on exit
|
||||
async def result_wrapper():
|
||||
try:
|
||||
async for result in strategy_result: yield result
|
||||
finally: self.deep_crawl_active.reset(token)
|
||||
return result_wrapper()
|
||||
else:
|
||||
return strategy_result # Return the list of results directly
|
||||
finally:
|
||||
# Reset the context var if not streaming (or handled in wrapper)
|
||||
if not config.stream: self.deep_crawl_active.reset(token)
|
||||
else:
|
||||
# No strategy or already deep crawling, call the original single-page arun
|
||||
return await original_arun(url, config=config, **kwargs)
|
||||
return wrapped_arun
|
||||
```
|
||||
|
||||
**2. The Strategy Blueprint (`deep_crawling/base_strategy.py`)**
|
||||
|
||||
```python
|
||||
# Simplified from deep_crawling/base_strategy.py
|
||||
from abc import ABC, abstractmethod
|
||||
# ... other imports
|
||||
|
||||
class DeepCrawlStrategy(ABC):
|
||||
|
||||
@abstractmethod
|
||||
async def _arun_batch(self, start_url, crawler, config) -> List[CrawlResult]:
|
||||
# Implementation for non-streaming mode
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def _arun_stream(self, start_url, crawler, config) -> AsyncGenerator[CrawlResult, None]:
|
||||
# Implementation for streaming mode
|
||||
pass
|
||||
|
||||
async def arun(self, start_url, crawler, config) -> RunManyReturn:
|
||||
# Decides whether to call _arun_batch or _arun_stream
|
||||
if config.stream:
|
||||
return self._arun_stream(start_url, crawler, config)
|
||||
else:
|
||||
return await self._arun_batch(start_url, crawler, config)
|
||||
|
||||
@abstractmethod
|
||||
async def can_process_url(self, url: str, depth: int) -> bool:
|
||||
# Applies filters to decide if a URL is valid to crawl
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def link_discovery(self, result, source_url, current_depth, visited, next_level, depths):
|
||||
# Extracts, validates, and prepares links for the next step
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def shutdown(self):
|
||||
# Cleanup logic
|
||||
pass
|
||||
```
|
||||
|
||||
**3. Example: BFS Implementation (`deep_crawling/bfs_strategy.py`)**
|
||||
|
||||
```python
|
||||
# Simplified from deep_crawling/bfs_strategy.py
|
||||
# ... imports ...
|
||||
from .base_strategy import DeepCrawlStrategy # Import the base class
|
||||
|
||||
class BFSDeepCrawlStrategy(DeepCrawlStrategy):
|
||||
def __init__(self, max_depth, filter_chain=None, url_scorer=None, ...):
|
||||
self.max_depth = max_depth
|
||||
self.filter_chain = filter_chain or FilterChain() # Use default if none
|
||||
self.url_scorer = url_scorer
|
||||
# ... other init ...
|
||||
self._pages_crawled = 0
|
||||
|
||||
async def can_process_url(self, url: str, depth: int) -> bool:
|
||||
# ... (validation logic using self.filter_chain) ...
|
||||
is_valid = True # Placeholder
|
||||
if depth != 0 and not await self.filter_chain.apply(url):
|
||||
is_valid = False
|
||||
return is_valid
|
||||
|
||||
async def link_discovery(self, result, source_url, current_depth, visited, next_level, depths):
|
||||
# ... (logic to get links from result.links) ...
|
||||
links = result.links.get("internal", []) # Example: only internal
|
||||
for link_data in links:
|
||||
url = link_data.get("href")
|
||||
if url and url not in visited:
|
||||
if await self.can_process_url(url, current_depth + 1):
|
||||
# Check scoring, max_pages limit etc.
|
||||
depths[url] = current_depth + 1
|
||||
next_level.append((url, source_url)) # Add (url, parent) tuple
|
||||
|
||||
async def _arun_batch(self, start_url, crawler, config) -> List[CrawlResult]:
|
||||
visited = set()
|
||||
current_level = [(start_url, None)] # List of (url, parent_url)
|
||||
depths = {start_url: 0}
|
||||
all_results = []
|
||||
|
||||
while current_level: # While there are pages in the current level
|
||||
next_level = []
|
||||
urls_in_level = [url for url, parent in current_level]
|
||||
visited.update(urls_in_level)
|
||||
|
||||
# Create config for this batch (no deep crawl recursion)
|
||||
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
|
||||
# Crawl all URLs in the current level
|
||||
batch_results = await crawler.arun_many(urls=urls_in_level, config=batch_config)
|
||||
|
||||
for result in batch_results:
|
||||
# Add metadata (depth, parent)
|
||||
depth = depths.get(result.url, 0)
|
||||
result.metadata = result.metadata or {}
|
||||
result.metadata["depth"] = depth
|
||||
# ... find parent ...
|
||||
all_results.append(result)
|
||||
# Discover links for the *next* level
|
||||
if result.success:
|
||||
await self.link_discovery(result, result.url, depth, visited, next_level, depths)
|
||||
|
||||
current_level = next_level # Move to the next level
|
||||
|
||||
return all_results
|
||||
|
||||
async def _arun_stream(self, start_url, crawler, config) -> AsyncGenerator[CrawlResult, None]:
|
||||
# Similar logic to _arun_batch, but uses 'yield result'
|
||||
# and processes results as they come from arun_many stream
|
||||
visited = set()
|
||||
current_level = [(start_url, None)] # List of (url, parent_url)
|
||||
depths = {start_url: 0}
|
||||
|
||||
while current_level:
|
||||
next_level = []
|
||||
urls_in_level = [url for url, parent in current_level]
|
||||
visited.update(urls_in_level)
|
||||
|
||||
# Use stream=True for arun_many
|
||||
batch_config = config.clone(deep_crawl_strategy=None, stream=True)
|
||||
batch_results_gen = await crawler.arun_many(urls=urls_in_level, config=batch_config)
|
||||
|
||||
async for result in batch_results_gen:
|
||||
# Add metadata
|
||||
depth = depths.get(result.url, 0)
|
||||
result.metadata = result.metadata or {}
|
||||
result.metadata["depth"] = depth
|
||||
# ... find parent ...
|
||||
yield result # Yield result immediately
|
||||
# Discover links for the next level
|
||||
if result.success:
|
||||
await self.link_discovery(result, result.url, depth, visited, next_level, depths)
|
||||
|
||||
current_level = next_level
|
||||
# ... shutdown method ...
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `DeepCrawlStrategy`, the component that turns Crawl4AI into a website explorer!
|
||||
|
||||
* It solves the problem of crawling beyond a single starting page by following links.
|
||||
* It defines the **exploration plan**:
|
||||
* `BFSDeepCrawlStrategy`: Level by level.
|
||||
* `DFSDeepCrawlStrategy`: Deep paths first.
|
||||
* `BestFirstCrawlingStrategy`: Prioritized by score.
|
||||
* **Filters** and **Scorers** help guide the exploration.
|
||||
* You enable it by setting `deep_crawl_strategy` in the `CrawlerRunConfig`.
|
||||
* A decorator mechanism intercepts `arun` calls to activate the strategy.
|
||||
* The strategy manages the queue of URLs and uses `crawler.arun_many` to crawl them in batches.
|
||||
|
||||
Deep crawling allows you to gather information from multiple related pages automatically. But how does Crawl4AI avoid re-fetching the same page over and over again, especially during these deeper crawls? The answer lies in caching.
|
||||
|
||||
**Next:** Let's explore how Crawl4AI smartly caches results with [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
346
docs/Crawl4AI/09_cachecontext___cachemode.md
Normal file
346
docs/Crawl4AI/09_cachecontext___cachemode.md
Normal file
@@ -0,0 +1,346 @@
|
||||
# Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode
|
||||
|
||||
In the previous chapter, [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md), we saw how Crawl4AI can explore websites by following links, potentially visiting many pages. During such explorations, or even when you run the same crawl multiple times, the crawler might try to fetch the exact same webpage again and again. This can be slow and might unnecessarily put a load on the website you're crawling. Wouldn't it be smarter to remember the result from the first time and just reuse it?
|
||||
|
||||
## What Problem Does Caching Solve?
|
||||
|
||||
Imagine you need to download a large instruction manual (a webpage) from the internet.
|
||||
|
||||
* **Without Caching:** Every single time you need the manual, you download the entire file again. This takes time and uses bandwidth every time.
|
||||
* **With Caching:** The first time you download it, you save a copy on your computer (the "cache"). The next time you need it, you first check your local copy. If it's there, you use it instantly! You only download it again if you specifically want the absolute latest version or if your local copy is missing.
|
||||
|
||||
Caching in Crawl4AI works the same way. It's a mechanism to **store the results** of crawling a webpage locally (in a database file). When asked to crawl a URL again, Crawl4AI can check its cache first. If a valid result is already stored, it can return that saved result almost instantly, saving time and resources.
|
||||
|
||||
## Introducing `CacheMode` and `CacheContext`
|
||||
|
||||
Crawl4AI uses two key concepts to manage this caching behavior:
|
||||
|
||||
1. **`CacheMode` (The Cache Policy):**
|
||||
* Think of this like setting the rules for how you interact with your saved instruction manuals.
|
||||
* It's an **instruction** you give the crawler for a specific run, telling it *how* to use the cache.
|
||||
* **Analogy:** Should you *always* use your saved copy if you have one? (`ENABLED`) Should you *ignore* your saved copies and always download a fresh one? (`BYPASS`) Should you *never* save any copies? (`DISABLED`) Should you save new copies but never reuse old ones? (`WRITE_ONLY`)
|
||||
* `CacheMode` lets you choose the caching behavior that best fits your needs for a particular task.
|
||||
|
||||
2. **`CacheContext` (The Decision Maker):**
|
||||
* This is an internal helper that Crawl4AI uses *during* a crawl. You don't usually interact with it directly.
|
||||
* It looks at the `CacheMode` you provided (the policy) and the type of URL being processed.
|
||||
* **Analogy:** Imagine a librarian who checks the library's borrowing rules (`CacheMode`) and the type of item you're requesting (e.g., a reference book that can't be checked out, like `raw:` HTML which isn't cached). Based on these, the librarian (`CacheContext`) decides if you can borrow an existing copy (read from cache) or if a new copy should be added to the library (write to cache).
|
||||
* It helps the main `AsyncWebCrawler` make the right decision about reading from or writing to the cache for each specific URL based on the active policy.
|
||||
|
||||
## Setting the Cache Policy: Using `CacheMode`
|
||||
|
||||
You control the caching behavior by setting the `cache_mode` parameter within the `CrawlerRunConfig` object that you pass to `crawler.arun()` or `crawler.arun_many()`.
|
||||
|
||||
Let's explore the most common `CacheMode` options:
|
||||
|
||||
**1. `CacheMode.ENABLED` (The Default Behavior - If not specified)**
|
||||
|
||||
* **Policy:** "Use the cache if a valid result exists. If not, fetch the page, save the result to the cache, and then return it."
|
||||
* This is the standard, balanced approach. It saves time on repeated crawls but ensures you get the content eventually.
|
||||
* *Note: In recent versions, the default if `cache_mode` is left completely unspecified might be `CacheMode.BYPASS`. Always check the documentation or explicitly set the mode for clarity.* For this tutorial, let's assume we explicitly set it.
|
||||
|
||||
```python
|
||||
# chapter9_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
url = "https://httpbin.org/html"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Explicitly set the mode to ENABLED
|
||||
config_enabled = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
print(f"Running with CacheMode: {config_enabled.cache_mode.name}")
|
||||
|
||||
# First run: Fetches, caches, and returns result
|
||||
print("First run (ENABLED)...")
|
||||
result1 = await crawler.arun(url=url, config=config_enabled)
|
||||
print(f"Got result 1? {'Yes' if result1.success else 'No'}")
|
||||
|
||||
# Second run: Finds result in cache and returns it instantly
|
||||
print("Second run (ENABLED)...")
|
||||
result2 = await crawler.arun(url=url, config=config_enabled)
|
||||
print(f"Got result 2? {'Yes' if result2.success else 'No'}")
|
||||
# This second run should be much faster!
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We create a `CrawlerRunConfig` with `cache_mode=CacheMode.ENABLED`.
|
||||
* The first `arun` call fetches the page from the web and saves the result in the cache.
|
||||
* The second `arun` call (for the same URL and config affecting cache key) finds the saved result in the cache and returns it immediately, skipping the web fetch.
|
||||
|
||||
**2. `CacheMode.BYPASS`**
|
||||
|
||||
* **Policy:** "Ignore any existing saved copy. Always fetch a fresh copy from the web. After fetching, save this new result to the cache (overwriting any old one)."
|
||||
* Useful when you *always* need the absolute latest version of the page, but you still want to update the cache for potential future use with `CacheMode.ENABLED`.
|
||||
|
||||
```python
|
||||
# chapter9_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
import time
|
||||
|
||||
async def main():
|
||||
url = "https://httpbin.org/html"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Set the mode to BYPASS
|
||||
config_bypass = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
print(f"Running with CacheMode: {config_bypass.cache_mode.name}")
|
||||
|
||||
# First run: Fetches, caches, and returns result
|
||||
print("First run (BYPASS)...")
|
||||
start_time = time.perf_counter()
|
||||
result1 = await crawler.arun(url=url, config=config_bypass)
|
||||
duration1 = time.perf_counter() - start_time
|
||||
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
|
||||
|
||||
# Second run: Ignores cache, fetches again, updates cache, returns result
|
||||
print("Second run (BYPASS)...")
|
||||
start_time = time.perf_counter()
|
||||
result2 = await crawler.arun(url=url, config=config_bypass)
|
||||
duration2 = time.perf_counter() - start_time
|
||||
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
|
||||
# Both runs should take a similar amount of time (fetching time)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We set `cache_mode=CacheMode.BYPASS`.
|
||||
* Both the first and second `arun` calls will fetch the page directly from the web, ignoring any previously cached result. They will still write the newly fetched result to the cache. Notice both runs take roughly the same amount of time (network fetch time).
|
||||
|
||||
**3. `CacheMode.DISABLED`**
|
||||
|
||||
* **Policy:** "Completely ignore the cache. Never read from it, never write to it."
|
||||
* Useful when you don't want Crawl4AI to interact with the cache files at all, perhaps for debugging or if you have storage constraints.
|
||||
|
||||
```python
|
||||
# chapter9_example_3.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
import time
|
||||
|
||||
async def main():
|
||||
url = "https://httpbin.org/html"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Set the mode to DISABLED
|
||||
config_disabled = CrawlerRunConfig(cache_mode=CacheMode.DISABLED)
|
||||
print(f"Running with CacheMode: {config_disabled.cache_mode.name}")
|
||||
|
||||
# First run: Fetches, returns result (does NOT cache)
|
||||
print("First run (DISABLED)...")
|
||||
start_time = time.perf_counter()
|
||||
result1 = await crawler.arun(url=url, config=config_disabled)
|
||||
duration1 = time.perf_counter() - start_time
|
||||
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
|
||||
|
||||
# Second run: Fetches again, returns result (does NOT cache)
|
||||
print("Second run (DISABLED)...")
|
||||
start_time = time.perf_counter()
|
||||
result2 = await crawler.arun(url=url, config=config_disabled)
|
||||
duration2 = time.perf_counter() - start_time
|
||||
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
|
||||
# Both runs fetch fresh, and nothing is ever saved to the cache.
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We set `cache_mode=CacheMode.DISABLED`.
|
||||
* Both `arun` calls fetch fresh content from the web. Crucially, neither run reads from nor writes to the cache database.
|
||||
|
||||
**Other Modes (`READ_ONLY`, `WRITE_ONLY`):**
|
||||
|
||||
* `CacheMode.READ_ONLY`: Only uses existing cached results. If a result isn't in the cache, it will fail or return an empty result rather than fetching it. Never saves anything new.
|
||||
* `CacheMode.WRITE_ONLY`: Never reads from the cache (always fetches fresh). It *only* writes the newly fetched result to the cache.
|
||||
|
||||
## How Caching Works Internally
|
||||
|
||||
When you call `crawler.arun(url="...", config=...)`:
|
||||
|
||||
1. **Create Context:** The `AsyncWebCrawler` creates a `CacheContext` instance using the `url` and the `config.cache_mode`.
|
||||
2. **Check Read:** It asks the `CacheContext`, "Should I read from the cache?" (`cache_context.should_read()`).
|
||||
3. **Try Reading:** If `should_read()` is `True`, it asks the database manager ([`AsyncDatabaseManager`](async_database.py)) to look for a cached result for the `url`.
|
||||
4. **Cache Hit?**
|
||||
* If a valid cached result is found: The `AsyncWebCrawler` returns this cached `CrawlResult` immediately. Done!
|
||||
* If no cached result is found (or if `should_read()` was `False`): Proceed to fetching.
|
||||
5. **Fetch:** The `AsyncWebCrawler` calls the appropriate [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to fetch the content from the web.
|
||||
6. **Process:** It processes the fetched HTML (scraping, filtering, extracting) to create a new `CrawlResult`.
|
||||
7. **Check Write:** It asks the `CacheContext`, "Should I write this result to the cache?" (`cache_context.should_write()`).
|
||||
8. **Write Cache:** If `should_write()` is `True`, it tells the database manager to save the new `CrawlResult` into the cache database.
|
||||
9. **Return:** The `AsyncWebCrawler` returns the newly created `CrawlResult`.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant AWC as AsyncWebCrawler
|
||||
participant Ctx as CacheContext
|
||||
participant DB as DatabaseManager
|
||||
participant Fetcher as AsyncCrawlerStrategy
|
||||
|
||||
User->>AWC: arun(url, config)
|
||||
AWC->>Ctx: Create CacheContext(url, config.cache_mode)
|
||||
AWC->>Ctx: should_read()?
|
||||
alt Cache Read Allowed
|
||||
Ctx-->>AWC: Yes
|
||||
AWC->>DB: aget_cached_url(url)
|
||||
DB-->>AWC: Cached Result (or None)
|
||||
alt Cache Hit & Valid
|
||||
AWC-->>User: Return Cached CrawlResult
|
||||
else Cache Miss or Invalid
|
||||
AWC->>AWC: Proceed to Fetch
|
||||
end
|
||||
else Cache Read Not Allowed
|
||||
Ctx-->>AWC: No
|
||||
AWC->>AWC: Proceed to Fetch
|
||||
end
|
||||
|
||||
Note over AWC: Fetching Required
|
||||
AWC->>Fetcher: crawl(url, config)
|
||||
Fetcher-->>AWC: Raw Response
|
||||
AWC->>AWC: Process HTML -> New CrawlResult
|
||||
AWC->>Ctx: should_write()?
|
||||
alt Cache Write Allowed
|
||||
Ctx-->>AWC: Yes
|
||||
AWC->>DB: acache_url(New CrawlResult)
|
||||
DB-->>AWC: OK
|
||||
else Cache Write Not Allowed
|
||||
Ctx-->>AWC: No
|
||||
end
|
||||
AWC-->>User: Return New CrawlResult
|
||||
|
||||
```
|
||||
|
||||
## Code Glimpse
|
||||
|
||||
Let's look at simplified code snippets.
|
||||
|
||||
**Inside `async_webcrawler.py` (where `arun` uses caching):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/async_webcrawler.py
|
||||
from .cache_context import CacheContext, CacheMode
|
||||
from .async_database import async_db_manager
|
||||
from .models import CrawlResult
|
||||
# ... other imports
|
||||
|
||||
class AsyncWebCrawler:
|
||||
# ... (init, other methods) ...
|
||||
|
||||
async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult:
|
||||
# ... (ensure config exists, set defaults) ...
|
||||
if config.cache_mode is None:
|
||||
config.cache_mode = CacheMode.ENABLED # Example default
|
||||
|
||||
# 1. Create CacheContext
|
||||
cache_context = CacheContext(url, config.cache_mode)
|
||||
|
||||
cached_result = None
|
||||
# 2. Check if cache read is allowed
|
||||
if cache_context.should_read():
|
||||
# 3. Try reading from database
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
# 4. If cache hit and valid, return it
|
||||
if cached_result and self._is_cache_valid(cached_result, config):
|
||||
self.logger.info("Cache hit for: %s", url) # Example log
|
||||
return cached_result # Return early
|
||||
|
||||
# 5. Fetch fresh content (if no cache hit or read disabled)
|
||||
async_response = await self.crawler_strategy.crawl(url, config=config)
|
||||
html = async_response.html # ... and other data ...
|
||||
|
||||
# 6. Process the HTML to get a new CrawlResult
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url, html=html, config=config, # ... other params ...
|
||||
)
|
||||
|
||||
# 7. Check if cache write is allowed
|
||||
if cache_context.should_write():
|
||||
# 8. Write the new result to the database
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
# 9. Return the new result
|
||||
return crawl_result
|
||||
|
||||
def _is_cache_valid(self, cached_result: CrawlResult, config: CrawlerRunConfig) -> bool:
|
||||
# Internal logic to check if cached result meets current needs
|
||||
# (e.g., was screenshot requested now but not cached?)
|
||||
if config.screenshot and not cached_result.screenshot: return False
|
||||
if config.pdf and not cached_result.pdf: return False
|
||||
# ... other checks ...
|
||||
return True
|
||||
```
|
||||
|
||||
**Inside `cache_context.py` (defining the concepts):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/cache_context.py
|
||||
from enum import Enum
|
||||
|
||||
class CacheMode(Enum):
|
||||
"""Defines the caching behavior for web crawling operations."""
|
||||
ENABLED = "enabled" # Read and Write
|
||||
DISABLED = "disabled" # No Read, No Write
|
||||
READ_ONLY = "read_only" # Read Only, No Write
|
||||
WRITE_ONLY = "write_only" # Write Only, No Read
|
||||
BYPASS = "bypass" # No Read, Write Only (similar to WRITE_ONLY but explicit intention)
|
||||
|
||||
class CacheContext:
|
||||
"""Encapsulates cache-related decisions and URL handling."""
|
||||
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
|
||||
self.url = url
|
||||
self.cache_mode = cache_mode
|
||||
self.always_bypass = always_bypass # Usually False
|
||||
# Determine if URL type is cacheable (e.g., not 'raw:')
|
||||
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
|
||||
# ... other URL type checks ...
|
||||
|
||||
def should_read(self) -> bool:
|
||||
"""Determines if cache should be read based on context."""
|
||||
if self.always_bypass or not self.is_cacheable:
|
||||
return False
|
||||
# Allow read if mode is ENABLED or READ_ONLY
|
||||
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
|
||||
|
||||
def should_write(self) -> bool:
|
||||
"""Determines if cache should be written based on context."""
|
||||
if self.always_bypass or not self.is_cacheable:
|
||||
return False
|
||||
# Allow write if mode is ENABLED, WRITE_ONLY, or BYPASS
|
||||
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY, CacheMode.BYPASS]
|
||||
|
||||
@property
|
||||
def display_url(self) -> str:
|
||||
"""Returns the URL in display format."""
|
||||
return self.url if not self.url.startswith("raw:") else "Raw HTML"
|
||||
|
||||
# Helper for backward compatibility (may be removed later)
|
||||
def _legacy_to_cache_mode(...) -> CacheMode:
|
||||
# ... logic to convert old boolean flags ...
|
||||
pass
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned how Crawl4AI uses caching to avoid redundant work and speed up repeated crawls!
|
||||
|
||||
* **Caching** stores results locally to reuse them later.
|
||||
* **`CacheMode`** is the policy you set in `CrawlerRunConfig` to control *how* the cache is used (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
|
||||
* **`CacheContext`** is an internal helper that makes decisions based on the `CacheMode` and URL type.
|
||||
* Using the cache effectively (especially `CacheMode.ENABLED`) can significantly speed up your crawling tasks, particularly during development or when dealing with many URLs, including deep crawls.
|
||||
|
||||
We've seen how Crawl4AI can crawl single pages, lists of pages (`arun_many`), and even explore websites (`DeepCrawlStrategy`). But how does `arun_many` or a deep crawl manage running potentially hundreds or thousands of individual crawl tasks efficiently without overwhelming your system or the target website?
|
||||
|
||||
**Next:** Let's explore the component responsible for managing concurrent tasks: [Chapter 10: Orchestrating the Crawl - BaseDispatcher](10_basedispatcher.md).
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
387
docs/Crawl4AI/10_basedispatcher.md
Normal file
387
docs/Crawl4AI/10_basedispatcher.md
Normal file
@@ -0,0 +1,387 @@
|
||||
# Chapter 10: Orchestrating the Crawl - BaseDispatcher
|
||||
|
||||
In [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md), we learned how Crawl4AI uses caching to cleverly avoid re-fetching the same webpage multiple times, which is especially helpful when crawling many URLs. We've also seen how methods like `arun_many()` ([Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md)) or strategies like [DeepCrawlStrategy](08_deepcrawlstrategy.md) can lead to potentially hundreds or thousands of individual URLs needing to be crawled.
|
||||
|
||||
This raises a question: if we have 1000 URLs to crawl, does Crawl4AI try to crawl all 1000 simultaneously? That would likely overwhelm your computer's resources (like memory and CPU) and could also flood the target website with too many requests, potentially getting you blocked! How does Crawl4AI manage running many crawls efficiently and responsibly?
|
||||
|
||||
## What Problem Does `BaseDispatcher` Solve?
|
||||
|
||||
Imagine you're managing a fleet of delivery drones (`AsyncWebCrawler` tasks) that need to pick up packages from many different addresses (URLs). If you launch all 1000 drones at the exact same moment:
|
||||
|
||||
* Your control station (your computer) might crash due to the processing load.
|
||||
* The central warehouse (the target website) might get overwhelmed by simultaneous arrivals.
|
||||
* Some drones might collide or interfere with each other.
|
||||
|
||||
You need a **Traffic Controller** or a **Dispatch Center** to manage the fleet. This controller decides:
|
||||
|
||||
1. How many drones can be active in the air at any one time.
|
||||
2. When to launch the next drone, maybe based on available airspace (system resources) or just a simple count limit.
|
||||
3. How to handle potential delays or issues (like rate limiting from a specific website).
|
||||
|
||||
In Crawl4AI, the `BaseDispatcher` acts as this **Traffic Controller** or **Task Scheduler** for concurrent crawling operations, primarily when using `arun_many()`. It manages *how* multiple crawl tasks are executed concurrently, ensuring the process is efficient without overwhelming your system or the target websites.
|
||||
|
||||
## What is `BaseDispatcher`?
|
||||
|
||||
`BaseDispatcher` is an abstract concept (a blueprint or job description) in Crawl4AI. It defines *that* we need a system for managing the execution of multiple, concurrent crawling tasks. It specifies the *interface* for how the main `AsyncWebCrawler` interacts with such a system, but the specific *logic* for managing concurrency can vary.
|
||||
|
||||
Think of it as the control panel for our drone fleet – the panel exists, but the specific rules programmed into it determine how drones are dispatched.
|
||||
|
||||
## The Different Controllers: Ways to Dispatch Tasks
|
||||
|
||||
Crawl4AI provides concrete implementations (the actual traffic control systems) based on the `BaseDispatcher` blueprint:
|
||||
|
||||
1. **`SemaphoreDispatcher` (The Simple Counter):**
|
||||
* **Analogy:** A parking garage with a fixed number of spots (e.g., 10). A gate (`asyncio.Semaphore`) only lets a new car in if one of the 10 spots is free.
|
||||
* **How it works:** You tell it the maximum number of crawls that can run *at the same time* (e.g., `semaphore_count=10`). It uses a simple counter (a semaphore) to ensure that no more than this number of crawls are active simultaneously. When one crawl finishes, it allows another one from the queue to start.
|
||||
* **Good for:** Simple, direct control over concurrency when you know a specific limit works well for your system and the target sites.
|
||||
|
||||
2. **`MemoryAdaptiveDispatcher` (The Resource-Aware Controller - Default):**
|
||||
* **Analogy:** A smart parking garage attendant who checks not just the number of cars, but also the *total space* they occupy (system memory). They might stop letting cars in if the garage is nearing its memory capacity, even if some numbered spots are technically free.
|
||||
* **How it works:** This dispatcher monitors your system's available memory. It tries to run multiple crawls concurrently (up to a configurable maximum like `max_session_permit`), but it will pause launching new crawls if the system memory usage exceeds a certain threshold (e.g., `memory_threshold_percent=90.0`). It adapts the concurrency level based on available resources.
|
||||
* **Good for:** Automatically adjusting concurrency to prevent out-of-memory errors, especially when crawl tasks vary significantly in resource usage. **This is the default dispatcher used by `arun_many` if you don't specify one.**
|
||||
|
||||
These dispatchers can also optionally work with a `RateLimiter` component, which adds politeness rules for specific websites (e.g., slowing down requests to a domain if it returns "429 Too Many Requests").
|
||||
|
||||
## How `arun_many` Uses the Dispatcher
|
||||
|
||||
When you call `crawler.arun_many(urls=...)`, here's the basic flow involving the dispatcher:
|
||||
|
||||
1. **Get URLs:** `arun_many` receives the list of URLs you want to crawl.
|
||||
2. **Select Dispatcher:** It checks if you provided a specific `dispatcher` instance. If not, it creates an instance of the default `MemoryAdaptiveDispatcher`.
|
||||
3. **Delegate Execution:** It hands over the list of URLs and the `CrawlerRunConfig` to the chosen dispatcher's `run_urls` (or `run_urls_stream`) method.
|
||||
4. **Manage Tasks:** The dispatcher takes charge:
|
||||
* It iterates through the URLs.
|
||||
* For each URL, it decides *when* to start the actual crawl based on its rules (semaphore count, memory usage, rate limits).
|
||||
* When ready, it typically calls the single-page `crawler.arun(url, config)` method internally for that specific URL, wrapped within its concurrency control mechanism.
|
||||
* It manages the running tasks (e.g., using `asyncio.create_task` and `asyncio.wait`).
|
||||
5. **Collect Results:** As individual `arun` calls complete, the dispatcher collects their `CrawlResult` objects.
|
||||
6. **Return:** Once all URLs are processed, the dispatcher returns the list of results (or yields them if streaming).
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant AWC as AsyncWebCrawler
|
||||
participant Dispatcher as BaseDispatcher (e.g., MemoryAdaptive)
|
||||
participant TaskPool as Concurrency Manager
|
||||
|
||||
User->>AWC: arun_many(urls, config, dispatcher?)
|
||||
AWC->>Dispatcher: run_urls(crawler=AWC, urls, config)
|
||||
Dispatcher->>TaskPool: Initialize (e.g., set max concurrency)
|
||||
loop For each URL in urls
|
||||
Dispatcher->>TaskPool: Can I start a new task? (Checks limits)
|
||||
alt Yes
|
||||
TaskPool-->>Dispatcher: OK
|
||||
Note over Dispatcher: Create task: call AWC.arun(url, config) internally
|
||||
Dispatcher->>TaskPool: Add new task
|
||||
else No
|
||||
TaskPool-->>Dispatcher: Wait
|
||||
Note over Dispatcher: Waits for a running task to finish
|
||||
end
|
||||
end
|
||||
Note over Dispatcher: Manages running tasks, collects results
|
||||
Dispatcher-->>AWC: List of CrawlResults
|
||||
AWC-->>User: List of CrawlResults
|
||||
```
|
||||
|
||||
## Using the Dispatcher (Often Implicitly!)
|
||||
|
||||
Most of the time, you don't need to think about the dispatcher explicitly. When you use `arun_many`, the default `MemoryAdaptiveDispatcher` handles things automatically.
|
||||
|
||||
```python
|
||||
# chapter10_example_1.py
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
urls_to_crawl = [
|
||||
"https://httpbin.org/html",
|
||||
"https://httpbin.org/links/5/0", # Page with 5 links
|
||||
"https://httpbin.org/robots.txt",
|
||||
"https://httpbin.org/status/200",
|
||||
]
|
||||
|
||||
# We DON'T specify a dispatcher here.
|
||||
# arun_many will use the default MemoryAdaptiveDispatcher.
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
print(f"Crawling {len(urls_to_crawl)} URLs using the default dispatcher...")
|
||||
config = CrawlerRunConfig(stream=False) # Get results as a list at the end
|
||||
|
||||
# The MemoryAdaptiveDispatcher manages concurrency behind the scenes.
|
||||
results = await crawler.arun_many(urls=urls_to_crawl, config=config)
|
||||
|
||||
print(f"\nFinished! Got {len(results)} results.")
|
||||
for result in results:
|
||||
status = "✅" if result.success else "❌"
|
||||
url_short = result.url.split('/')[-1]
|
||||
print(f" {status} {url_short:<15} | Title: {result.metadata.get('title', 'N/A')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
* We call `crawler.arun_many` without passing a `dispatcher` argument.
|
||||
* Crawl4AI automatically creates and uses a `MemoryAdaptiveDispatcher`.
|
||||
* This dispatcher runs the crawls concurrently, adapting to your system's memory, and returns all the results once completed (because `stream=False`). You benefit from concurrency without explicit setup.
|
||||
|
||||
## Explicitly Choosing a Dispatcher
|
||||
|
||||
What if you want simpler, fixed concurrency? You can explicitly create and pass a `SemaphoreDispatcher`.
|
||||
|
||||
```python
|
||||
# chapter10_example_2.py
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
SemaphoreDispatcher # 1. Import the specific dispatcher
|
||||
)
|
||||
|
||||
async def main():
|
||||
urls_to_crawl = [
|
||||
"https://httpbin.org/delay/1", # Takes 1 second
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
]
|
||||
|
||||
# 2. Create an instance of the SemaphoreDispatcher
|
||||
# Allow only 2 crawls to run at the same time.
|
||||
semaphore_controller = SemaphoreDispatcher(semaphore_count=2)
|
||||
print(f"Using SemaphoreDispatcher with limit: {semaphore_controller.semaphore_count}")
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
print(f"Crawling {len(urls_to_crawl)} URLs with explicit dispatcher...")
|
||||
config = CrawlerRunConfig(stream=False)
|
||||
|
||||
# 3. Pass the dispatcher instance to arun_many
|
||||
results = await crawler.arun_many(
|
||||
urls=urls_to_crawl,
|
||||
config=config,
|
||||
dispatcher=semaphore_controller # Pass our controller
|
||||
)
|
||||
|
||||
print(f"\nFinished! Got {len(results)} results.")
|
||||
# This crawl likely took around 3 seconds (5 tasks, 1s each, 2 concurrent = ceil(5/2)*1s)
|
||||
for result in results:
|
||||
status = "✅" if result.success else "❌"
|
||||
print(f" {status} {result.url}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
1. **Import:** We import `SemaphoreDispatcher`.
|
||||
2. **Instantiate:** We create `SemaphoreDispatcher(semaphore_count=2)`, limiting concurrency to 2 simultaneous crawls.
|
||||
3. **Pass Dispatcher:** We pass our `semaphore_controller` instance directly to the `dispatcher` parameter of `arun_many`.
|
||||
4. **Execution:** Now, `arun_many` uses our `SemaphoreDispatcher`. It will start the first two crawls. As one finishes, it will start the next one from the list, always ensuring no more than two are running concurrently.
|
||||
|
||||
## A Glimpse Under the Hood
|
||||
|
||||
Where are these dispatchers defined? In `crawl4ai/async_dispatcher.py`.
|
||||
|
||||
**The Blueprint (`BaseDispatcher`):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/async_dispatcher.py
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Optional
|
||||
# ... other imports like CrawlerRunConfig, CrawlerTaskResult, AsyncWebCrawler ...
|
||||
|
||||
class BaseDispatcher(ABC):
|
||||
def __init__(
|
||||
self,
|
||||
rate_limiter: Optional[RateLimiter] = None,
|
||||
monitor: Optional[CrawlerMonitor] = None,
|
||||
):
|
||||
self.crawler = None # Will be set by arun_many
|
||||
self.rate_limiter = rate_limiter
|
||||
self.monitor = monitor
|
||||
# ... other common state ...
|
||||
|
||||
@abstractmethod
|
||||
async def crawl_url(
|
||||
self,
|
||||
url: str,
|
||||
config: CrawlerRunConfig,
|
||||
task_id: str,
|
||||
# ... maybe other internal params ...
|
||||
) -> CrawlerTaskResult:
|
||||
"""Crawls a single URL, potentially handling concurrency primitives."""
|
||||
# This is often the core worker method called by run_urls
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def run_urls(
|
||||
self,
|
||||
urls: List[str],
|
||||
crawler: "AsyncWebCrawler",
|
||||
config: CrawlerRunConfig,
|
||||
) -> List[CrawlerTaskResult]:
|
||||
"""Manages the concurrent execution of crawl_url for multiple URLs."""
|
||||
# This is the main entry point called by arun_many
|
||||
pass
|
||||
|
||||
async def run_urls_stream(
|
||||
self,
|
||||
urls: List[str],
|
||||
crawler: "AsyncWebCrawler",
|
||||
config: CrawlerRunConfig,
|
||||
) -> AsyncGenerator[CrawlerTaskResult, None]:
|
||||
""" Streaming version of run_urls (might be implemented in base or subclasses) """
|
||||
# Example default implementation (subclasses might override)
|
||||
results = await self.run_urls(urls, crawler, config)
|
||||
for res in results: yield res # Naive stream, real one is more complex
|
||||
|
||||
# ... other potential helper methods ...
|
||||
```
|
||||
|
||||
**Example Implementation (`SemaphoreDispatcher`):**
|
||||
|
||||
```python
|
||||
# Simplified from crawl4ai/async_dispatcher.py
|
||||
import asyncio
|
||||
import uuid
|
||||
import psutil # For memory tracking in crawl_url
|
||||
import time # For timing in crawl_url
|
||||
# ... other imports ...
|
||||
|
||||
class SemaphoreDispatcher(BaseDispatcher):
|
||||
def __init__(
|
||||
self,
|
||||
semaphore_count: int = 5,
|
||||
# ... other params like rate_limiter, monitor ...
|
||||
):
|
||||
super().__init__(...) # Pass rate_limiter, monitor to base
|
||||
self.semaphore_count = semaphore_count
|
||||
|
||||
async def crawl_url(
|
||||
self,
|
||||
url: str,
|
||||
config: CrawlerRunConfig,
|
||||
task_id: str,
|
||||
semaphore: asyncio.Semaphore = None, # Takes the semaphore
|
||||
) -> CrawlerTaskResult:
|
||||
# ... (Code to track start time, memory usage - similar to MemoryAdaptiveDispatcher's version)
|
||||
start_time = time.time()
|
||||
error_message = ""
|
||||
memory_usage = peak_memory = 0.0
|
||||
result = None
|
||||
|
||||
try:
|
||||
# Update monitor state if used
|
||||
if self.monitor: self.monitor.update_task(task_id, status=CrawlStatus.IN_PROGRESS)
|
||||
|
||||
# Wait for rate limiter if used
|
||||
if self.rate_limiter: await self.rate_limiter.wait_if_needed(url)
|
||||
|
||||
# --- Core Semaphore Logic ---
|
||||
async with semaphore: # Acquire a spot from the semaphore
|
||||
# Now that we have a spot, run the actual crawl
|
||||
process = psutil.Process()
|
||||
start_memory = process.memory_info().rss / (1024 * 1024)
|
||||
|
||||
# Call the single-page crawl method of the main crawler
|
||||
result = await self.crawler.arun(url, config=config, session_id=task_id)
|
||||
|
||||
end_memory = process.memory_info().rss / (1024 * 1024)
|
||||
memory_usage = peak_memory = end_memory - start_memory
|
||||
# --- Semaphore spot is released automatically on exiting 'async with' ---
|
||||
|
||||
# Update rate limiter based on result status if used
|
||||
if self.rate_limiter and result.status_code:
|
||||
if not self.rate_limiter.update_delay(url, result.status_code):
|
||||
# Handle retry limit exceeded
|
||||
error_message = "Rate limit retry count exceeded"
|
||||
# ... update monitor, prepare error result ...
|
||||
|
||||
# Update monitor status (success/fail)
|
||||
if result and not result.success: error_message = result.error_message
|
||||
if self.monitor: self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED if result.success else CrawlStatus.FAILED)
|
||||
|
||||
except Exception as e:
|
||||
# Handle unexpected errors during the crawl
|
||||
error_message = str(e)
|
||||
if self.monitor: self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||
# Create a failed CrawlResult if needed
|
||||
if not result: result = CrawlResult(url=url, html="", success=False, error_message=error_message)
|
||||
|
||||
finally:
|
||||
# Final monitor update with timing, memory etc.
|
||||
end_time = time.time()
|
||||
if self.monitor: self.monitor.update_task(...)
|
||||
|
||||
# Package everything into CrawlerTaskResult
|
||||
return CrawlerTaskResult(...)
|
||||
|
||||
|
||||
async def run_urls(
|
||||
self,
|
||||
crawler: "AsyncWebCrawler",
|
||||
urls: List[str],
|
||||
config: CrawlerRunConfig,
|
||||
) -> List[CrawlerTaskResult]:
|
||||
self.crawler = crawler # Store the crawler instance
|
||||
if self.monitor: self.monitor.start()
|
||||
|
||||
try:
|
||||
# Create the semaphore with the specified count
|
||||
semaphore = asyncio.Semaphore(self.semaphore_count)
|
||||
tasks = []
|
||||
|
||||
# Create a crawl task for each URL, passing the semaphore
|
||||
for url in urls:
|
||||
task_id = str(uuid.uuid4())
|
||||
if self.monitor: self.monitor.add_task(task_id, url)
|
||||
# Create an asyncio task to run crawl_url
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, semaphore=semaphore)
|
||||
)
|
||||
tasks.append(task)
|
||||
|
||||
# Wait for all created tasks to complete
|
||||
# asyncio.gather runs them concurrently, respecting the semaphore limit
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Process results (handle potential exceptions returned by gather)
|
||||
final_results = []
|
||||
for res in results:
|
||||
if isinstance(res, Exception):
|
||||
# Handle case where gather caught an exception from a task
|
||||
# You might create a failed CrawlerTaskResult here
|
||||
pass
|
||||
elif isinstance(res, CrawlerTaskResult):
|
||||
final_results.append(res)
|
||||
return final_results
|
||||
finally:
|
||||
if self.monitor: self.monitor.stop()
|
||||
|
||||
# run_urls_stream would have similar logic but use asyncio.as_completed
|
||||
# or manage tasks manually to yield results as they finish.
|
||||
```
|
||||
|
||||
The key takeaway is that the `Dispatcher` orchestrates calls to the single-page `crawler.arun` method, wrapping them with concurrency controls (like the `async with semaphore:` block) before running them using `asyncio`'s concurrency tools (`asyncio.create_task`, `asyncio.gather`, etc.).
|
||||
|
||||
## Conclusion
|
||||
|
||||
You've learned about `BaseDispatcher`, the crucial "Traffic Controller" that manages concurrent crawls in Crawl4AI, especially for `arun_many`.
|
||||
|
||||
* It solves the problem of efficiently running many crawls without overloading systems or websites.
|
||||
* It acts as a **blueprint** for managing concurrency.
|
||||
* Key implementations:
|
||||
* **`SemaphoreDispatcher`**: Uses a simple count limit.
|
||||
* **`MemoryAdaptiveDispatcher`**: Adjusts concurrency based on system memory (the default for `arun_many`).
|
||||
* The dispatcher is used **automatically** by `arun_many`, but you can provide a specific instance if needed.
|
||||
* It orchestrates the execution of individual crawl tasks, respecting defined limits.
|
||||
|
||||
Understanding the dispatcher helps appreciate how Crawl4AI handles large-scale crawling tasks responsibly and efficiently.
|
||||
|
||||
This concludes our tour of the core concepts in Crawl4AI! We've covered how pages are fetched, how the process is managed, how content is cleaned, filtered, and extracted, how deep crawls are performed, how caching optimizes fetches, and finally, how concurrency is managed. You now have a solid foundation to start building powerful web data extraction and processing applications with Crawl4AI. Happy crawling!
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
52
docs/Crawl4AI/index.md
Normal file
52
docs/Crawl4AI/index.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Tutorial: Crawl4AI
|
||||
|
||||
`Crawl4AI` is a flexible Python library for *asynchronously crawling websites* and *extracting structured content*, specifically designed for **AI use cases**.
|
||||
You primarily interact with the `AsyncWebCrawler`, which acts as the main coordinator. You provide it with URLs and a `CrawlerRunConfig` detailing *how* to crawl (e.g., using specific strategies for fetching, scraping, filtering, and extraction).
|
||||
It can handle single pages or multiple URLs concurrently using a `BaseDispatcher`, optionally crawl deeper by following links via `DeepCrawlStrategy`, manage `CacheMode`, and apply `RelevantContentFilter` before finally returning a `CrawlResult` containing all the gathered data.
|
||||
|
||||
|
||||
**Source Repository:** [https://github.com/unclecode/crawl4ai/tree/9c58e4ce2ee025debd3f36bf213330bd72b90e46/crawl4ai](https://github.com/unclecode/crawl4ai/tree/9c58e4ce2ee025debd3f36bf213330bd72b90e46/crawl4ai)
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A0["AsyncWebCrawler"]
|
||||
A1["CrawlerRunConfig"]
|
||||
A2["AsyncCrawlerStrategy"]
|
||||
A3["ContentScrapingStrategy"]
|
||||
A4["ExtractionStrategy"]
|
||||
A5["CrawlResult"]
|
||||
A6["BaseDispatcher"]
|
||||
A7["DeepCrawlStrategy"]
|
||||
A8["CacheContext / CacheMode"]
|
||||
A9["RelevantContentFilter"]
|
||||
A0 -- "Configured by" --> A1
|
||||
A0 -- "Uses Fetching Strategy" --> A2
|
||||
A0 -- "Uses Scraping Strategy" --> A3
|
||||
A0 -- "Uses Extraction Strategy" --> A4
|
||||
A0 -- "Produces" --> A5
|
||||
A0 -- "Uses Dispatcher for `arun_m..." --> A6
|
||||
A0 -- "Uses Caching Logic" --> A8
|
||||
A6 -- "Calls Crawler's `arun`" --> A0
|
||||
A1 -- "Specifies Deep Crawl Strategy" --> A7
|
||||
A7 -- "Processes Links from" --> A5
|
||||
A3 -- "Provides Cleaned HTML to" --> A9
|
||||
A1 -- "Specifies Content Filter" --> A9
|
||||
```
|
||||
|
||||
## Chapters
|
||||
|
||||
1. [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)
|
||||
2. [AsyncWebCrawler](02_asyncwebcrawler.md)
|
||||
3. [CrawlerRunConfig](03_crawlerrunconfig.md)
|
||||
4. [ContentScrapingStrategy](04_contentscrapingstrategy.md)
|
||||
5. [RelevantContentFilter](05_relevantcontentfilter.md)
|
||||
6. [ExtractionStrategy](06_extractionstrategy.md)
|
||||
7. [CrawlResult](07_crawlresult.md)
|
||||
8. [DeepCrawlStrategy](08_deepcrawlstrategy.md)
|
||||
9. [CacheContext / CacheMode](09_cachecontext___cachemode.md)
|
||||
10. [BaseDispatcher](10_basedispatcher.md)
|
||||
|
||||
|
||||
---
|
||||
|
||||
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)
|
||||
Reference in New Issue
Block a user