mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-19 07:24:20 +01:00
284 lines
14 KiB
Markdown
284 lines
14 KiB
Markdown
---
|
|
layout: default
|
|
title: "CrawlerRunConfig"
|
|
parent: "Crawl4AI"
|
|
nav_order: 3
|
|
---
|
|
|
|
# Chapter 3: Giving Instructions - CrawlerRunConfig
|
|
|
|
In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method.
|
|
|
|
But what if we want to tell the crawler *how* to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version?
|
|
|
|
Passing all these different instructions individually every time we call `arun` could get complicated and messy.
|
|
|
|
```python
|
|
# Imagine doing this every time - it gets long!
|
|
# result = await crawler.arun(
|
|
# url="https://example.com",
|
|
# take_screenshot=True,
|
|
# ignore_cache=True,
|
|
# only_look_at_this_part="#main-content",
|
|
# wait_for_this_element="#data-table",
|
|
# # ... maybe many more settings ...
|
|
# )
|
|
```
|
|
|
|
That's where `CrawlerRunConfig` comes in!
|
|
|
|
## What Problem Does `CrawlerRunConfig` Solve?
|
|
|
|
Think of `CrawlerRunConfig` as the **Instruction Manual** for a *specific* crawl job. Instead of giving the `AsyncWebCrawler` manager lots of separate instructions each time, you bundle them all neatly into a single `CrawlerRunConfig` object.
|
|
|
|
This object tells the `AsyncWebCrawler` exactly *how* to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage.
|
|
|
|
## What is `CrawlerRunConfig`?
|
|
|
|
`CrawlerRunConfig` is a configuration class that holds all the settings for a single crawl operation initiated by `AsyncWebCrawler.arun()` or `arun_many()`.
|
|
|
|
It allows you to customize various aspects of the crawl, such as:
|
|
|
|
* **Taking Screenshots:** Should the crawler capture an image of the page? (`screenshot`)
|
|
* **Waiting:** How long should the crawler wait for the page or specific elements to load? (`page_timeout`, `wait_for`)
|
|
* **Focusing Content:** Should the crawler only process a specific part of the page? (`css_selector`)
|
|
* **Extracting Data:** Should the crawler use a specific method to pull out structured data? ([ExtractionStrategy](06_extractionstrategy.md))
|
|
* **Caching:** How should the crawler interact with previously saved results? ([CacheMode](09_cachecontext___cachemode.md))
|
|
* **And much more!** (like handling JavaScript, filtering links, etc.)
|
|
|
|
## Using `CrawlerRunConfig`
|
|
|
|
Let's see how to use it. Remember our basic crawl from Chapter 2?
|
|
|
|
```python
|
|
# chapter3_example_1.py
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler
|
|
|
|
async def main():
|
|
async with AsyncWebCrawler() as crawler:
|
|
url_to_crawl = "https://httpbin.org/html"
|
|
print(f"Crawling {url_to_crawl} with default settings...")
|
|
|
|
# This uses the default behavior (no specific config)
|
|
result = await crawler.arun(url=url_to_crawl)
|
|
|
|
if result.success:
|
|
print("Success! Got the content.")
|
|
print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No
|
|
# We'll learn about CacheMode later, but it defaults to using the cache
|
|
else:
|
|
print(f"Failed: {result.error_message}")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
Now, let's say for this *specific* crawl, we want to bypass the cache (fetch fresh) and also take a screenshot.
|
|
|
|
We create a `CrawlerRunConfig` instance and pass it to `arun`:
|
|
|
|
```python
|
|
# chapter3_example_2.py
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler
|
|
from crawl4ai import CrawlerRunConfig # 1. Import the config class
|
|
from crawl4ai import CacheMode # Import cache options
|
|
|
|
async def main():
|
|
async with AsyncWebCrawler() as crawler:
|
|
url_to_crawl = "https://httpbin.org/html"
|
|
print(f"Crawling {url_to_crawl} with custom settings...")
|
|
|
|
# 2. Create an instance of CrawlerRunConfig with our desired settings
|
|
my_instructions = CrawlerRunConfig(
|
|
cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh
|
|
screenshot=True # Take a screenshot
|
|
)
|
|
print("Instructions: Bypass cache, take screenshot.")
|
|
|
|
# 3. Pass the config object to arun()
|
|
result = await crawler.arun(
|
|
url=url_to_crawl,
|
|
config=my_instructions # Pass our instruction manual
|
|
)
|
|
|
|
if result.success:
|
|
print("\nSuccess! Got the content with custom config.")
|
|
print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes
|
|
# Check if the screenshot file path exists in result.screenshot
|
|
if result.screenshot:
|
|
print(f"Screenshot saved to: {result.screenshot}")
|
|
else:
|
|
print(f"\nFailed: {result.error_message}")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
**Explanation:**
|
|
|
|
1. **Import:** We import `CrawlerRunConfig` and `CacheMode`.
|
|
2. **Create Config:** We create an instance: `my_instructions = CrawlerRunConfig(...)`. We set `cache_mode` to `CacheMode.BYPASS` and `screenshot` to `True`. All other settings remain at their defaults.
|
|
3. **Pass Config:** We pass this `my_instructions` object to `crawler.arun` using the `config=` parameter.
|
|
|
|
Now, when `AsyncWebCrawler` runs this job, it will look inside `my_instructions` and follow those specific settings for *this run only*.
|
|
|
|
## Some Common `CrawlerRunConfig` Parameters
|
|
|
|
`CrawlerRunConfig` has many options, but here are a few common ones you might use:
|
|
|
|
* **`cache_mode`**: Controls caching behavior.
|
|
* `CacheMode.ENABLED` (Default): Use the cache if available, otherwise fetch and save.
|
|
* `CacheMode.BYPASS`: Always fetch fresh, ignoring any cached version (but still save the new result).
|
|
* `CacheMode.DISABLED`: Never read from or write to the cache.
|
|
* *(More details in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md))*
|
|
* **`screenshot` (bool)**: If `True`, takes a screenshot of the fully rendered page. The path to the screenshot file will be in `CrawlResult.screenshot`. Default: `False`.
|
|
* **`pdf` (bool)**: If `True`, generates a PDF of the page. The path to the PDF file will be in `CrawlResult.pdf`. Default: `False`.
|
|
* **`css_selector` (str)**: If provided (e.g., `"#main-content"` or `.article-body`), the crawler will try to extract *only* the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: `None` (process the whole page).
|
|
* **`wait_for` (str)**: A CSS selector (e.g., `"#data-loaded-indicator"`). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: `None`.
|
|
* **`page_timeout` (int)**: Maximum time in milliseconds to wait for page navigation or certain operations. Default: `60000` (60 seconds).
|
|
* **`extraction_strategy`**: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: `None`. *(See [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md))*
|
|
* **`scraping_strategy`**: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: `WebScrapingStrategy()`. *(See [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md))*
|
|
|
|
Let's try combining a few: focus on a specific part of the page and wait for something to appear.
|
|
|
|
```python
|
|
# chapter3_example_3.py
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|
|
|
async def main():
|
|
# This example site has a heading 'H1' inside a 'body' tag.
|
|
url_to_crawl = "https://httpbin.org/html"
|
|
async with AsyncWebCrawler() as crawler:
|
|
print(f"Crawling {url_to_crawl}, focusing on the H1 tag...")
|
|
|
|
# Instructions: Only get the H1 tag, wait max 10s for it
|
|
specific_config = CrawlerRunConfig(
|
|
css_selector="h1", # Only grab content inside <h1> tags
|
|
page_timeout=10000 # Set page timeout to 10 seconds
|
|
# We could also add wait_for="h1" if needed for dynamic loading
|
|
)
|
|
|
|
result = await crawler.arun(url=url_to_crawl, config=specific_config)
|
|
|
|
if result.success:
|
|
print("\nSuccess! Focused crawl completed.")
|
|
# The markdown should now ONLY contain the H1 content
|
|
print(f"Markdown content:\n---\n{result.markdown.raw_markdown.strip()}\n---")
|
|
else:
|
|
print(f"\nFailed: {result.error_message}")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
This time, the `result.markdown` should only contain the text from the `<h1>` tag on that page, because we used `css_selector="h1"` in our `CrawlerRunConfig`.
|
|
|
|
## How `AsyncWebCrawler` Uses the Config (Under the Hood)
|
|
|
|
You don't need to know the exact internal code, but it helps to understand the flow. When you call `crawler.arun(url, config=my_config)`, the `AsyncWebCrawler` essentially does this:
|
|
|
|
1. Receives the `url` and the `my_config` object.
|
|
2. Before fetching, it checks `my_config.cache_mode` to see if it should look in the cache first.
|
|
3. If fetching is needed, it passes `my_config` to the underlying [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md).
|
|
4. The strategy uses settings from `my_config` like `page_timeout`, `wait_for`, and whether to take a `screenshot`.
|
|
5. After getting the raw HTML, `AsyncWebCrawler` uses the `my_config.scraping_strategy` and `my_config.css_selector` to process the content.
|
|
6. If `my_config.extraction_strategy` is set, it uses that to extract structured data.
|
|
7. Finally, it bundles everything into a `CrawlResult` and returns it.
|
|
|
|
Here's a simplified view:
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant AWC as AsyncWebCrawler
|
|
participant Config as CrawlerRunConfig
|
|
participant Fetcher as AsyncCrawlerStrategy
|
|
participant Processor as Scraping/Extraction
|
|
|
|
User->>AWC: arun(url, config=my_config)
|
|
AWC->>Config: Check my_config.cache_mode
|
|
alt Need to Fetch
|
|
AWC->>Fetcher: crawl(url, config=my_config)
|
|
Note over Fetcher: Uses my_config settings (timeout, wait_for, screenshot...)
|
|
Fetcher-->>AWC: Raw Response (HTML, screenshot?)
|
|
AWC->>Processor: Process HTML (using my_config.css_selector, my_config.extraction_strategy...)
|
|
Processor-->>AWC: Processed Data
|
|
else Use Cache
|
|
AWC->>AWC: Retrieve from Cache
|
|
end
|
|
AWC-->>User: Return CrawlResult
|
|
```
|
|
|
|
The `CrawlerRunConfig` acts as a messenger carrying your specific instructions throughout the crawling process.
|
|
|
|
Inside the `crawl4ai` library, in the file `async_configs.py`, you'll find the definition of the `CrawlerRunConfig` class. It looks something like this (simplified):
|
|
|
|
```python
|
|
# Simplified from crawl4ai/async_configs.py
|
|
|
|
from .cache_context import CacheMode
|
|
from .extraction_strategy import ExtractionStrategy
|
|
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
|
|
# ... other imports ...
|
|
|
|
class CrawlerRunConfig():
|
|
"""
|
|
Configuration class for controlling how the crawler runs each crawl operation.
|
|
"""
|
|
def __init__(
|
|
self,
|
|
# Caching
|
|
cache_mode: CacheMode = CacheMode.BYPASS, # Default behavior if not specified
|
|
|
|
# Content Selection / Waiting
|
|
css_selector: str = None,
|
|
wait_for: str = None,
|
|
page_timeout: int = 60000, # 60 seconds
|
|
|
|
# Media
|
|
screenshot: bool = False,
|
|
pdf: bool = False,
|
|
|
|
# Processing Strategies
|
|
scraping_strategy: ContentScrapingStrategy = None, # Defaults internally if None
|
|
extraction_strategy: ExtractionStrategy = None,
|
|
|
|
# ... many other parameters omitted for clarity ...
|
|
**kwargs # Allows for flexibility
|
|
):
|
|
self.cache_mode = cache_mode
|
|
self.css_selector = css_selector
|
|
self.wait_for = wait_for
|
|
self.page_timeout = page_timeout
|
|
self.screenshot = screenshot
|
|
self.pdf = pdf
|
|
# Assign scraping strategy, ensuring a default if None is provided
|
|
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
|
|
self.extraction_strategy = extraction_strategy
|
|
# ... initialize other attributes ...
|
|
|
|
# Helper methods like 'clone', 'to_dict', 'from_kwargs' might exist too
|
|
# ...
|
|
```
|
|
|
|
The key idea is that it's a class designed to hold various settings together. When you create an instance `CrawlerRunConfig(...)`, you're essentially creating an object that stores your choices for these parameters.
|
|
|
|
## Conclusion
|
|
|
|
You've learned about `CrawlerRunConfig`, the "Instruction Manual" for individual crawl jobs in Crawl4AI!
|
|
|
|
* It solves the problem of passing many settings individually to `AsyncWebCrawler`.
|
|
* You create an instance of `CrawlerRunConfig` and set the parameters you want to customize (like `cache_mode`, `screenshot`, `css_selector`, `wait_for`).
|
|
* You pass this config object to `crawler.arun(url, config=your_config)`.
|
|
* This makes your code cleaner and gives you fine-grained control over *how* each crawl is performed.
|
|
|
|
Now that we know how to fetch content ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)), manage the overall process ([AsyncWebCrawler](02_asyncwebcrawler.md)), and give specific instructions ([CrawlerRunConfig](03_crawlerrunconfig.md)), let's look at how the raw, messy HTML fetched from the web is initially cleaned up and processed.
|
|
|
|
**Next:** Let's explore [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md).
|
|
|
|
---
|
|
|
|
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) |