init push

2026-02-04 06:04:18 +01:00 · 2025-04-04 13:03:54 -04:00
parent e62ee2cb13
commit 2ebad5e5f2
160 changed files with 2 additions and 0 deletions
--- a/docs/Crawl4AI/03_crawlerrunconfig.md
+++ b/docs/Crawl4AI/03_crawlerrunconfig.md
@@ -0,0 +1,277 @@
+# Chapter 3: Giving Instructions - CrawlerRunConfig
+
+In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method.
+
+But what if we want to tell the crawler *how* to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version?
+
+Passing all these different instructions individually every time we call `arun` could get complicated and messy.
+
+```python
+# Imagine doing this every time - it gets long!
+# result = await crawler.arun(
+#     url="https://example.com",
+#     take_screenshot=True,
+#     ignore_cache=True,
+#     only_look_at_this_part="#main-content",
+#     wait_for_this_element="#data-table",
+#     # ... maybe many more settings ...
+# )
+```
+
+That's where `CrawlerRunConfig` comes in!
+
+## What Problem Does `CrawlerRunConfig` Solve?
+
+Think of `CrawlerRunConfig` as the **Instruction Manual** for a *specific* crawl job. Instead of giving the `AsyncWebCrawler` manager lots of separate instructions each time, you bundle them all neatly into a single `CrawlerRunConfig` object.
+
+This object tells the `AsyncWebCrawler` exactly *how* to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage.
+
+## What is `CrawlerRunConfig`?
+
+`CrawlerRunConfig` is a configuration class that holds all the settings for a single crawl operation initiated by `AsyncWebCrawler.arun()` or `arun_many()`.
+
+It allows you to customize various aspects of the crawl, such as:
+
+*   **Taking Screenshots:** Should the crawler capture an image of the page? (`screenshot`)
+*   **Waiting:** How long should the crawler wait for the page or specific elements to load? (`page_timeout`, `wait_for`)
+*   **Focusing Content:** Should the crawler only process a specific part of the page? (`css_selector`)
+*   **Extracting Data:** Should the crawler use a specific method to pull out structured data? ([ExtractionStrategy](06_extractionstrategy.md))
+*   **Caching:** How should the crawler interact with previously saved results? ([CacheMode](09_cachecontext___cachemode.md))
+*   **And much more!** (like handling JavaScript, filtering links, etc.)
+
+## Using `CrawlerRunConfig`
+
+Let's see how to use it. Remember our basic crawl from Chapter 2?
+
+```python
+# chapter3_example_1.py
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        url_to_crawl = "https://httpbin.org/html"
+        print(f"Crawling {url_to_crawl} with default settings...")
+
+        # This uses the default behavior (no specific config)
+        result = await crawler.arun(url=url_to_crawl)
+
+        if result.success:
+            print("Success! Got the content.")
+            print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No
+            # We'll learn about CacheMode later, but it defaults to using the cache
+        else:
+            print(f"Failed: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+Now, let's say for this *specific* crawl, we want to bypass the cache (fetch fresh) and also take a screenshot.
+
+We create a `CrawlerRunConfig` instance and pass it to `arun`:
+
+```python
+# chapter3_example_2.py
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai import CrawlerRunConfig # 1. Import the config class
+from crawl4ai import CacheMode        # Import cache options
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        url_to_crawl = "https://httpbin.org/html"
+        print(f"Crawling {url_to_crawl} with custom settings...")
+
+        # 2. Create an instance of CrawlerRunConfig with our desired settings
+        my_instructions = CrawlerRunConfig(
+            cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh
+            screenshot=True              # Take a screenshot
+        )
+        print("Instructions: Bypass cache, take screenshot.")
+
+        # 3. Pass the config object to arun()
+        result = await crawler.arun(
+            url=url_to_crawl,
+            config=my_instructions # Pass our instruction manual
+        )
+
+        if result.success:
+            print("\nSuccess! Got the content with custom config.")
+            print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes
+            # Check if the screenshot file path exists in result.screenshot
+            if result.screenshot:
+                print(f"Screenshot saved to: {result.screenshot}")
+        else:
+            print(f"\nFailed: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Explanation:**
+
+1.  **Import:** We import `CrawlerRunConfig` and `CacheMode`.
+2.  **Create Config:** We create an instance: `my_instructions = CrawlerRunConfig(...)`. We set `cache_mode` to `CacheMode.BYPASS` and `screenshot` to `True`. All other settings remain at their defaults.
+3.  **Pass Config:** We pass this `my_instructions` object to `crawler.arun` using the `config=` parameter.
+
+Now, when `AsyncWebCrawler` runs this job, it will look inside `my_instructions` and follow those specific settings for *this run only*.
+
+## Some Common `CrawlerRunConfig` Parameters
+
+`CrawlerRunConfig` has many options, but here are a few common ones you might use:
+
+*   **`cache_mode`**: Controls caching behavior.
+    *   `CacheMode.ENABLED` (Default): Use the cache if available, otherwise fetch and save.
+    *   `CacheMode.BYPASS`: Always fetch fresh, ignoring any cached version (but still save the new result).
+    *   `CacheMode.DISABLED`: Never read from or write to the cache.
+    *   *(More details in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md))*
+*   **`screenshot` (bool)**: If `True`, takes a screenshot of the fully rendered page. The path to the screenshot file will be in `CrawlResult.screenshot`. Default: `False`.
+*   **`pdf` (bool)**: If `True`, generates a PDF of the page. The path to the PDF file will be in `CrawlResult.pdf`. Default: `False`.
+*   **`css_selector` (str)**: If provided (e.g., `"#main-content"` or `.article-body`), the crawler will try to extract *only* the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: `None` (process the whole page).
+*   **`wait_for` (str)**: A CSS selector (e.g., `"#data-loaded-indicator"`). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: `None`.
+*   **`page_timeout` (int)**: Maximum time in milliseconds to wait for page navigation or certain operations. Default: `60000` (60 seconds).
+*   **`extraction_strategy`**: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: `None`. *(See [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md))*
+*   **`scraping_strategy`**: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: `WebScrapingStrategy()`. *(See [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md))*
+
+Let's try combining a few: focus on a specific part of the page and wait for something to appear.
+
+```python
+# chapter3_example_3.py
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+    # This example site has a heading 'H1' inside a 'body' tag.
+    url_to_crawl = "https://httpbin.org/html"
+    async with AsyncWebCrawler() as crawler:
+        print(f"Crawling {url_to_crawl}, focusing on the H1 tag...")
+
+        # Instructions: Only get the H1 tag, wait max 10s for it
+        specific_config = CrawlerRunConfig(
+            css_selector="h1", # Only grab content inside <h1> tags
+            page_timeout=10000 # Set page timeout to 10 seconds
+            # We could also add wait_for="h1" if needed for dynamic loading
+        )
+
+        result = await crawler.arun(url=url_to_crawl, config=specific_config)
+
+        if result.success:
+            print("\nSuccess! Focused crawl completed.")
+            # The markdown should now ONLY contain the H1 content
+            print(f"Markdown content:\n---\n{result.markdown.raw_markdown.strip()}\n---")
+        else:
+            print(f"\nFailed: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+This time, the `result.markdown` should only contain the text from the `<h1>` tag on that page, because we used `css_selector="h1"` in our `CrawlerRunConfig`.
+
+## How `AsyncWebCrawler` Uses the Config (Under the Hood)
+
+You don't need to know the exact internal code, but it helps to understand the flow. When you call `crawler.arun(url, config=my_config)`, the `AsyncWebCrawler` essentially does this:
+
+1.  Receives the `url` and the `my_config` object.
+2.  Before fetching, it checks `my_config.cache_mode` to see if it should look in the cache first.
+3.  If fetching is needed, it passes `my_config` to the underlying [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md).
+4.  The strategy uses settings from `my_config` like `page_timeout`, `wait_for`, and whether to take a `screenshot`.
+5.  After getting the raw HTML, `AsyncWebCrawler` uses the `my_config.scraping_strategy` and `my_config.css_selector` to process the content.
+6.  If `my_config.extraction_strategy` is set, it uses that to extract structured data.
+7.  Finally, it bundles everything into a `CrawlResult` and returns it.
+
+Here's a simplified view:
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant AWC as AsyncWebCrawler
+    participant Config as CrawlerRunConfig
+    participant Fetcher as AsyncCrawlerStrategy
+    participant Processor as Scraping/Extraction
+
+    User->>AWC: arun(url, config=my_config)
+    AWC->>Config: Check my_config.cache_mode
+    alt Need to Fetch
+        AWC->>Fetcher: crawl(url, config=my_config)
+        Note over Fetcher: Uses my_config settings (timeout, wait_for, screenshot...)
+        Fetcher-->>AWC: Raw Response (HTML, screenshot?)
+        AWC->>Processor: Process HTML (using my_config.css_selector, my_config.extraction_strategy...)
+        Processor-->>AWC: Processed Data
+    else Use Cache
+        AWC->>AWC: Retrieve from Cache
+    end
+    AWC-->>User: Return CrawlResult
+```
+
+The `CrawlerRunConfig` acts as a messenger carrying your specific instructions throughout the crawling process.
+
+Inside the `crawl4ai` library, in the file `async_configs.py`, you'll find the definition of the `CrawlerRunConfig` class. It looks something like this (simplified):
+
+```python
+# Simplified from crawl4ai/async_configs.py
+
+from .cache_context import CacheMode
+from .extraction_strategy import ExtractionStrategy
+from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
+# ... other imports ...
+
+class CrawlerRunConfig():
+    """
+    Configuration class for controlling how the crawler runs each crawl operation.
+    """
+    def __init__(
+        self,
+        # Caching
+        cache_mode: CacheMode = CacheMode.BYPASS, # Default behavior if not specified
+
+        # Content Selection / Waiting
+        css_selector: str = None,
+        wait_for: str = None,
+        page_timeout: int = 60000, # 60 seconds
+
+        # Media
+        screenshot: bool = False,
+        pdf: bool = False,
+
+        # Processing Strategies
+        scraping_strategy: ContentScrapingStrategy = None, # Defaults internally if None
+        extraction_strategy: ExtractionStrategy = None,
+
+        # ... many other parameters omitted for clarity ...
+        **kwargs # Allows for flexibility
+    ):
+        self.cache_mode = cache_mode
+        self.css_selector = css_selector
+        self.wait_for = wait_for
+        self.page_timeout = page_timeout
+        self.screenshot = screenshot
+        self.pdf = pdf
+        # Assign scraping strategy, ensuring a default if None is provided
+        self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
+        self.extraction_strategy = extraction_strategy
+        # ... initialize other attributes ...
+
+    # Helper methods like 'clone', 'to_dict', 'from_kwargs' might exist too
+    # ...
+```
+
+The key idea is that it's a class designed to hold various settings together. When you create an instance `CrawlerRunConfig(...)`, you're essentially creating an object that stores your choices for these parameters.
+
+## Conclusion
+
+You've learned about `CrawlerRunConfig`, the "Instruction Manual" for individual crawl jobs in Crawl4AI!
+
+*   It solves the problem of passing many settings individually to `AsyncWebCrawler`.
+*   You create an instance of `CrawlerRunConfig` and set the parameters you want to customize (like `cache_mode`, `screenshot`, `css_selector`, `wait_for`).
+*   You pass this config object to `crawler.arun(url, config=your_config)`.
+*   This makes your code cleaner and gives you fine-grained control over *how* each crawl is performed.
+
+Now that we know how to fetch content ([AsyncCrawlerStrategy](01_asynccrawlerstrategy.md)), manage the overall process ([AsyncWebCrawler](02_asyncwebcrawler.md)), and give specific instructions ([CrawlerRunConfig](03_crawlerrunconfig.md)), let's look at how the raw, messy HTML fetched from the web is initially cleaned up and processed.
+
+**Next:** Let's explore [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md).
+
+---
+
+Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)