# Chapter 3: Giving Instructions - CrawlerRunConfig In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method. But what if we want to tell the crawler *how* to crawl that URL? Maybe we want it to take a picture (screenshot) of the page? Or perhaps we only care about a specific section of the page? Or maybe we want to ignore the cache and get the very latest version? Passing all these different instructions individually every time we call `arun` could get complicated and messy. ```python # Imagine doing this every time - it gets long! # result = await crawler.arun( # url="https://example.com", # take_screenshot=True, # ignore_cache=True, # only_look_at_this_part="#main-content", # wait_for_this_element="#data-table", # # ... maybe many more settings ... # ) ``` That's where `CrawlerRunConfig` comes in! ## What Problem Does `CrawlerRunConfig` Solve? Think of `CrawlerRunConfig` as the **Instruction Manual** for a *specific* crawl job. Instead of giving the `AsyncWebCrawler` manager lots of separate instructions each time, you bundle them all neatly into a single `CrawlerRunConfig` object. This object tells the `AsyncWebCrawler` exactly *how* to handle a particular URL or set of URLs for that specific run. It makes your code cleaner and easier to manage. ## What is `CrawlerRunConfig`? `CrawlerRunConfig` is a configuration class that holds all the settings for a single crawl operation initiated by `AsyncWebCrawler.arun()` or `arun_many()`. It allows you to customize various aspects of the crawl, such as: * **Taking Screenshots:** Should the crawler capture an image of the page? (`screenshot`) * **Waiting:** How long should the crawler wait for the page or specific elements to load? (`page_timeout`, `wait_for`) * **Focusing Content:** Should the crawler only process a specific part of the page? (`css_selector`) * **Extracting Data:** Should the crawler use a specific method to pull out structured data? ([ExtractionStrategy](06_extractionstrategy.md)) * **Caching:** How should the crawler interact with previously saved results? ([CacheMode](09_cachecontext___cachemode.md)) * **And much more!** (like handling JavaScript, filtering links, etc.) ## Using `CrawlerRunConfig` Let's see how to use it. Remember our basic crawl from Chapter 2? ```python # chapter3_example_1.py import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: url_to_crawl = "https://httpbin.org/html" print(f"Crawling {url_to_crawl} with default settings...") # This uses the default behavior (no specific config) result = await crawler.arun(url=url_to_crawl) if result.success: print("Success! Got the content.") print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Likely No # We'll learn about CacheMode later, but it defaults to using the cache else: print(f"Failed: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` Now, let's say for this *specific* crawl, we want to bypass the cache (fetch fresh) and also take a screenshot. We create a `CrawlerRunConfig` instance and pass it to `arun`: ```python # chapter3_example_2.py import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai import CrawlerRunConfig # 1. Import the config class from crawl4ai import CacheMode # Import cache options async def main(): async with AsyncWebCrawler() as crawler: url_to_crawl = "https://httpbin.org/html" print(f"Crawling {url_to_crawl} with custom settings...") # 2. Create an instance of CrawlerRunConfig with our desired settings my_instructions = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, # Don't use the cache, fetch fresh screenshot=True # Take a screenshot ) print("Instructions: Bypass cache, take screenshot.") # 3. Pass the config object to arun() result = await crawler.arun( url=url_to_crawl, config=my_instructions # Pass our instruction manual ) if result.success: print("\nSuccess! Got the content with custom config.") print(f"Screenshot taken? {'Yes' if result.screenshot else 'No'}") # Should be Yes # Check if the screenshot file path exists in result.screenshot if result.screenshot: print(f"Screenshot saved to: {result.screenshot}") else: print(f"\nFailed: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` **Explanation:** 1. **Import:** We import `CrawlerRunConfig` and `CacheMode`. 2. **Create Config:** We create an instance: `my_instructions = CrawlerRunConfig(...)`. We set `cache_mode` to `CacheMode.BYPASS` and `screenshot` to `True`. All other settings remain at their defaults. 3. **Pass Config:** We pass this `my_instructions` object to `crawler.arun` using the `config=` parameter. Now, when `AsyncWebCrawler` runs this job, it will look inside `my_instructions` and follow those specific settings for *this run only*. ## Some Common `CrawlerRunConfig` Parameters `CrawlerRunConfig` has many options, but here are a few common ones you might use: * **`cache_mode`**: Controls caching behavior. * `CacheMode.ENABLED` (Default): Use the cache if available, otherwise fetch and save. * `CacheMode.BYPASS`: Always fetch fresh, ignoring any cached version (but still save the new result). * `CacheMode.DISABLED`: Never read from or write to the cache. * *(More details in [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md))* * **`screenshot` (bool)**: If `True`, takes a screenshot of the fully rendered page. The path to the screenshot file will be in `CrawlResult.screenshot`. Default: `False`. * **`pdf` (bool)**: If `True`, generates a PDF of the page. The path to the PDF file will be in `CrawlResult.pdf`. Default: `False`. * **`css_selector` (str)**: If provided (e.g., `"#main-content"` or `.article-body`), the crawler will try to extract *only* the HTML content within the element(s) matching this CSS selector. This is great for focusing on the important part of a page. Default: `None` (process the whole page). * **`wait_for` (str)**: A CSS selector (e.g., `"#data-loaded-indicator"`). The crawler will wait until an element matching this selector appears on the page before proceeding. Useful for pages that load content dynamically with JavaScript. Default: `None`. * **`page_timeout` (int)**: Maximum time in milliseconds to wait for page navigation or certain operations. Default: `60000` (60 seconds). * **`extraction_strategy`**: An object that defines how to extract specific, structured data (like product names and prices) from the page. Default: `None`. *(See [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md))* * **`scraping_strategy`**: An object defining how the raw HTML is cleaned and basic content (like text and links) is extracted. Default: `WebScrapingStrategy()`. *(See [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md))* Let's try combining a few: focus on a specific part of the page and wait for something to appear. ```python # chapter3_example_3.py import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): # This example site has a heading 'H1' inside a 'body' tag. url_to_crawl = "https://httpbin.org/html" async with AsyncWebCrawler() as crawler: print(f"Crawling {url_to_crawl}, focusing on the H1 tag...") # Instructions: Only get the H1 tag, wait max 10s for it specific_config = CrawlerRunConfig( css_selector="h1", # Only grab content inside