mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-18 23:14:21 +01:00
346 lines
17 KiB
Markdown
346 lines
17 KiB
Markdown
# Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode
|
|
|
|
In the previous chapter, [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md), we saw how Crawl4AI can explore websites by following links, potentially visiting many pages. During such explorations, or even when you run the same crawl multiple times, the crawler might try to fetch the exact same webpage again and again. This can be slow and might unnecessarily put a load on the website you're crawling. Wouldn't it be smarter to remember the result from the first time and just reuse it?
|
|
|
|
## What Problem Does Caching Solve?
|
|
|
|
Imagine you need to download a large instruction manual (a webpage) from the internet.
|
|
|
|
* **Without Caching:** Every single time you need the manual, you download the entire file again. This takes time and uses bandwidth every time.
|
|
* **With Caching:** The first time you download it, you save a copy on your computer (the "cache"). The next time you need it, you first check your local copy. If it's there, you use it instantly! You only download it again if you specifically want the absolute latest version or if your local copy is missing.
|
|
|
|
Caching in Crawl4AI works the same way. It's a mechanism to **store the results** of crawling a webpage locally (in a database file). When asked to crawl a URL again, Crawl4AI can check its cache first. If a valid result is already stored, it can return that saved result almost instantly, saving time and resources.
|
|
|
|
## Introducing `CacheMode` and `CacheContext`
|
|
|
|
Crawl4AI uses two key concepts to manage this caching behavior:
|
|
|
|
1. **`CacheMode` (The Cache Policy):**
|
|
* Think of this like setting the rules for how you interact with your saved instruction manuals.
|
|
* It's an **instruction** you give the crawler for a specific run, telling it *how* to use the cache.
|
|
* **Analogy:** Should you *always* use your saved copy if you have one? (`ENABLED`) Should you *ignore* your saved copies and always download a fresh one? (`BYPASS`) Should you *never* save any copies? (`DISABLED`) Should you save new copies but never reuse old ones? (`WRITE_ONLY`)
|
|
* `CacheMode` lets you choose the caching behavior that best fits your needs for a particular task.
|
|
|
|
2. **`CacheContext` (The Decision Maker):**
|
|
* This is an internal helper that Crawl4AI uses *during* a crawl. You don't usually interact with it directly.
|
|
* It looks at the `CacheMode` you provided (the policy) and the type of URL being processed.
|
|
* **Analogy:** Imagine a librarian who checks the library's borrowing rules (`CacheMode`) and the type of item you're requesting (e.g., a reference book that can't be checked out, like `raw:` HTML which isn't cached). Based on these, the librarian (`CacheContext`) decides if you can borrow an existing copy (read from cache) or if a new copy should be added to the library (write to cache).
|
|
* It helps the main `AsyncWebCrawler` make the right decision about reading from or writing to the cache for each specific URL based on the active policy.
|
|
|
|
## Setting the Cache Policy: Using `CacheMode`
|
|
|
|
You control the caching behavior by setting the `cache_mode` parameter within the `CrawlerRunConfig` object that you pass to `crawler.arun()` or `crawler.arun_many()`.
|
|
|
|
Let's explore the most common `CacheMode` options:
|
|
|
|
**1. `CacheMode.ENABLED` (The Default Behavior - If not specified)**
|
|
|
|
* **Policy:** "Use the cache if a valid result exists. If not, fetch the page, save the result to the cache, and then return it."
|
|
* This is the standard, balanced approach. It saves time on repeated crawls but ensures you get the content eventually.
|
|
* *Note: In recent versions, the default if `cache_mode` is left completely unspecified might be `CacheMode.BYPASS`. Always check the documentation or explicitly set the mode for clarity.* For this tutorial, let's assume we explicitly set it.
|
|
|
|
```python
|
|
# chapter9_example_1.py
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
|
|
|
async def main():
|
|
url = "https://httpbin.org/html"
|
|
async with AsyncWebCrawler() as crawler:
|
|
# Explicitly set the mode to ENABLED
|
|
config_enabled = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
|
print(f"Running with CacheMode: {config_enabled.cache_mode.name}")
|
|
|
|
# First run: Fetches, caches, and returns result
|
|
print("First run (ENABLED)...")
|
|
result1 = await crawler.arun(url=url, config=config_enabled)
|
|
print(f"Got result 1? {'Yes' if result1.success else 'No'}")
|
|
|
|
# Second run: Finds result in cache and returns it instantly
|
|
print("Second run (ENABLED)...")
|
|
result2 = await crawler.arun(url=url, config=config_enabled)
|
|
print(f"Got result 2? {'Yes' if result2.success else 'No'}")
|
|
# This second run should be much faster!
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
**Explanation:**
|
|
|
|
* We create a `CrawlerRunConfig` with `cache_mode=CacheMode.ENABLED`.
|
|
* The first `arun` call fetches the page from the web and saves the result in the cache.
|
|
* The second `arun` call (for the same URL and config affecting cache key) finds the saved result in the cache and returns it immediately, skipping the web fetch.
|
|
|
|
**2. `CacheMode.BYPASS`**
|
|
|
|
* **Policy:** "Ignore any existing saved copy. Always fetch a fresh copy from the web. After fetching, save this new result to the cache (overwriting any old one)."
|
|
* Useful when you *always* need the absolute latest version of the page, but you still want to update the cache for potential future use with `CacheMode.ENABLED`.
|
|
|
|
```python
|
|
# chapter9_example_2.py
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
|
import time
|
|
|
|
async def main():
|
|
url = "https://httpbin.org/html"
|
|
async with AsyncWebCrawler() as crawler:
|
|
# Set the mode to BYPASS
|
|
config_bypass = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
|
print(f"Running with CacheMode: {config_bypass.cache_mode.name}")
|
|
|
|
# First run: Fetches, caches, and returns result
|
|
print("First run (BYPASS)...")
|
|
start_time = time.perf_counter()
|
|
result1 = await crawler.arun(url=url, config=config_bypass)
|
|
duration1 = time.perf_counter() - start_time
|
|
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
|
|
|
|
# Second run: Ignores cache, fetches again, updates cache, returns result
|
|
print("Second run (BYPASS)...")
|
|
start_time = time.perf_counter()
|
|
result2 = await crawler.arun(url=url, config=config_bypass)
|
|
duration2 = time.perf_counter() - start_time
|
|
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
|
|
# Both runs should take a similar amount of time (fetching time)
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
**Explanation:**
|
|
|
|
* We set `cache_mode=CacheMode.BYPASS`.
|
|
* Both the first and second `arun` calls will fetch the page directly from the web, ignoring any previously cached result. They will still write the newly fetched result to the cache. Notice both runs take roughly the same amount of time (network fetch time).
|
|
|
|
**3. `CacheMode.DISABLED`**
|
|
|
|
* **Policy:** "Completely ignore the cache. Never read from it, never write to it."
|
|
* Useful when you don't want Crawl4AI to interact with the cache files at all, perhaps for debugging or if you have storage constraints.
|
|
|
|
```python
|
|
# chapter9_example_3.py
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
|
import time
|
|
|
|
async def main():
|
|
url = "https://httpbin.org/html"
|
|
async with AsyncWebCrawler() as crawler:
|
|
# Set the mode to DISABLED
|
|
config_disabled = CrawlerRunConfig(cache_mode=CacheMode.DISABLED)
|
|
print(f"Running with CacheMode: {config_disabled.cache_mode.name}")
|
|
|
|
# First run: Fetches, returns result (does NOT cache)
|
|
print("First run (DISABLED)...")
|
|
start_time = time.perf_counter()
|
|
result1 = await crawler.arun(url=url, config=config_disabled)
|
|
duration1 = time.perf_counter() - start_time
|
|
print(f"Got result 1? {'Yes' if result1.success else 'No'} (took {duration1:.2f}s)")
|
|
|
|
# Second run: Fetches again, returns result (does NOT cache)
|
|
print("Second run (DISABLED)...")
|
|
start_time = time.perf_counter()
|
|
result2 = await crawler.arun(url=url, config=config_disabled)
|
|
duration2 = time.perf_counter() - start_time
|
|
print(f"Got result 2? {'Yes' if result2.success else 'No'} (took {duration2:.2f}s)")
|
|
# Both runs fetch fresh, and nothing is ever saved to the cache.
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
**Explanation:**
|
|
|
|
* We set `cache_mode=CacheMode.DISABLED`.
|
|
* Both `arun` calls fetch fresh content from the web. Crucially, neither run reads from nor writes to the cache database.
|
|
|
|
**Other Modes (`READ_ONLY`, `WRITE_ONLY`):**
|
|
|
|
* `CacheMode.READ_ONLY`: Only uses existing cached results. If a result isn't in the cache, it will fail or return an empty result rather than fetching it. Never saves anything new.
|
|
* `CacheMode.WRITE_ONLY`: Never reads from the cache (always fetches fresh). It *only* writes the newly fetched result to the cache.
|
|
|
|
## How Caching Works Internally
|
|
|
|
When you call `crawler.arun(url="...", config=...)`:
|
|
|
|
1. **Create Context:** The `AsyncWebCrawler` creates a `CacheContext` instance using the `url` and the `config.cache_mode`.
|
|
2. **Check Read:** It asks the `CacheContext`, "Should I read from the cache?" (`cache_context.should_read()`).
|
|
3. **Try Reading:** If `should_read()` is `True`, it asks the database manager ([`AsyncDatabaseManager`](async_database.py)) to look for a cached result for the `url`.
|
|
4. **Cache Hit?**
|
|
* If a valid cached result is found: The `AsyncWebCrawler` returns this cached `CrawlResult` immediately. Done!
|
|
* If no cached result is found (or if `should_read()` was `False`): Proceed to fetching.
|
|
5. **Fetch:** The `AsyncWebCrawler` calls the appropriate [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to fetch the content from the web.
|
|
6. **Process:** It processes the fetched HTML (scraping, filtering, extracting) to create a new `CrawlResult`.
|
|
7. **Check Write:** It asks the `CacheContext`, "Should I write this result to the cache?" (`cache_context.should_write()`).
|
|
8. **Write Cache:** If `should_write()` is `True`, it tells the database manager to save the new `CrawlResult` into the cache database.
|
|
9. **Return:** The `AsyncWebCrawler` returns the newly created `CrawlResult`.
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant AWC as AsyncWebCrawler
|
|
participant Ctx as CacheContext
|
|
participant DB as DatabaseManager
|
|
participant Fetcher as AsyncCrawlerStrategy
|
|
|
|
User->>AWC: arun(url, config)
|
|
AWC->>Ctx: Create CacheContext(url, config.cache_mode)
|
|
AWC->>Ctx: should_read()?
|
|
alt Cache Read Allowed
|
|
Ctx-->>AWC: Yes
|
|
AWC->>DB: aget_cached_url(url)
|
|
DB-->>AWC: Cached Result (or None)
|
|
alt Cache Hit & Valid
|
|
AWC-->>User: Return Cached CrawlResult
|
|
else Cache Miss or Invalid
|
|
AWC->>AWC: Proceed to Fetch
|
|
end
|
|
else Cache Read Not Allowed
|
|
Ctx-->>AWC: No
|
|
AWC->>AWC: Proceed to Fetch
|
|
end
|
|
|
|
Note over AWC: Fetching Required
|
|
AWC->>Fetcher: crawl(url, config)
|
|
Fetcher-->>AWC: Raw Response
|
|
AWC->>AWC: Process HTML -> New CrawlResult
|
|
AWC->>Ctx: should_write()?
|
|
alt Cache Write Allowed
|
|
Ctx-->>AWC: Yes
|
|
AWC->>DB: acache_url(New CrawlResult)
|
|
DB-->>AWC: OK
|
|
else Cache Write Not Allowed
|
|
Ctx-->>AWC: No
|
|
end
|
|
AWC-->>User: Return New CrawlResult
|
|
|
|
```
|
|
|
|
## Code Glimpse
|
|
|
|
Let's look at simplified code snippets.
|
|
|
|
**Inside `async_webcrawler.py` (where `arun` uses caching):**
|
|
|
|
```python
|
|
# Simplified from crawl4ai/async_webcrawler.py
|
|
from .cache_context import CacheContext, CacheMode
|
|
from .async_database import async_db_manager
|
|
from .models import CrawlResult
|
|
# ... other imports
|
|
|
|
class AsyncWebCrawler:
|
|
# ... (init, other methods) ...
|
|
|
|
async def arun(self, url: str, config: CrawlerRunConfig = None) -> CrawlResult:
|
|
# ... (ensure config exists, set defaults) ...
|
|
if config.cache_mode is None:
|
|
config.cache_mode = CacheMode.ENABLED # Example default
|
|
|
|
# 1. Create CacheContext
|
|
cache_context = CacheContext(url, config.cache_mode)
|
|
|
|
cached_result = None
|
|
# 2. Check if cache read is allowed
|
|
if cache_context.should_read():
|
|
# 3. Try reading from database
|
|
cached_result = await async_db_manager.aget_cached_url(url)
|
|
|
|
# 4. If cache hit and valid, return it
|
|
if cached_result and self._is_cache_valid(cached_result, config):
|
|
self.logger.info("Cache hit for: %s", url) # Example log
|
|
return cached_result # Return early
|
|
|
|
# 5. Fetch fresh content (if no cache hit or read disabled)
|
|
async_response = await self.crawler_strategy.crawl(url, config=config)
|
|
html = async_response.html # ... and other data ...
|
|
|
|
# 6. Process the HTML to get a new CrawlResult
|
|
crawl_result = await self.aprocess_html(
|
|
url=url, html=html, config=config, # ... other params ...
|
|
)
|
|
|
|
# 7. Check if cache write is allowed
|
|
if cache_context.should_write():
|
|
# 8. Write the new result to the database
|
|
await async_db_manager.acache_url(crawl_result)
|
|
|
|
# 9. Return the new result
|
|
return crawl_result
|
|
|
|
def _is_cache_valid(self, cached_result: CrawlResult, config: CrawlerRunConfig) -> bool:
|
|
# Internal logic to check if cached result meets current needs
|
|
# (e.g., was screenshot requested now but not cached?)
|
|
if config.screenshot and not cached_result.screenshot: return False
|
|
if config.pdf and not cached_result.pdf: return False
|
|
# ... other checks ...
|
|
return True
|
|
```
|
|
|
|
**Inside `cache_context.py` (defining the concepts):**
|
|
|
|
```python
|
|
# Simplified from crawl4ai/cache_context.py
|
|
from enum import Enum
|
|
|
|
class CacheMode(Enum):
|
|
"""Defines the caching behavior for web crawling operations."""
|
|
ENABLED = "enabled" # Read and Write
|
|
DISABLED = "disabled" # No Read, No Write
|
|
READ_ONLY = "read_only" # Read Only, No Write
|
|
WRITE_ONLY = "write_only" # Write Only, No Read
|
|
BYPASS = "bypass" # No Read, Write Only (similar to WRITE_ONLY but explicit intention)
|
|
|
|
class CacheContext:
|
|
"""Encapsulates cache-related decisions and URL handling."""
|
|
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
|
|
self.url = url
|
|
self.cache_mode = cache_mode
|
|
self.always_bypass = always_bypass # Usually False
|
|
# Determine if URL type is cacheable (e.g., not 'raw:')
|
|
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
|
|
# ... other URL type checks ...
|
|
|
|
def should_read(self) -> bool:
|
|
"""Determines if cache should be read based on context."""
|
|
if self.always_bypass or not self.is_cacheable:
|
|
return False
|
|
# Allow read if mode is ENABLED or READ_ONLY
|
|
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
|
|
|
|
def should_write(self) -> bool:
|
|
"""Determines if cache should be written based on context."""
|
|
if self.always_bypass or not self.is_cacheable:
|
|
return False
|
|
# Allow write if mode is ENABLED, WRITE_ONLY, or BYPASS
|
|
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY, CacheMode.BYPASS]
|
|
|
|
@property
|
|
def display_url(self) -> str:
|
|
"""Returns the URL in display format."""
|
|
return self.url if not self.url.startswith("raw:") else "Raw HTML"
|
|
|
|
# Helper for backward compatibility (may be removed later)
|
|
def _legacy_to_cache_mode(...) -> CacheMode:
|
|
# ... logic to convert old boolean flags ...
|
|
pass
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
You've learned how Crawl4AI uses caching to avoid redundant work and speed up repeated crawls!
|
|
|
|
* **Caching** stores results locally to reuse them later.
|
|
* **`CacheMode`** is the policy you set in `CrawlerRunConfig` to control *how* the cache is used (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
|
|
* **`CacheContext`** is an internal helper that makes decisions based on the `CacheMode` and URL type.
|
|
* Using the cache effectively (especially `CacheMode.ENABLED`) can significantly speed up your crawling tasks, particularly during development or when dealing with many URLs, including deep crawls.
|
|
|
|
We've seen how Crawl4AI can crawl single pages, lists of pages (`arun_many`), and even explore websites (`DeepCrawlStrategy`). But how does `arun_many` or a deep crawl manage running potentially hundreds or thousands of individual crawl tasks efficiently without overwhelming your system or the target website?
|
|
|
|
**Next:** Let's explore the component responsible for managing concurrent tasks: [Chapter 10: Orchestrating the Crawl - BaseDispatcher](10_basedispatcher.md).
|
|
|
|
---
|
|
|
|
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge) |