update nav

This commit is contained in:
zachary62
2025-04-04 14:03:22 -04:00
parent 2fa60fe7d5
commit 0426110e66
24 changed files with 261 additions and 32 deletions

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "AsyncCrawlerStrategy"
parent: "Crawl4AI"
nav_order: 1
---
# Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy
Welcome to the Crawl4AI tutorial series! Our goal is to build intelligent agents that can understand and extract information from the web. The very first step in this process is actually *getting* the content from a webpage. This chapter explains how Crawl4AI handles that fundamental task.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "AsyncWebCrawler"
parent: "Crawl4AI"
nav_order: 2
---
# Chapter 2: Meet the General Manager - AsyncWebCrawler
In [Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy](01_asynccrawlerstrategy.md), we learned about the different ways Crawl4AI can fetch the raw content of a webpage, like choosing between a fast drone (`AsyncHTTPCrawlerStrategy`) or a versatile delivery truck (`AsyncPlaywrightCrawlerStrategy`).

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "CrawlerRunConfig"
parent: "Crawl4AI"
nav_order: 3
---
# Chapter 3: Giving Instructions - CrawlerRunConfig
In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "ContentScrapingStrategy"
parent: "Crawl4AI"
nav_order: 4
---
# Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy
In [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md), we learned how to give specific instructions to our `AsyncWebCrawler` using `CrawlerRunConfig`. This included telling it *how* to fetch the page and potentially take screenshots or PDFs.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "RelevantContentFilter"
parent: "Crawl4AI"
nav_order: 5
---
# Chapter 5: Focusing on What Matters - RelevantContentFilter
In [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md), we learned how Crawl4AI takes the raw, messy HTML from a webpage and cleans it up using a `ContentScrapingStrategy`. This gives us a tidier version of the HTML (`cleaned_html`) and extracts basic elements like links and images.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "ExtractionStrategy"
parent: "Crawl4AI"
nav_order: 6
---
# Chapter 6: Getting Specific Data - ExtractionStrategy
In the previous chapter, [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md), we learned how to sift through the cleaned webpage content to keep only the parts relevant to our query or goal, producing a focused `fit_markdown`. This is great for tasks like summarization or getting the main gist of an article.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "CrawlResult"
parent: "Crawl4AI"
nav_order: 7
---
# Chapter 7: Understanding the Results - CrawlResult
In the previous chapter, [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md), we learned how to teach Crawl4AI to act like an analyst, extracting specific, structured data points from a webpage using an `ExtractionStrategy`. We've seen how Crawl4AI can fetch pages, clean them, filter them, and even extract precise information.
@@ -247,7 +254,7 @@ if __name__ == "__main__":
You don't interact with the `CrawlResult` constructor directly. The `AsyncWebCrawler` creates it for you at the very end of the `arun` process, typically inside its internal `aprocess_html` method (or just before returning if fetching from cache).
Heres a simplified sequence:
Here's a simplified sequence:
1. **Fetch:** `AsyncWebCrawler` calls the [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to get the raw `html`, `status_code`, `response_headers`, etc.
2. **Scrape:** It passes the `html` to the [ContentScrapingStrategy](04_contentscrapingstrategy.md) to get `cleaned_html`, `links`, `media`, `metadata`.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "DeepCrawlStrategy"
parent: "Crawl4AI"
nav_order: 8
---
# Chapter 8: Exploring Websites - DeepCrawlStrategy
In [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md), we saw the final report (`CrawlResult`) that Crawl4AI gives us after processing a single URL. This report contains cleaned content, links, metadata, and maybe even extracted data.

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "CacheContext & CacheMode"
parent: "Crawl4AI"
nav_order: 9
---
# Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode
In the previous chapter, [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md), we saw how Crawl4AI can explore websites by following links, potentially visiting many pages. During such explorations, or even when you run the same crawl multiple times, the crawler might try to fetch the exact same webpage again and again. This can be slow and might unnecessarily put a load on the website you're crawling. Wouldn't it be smarter to remember the result from the first time and just reuse it?

View File

@@ -1,3 +1,10 @@
---
layout: default
title: "BaseDispatcher"
parent: "Crawl4AI"
nav_order: 10
---
# Chapter 10: Orchestrating the Crawl - BaseDispatcher
In [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md), we learned how Crawl4AI uses caching to cleverly avoid re-fetching the same webpage multiple times, which is especially helpful when crawling many URLs. We've also seen how methods like `arun_many()` ([Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md)) or strategies like [DeepCrawlStrategy](08_deepcrawlstrategy.md) can lead to potentially hundreds or thousands of individual URLs needing to be crawled.