mirror of
https://github.com/aljazceru/Tutorial-Codebase-Knowledge.git
synced 2025-12-19 07:24:20 +01:00
update nav
This commit is contained in:
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "AsyncCrawlerStrategy"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 1
|
||||
---
|
||||
|
||||
# Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy
|
||||
|
||||
Welcome to the Crawl4AI tutorial series! Our goal is to build intelligent agents that can understand and extract information from the web. The very first step in this process is actually *getting* the content from a webpage. This chapter explains how Crawl4AI handles that fundamental task.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "AsyncWebCrawler"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 2
|
||||
---
|
||||
|
||||
# Chapter 2: Meet the General Manager - AsyncWebCrawler
|
||||
|
||||
In [Chapter 1: How We Fetch Webpages - AsyncCrawlerStrategy](01_asynccrawlerstrategy.md), we learned about the different ways Crawl4AI can fetch the raw content of a webpage, like choosing between a fast drone (`AsyncHTTPCrawlerStrategy`) or a versatile delivery truck (`AsyncPlaywrightCrawlerStrategy`).
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "CrawlerRunConfig"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 3
|
||||
---
|
||||
|
||||
# Chapter 3: Giving Instructions - CrawlerRunConfig
|
||||
|
||||
In [Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md), we met the `AsyncWebCrawler`, the central coordinator for our web crawling tasks. We saw how to tell it *what* URL to crawl using the `arun` method.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "ContentScrapingStrategy"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 4
|
||||
---
|
||||
|
||||
# Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy
|
||||
|
||||
In [Chapter 3: Giving Instructions - CrawlerRunConfig](03_crawlerrunconfig.md), we learned how to give specific instructions to our `AsyncWebCrawler` using `CrawlerRunConfig`. This included telling it *how* to fetch the page and potentially take screenshots or PDFs.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "RelevantContentFilter"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 5
|
||||
---
|
||||
|
||||
# Chapter 5: Focusing on What Matters - RelevantContentFilter
|
||||
|
||||
In [Chapter 4: Cleaning Up the Mess - ContentScrapingStrategy](04_contentscrapingstrategy.md), we learned how Crawl4AI takes the raw, messy HTML from a webpage and cleans it up using a `ContentScrapingStrategy`. This gives us a tidier version of the HTML (`cleaned_html`) and extracts basic elements like links and images.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "ExtractionStrategy"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 6
|
||||
---
|
||||
|
||||
# Chapter 6: Getting Specific Data - ExtractionStrategy
|
||||
|
||||
In the previous chapter, [Chapter 5: Focusing on What Matters - RelevantContentFilter](05_relevantcontentfilter.md), we learned how to sift through the cleaned webpage content to keep only the parts relevant to our query or goal, producing a focused `fit_markdown`. This is great for tasks like summarization or getting the main gist of an article.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "CrawlResult"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 7
|
||||
---
|
||||
|
||||
# Chapter 7: Understanding the Results - CrawlResult
|
||||
|
||||
In the previous chapter, [Chapter 6: Getting Specific Data - ExtractionStrategy](06_extractionstrategy.md), we learned how to teach Crawl4AI to act like an analyst, extracting specific, structured data points from a webpage using an `ExtractionStrategy`. We've seen how Crawl4AI can fetch pages, clean them, filter them, and even extract precise information.
|
||||
@@ -247,7 +254,7 @@ if __name__ == "__main__":
|
||||
|
||||
You don't interact with the `CrawlResult` constructor directly. The `AsyncWebCrawler` creates it for you at the very end of the `arun` process, typically inside its internal `aprocess_html` method (or just before returning if fetching from cache).
|
||||
|
||||
Here’s a simplified sequence:
|
||||
Here's a simplified sequence:
|
||||
|
||||
1. **Fetch:** `AsyncWebCrawler` calls the [AsyncCrawlerStrategy](01_asynccrawlerstrategy.md) to get the raw `html`, `status_code`, `response_headers`, etc.
|
||||
2. **Scrape:** It passes the `html` to the [ContentScrapingStrategy](04_contentscrapingstrategy.md) to get `cleaned_html`, `links`, `media`, `metadata`.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "DeepCrawlStrategy"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 8
|
||||
---
|
||||
|
||||
# Chapter 8: Exploring Websites - DeepCrawlStrategy
|
||||
|
||||
In [Chapter 7: Understanding the Results - CrawlResult](07_crawlresult.md), we saw the final report (`CrawlResult`) that Crawl4AI gives us after processing a single URL. This report contains cleaned content, links, metadata, and maybe even extracted data.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "CacheContext & CacheMode"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 9
|
||||
---
|
||||
|
||||
# Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode
|
||||
|
||||
In the previous chapter, [Chapter 8: Exploring Websites - DeepCrawlStrategy](08_deepcrawlstrategy.md), we saw how Crawl4AI can explore websites by following links, potentially visiting many pages. During such explorations, or even when you run the same crawl multiple times, the crawler might try to fetch the exact same webpage again and again. This can be slow and might unnecessarily put a load on the website you're crawling. Wouldn't it be smarter to remember the result from the first time and just reuse it?
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
---
|
||||
layout: default
|
||||
title: "BaseDispatcher"
|
||||
parent: "Crawl4AI"
|
||||
nav_order: 10
|
||||
---
|
||||
|
||||
# Chapter 10: Orchestrating the Crawl - BaseDispatcher
|
||||
|
||||
In [Chapter 9: Smart Fetching with Caching - CacheContext / CacheMode](09_cachecontext___cachemode.md), we learned how Crawl4AI uses caching to cleverly avoid re-fetching the same webpage multiple times, which is especially helpful when crawling many URLs. We've also seen how methods like `arun_many()` ([Chapter 2: Meet the General Manager - AsyncWebCrawler](02_asyncwebcrawler.md)) or strategies like [DeepCrawlStrategy](08_deepcrawlstrategy.md) can lead to potentially hundreds or thousands of individual URLs needing to be crawled.
|
||||
|
||||
Reference in New Issue
Block a user