Post

How Crawl4ai Works: The Open-Source AI Scraping Framework

How Crawl4ai Works: The Open-Source AI Scraping Framework

Crawl4ai is an open-source Python framework that combines web crawling with LLM-powered data extraction. Instead of writing brittle CSS selectors or XPath queries that break every time a site changes its layout, you point Crawl4ai at a URL and get back clean markdown or structured JSON – ready for feeding into a language model, populating a knowledge base, or building a dataset. The project has grown rapidly since its initial release, sitting at over 40,000 GitHub stars and becoming one of the most popular AI-adjacent scraping tools in the Python ecosystem. The recent Crawl4ai v0.8 release added crash recovery and prefetch mode, further improving its production readiness. This post covers how Crawl4ai works under the hood, how to use it in practice, and where it fits in the broader scraping landscape.

What Crawl4ai Actually Does

At its core, Crawl4ai solves a specific problem: turning messy web pages into clean, structured data that AI systems can consume. Traditional scraping tools give you raw HTML and leave the parsing to you. Crawl4ai handles the entire pipeline – from rendering JavaScript-heavy pages in a real browser to converting the result into markdown or extracting structured fields using an LLM.

The framework is built on top of Playwright for browser automation, which means it can handle single-page applications, lazy-loaded content, and pages that require JavaScript execution to display their data. On top of that rendering layer, Crawl4ai adds its own content extraction pipeline: stripping out boilerplate (navigation, ads, footers), converting the meaningful content to clean markdown, and optionally sending that content to an LLM for structured extraction.

The Crawl4ai Pipeline

Every request through Crawl4ai follows the same general flow. Understanding this pipeline helps you configure the framework effectively and debug issues when they arise.

graph TD
    A["URL Input"] --> B["Playwright Browser<br>Renders the page"]
    B --> C["Raw HTML<br>Full DOM after JS execution"]
    C --> D["Content Cleaning<br>Strip boilerplate and noise"]
    D --> E["Markdown Conversion<br>Clean, readable output"]
    E --> F{"LLM Extraction<br>Configured?"}
    F -->|Yes| G["Send to LLM<br>with schema"]
    G --> H["Structured Data<br>JSON output"]
    F -->|No| I["Return Markdown<br>as final output"]

    style A fill:#ffcccc
    style B fill:#ffffcc
    style D fill:#ffffcc
    style E fill:#ccffcc
    style G fill:#ccccff
    style H fill:#99ff99
    style I fill:#99ff99

The key insight is that LLM extraction is optional. Many use cases only need the markdown conversion step – feeding documentation pages into a RAG system, for example, or archiving article content. The LLM layer adds cost and latency, so Crawl4ai lets you skip it when you do not need it.

Installation

Crawl4ai requires Python 3.9 or later. Installation pulls in Playwright and its browser binaries.

1
2
3
4
pip install crawl4ai

# Install the browser binary (Chromium by default)
crawl4ai-setup

The crawl4ai-setup command downloads Chromium and any system dependencies needed for headless browser operation. On Linux servers, this typically includes libraries like libglib2.0 and libnss3. On macOS and Windows, the setup is usually seamless.

If you already have Playwright installed, Crawl4ai will use your existing browser binaries. You can also point it at a specific browser installation using configuration options.

Basic Usage: Fetching a Page as Markdown

The simplest use case is fetching a page and getting clean markdown back. The main entry point is the AsyncWebCrawler class.

1
2
3
4
5
6
7
8
9
import asyncio
from crawl4ai import AsyncWebCrawler

async def fetch_page():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)

asyncio.run(fetch_page())

The result object contains several properties:

  • result.markdown – clean markdown with boilerplate removed
  • result.html – the raw HTML after JavaScript execution
  • result.extracted_content – structured data if an extraction strategy is configured
  • result.links – all links found on the page
  • result.media – images and other media elements

The markdown output is not just a naive HTML-to-markdown conversion. Crawl4ai strips out navigation elements, cookie banners, sidebars, and other noise before converting. The result is close to what you would get if you manually copied the main content of a page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
async def fetch_multiple_pages():
    urls = [
        "https://docs.python.org/3/tutorial/classes.html",
        "https://docs.python.org/3/tutorial/modules.html",
        "https://docs.python.org/3/tutorial/errors.html",
    ]

    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url)
            print(f"--- {url} ---")
            print(f"Word count: {len(result.markdown.split())}")
            print(result.markdown[:200])
            print()

asyncio.run(fetch_multiple_pages())

For crawling many pages at once, arun_many processes a list of URLs with configurable concurrency.

1
2
3
4
5
6
7
8
9
10
11
12
13
async def batch_crawl():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ]

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls)
        for result in results:
            print(f"{result.url}: {len(result.markdown)} chars")

asyncio.run(batch_crawl())

LLM Extraction: Getting Structured Data

This is where Crawl4ai differentiates itself from other scraping frameworks. You define a schema describing the data you want, and Crawl4ai uses an LLM to extract it from the page content.

The extraction system uses LLMExtractionStrategy, which wraps calls to OpenAI, Anthropic, or any OpenAI-compatible API. For a deeper look at how LLMs handle HTML-to-data conversion, see our guide on LLM-powered structured data extraction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List, Optional
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD")
    rating: Optional[float] = Field(
        default=None, description="Rating out of 5"
    )
    description: str = Field(description="Short product description")

async def extract_products():
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token="your-api-key",
        schema=Product.model_json_schema(),
        extraction_type="schema",
        instruction="Extract all product listings from this page.",
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/products",
            extraction_strategy=strategy,
        )

        products = json.loads(result.extracted_content)
        for product in products:
            print(f"{product['name']}: ${product['price']}")

asyncio.run(extract_products())

The provider parameter follows the provider/model format. Common options include:

ProviderFormat
OpenAIopenai/gpt-4o
Anthropicanthropic/claude-sonnet-4-20250514
Ollama (local)ollama/llama3
Any OpenAI-compatibleopenai/model-name with custom api_base

Using Anthropic’s Claude for extraction looks like this:

1
2
3
4
5
6
7
strategy = LLMExtractionStrategy(
    provider="anthropic/claude-sonnet-4-20250514",
    api_token="your-anthropic-key",
    schema=Product.model_json_schema(),
    extraction_type="schema",
    instruction="Extract all product listings. Include prices and ratings.",
)

You can also use the instruction parameter without a schema for free-form extraction, where the LLM decides the structure based on your natural language description.

1
2
3
4
5
strategy = LLMExtractionStrategy(
    provider="openai/gpt-4o",
    api_token="your-api-key",
    instruction="Extract the main article title, author, publication date, and a one-sentence summary.",
)
AI is reshaping how we think about web data extraction.
AI is reshaping how we think about web data extraction. Photo by Pavel Danilyuk / Pexels

Chunking Strategies

When pages are large, sending the entire content to an LLM in one call is impractical – it burns tokens and can exceed context windows. Crawl4ai includes chunking strategies that split content into manageable pieces before extraction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from crawl4ai.chunking_strategy import RegexChunking, SlidingWindowChunking

# Split by headers or paragraphs
chunking = RegexChunking(patterns=[r"\n## ", r"\n\n"])

# Or use a sliding window with overlap
chunking = SlidingWindowChunking(
    window_size=2000,
    step=1500,  # 500-token overlap between chunks
)

strategy = LLMExtractionStrategy(
    provider="openai/gpt-4o",
    api_token="your-api-key",
    schema=Product.model_json_schema(),
    extraction_type="schema",
    instruction="Extract product data from this content.",
    chunking_strategy=chunking,
)

The sliding window approach with overlap ensures that items spanning chunk boundaries are not missed. Crawl4ai merges duplicate extractions from overlapping chunks automatically.

Configuration Options

Crawl4ai exposes configuration through CrawlerRunConfig for per-request settings and BrowserConfig for browser-level settings.

Browser Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_config = BrowserConfig(
    headless=True,
    browser_type="chromium",  # or "firefox", "webkit"
    proxy="http://user:pass@proxy.example.com:8080",
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)",
    },
    viewport_width=1920,
    viewport_height=1080,
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(url="https://example.com")

Run Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from crawl4ai import CrawlerRunConfig, CacheMode

run_config = CrawlerRunConfig(
    # Wait for specific content before extracting
    wait_for="css:.product-list",

    # JavaScript to execute before extraction
    js_code="window.scrollTo(0, document.body.scrollHeight);",

    # Delay after JS execution
    delay_before_return_html=2.0,

    # Cache settings
    cache_mode=CacheMode.ENABLED,

    # Content filtering
    word_count_threshold=10,
    excluded_tags=["nav", "footer", "header"],

    # Screenshot capture
    screenshot=True,
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=run_config,
    )

The wait_for parameter is critical for JavaScript-heavy sites. Without it, Crawl4ai might extract content before the page has finished rendering. You can wait for a CSS selector to appear, a specific text string to be present, or a JavaScript expression to evaluate to true.

1
2
3
4
5
6
7
8
# Wait for a CSS selector
config = CrawlerRunConfig(wait_for="css:.results-loaded")

# Wait for text to appear
config = CrawlerRunConfig(wait_for="text:Showing results")

# Wait for a JS condition
config = CrawlerRunConfig(wait_for="js:() => document.querySelectorAll('.item').length > 5")

Caching

Crawl4ai includes a built-in caching system that avoids re-fetching pages you have already crawled. This is useful during development (you do not want to hit the target site repeatedly while tweaking your extraction logic) and for production crawls where some pages rarely change.

1
2
3
4
5
6
7
8
9
10
from crawl4ai import CrawlerRunConfig, CacheMode

# Enable caching -- results are stored locally
config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)

# Force bypass cache for this run
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

# Read from cache only, do not make network requests
config = CrawlerRunConfig(cache_mode=CacheMode.READ_ONLY)

Cached results include the full HTML, markdown, and any extracted content. The cache is stored on disk, so it persists between runs.

Use Cases

Crawl4ai fits naturally into several workflows that have emerged as LLMs have become central to data processing.

Research Automation

Academic researchers and analysts use Crawl4ai to build corpora from web sources. The markdown output feeds directly into RAG pipelines, where a vector database indexes the content for retrieval during LLM conversations.

1
2
3
4
5
6
7
8
9
10
11
12
13
async def build_research_corpus(urls: list[str]):
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(urls=urls)

        corpus = []
        for result in results:
            corpus.append({
                "url": result.url,
                "content": result.markdown,
                "links": result.links,
            })

        return corpus

Content Aggregation

News aggregators and content monitors use Crawl4ai to track changes across many sites. The LLM extraction layer normalizes data from sites with completely different layouts into a single consistent schema.

Dataset Building

Training data for machine learning models often comes from the web. Crawl4ai’s schema-driven extraction using Pydantic models produces clean, typed records that can go directly into a training pipeline without manual cleaning.

Machine learning adds intelligence to what was once a mechanical process.
Machine learning adds intelligence to what was once a mechanical process. Photo by Google DeepMind / Pexels

Limitations

Crawl4ai is not the right tool for every scraping job. Understanding its limitations helps you avoid frustrating dead ends.

Speed. Every page goes through a full browser render. Even with caching and concurrency, Crawl4ai is slower than HTTP-based scrapers like Scrapy or raw requests + BeautifulSoup. If your target pages are static HTML and you do not need JavaScript rendering, a lighter tool will be significantly faster.

LLM costs. Structured extraction with an LLM adds cost per page. Extracting data from 10,000 product pages using GPT-4o will cost real money. Local models via Ollama reduce the dollar cost but add latency and may reduce extraction accuracy.

Reliability at scale. Browser-based crawling is inherently less reliable than HTTP-based approaches, and these challenges are among the unsolved problems of AI web scraping in 2026. Browsers consume significant memory, can crash, and occasionally render pages differently between runs. For crawls exceeding tens of thousands of pages, you need to handle failures, retries, and resource management carefully.

Anti-bot detection. Crawl4ai uses standard Playwright, which is detectable by sophisticated anti-bot systems like Cloudflare Turnstile or DataDome. It does not include stealth patches or fingerprint randomization out of the box. For sites with aggressive bot detection, you may need to layer in additional tools or use a stealth browser like Camoufox or nodriver.

Crawl4ai vs Traditional Scraping

The decision between Crawl4ai and a traditional scraping stack depends on what you are building.

graph TD
    A["Do you need<br>JavaScript rendering?"] -->|No| B["Do you need<br>structured extraction?"]
    A -->|Yes| C["Do you need<br>LLM extraction?"]
    B -->|No| D["Use requests +<br>BeautifulSoup"]
    B -->|Yes| E["Use requests +<br>Pydantic manually"]
    C -->|No| F["Use Playwright<br>directly"]
    C -->|Yes| G["Use Crawl4ai"]

    style D fill:#ccffcc
    style E fill:#ccffcc
    style F fill:#ffffcc
    style G fill:#ccccff

Use Crawl4ai when:

  • You are building a RAG pipeline and need clean markdown from diverse sources
  • The sites you are scraping have varied layouts and you cannot write a single CSS selector strategy that covers all of them
  • You want structured extraction without writing and maintaining page-specific parsing code
  • You are comfortable with the per-page cost of LLM extraction

Use traditional scraping when:

  • You are scraping a single site with a known, stable layout
  • Speed is critical and you are scraping millions of pages
  • You need fine-grained control over the extraction logic
  • Budget constraints rule out per-page LLM costs

The hybrid approach also works well. Use Crawl4ai to prototype your extraction, figure out the patterns in the data, then write optimized custom scrapers for the sites you hit most frequently. Keep Crawl4ai for the long tail of sites where writing custom parsers is not worth the effort.

Crawl4ai vs Other AI Scraping Tools

Crawl4ai is not the only framework in this space. Here is how it compares to the main alternatives.

FeatureCrawl4aiFirecrawlScrapeGraphAI
Open sourceYesPartiallyYes
Self-hostedYesYes (Docker)Yes
Browser enginePlaywrightPlaywrightPlaywright
LLM extractionYesYesYes
Markdown outputYesYesLimited
CostFree + LLM costsFree tier + paidFree + LLM costs
Chunking strategiesBuilt-inBuilt-inManual
CachingBuilt-inServer-sideNo

Crawl4ai’s main advantage over Firecrawl is that it is fully open source and runs entirely on your infrastructure – no API keys to the scraping service itself, no rate limits beyond what the target site imposes. Compared to ScrapeGraphAI, Crawl4ai offers more mature caching, better chunking, and cleaner markdown output.

Putting It Together

Here is a complete example that combines browser configuration, wait conditions, and LLM extraction to scrape job listings from a hypothetical site.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List, Optional
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class JobListing(BaseModel):
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Job location")
    salary_range: Optional[str] = Field(
        default=None, description="Salary range if listed"
    )
    requirements: List[str] = Field(
        description="Key requirements or qualifications"
    )

async def scrape_jobs():
    browser_config = BrowserConfig(
        headless=True,
        headers={"Accept-Language": "en-US,en;q=0.9"},
    )

    extraction_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token="your-api-key",
        schema=JobListing.model_json_schema(),
        extraction_type="schema",
        instruction="Extract all job listings from this page. Include salary if shown.",
    )

    run_config = CrawlerRunConfig(
        wait_for="css:.job-listing",
        delay_before_return_html=1.0,
        cache_mode=CacheMode.ENABLED,
        word_count_threshold=10,
        excluded_tags=["nav", "footer"],
        extraction_strategy=extraction_strategy,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://jobs.example.com/search?q=python",
            config=run_config,
        )

        # Clean markdown of the page
        print("--- Page Content ---")
        print(result.markdown[:500])

        # Structured job data from LLM extraction
        print("\n--- Extracted Jobs ---")
        jobs = json.loads(result.extracted_content)
        for job in jobs:
            print(f"{job['title']} at {job['company']} ({job['location']})")
            if job.get('salary_range'):
                print(f"  Salary: {job['salary_range']}")
            print(f"  Requirements: {', '.join(job['requirements'][:3])}")
            print()

asyncio.run(scrape_jobs())

This example shows the three layers working together: Playwright renders the page with JavaScript, Crawl4ai converts it to clean markdown and strips boilerplate, and the LLM extracts structured job listings matching your Pydantic schema. The cache ensures that re-running the script during development does not hammer the target site.

Where Crawl4ai Fits in Your Stack

Crawl4ai is best understood as a bridge between raw web content and AI systems. It is not trying to replace Scrapy for high-volume production crawling or Playwright for complex browser automation workflows. Its value is in the conversion layer – taking the chaotic mess of a rendered web page and producing clean, structured output that an LLM can work with.

If you are building anything that needs to consume web content programmatically – a RAG system, a research tool, a content aggregator, a dataset pipeline – Crawl4ai is worth evaluating. The open-source model means you can inspect exactly what it does, run it on your own infrastructure, and modify it when it does not quite fit your needs. The LLM extraction layer is genuinely useful for sites where writing custom parsers is impractical, but remember that it adds cost and latency. Start with the markdown output, and only add LLM extraction when you actually need structured fields.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.