Post

BeautifulSoup vs Scrapy vs Selenium: A Python Scraper's Decision Tree

BeautifulSoup vs Scrapy vs Selenium: A Python Scraper's Decision Tree

Python’s three most popular scraping tools – BeautifulSoup, Scrapy, and Selenium – each fill a fundamentally different niche. BeautifulSoup is a parser. Scrapy is a framework. Selenium is a browser controller. Picking the wrong one does not just cost you time writing code; it costs you time rewriting code when you hit the tool’s ceiling. For an even broader comparison that includes Playwright and Puppeteer, see our Playwright vs Puppeteer vs Selenium vs Scrapy mega comparison. This post gives you a decision tree, code comparisons, and a feature matrix so you can pick the right tool before writing a single line.

BeautifulSoup: The Parser

BeautifulSoup (BS4) is not a scraper. It is an HTML and XML parser. It cannot fetch pages, execute JavaScript, or manage request sessions on its own. You pair it with an HTTP library like requests or httpx, hand it the raw HTML, and it builds a parse tree you can search.

This is its greatest strength and its greatest limitation. Because it does nothing beyond parsing, it is fast, lightweight, and simple. There are no event loops, no middleware stacks, no configuration files. You write a few lines of code and you have your data.

1
2
3
4
5
6
7
8
9
10
import requests
from bs4 import BeautifulSoup

response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.text, "html.parser")

for article in soup.select("article.product_pod"):
    title = article.select_one("h3 a")["title"]
    price = article.select_one("p.price_color").get_text(strip=True)
    print(f"{title}: {price}")

That is the entire program. No imports from a framework, no class definitions, no settings files. For small jobs – pulling data from a handful of pages where the content is in the initial HTML – BS4 is the fastest path from idea to data.

What BS4 Does Well

  • Parsing HTML with CSS selectors (soup.select()) and the find API (soup.find(), soup.find_all())
  • Navigating the document tree (parent, children, siblings)
  • Handling malformed HTML gracefully
  • Working with multiple parser backends (html.parser, lxml, html5lib)

What BS4 Does Not Do

  • Fetch pages (you need requests, httpx, or aiohttp)
  • Execute JavaScript
  • Handle concurrency, retries, or rate limiting
  • Manage crawl state or URL queues
  • Export data to structured formats

You build all of that yourself if you need it. For ten pages, that is fine. For ten thousand, you are reinventing Scrapy.

Scrapy: The Framework

Scrapy is a complete web crawling framework. It handles HTTP requests, response parsing, link following, data pipelines, retries, rate limiting, middleware, and output serialization. You define spiders – classes that describe how to navigate a site and what data to extract – and Scrapy handles everything else.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for article in response.css("article.product_pod"):
            yield {
                "title": article.css("h3 a::attr(title)").get(),
                "price": article.css("p.price_color::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run this with scrapy runspider books_spider.py -o books.json and Scrapy will crawl every page, follow pagination links, extract data, and write it to a JSON file. It handles concurrent requests, respects DOWNLOAD_DELAY settings, retries failed requests, and deduplicates URLs automatically.

Scrapy’s Architecture

Scrapy is built around an event-driven architecture with clearly separated components:

graph TD
    A[Spider] -->|Yields Requests| B[Scheduler]
    B -->|Dispatches| C[Downloader]
    C -->|Returns Response| A
    A -->|Yields Items| D[Item Pipeline]
    D -->|Processes & Stores| E[Output<br>JSON / CSV / DB]
    C <-->|Middleware| F[Downloader<br>Middleware]
    A <-->|Middleware| G[Spider<br>Middleware]

Each component is pluggable. You can write custom downloader middleware to rotate proxies, spider middleware to filter responses, and item pipelines to validate, deduplicate, or store data in a database.

Key Scrapy Features

  • Asynchronous I/O: Built on Twisted, handles many concurrent requests without threading
  • Built-in selectors: CSS and XPath selectors on response objects
  • AutoThrottle: Adjusts crawl speed based on server response times
  • Retry middleware: Automatically retries failed requests with configurable policies
  • Feed exports: Output to JSON, CSV, XML, or custom backends
  • Duplicate filtering: Skips already-visited URLs by default
  • Robots.txt compliance: Respects robots.txt by default (configurable)
  • Stats collection: Tracks items scraped, requests made, errors, and timing

The Boilerplate Trade-off

For a simple five-page scrape, Scrapy’s structure feels heavy. You need a spider class, you deal with yield and callbacks, and the mental model of asynchronous request-response flow is more complex than a for loop with requests.get(). But that structure pays off the moment your crawl grows beyond a trivial size.

Selenium: The Browser Controller

Selenium launches a real web browser – Chrome, Firefox, Edge – and controls it programmatically. It renders pages exactly as a human user would see them: JavaScript executes, AJAX calls fire, single-page applications load their data, and dynamic content appears in the DOM.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)

driver.get("https://books.toscrape.com/")

wait = WebDriverWait(driver, 10)
articles = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "article.product_pod"))
)

for article in articles:
    title = article.find_element(By.CSS_SELECTOR, "h3 a").get_attribute("title")
    price = article.find_element(By.CSS_SELECTOR, "p.price_color").text
    print(f"{title}: {price}")

driver.quit()

Selenium is the heaviest option. It launches a browser process, allocates memory for rendering, and waits for page loads. But it is the only option when the data you need is generated by JavaScript after the initial page load.

When You Need Selenium

  • The page is a single-page application (React, Vue, Angular)
  • Data loads via AJAX calls triggered by user interaction
  • You need to click buttons, fill forms, or scroll to load content
  • The site uses client-side rendering with no server-side fallback
  • You need to interact with iframes, shadow DOM, or canvas elements

When You Do Not Need Selenium

If the data is in the initial HTML response, Selenium is overkill. A curl to the page URL will get you the same HTML that Selenium renders, but in milliseconds instead of seconds and without launching a browser process.

The Decision Tree

Before writing any code, ask yourself three questions. The answers point to the right tool.

graph TD
    A[Does the site<br>require JavaScript<br>to load data?] -->|Yes| B[Do you need to<br>interact with the page?<br>Clicks, scrolls, forms]
    A -->|No| C[How many pages<br>do you need to scrape?]
    B -->|Yes| D[Selenium]
    B -->|No| E[Consider Scrapy +<br>scrapy-playwright<br>or Selenium]
    C -->|A few<br>under 100| F[BeautifulSoup<br>+ requests]
    C -->|Hundreds<br>or thousands| G[Scrapy]

The first branch is binary: does the content require a browser to render? If yes, you need a browser tool. If no, you pick between BS4 and Scrapy based on scale.

Parsing HTML doesn't always require a full browser.
Parsing HTML doesn't always require a full browser. Photo by Stanislav Kondratiev / Pexels

Code Comparison: Same Task, Three Tools

To make the differences concrete, here is the same task implemented with all three tools: scrape article titles and links from a blog’s listing page, following pagination to get all articles.

BeautifulSoup + requests

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import requests
from bs4 import BeautifulSoup

base_url = "https://example-blog.com"
url = f"{base_url}/articles"
articles = []

while url:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")

    for item in soup.select("div.article-card"):
        title = item.select_one("h2 a").get_text(strip=True)
        link = item.select_one("h2 a")["href"]
        articles.append({"title": title, "link": base_url + link})

    next_link = soup.select_one("a.next-page")
    url = base_url + next_link["href"] if next_link else None

print(f"Scraped {len(articles)} articles")

Straightforward. Sequential. No concurrency. You handle pagination manually with a while loop. Error handling, retries, and rate limiting are your responsibility.

Scrapy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy


class ArticleSpider(scrapy.Spider):
    name = "articles"
    start_urls = ["https://example-blog.com/articles"]
    custom_settings = {
        "FEEDS": {"articles.json": {"format": "json"}},
        "DOWNLOAD_DELAY": 1,
        "CONCURRENT_REQUESTS": 4,
    }

    def parse(self, response):
        for card in response.css("div.article-card"):
            yield {
                "title": card.css("h2 a::text").get().strip(),
                "link": response.urljoin(card.css("h2 a::attr(href)").get()),
            }

        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

More structure, but Scrapy handles concurrency, retries, URL joining, output serialization, and duplicate filtering. The spider defines what to scrape; the framework handles how.

Selenium

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json

options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)

articles = []
driver.get("https://example-blog.com/articles")

while True:
    wait = WebDriverWait(driver, 10)
    cards = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.article-card"))
    )

    for card in cards:
        title_el = card.find_element(By.CSS_SELECTOR, "h2 a")
        articles.append({
            "title": title_el.text.strip(),
            "link": title_el.get_attribute("href"),
        })

    try:
        next_btn = driver.find_element(By.CSS_SELECTOR, "a.next-page")
        next_btn.click()
        wait.until(EC.staleness_of(cards[0]))
    except Exception:
        break

driver.quit()

with open("articles.json", "w") as f:
    json.dump(articles, f, indent=2)

print(f"Scraped {len(articles)} articles")

The most verbose version. You manage the browser lifecycle, wait for elements to appear, handle navigation by clicking, and detect when pagination ends with a try/except. But this is the only version that works if the blog is a JavaScript SPA.

Complexity Comparison

Lines of code is a crude metric, but it tells a clear story for small jobs:

AspectBS4 + requestsScrapySelenium
Lines for basic scrape10-1515-2525-40
Dependencies2 (requests, bs4)1 (scrapy)2 (selenium, webdriver)
Setup overheadpip installpip install + project structurepip install + browser driver
Learning curveLowMediumMedium
Mental modelSequential scriptAsync callbacks + yieldBrowser interaction

For a quick one-off scrape, BS4 wins on simplicity. You open a Python file, write a script, run it, and you are done. Scrapy’s class-based spiders and callback model require understanding the framework’s execution flow. Selenium requires understanding browser automation, waits, and element staleness.

Scalability

This is where the tools diverge sharply. Scaling means handling thousands or millions of pages with acceptable speed, resource usage, and reliability.

BS4 + requests

Out of the box, requests is synchronous. One request at a time. You can add concurrency with concurrent.futures or switch to httpx with asyncio:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import asyncio
import httpx
from bs4 import BeautifulSoup

async def scrape_page(client, url):
    response = await client.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    return soup.select_one("h1").get_text(strip=True)

async def main():
    urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
    async with httpx.AsyncClient() as client:
        tasks = [scrape_page(client, url) for url in urls]
        results = await asyncio.gather(*tasks)
    for title in results:
        print(title)

asyncio.run(main())

This works, but you are building your own concurrency, error handling, and rate limiting from scratch. At a few hundred pages, it is manageable. At tens of thousands, you are writing a worse version of Scrapy.

Scrapy

Scrapy was designed for scale. Its Twisted-based engine handles concurrent requests natively. Configuration controls everything:

1
2
3
4
5
6
7
8
9
# settings.py
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408]
HTTPCACHE_ENABLED = True

Scrapy also integrates with scrapy-redis for distributed crawling across multiple machines. For large-scale projects, Scrapy is the clear winner.

Selenium

Selenium scales poorly, and the speed gap between requests and Selenium is dramatic. Each browser instance consumes 200-500 MB of RAM. Running ten concurrent browsers requires 2-5 GB of memory. You can use Selenium Grid for distributed execution, but the resource cost per page is orders of magnitude higher than HTTP-only approaches.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from concurrent.futures import ThreadPoolExecutor
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

def scrape_with_browser(url):
    options = Options()
    options.add_argument("--headless=new")
    driver = webdriver.Chrome(options=options)
    try:
        driver.get(url)
        title = driver.find_element(By.TAG_NAME, "h1").text
        return title
    finally:
        driver.quit()

urls = [f"https://example.com/page/{i}" for i in range(1, 21)]
with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(scrape_with_browser, urls))

Each call launches and destroys a browser. Even with pooling, the overhead is enormous compared to HTTP requests.

Speed Benchmarks

Raw speed for scraping 100 static HTML pages (approximate, typical hardware):

ToolTime (100 pages)Requests/sec
BS4 + requests (sequential)~30s~3
BS4 + httpx (async, 10 concurrent)~5s~20
Scrapy (16 concurrent)~4s~25
Selenium (headless, sequential)~200s~0.5
Selenium (headless, 4 concurrent)~60s~1.7

The speed gap is not small. BS4 with async HTTP is roughly 40x faster than sequential Selenium. Scrapy is the fastest out of the box because concurrency and connection pooling are built in.

For JavaScript-rendered pages, the comparison is unfair – BS4 and Scrapy simply cannot access the data without a browser, so speed is irrelevant if the tool cannot do the job.

When to Combine Tools

The three tools are not mutually exclusive. Some of the most effective scraping setups combine them.

Selenium + BeautifulSoup

Use Selenium to render the page, then pass the rendered HTML to BS4 for parsing. Selenium’s element-finding API is functional but slower than BS4’s parsing for complex extraction:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)

driver.get("https://spa-example.com/dashboard")

# Wait for JS to render...
# Then hand off to BS4 for fast, flexible parsing
soup = BeautifulSoup(driver.page_source, "html.parser")

for row in soup.select("table.data tbody tr"):
    cells = row.select("td")
    print({
        "name": cells[0].get_text(strip=True),
        "value": cells[1].get_text(strip=True),
        "date": cells[2].get_text(strip=True),
    })

driver.quit()

This gives you the best of both worlds: Selenium handles JavaScript rendering, and BS4 handles the parsing with its richer API.

Scrapy + scrapy-playwright

For JavaScript-heavy sites at scale, scrapy-playwright integrates Playwright (a browser automation library that has largely replaced Selenium for new projects) into Scrapy’s framework. You get Scrapy’s crawling infrastructure with browser rendering where needed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import scrapy


class JSSpider(scrapy.Spider):
    name = "js_spider"
    start_urls = ["https://spa-example.com/products"]

    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "PLAYWRIGHT_BROWSER_TYPE": "chromium",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={"playwright": True, "playwright_include_page": True},
            )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.wait_for_selector("div.product-card")

        for card in response.css("div.product-card"):
            yield {
                "name": card.css("h3::text").get(),
                "price": card.css("span.price::text").get(),
            }

        await page.close()

This approach gives you Scrapy’s scheduler, retries, and pipelines while using a real browser for rendering. It is more resource-intensive than plain Scrapy but far more manageable than raw Selenium at scale.

Scrapy for Crawling + BS4 for Post-Processing

Sometimes you want Scrapy to handle the crawl but prefer BS4’s API for complex HTML manipulation during post-processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import scrapy
from bs4 import BeautifulSoup


class HybridSpider(scrapy.Spider):
    name = "hybrid"
    start_urls = ["https://example.com/articles"]

    def parse(self, response):
        soup = BeautifulSoup(response.text, "html.parser")

        for article in soup.select("article"):
            # Use BS4's powerful navigation for complex extraction
            content = article.select_one("div.content")

            # Remove unwanted elements
            for ad in content.select("div.advertisement"):
                ad.decompose()

            yield {
                "title": article.select_one("h2").get_text(strip=True),
                "clean_html": str(content),
                "text": content.get_text(separator="\n", strip=True),
            }

Feature Comparison

FeatureBS4 + requestsScrapySelenium
HTML parsingYes (core feature)Yes (CSS + XPath)Yes (browser DOM)
JavaScript executionNoNo (yes with plugin)Yes (core feature)
Async / concurrentManual (httpx/asyncio)Built-inManual (threading)
Pagination handlingManual loopsBuilt-in link followingManual clicks
Retry logicManualBuilt-in middlewareManual
Rate limitingManualBuilt-in AutoThrottleManual
Data exportManual (json/csv)Built-in feedsManual
Proxy supportManual per requestBuilt-in middlewareBrowser flags
Robots.txtManualBuilt-in complianceN/A
Cookie handlingrequests.SessionBuilt-inBrowser-native
User interactionNoNoYes (clicks, typing)
Memory per page~1-5 MB~1-5 MB~200-500 MB
Best scale1 - 100 pages100 - 1M+ pages1 - 100 pages
Learning timeHoursDaysHours

Full Decision Flowchart

Use this flowchart when starting a new scraping project:

graph TD
    A[New scraping<br>project] --> B{Is the data in<br>the initial HTML?}
    B -->|Yes| C{How many pages?}
    B -->|No, needs JS| D{Need user<br>interaction?}

    C -->|Under 100| E[BS4 + requests<br>Simple script]
    C -->|100 - 1000| F{Need retries,<br>rate limiting,<br>pipelines?}
    C -->|1000+| G[Scrapy<br>Full framework]

    F -->|No| H[BS4 + httpx<br>Async script]
    F -->|Yes| G

    D -->|Yes, clicks<br>forms, scrolls| I[Selenium<br>Browser control]
    D -->|No, just<br>rendering| J{How many pages?}

    J -->|Under 100| K[Selenium +<br>BS4 parsing]
    J -->|100+| L[Scrapy +<br>scrapy-playwright]

Choosing Quickly

If you want a one-line answer:

  • Just need data from a few pages: BS4 + requests
  • Crawling a large site with structure: Scrapy
  • Page needs a browser to work: Selenium
  • Browser-rendered pages at scale: Scrapy + scrapy-playwright
  • Complex HTML parsing after rendering: Selenium + BS4

The best scraping projects start with the simplest tool that works. Begin with BS4 and requests. If you hit a JavaScript wall, add Selenium or Playwright. If you need scale, reach for Scrapy. And if the page structure is unpredictable, consider LLM-powered structured data extraction to avoid writing brittle selectors. The decision tree is not about picking the most powerful tool – it is about picking the one that matches the problem with the least overhead.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.