Post

CSS Selectors for Web Scraping: A Practical Cheat Sheet

CSS Selectors for Web Scraping: A Practical Cheat Sheet

CSS selectors are the most intuitive way to target elements when scraping web pages. Unlike XPath, which reads like a file system path, CSS selectors use the same syntax you already know from stylesheets. Every browser, every scraping library, and every automation framework supports them. Whether you are using BeautifulSoup in Python or Playwright in JavaScript, the selector string itself stays the same – only the API call around it changes. This post is a practical cheat sheet covering every CSS selector pattern you will actually use in scraping work, with real code examples and a reference table you can bookmark.

Basic Selectors

These are the four selectors you will use on almost every scraping project. They target elements by tag name, class, ID, or the presence of an attribute.

Tag Selector

Matches every element of the given tag type.

1
h1

Given this HTML:

1
2
3
<h1>Product Title</h1>
<h2>Description</h2>
<h1>Another Title</h1>

The selector h1 matches both <h1> elements.

Class Selector

Matches elements with a specific class. Prefix the class name with a dot.

1
.product-card

Given this HTML:

1
2
3
<div class="product-card">Laptop</div>
<div class="sidebar">Filters</div>
<div class="product-card">Phone</div>

The selector .product-card matches both product card divs but not the sidebar.

ID Selector

Matches a single element with a specific ID. Prefix with a hash.

1
#main-content

Given this HTML:

1
2
3
<div id="main-content">
  <p>Page content here</p>
</div>

The selector #main-content matches that one div. IDs should be unique on a page, so this returns at most one element.

Attribute Selector

Matches elements that have a specific attribute, regardless of its value.

1
[data-product-id]

Given this HTML:

1
2
3
<div data-product-id="101">Laptop</div>
<div data-product-id="102">Phone</div>
<div class="promo">Sale banner</div>

The selector [data-product-id] matches the first two divs.

Combinators

Combinators let you express relationships between elements. This is where CSS selectors start to get powerful for scraping, because you can scope your selections to specific parts of the page.

Descendant Combinator (space)

Matches elements nested anywhere inside an ancestor, at any depth.

1
.product-card .price

Given this HTML:

1
2
3
4
5
6
<div class="product-card">
  <h3>Laptop</h3>
  <div class="details">
    <span class="price">$999</span>
  </div>
</div>

The selector .product-card .price matches the price span even though it is nested two levels deep inside the product card. The space means “anywhere inside,” not just direct children.

Child Combinator (>)

Matches only direct children, not deeper descendants.

1
ul > li

Given this HTML:

1
2
3
4
5
6
7
8
9
<ul class="menu">
  <li>Home</li>
  <li>Products
    <ul>
      <li>Laptops</li>
      <li>Phones</li>
    </ul>
  </li>
</ul>

The selector ul.menu > li matches only “Home” and “Products,” not “Laptops” or “Phones.” Those are children of the nested <ul>, not direct children of ul.menu.

Adjacent Sibling Combinator (+)

Matches an element that immediately follows another element at the same level.

1
h2 + p

Given this HTML:

1
2
3
<h2>Description</h2>
<p>This laptop has a 15-inch screen.</p>
<p>It weighs 2.1 kg.</p>

The selector h2 + p matches only the first paragraph – the one directly after the <h2>. The second paragraph is not an immediate sibling of the heading.

General Sibling Combinator (~)

Matches all sibling elements that follow, not just the immediate one.

1
h2 ~ p

Using the same HTML above, h2 ~ p matches both paragraphs because they are both siblings that come after the <h2>.

Attribute Selectors

Beyond just checking for the presence of an attribute, CSS lets you match attribute values with different comparison operators. These are extremely useful for scraping because many sites use data attributes, dynamic classes, and parameterized URLs.

Exact Match

1
[type="submit"]

Matches elements where the attribute value is exactly “submit.”

Starts With (^=)

1
a[href^="https://"]

Matches all links whose href begins with “https://”. This is useful for filtering external links from internal ones:

1
2
3
<a href="https://example.com">External</a>
<a href="/about">Internal</a>
<a href="https://cdn.example.com/image.jpg">CDN Link</a>

The selector a[href^="https://"] matches the first and third links.

Ends With ($=)

1
a[href$=".pdf"]

Matches all links pointing to PDF files:

1
2
3
<a href="/docs/report.pdf">Annual Report</a>
<a href="/docs/slides.pptx">Slides</a>
<a href="/docs/invoice.pdf">Invoice</a>

The selector a[href$=".pdf"] matches the first and third links.

Contains (*=)

1
[class*="price"]

Matches any element whose class attribute contains the substring “price” anywhere:

1
2
3
<span class="product-price">$29.99</span>
<span class="price-sale">$19.99</span>
<span class="quantity">3</span>

The selector [class*="price"] matches the first two spans. This is particularly useful when sites use naming conventions like price-original, price-discounted, sale-price, and you want to catch all of them.

Contains Word (~=)

1
[class~="featured"]

Matches elements where the attribute contains the word “featured” as a space-separated token. This is more precise than *= because [class*="featured"] would also match a class like unfeatured, while [class~="featured"] only matches when “featured” is a standalone class name.

Pseudo-Selectors for Scraping

CSS pseudo-selectors let you target elements based on their position within a parent. These are invaluable when scraping tables, lists, and repeated structures.

:first-child and :last-child

1
tr td:first-child

Matches the first <td> in every table row. Useful for extracting the first column from a table:

1
2
3
4
<table>
  <tr><td>Laptop</td><td>$999</td><td>In Stock</td></tr>
  <tr><td>Phone</td><td>$699</td><td>Out of Stock</td></tr>
</table>

The selector tr td:first-child returns “Laptop” and “Phone.” The selector tr td:last-child returns “In Stock” and “Out of Stock.”

:nth-child(n)

1
tr td:nth-child(2)

Matches the second <td> in each row. Using the table above, this returns “$999” and “$699” – the price column.

You can also use formulas:

  • tr:nth-child(2n) – every even row
  • tr:nth-child(2n+1) – every odd row
  • tr:nth-child(n+2) – all rows starting from the second (useful for skipping a header row)

:not()

1
a:not(.nav-link)

Matches all links that do not have the class nav-link. This is a powerful filter when you want to exclude navigation links and focus on content links:

1
2
3
4
<a class="nav-link" href="/home">Home</a>
<a class="nav-link" href="/products">Products</a>
<a href="/products/laptop-123">Laptop Pro 15</a>
<a href="/products/phone-456">Phone Ultra</a>

The selector a:not(.nav-link) matches only the product links.

Combining Multiple Selectors

You can chain selectors to create precise targeting. Here are patterns that come up regularly in scraping:

1
2
3
4
5
6
7
8
/* Element with multiple classes */
div.product-card.featured

/* Element with tag, class, and attribute */
a.product-link[href^="/products/"]

/* Complex nesting */
#search-results .product-card:not(.ad) .price

That last selector reads: inside the element with ID search-results, find .product-card elements that do not have the class .ad, then grab the .price element inside each of those. This kind of precise targeting is what makes CSS selectors so effective for scraping.

CSS selectors are the bridge between what you see and what you can extract.
CSS selectors are the bridge between what you see and what you can extract. Photo by Bibek ghosh / Pexels

Real Scraping Patterns

Here are practical selector patterns for common scraping scenarios.

Select All Product Cards

1
.product-card

Most e-commerce sites wrap each product in a container with a class like product-card, product-item, product-tile, or similar. Start by inspecting one product in DevTools and identifying this container class.

Select Price Inside a Product

1
.product-card .price

Scoping the price selector inside the product card ensures you get the price that belongs to each product, not a price element somewhere else on the page like a cart total.

1
nav a[href]

This targets only anchor tags inside <nav> elements that have an href attribute. It filters out anchor tags used as buttons (no href) and links outside the navigation.

Select Data Attributes

1
[data-product-id]

Many modern sites store structured data in data-* attributes. Understanding the HTML basics behind these attributes helps you know where to look. These are more stable than class names and often contain the exact IDs, SKUs, or metadata you need:

1
2
3
<div class="x7g_t2 Jk99a" data-product-id="12345" data-sku="LP-PRO-15">
  Laptop Pro 15
</div>

The class names x7g_t2 Jk99a are auto-generated and will change on the next deploy. The data-product-id attribute is part of the application logic and far more stable.

Select by Partial Class Name

1
[class*="price"]

When a site uses CSS-in-JS or utility class frameworks, the actual class names may be obfuscated, but they often still contain recognizable substrings. The *= operator lets you match those substrings.

Select Rows After a Header

1
table.results tr:nth-child(n+2) td

Skips the first row (typically the header) and selects all <td> elements from the data rows.

1
a[href^="http"]:not([href*="example.com"])

Matches links that start with “http” but do not contain the site’s own domain. Useful for extracting outbound links from a page.

Python Examples with BeautifulSoup

BeautifulSoup supports CSS selectors through the select() and select_one() methods. The select() method returns a list of all matching elements, while select_one() returns the first match or None.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from bs4 import BeautifulSoup

html = """
<html>
<body>
  <nav>
    <a href="/">Home</a>
    <a href="/products">Products</a>
  </nav>
  <div id="search-results">
    <div class="product-card" data-product-id="101">
      <h3 class="product-title">Laptop Pro 15</h3>
      <span class="price">$999.00</span>
      <a href="/products/laptop-101">View Details</a>
    </div>
    <div class="product-card" data-product-id="102">
      <h3 class="product-title">Phone Ultra</h3>
      <span class="price">$699.00</span>
      <a href="/products/phone-102">View Details</a>
    </div>
    <div class="product-card ad" data-product-id="promo">
      <h3 class="product-title">Sponsored: Case Deluxe</h3>
      <span class="price">$29.00</span>
    </div>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

# Select all product cards
cards = soup.select(".product-card")
print(f"Found {len(cards)} product cards")
# Found 3 product cards

# Select the first product title
title = soup.select_one(".product-card .product-title")
print(title.text)
# Laptop Pro 15

# Select all prices (excluding ads)
prices = soup.select(".product-card:not(.ad) .price")
for price in prices:
    print(price.text)
# $999.00
# $699.00

# Extract data attributes
for card in soup.select("[data-product-id]"):
    product_id = card["data-product-id"]
    name = card.select_one(".product-title").text
    print(f"ID: {product_id}, Name: {name}")
# ID: 101, Name: Laptop Pro 15
# ID: 102, Name: Phone Ultra
# ID: promo, Name: Sponsored: Case Deluxe

# Select navigation links only
nav_links = soup.select("nav a[href]")
for link in nav_links:
    print(f"{link.text} -> {link['href']}")
# Home -> /
# Products -> /products

# Select product detail links using starts-with
detail_links = soup.select('a[href^="/products/"]')
for link in detail_links:
    print(link["href"])
# /products/laptop-101
# /products/phone-102

Extracting a Table with BeautifulSoup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from bs4 import BeautifulSoup

html = """
<table class="specs">
  <tr><th>Feature</th><th>Value</th></tr>
  <tr><td>Screen</td><td>15 inch</td></tr>
  <tr><td>RAM</td><td>16 GB</td></tr>
  <tr><td>Storage</td><td>512 GB SSD</td></tr>
</table>
"""

soup = BeautifulSoup(html, "html.parser")

# Skip the header row and get data rows
rows = soup.select("table.specs tr:nth-child(n+2)")
for row in rows:
    cells = row.select("td")
    feature = cells[0].text
    value = cells[1].text
    print(f"{feature}: {value}")
# Screen: 15 inch
# RAM: 16 GB
# Storage: 512 GB SSD

JavaScript Examples with Playwright

Playwright provides locator() for actionable queries and querySelector()/querySelectorAll() through page.evaluate() for direct DOM access. The locator() method is preferred because it includes auto-waiting and retry logic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
const { chromium } = require("playwright");

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.setContent(`
    <div id="search-results">
      <div class="product-card" data-product-id="101">
        <h3 class="product-title">Laptop Pro 15</h3>
        <span class="price">$999.00</span>
        <a href="/products/laptop-101">View Details</a>
      </div>
      <div class="product-card" data-product-id="102">
        <h3 class="product-title">Phone Ultra</h3>
        <span class="price">$699.00</span>
        <a href="/products/phone-102">View Details</a>
      </div>
    </div>
  `);

  // Select all product cards using locator
  const cards = page.locator(".product-card");
  const count = await cards.count();
  console.log(`Found ${count} product cards`);
  // Found 2 product cards

  // Get text from each price element
  const prices = page.locator(".product-card .price");
  const allPrices = await prices.allTextContents();
  console.log(allPrices);
  // ['$999.00', '$699.00']

  // Extract data attributes with evaluate
  const productIds = await page.$$eval("[data-product-id]", (elements) =>
    elements.map((el) => ({
      id: el.dataset.productId,
      name: el.querySelector(".product-title").textContent,
    }))
  );
  console.log(productIds);
  // [{ id: '101', name: 'Laptop Pro 15' }, { id: '102', name: 'Phone Ultra' }]

  // Use query_selector for a single element
  const firstTitle = await page
    .locator(".product-card .product-title")
    .first()
    .textContent();
  console.log(firstTitle);
  // Laptop Pro 15

  // Select links that start with /products/
  const detailLinks = page.locator('a[href^="/products/"]');
  const hrefs = await detailLinks.evaluateAll((links) =>
    links.map((a) => a.getAttribute("href"))
  );
  console.log(hrefs);
  // ['/products/laptop-101', '/products/phone-102']

  await browser.close();
})();

Playwright with Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/products")

    # Select all product titles using locator
    titles = page.locator(".product-card .product-title").all_text_contents()
    print(titles)

    # Get a single element's text
    first_price = page.locator(".product-card .price").first.text_content()
    print(first_price)

    # Extract data attributes
    product_ids = page.locator("[data-product-id]").evaluate_all(
        "elements => elements.map(el => el.dataset.productId)"
    )
    print(product_ids)

    # Use nth to get a specific match
    second_card_title = page.locator(".product-card .product-title").nth(1).text_content()
    print(second_card_title)

    browser.close()

Cheat Sheet Reference Table

SelectorWhat It DoesExample HTMLWhat It Matches
h1All <h1> elements<h1>Title</h1>The <h1> element
.priceElements with class “price”<span class="price">$10</span>The span
#mainElement with ID “main”<div id="main">...</div>The div
[href]Elements with an href attribute<a href="/page">Link</a>The anchor
[type="text"]Attribute equals value<input type="text">The input
[href^="/products"]Attribute starts with value<a href="/products/1">The anchor
[href$=".pdf"]Attribute ends with value<a href="/doc.pdf">The anchor
[class*="price"]Attribute contains substring<span class="sale-price">The span
.card .priceDescendant (any depth)<div class="card"><p><span class="price">The span
ul > liDirect child only<ul><li>Item</li></ul>The <li>
h2 + pImmediately following sibling<h2>...</h2><p>Text</p>The <p>
h2 ~ pAny following sibling<h2>...</h2><div>...</div><p>Text</p>The <p>
td:first-childFirst child element<tr><td>A</td><td>B</td></tr>First <td> (“A”)
td:last-childLast child element<tr><td>A</td><td>B</td></tr>Last <td> (“B”)
td:nth-child(2)Nth child element<tr><td>A</td><td>B</td><td>C</td></tr>Second <td> (“B”)
tr:nth-child(n+2)All from 2nd onward<table><tr>header</tr><tr>data</tr>Data rows
a:not(.nav)Elements not matching<a class="nav"> <a class="link">Second anchor only
div.card.featuredMultiple classes<div class="card featured">The div
Selecting the right element is half the battle in web scraping.
Selecting the right element is half the battle in web scraping. Photo by Mikhail Nilov / Pexels

How CSS Selectors Connect to the DOM

When you write a CSS selector, the browser (or parser) walks the DOM tree to find matching elements. Understanding this flow helps you write selectors that are both accurate and efficient.

graph TD
    A[Your CSS Selector<br>.product-card .price] --> B[Browser / Parser<br>walks the DOM tree]
    B --> C[document]
    C --> D[body]
    D --> E[div.product-card]
    D --> F[div.sidebar]
    E --> G[h3.product-title]
    E --> H["span.price -- MATCH"]
    F --> I[span.price]

    style H fill:#2d6,stroke:#333,color:#000
    style I fill:#d44,stroke:#333,color:#fff

The selector .product-card .price matches the price span inside the product card but not the one inside the sidebar, even though both have the class price. Scoping your selectors with a parent context is how you avoid picking up stray matches from other parts of the page.

Testing Selectors with Browser DevTools

Before writing any scraping code, test your selectors in the browser. This saves significant time and debugging effort.

Step 1: Open DevTools (F12 or right-click and Inspect).

Step 2: Open the Console tab.

Step 3: Use document.querySelectorAll() to test your selector:

1
2
3
4
5
6
7
8
// Count matches
document.querySelectorAll(".product-card").length

// Preview matched elements
document.querySelectorAll(".product-card .price")

// Extract text from matches
[...document.querySelectorAll(".product-card .price")].map(el => el.textContent)

Step 4: Use the Elements panel shortcut. For a more detailed walkthrough, see how to find CSS selectors for any website element. Right-click any element and select “Copy > Copy selector” to get a CSS selector for that specific element. This gives you a starting point, but the auto-generated selector is usually overly specific and needs simplification.

Another useful technique is the $$() shortcut available in most browser consoles:

1
2
3
4
5
// Shorthand for document.querySelectorAll()
$$(".product-card .price")

// Quick check if your selector finds anything
$$("[data-product-id]").length

Common Mistakes

Overly Specific Selectors

The “Copy selector” feature in DevTools generates paths like this:

1
#root > div > div:nth-child(2) > div > div:nth-child(1) > div > span

This selector is brittle. If the site adds a wrapper div or reorders elements, it breaks immediately. Instead, identify the meaningful class or attribute and use that:

1
.product-card .price

Relying on Auto-Generated Class Names

Sites using CSS-in-JS frameworks (styled-components, CSS Modules, Tailwind with JIT) produce class names like _3xK9d, css-1a2b3c, or sc-bdVTJa. These change every time the site rebuilds. Instead of targeting these classes directly, look for:

  • Data attributes like [data-testid="product-price"] or [data-product-id]
  • Semantic HTML tags like <article>, <nav>, <main>, <header>
  • Attribute substrings like [class*="price"] when the obfuscated class still contains a readable keyword
  • Structural selectors like article > h2 that rely on tag hierarchy rather than class names

Forgetting That select() Returns a List

In BeautifulSoup, select() always returns a list, even if there is only one match:

1
2
3
4
5
6
7
8
9
10
11
# Wrong -- this is a list, not an element
price = soup.select(".price")
print(price.text)  # AttributeError

# Correct -- use select_one() for a single element
price = soup.select_one(".price")
print(price.text)

# Or index into the list
price = soup.select(".price")[0]
print(price.text)

Not Handling Missing Elements

A selector might return None (in select_one()) or an empty list (in select()). Always check:

1
2
3
4
5
price_element = soup.select_one(".product-card .price")
if price_element:
    price = price_element.text.strip()
else:
    price = "N/A"

In Playwright, using locator() with assertions or count() prevents errors on missing elements:

1
2
3
4
5
const priceLocator = page.locator(".product-card .price");
if ((await priceLocator.count()) > 0) {
  const price = await priceLocator.first().textContent();
  console.log(price);
}

Selector Strategy Decision Flow

When you are looking at a page and trying to decide which selector to use, follow this flow:

graph TD
    A[Inspect the target element] --> B{Has a unique ID?}
    B -->|Yes| C["Use #id"]
    B -->|No| D{Has a stable class name?}
    D -->|Yes| E["Use .class-name"]
    D -->|No| F{Has a data attribute?}
    F -->|Yes| G["Use [data-attr]"]
    F -->|No| H{Class contains<br>readable keyword?}
    H -->|Yes| I["Use [class*=keyword]"]
    H -->|No| J{Inside a semantic<br>tag like nav or article?}
    J -->|Yes| K[Use tag-based selector<br>like article > h2]
    J -->|No| L[Use structural selector<br>with :nth-child]

Start from the top and use the first strategy that gives you a reliable, stable selector. IDs and data attributes are the most resilient to site changes. Class names are next, as long as they are human-readable and not auto-generated. Structural selectors are the last resort because they depend on the exact DOM layout.

Putting It All Together

Here is a complete scraping script that uses multiple selector strategies to extract product data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")

products = []

# Select all real product cards, excluding ads
for card in soup.select(".product-card:not(.ad):not(.sponsored)"):
    product = {}

    # Use data attribute for ID
    product["id"] = card.get("data-product-id", "unknown")

    # Use class selector for title
    title_el = card.select_one(".product-title")
    product["title"] = title_el.text.strip() if title_el else "N/A"

    # Use attribute substring match for price (handles price, sale-price, etc.)
    price_el = card.select_one("[class*='price']")
    product["price"] = price_el.text.strip() if price_el else "N/A"

    # Use starts-with for product detail links
    link_el = card.select_one("a[href^='/products/']")
    product["url"] = link_el["href"] if link_el else None

    # Use nth-child for spec table inside the card
    specs = {}
    for row in card.select("table.specs tr:nth-child(n+2)"):
        cells = row.select("td")
        if len(cells) == 2:
            specs[cells[0].text.strip()] = cells[1].text.strip()
    product["specs"] = specs

    products.append(product)

for p in products:
    print(f"{p['id']}: {p['title']} - {p['price']}")

CSS selectors cover the vast majority of element targeting needs in web scraping. They are readable, well-supported, and fast. For the rare cases where you need to traverse upward in the DOM (selecting a parent based on a child), XPath is the better tool. But for everything else – extracting products, prices, links, tables, and structured data – CSS selectors are the right default choice.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.