Building a Web Scraper with Regex: Practical Patterns and Pitfalls
Most web scraping tutorials reach for BeautifulSoup or lxml the moment HTML enters the picture. Those are excellent tools, but they are not always necessary. If you are scraping a predictable page, extracting a handful of fields from a known HTML structure, or working in an environment where you cannot install third-party parsing libraries, regular expressions can do the job. This post walks through building a complete web scraper in Python using only the requests library for fetching and the built-in re module for extraction. Along the way, we will build a reusable pattern library, handle common edge cases, and then look honestly at where regex scraping falls apart.
This is not a regex basics article. If you need a refresher on capture groups and quantifiers, check our earlier post on pattern matching fundamentals. Here we are building a working tool.
The Project
We will build a scraper that takes a URL, fetches the page, and extracts the following structured data using only regex:
- Page title
- All links with their anchor text
- Meta tags (description, keywords, Open Graph tags)
- Structured product-like data from repeating HTML blocks
The entire scraper will be a single Python script with no dependencies beyond requests and the standard library. By the end, you will have a reusable module you can adapt for your own projects.
Step 1: Fetching HTML with Python Requests
Before we extract anything, we need clean HTML to work with. The requests library handles fetching, redirects, and encoding detection for us.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests
def fetch_page(url, timeout=10):
"""Fetch a webpage and return its HTML as a string."""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/131.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = requests.get(url, headers=headers, timeout=timeout)
response.raise_for_status()
return response.text
A few things worth noting here. We set a User-Agent header because some servers return different HTML or block requests that identify as Python. The response.text property returns the HTML as a decoded string, which matters for regex because re works on strings, not raw bytes. If you use response.content instead, you get bytes, and your patterns will need b'...' byte literals. Stick with response.text unless you have a specific reason not to.
1
2
html = fetch_page('https://example.com')
print(f"Fetched {len(html)} characters")
That gives us a string we can run patterns against.
Step 2: Extracting the Page Title
The simplest extraction task: pull the text between <title> and </title>.
1
2
3
4
5
6
7
8
import re
def extract_title(html):
"""Extract the page title from HTML."""
match = re.search(r'<title[^>]*>(.*?)</title>', html, re.IGNORECASE | re.DOTALL)
if match:
return match.group(1).strip()
return None
Let us break down the pattern <title[^>]*>(.*?)</title>:
<titlematches the opening tag literally[^>]*handles any attributes on the title tag (rare, but possible)>closes the opening tag(.*?)is a non-greedy capture group that grabs everything inside the tag</title>matches the closing tagre.IGNORECASEhandles<TITLE>or<Title>variationsre.DOTALLlets.match newlines, since the title might span multiple lines
1
2
3
title = extract_title(html)
print(f"Title: {title}")
# Output: Title: Example Domain
The re.search function returns None if the pattern does not match, so always check before calling .group(). This is a pattern we will repeat throughout the scraper.

Step 3: Extracting All Links with Their Text
Links are more interesting because there are many on a page, and each has both a URL and anchor text.
1
2
3
4
5
6
7
8
9
10
def extract_links(html):
"""Extract all links with their href and anchor text."""
pattern = r'<a\s+[^>]*href=["\']([^"\']+)["\'][^>]*>(.*?)</a>'
matches = re.findall(pattern, html, re.IGNORECASE | re.DOTALL)
links = []
for href, text in matches:
# Clean the anchor text by removing nested HTML tags
clean_text = re.sub(r'<[^>]+>', '', text).strip()
links.append({'href': href, 'text': clean_text})
return links
The pattern <a\s+[^>]*href=["\']([^"\']+)["\'][^>]*>(.*?)</a> works like this:
<a\s+matches the opening<atag followed by at least one whitespace character[^>]*skips over any attributes beforehref(likeclass,id)href=["\']matches the href attribute with either single or double quotes([^"\']+)captures the URL (everything that is not a quote)["\']closes the quote[^>]*>skips any remaining attributes and closes the opening tag(.*?)captures the anchor text non-greedily</a>matches the closing tag
There is a subtlety here: anchor text often contains nested HTML like <span>, <img>, or <strong> tags. The secondary re.sub call strips those out, leaving just the visible text. For a dedicated look at extracting email addresses from links and page content, see our post on email regex patterns for web scraping.
1
2
3
links = extract_links(html)
for link in links:
print(f" {link['text']} -> {link['href']}")
One limitation: this pattern requires href to use quotes. Some HTML in the wild uses unquoted attribute values like href=https://example.com. If you need to handle that, extend the pattern with an alternation:
1
2
# Handles quoted and unquoted href values
pattern = r'<a\s+[^>]*href=(?:["\']([^"\']+)["\']|(\S+))[^>]*>(.*?)</a>'
For most modern websites, the quoted version is sufficient.
Step 4: Extracting Meta Tags
Meta tags carry structured metadata that is useful for scraping: descriptions, keywords, Open Graph tags for social media, and canonical URLs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def extract_meta_tags(html):
"""Extract meta tags and return them as a dictionary."""
meta = {}
# Standard meta tags with name attribute
pattern_name = r'<meta\s+[^>]*name=["\']([^"\']+)["\'][^>]*content=["\']([^"\']*)["\'][^>]*/?>'
for name, content in re.findall(pattern_name, html, re.IGNORECASE):
meta[name.lower()] = content
# Handle reversed attribute order (content before name)
pattern_reversed = r'<meta\s+[^>]*content=["\']([^"\']*)["\'][^>]*name=["\']([^"\']+)["\'][^>]*/?>'
for content, name in re.findall(pattern_reversed, html, re.IGNORECASE):
meta[name.lower()] = content
# Open Graph meta tags (use property instead of name)
pattern_og = r'<meta\s+[^>]*property=["\']([^"\']+)["\'][^>]*content=["\']([^"\']*)["\'][^>]*/?>'
for prop, content in re.findall(pattern_og, html, re.IGNORECASE):
meta[prop.lower()] = content
return meta
This function uses three patterns because meta tags in real HTML vary in attribute order and use different attribute names. The HTML spec does not enforce any particular order for attributes, so <meta name="description" content="..."> and <meta content="..." name="description"> are both valid. You will encounter both in production scraping.
1
2
3
meta = extract_meta_tags(html)
for key, value in meta.items():
print(f" {key}: {value}")
Notice that we need separate patterns for name vs property attributes. Open Graph tags use property="og:title" instead of name. This is exactly the kind of variation that makes regex scraping require careful pattern design.

Step 5: Extracting Structured Data from Repeating Blocks
The most practical use case for regex scraping is pulling structured data from repeating HTML blocks: product listings, search results, table rows, article cards. Here we will parse a product listing page.
Suppose the HTML contains product cards with this structure:
1
2
3
4
5
<div class="product-card">
<h2 class="product-name">Wireless Mouse</h2>
<span class="price">$29.99</span>
<p class="description">Ergonomic wireless mouse with USB receiver</p>
</div>
First, we extract each card as a block, then parse fields within each block:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def extract_products(html):
"""Extract product data from repeating HTML blocks."""
products = []
# Step 1: Extract each product card block
card_pattern = r'<div\s+class=["\']product-card["\'][^>]*>(.*?)</div>'
cards = re.findall(card_pattern, html, re.IGNORECASE | re.DOTALL)
# Step 2: Extract fields from each card
name_pattern = re.compile(r'<h2[^>]*class=["\']product-name["\'][^>]*>(.*?)</h2>', re.IGNORECASE | re.DOTALL)
price_pattern = re.compile(r'<span[^>]*class=["\']price["\'][^>]*>\$?([\d,.]+)</span>', re.IGNORECASE | re.DOTALL)
desc_pattern = re.compile(r'<p[^>]*class=["\']description["\'][^>]*>(.*?)</p>', re.IGNORECASE | re.DOTALL)
for card_html in cards:
product = {}
name_match = name_pattern.search(card_html)
product['name'] = name_match.group(1).strip() if name_match else None
price_match = price_pattern.search(card_html)
product['price'] = float(price_match.group(1).replace(',', '')) if price_match else None
desc_match = desc_pattern.search(card_html)
product['description'] = desc_match.group(1).strip() if desc_match else None
products.append(product)
return products
The two-phase approach (extract blocks, then parse fields) is important. If you try to match all fields in a single pattern across the entire page, you risk matching a price from one product with the name from another. Isolating each block first keeps the data aligned.
Notice that the field patterns are compiled with re.compile(). This is the entry point into building a reusable pattern library.
Building a Reusable Pattern Library
When you scrape multiple pages from the same site, you end up writing the same patterns repeatedly. Compiling them into a library avoids that duplication and improves performance since re.compile() pre-processes the pattern once.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import re
class PatternLibrary:
"""Reusable compiled regex patterns for HTML extraction."""
def __init__(self):
flags = re.IGNORECASE | re.DOTALL
# Structural patterns
self.title = re.compile(r'<title[^>]*>(.*?)</title>', flags)
self.links = re.compile(
r'<a\s+[^>]*href=["\']([^"\']+)["\'][^>]*>(.*?)</a>', flags
)
self.images = re.compile(
r'<img\s+[^>]*src=["\']([^"\']+)["\'][^>]*(?:alt=["\']([^"\']*)["\'])?[^>]*/?>',
flags
)
# Meta patterns
self.meta_name = re.compile(
r'<meta\s+[^>]*name=["\']([^"\']+)["\'][^>]*content=["\']([^"\']*)["\'][^>]*/?>',
flags
)
self.meta_property = re.compile(
r'<meta\s+[^>]*property=["\']([^"\']+)["\'][^>]*content=["\']([^"\']*)["\'][^>]*/?>',
flags
)
# Content cleaning
self.html_tags = re.compile(r'<[^>]+>')
self.whitespace = re.compile(r'\s+')
def clean_text(self, text):
"""Remove HTML tags and normalize whitespace."""
text = self.html_tags.sub(' ', text)
text = self.whitespace.sub(' ', text)
return text.strip()
def extract_tag(self, html, tag, class_name=None):
"""Generic extractor for content within a tag, optionally filtered by class."""
if class_name:
pattern = re.compile(
rf'<{tag}\s+[^>]*class=["\'][^"\']*{re.escape(class_name)}[^"\']*["\'][^>]*>(.*?)</{tag}>',
re.IGNORECASE | re.DOTALL
)
else:
pattern = re.compile(
rf'<{tag}[^>]*>(.*?)</{tag}>',
re.IGNORECASE | re.DOTALL
)
return [self.clean_text(m) for m in pattern.findall(html)]
The extract_tag method is particularly useful. Instead of writing a new pattern for every tag and class combination, you build patterns dynamically. The re.escape() call on the class name prevents special regex characters in class names from breaking the pattern.
1
2
3
4
5
6
7
patterns = PatternLibrary()
# Use compiled patterns
title_match = patterns.title.search(html)
all_links = patterns.links.findall(html)
headings = patterns.extract_tag(html, 'h2')
prices = patterns.extract_tag(html, 'span', class_name='price')

Error Handling: When Patterns Do Not Match
Regex extraction fails silently. re.search returns None, re.findall returns an empty list. Your scraper needs to handle both without crashing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def safe_extract(pattern, html, group=1, default=None):
"""Safely extract a single value from HTML using a regex pattern."""
match = pattern.search(html) if hasattr(pattern, 'search') else re.search(pattern, html)
if match:
try:
return match.group(group).strip()
except IndexError:
return default
return default
def safe_extract_all(pattern, html):
"""Safely extract all matches, always returns a list."""
if hasattr(pattern, 'findall'):
return pattern.findall(html)
return re.findall(pattern, html)
Use these wrappers everywhere in your scraper. They prevent AttributeError from calling .group() on None and ensure you always get a consistent return type.
1
2
3
4
5
title = safe_extract(patterns.title, html, default='No title found')
links = safe_extract_all(patterns.links, html)
print(f"Title: {title}")
print(f"Found {len(links)} links")
For more robust error handling, add logging to track which patterns fail on which pages. This is invaluable when a site updates its HTML and your patterns start returning empty results:
1
2
3
4
5
6
7
8
9
10
import logging
logger = logging.getLogger('regex_scraper')
def extract_with_logging(pattern, html, field_name, url=''):
"""Extract with logging for debugging pattern failures."""
result = safe_extract(pattern, html)
if result is None:
logger.warning(f"Pattern failed for '{field_name}' on {url}")
return result
The Pitfalls
Regex scraping works well in controlled scenarios, but it has sharp edges that will cut you if you are not careful. Every one of these has cost real developers real debugging time.
HTML Is Not a Regular Language
This is the fundamental limitation. Regular expressions can match regular languages, and HTML is not one. HTML has nested structures, and regex cannot reliably count nesting depth.
Consider this:
1
2
3
4
5
<div class="product-card">
<div class="inner-wrapper">
<h2>Product Name</h2>
</div>
</div>
If your card extraction pattern uses (.*?)</div>, it will match up to the first </div>, which closes inner-wrapper, not product-card. You get a partial block.
1
2
3
4
5
6
# This breaks on nested divs
broken_pattern = r'<div class="product-card">(.*?)</div>'
# For single-level nesting, you can be more specific
# but this is fragile and does not generalize
workaround = r'<div class="product-card">\s*<div[^>]*>.*?</div>\s*</div>'
There is no general regex solution to the nesting problem. If the HTML you are scraping has predictable, shallow nesting, you can work around it. If the nesting varies or goes deep, use a parser.
Character Encoding Issues
The response.text from requests is usually correctly decoded, but not always. Some servers send incorrect encoding headers, or the HTML declares one encoding in a meta tag while the server header says another.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def fetch_page_safe(url, timeout=10):
"""Fetch with encoding fallback."""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/131.0.0.0 Safari/537.36',
}
response = requests.get(url, headers=headers, timeout=timeout)
response.raise_for_status()
# Check if encoding was explicitly set by the server
if response.encoding and response.encoding.lower() != 'iso-8859-1':
return response.text
# Try to detect encoding from HTML meta tag
meta_match = re.search(
rb'<meta[^>]+charset=["\']?([^"\'\s;>]+)',
response.content,
re.IGNORECASE
)
if meta_match:
detected = meta_match.group(1).decode('ascii', errors='ignore')
response.encoding = detected
return response.text
Notice we use response.content (bytes) to look for the charset meta tag, then set the encoding before reading response.text. This avoids mojibake (garbled text) that would make your regex patterns match garbage.
Greedy Matching Eating Too Much
This is the most common regex scraping bug. The difference between (.*) and (.*?) determines whether your pattern captures too much content.
1
2
3
4
5
6
7
8
9
10
11
html_snippet = '<p>First paragraph</p><p>Second paragraph</p>'
# Greedy: captures everything between the first <p> and the LAST </p>
greedy = re.search(r'<p>(.*)</p>', html_snippet)
print(greedy.group(1))
# Output: First paragraph</p><p>Second paragraph
# Non-greedy: captures only up to the first </p>
lazy = re.search(r'<p>(.*?)</p>', html_snippet)
print(lazy.group(1))
# Output: First paragraph
Always use *?, +?, and ?? (non-greedy quantifiers) when extracting content between HTML tags. The greedy versions will reach across multiple tags and produce incorrect results.
A more insidious version of this problem appears with re.DOTALL:
1
2
3
4
5
6
7
8
# Without DOTALL, . does not match newlines, which naturally limits greediness
# With DOTALL, a greedy .* can consume the entire page
bad_pattern = re.search(r'<div class="target">(.*)</div>', huge_html, re.DOTALL)
# This might match from the first target div to the very last </div> on the page
good_pattern = re.search(r'<div class="target">(.*?)</div>', huge_html, re.DOTALL)
# Non-greedy stops at the first </div> after the target
Brittle Patterns That Break When HTML Changes
Regex patterns encode assumptions about the HTML structure. When the site updates its layout, even minor changes can break your scraper.
1
2
3
4
5
6
7
8
9
# This pattern assumes the price is in a <span> with class "price"
price_v1 = re.compile(r'<span class="price">\$([\d.]+)</span>')
# Site update: they added a data attribute
# <span class="price" data-currency="USD">$29.99</span>
# price_v1 breaks because it expects > immediately after "price"
# More resilient version
price_v2 = re.compile(r'<span[^>]*class=["\'][^"\']*price[^"\']*["\'][^>]*>\$?([\d,.]+)</span>')
Tips for writing less brittle patterns:
- Use
[^>]*to skip unknown attributes - Match class names with
[^"\']*classname[^"\']*to handle multiple classes - Do not depend on whitespace formatting (use
\s*between tags) - Use
["\']to handle both quote types
Even with these precautions, regex patterns will break more often than CSS selectors or XPath queries when a site changes its markup. Budget time for pattern maintenance.
Missing the DOTALL Flag for Multiline Content
HTML is routinely formatted across multiple lines. Without re.DOTALL, the . metacharacter does not match newline characters, and your patterns silently fail to match content that spans lines.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
multiline_html = """
<div class="product-card">
<h2>Product Name</h2>
<span class="price">$19.99</span>
</div>
"""
# Without DOTALL: no match, because .* cannot cross the newline
no_dotall = re.search(r'<div class="product-card">(.*?)</div>', multiline_html)
print(no_dotall) # None
# With DOTALL: matches correctly
with_dotall = re.search(r'<div class="product-card">(.*?)</div>', multiline_html, re.DOTALL)
print(with_dotall.group(1))
# Output:
# <h2>Product Name</h2>
# <span class="price">$19.99</span>
Make re.DOTALL your default flag for any pattern that extracts content between tags. The only time to omit it is when you specifically want to restrict matching to a single line.
When Regex Works vs When to Use a Parser
Regex scraping is the right tool when:
- You need to extract a small number of specific fields from well-structured HTML
- You are working in an environment where you cannot install BeautifulSoup or lxml
- The HTML is generated by a template and has very consistent structure
- Performance matters and you want to avoid the overhead of building a DOM tree
- You are extracting data from non-HTML text that happens to be inside a webpage (embedded JSON, inline scripts, CSV data)
Switch to a proper parser when:
- The HTML has deeply nested structures you need to traverse
- You need to extract data based on relationships between elements (parent, sibling, child) – for example, automating form filling requires understanding label-input relationships
- The site’s HTML changes frequently and you need more resilient selectors
- You are scraping complex pages with hundreds of elements – at that scale, a browser automation framework is usually the better choice
- The extraction logic requires understanding the document tree (like “the third table in the second section”)
For extraction tasks that go beyond what regex and parsers handle well, LLM-powered data extraction and schema-driven scraping with structured output offer compelling alternatives. One hybrid approach that works well in practice: use regex to extract embedded JSON or script blocks, then parse the structured data normally.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import json
def extract_json_ld(html):
"""Extract JSON-LD structured data from script tags."""
pattern = r'<script\s+type=["\']application/ld\+json["\'][^>]*>(.*?)</script>'
matches = re.findall(pattern, html, re.IGNORECASE | re.DOTALL)
results = []
for match in matches:
try:
data = json.loads(match.strip())
results.append(data)
except json.JSONDecodeError:
continue
return results
This is arguably the strongest use case for regex in scraping. JSON-LD blocks are self-contained, predictable, and the regex just extracts the raw JSON string for json.loads to handle. Once you have that structured data, you can validate it against a schema using tools like Pydantic or Zod – an approach we explore in our post on schema-driven scraping with LLMs.
Complete Working Script
Here is the full scraper assembled into a single working script. It fetches a page and runs all of the extraction functions we built:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
#!/usr/bin/env python3
"""
Regex Web Scraper - Extract structured data from HTML using only regex.
No BeautifulSoup, no lxml, just requests and re.
"""
import re
import json
import logging
import requests
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger('regex_scraper')
class RegexScraper:
"""A web scraper that uses regex for all HTML extraction."""
def __init__(self):
flags = re.IGNORECASE | re.DOTALL
# Compiled patterns
self.p_title = re.compile(r'<title[^>]*>(.*?)</title>', flags)
self.p_links = re.compile(
r'<a\s+[^>]*href=["\']([^"\']+)["\'][^>]*>(.*?)</a>', flags
)
self.p_meta_name = re.compile(
r'<meta\s+[^>]*name=["\']([^"\']+)["\'][^>]*content=["\']([^"\']*)["\'][^>]*/?>',
flags
)
self.p_meta_name_rev = re.compile(
r'<meta\s+[^>]*content=["\']([^"\']*)["\'][^>]*name=["\']([^"\']+)["\'][^>]*/?>',
flags
)
self.p_meta_property = re.compile(
r'<meta\s+[^>]*property=["\']([^"\']+)["\'][^>]*content=["\']([^"\']*)["\'][^>]*/?>',
flags
)
self.p_json_ld = re.compile(
r'<script\s+type=["\']application/ld\+json["\'][^>]*>(.*?)</script>',
flags
)
self.p_tags = re.compile(r'<[^>]+>')
self.p_whitespace = re.compile(r'\s+')
def fetch(self, url, timeout=10):
"""Fetch a webpage and return its HTML."""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/131.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = requests.get(url, headers=headers, timeout=timeout)
response.raise_for_status()
# Encoding detection fallback
if not response.encoding or response.encoding.lower() == 'iso-8859-1':
meta_match = re.search(
rb'<meta[^>]+charset=["\']?([^"\'\s;>]+)',
response.content,
re.IGNORECASE
)
if meta_match:
response.encoding = meta_match.group(1).decode('ascii', errors='ignore')
return response.text
def clean_text(self, text):
"""Remove HTML tags and normalize whitespace."""
text = self.p_tags.sub(' ', text)
text = self.p_whitespace.sub(' ', text)
return text.strip()
def extract_title(self, html):
"""Extract the page title."""
match = self.p_title.search(html)
return match.group(1).strip() if match else None
def extract_links(self, html):
"""Extract all links with their href and anchor text."""
matches = self.p_links.findall(html)
links = []
for href, text in matches:
clean_text = self.clean_text(text)
if href and clean_text:
links.append({'href': href, 'text': clean_text})
return links
def extract_meta(self, html):
"""Extract meta tags as a dictionary."""
meta = {}
for name, content in self.p_meta_name.findall(html):
meta[name.lower()] = content
for content, name in self.p_meta_name_rev.findall(html):
if name.lower() not in meta:
meta[name.lower()] = content
for prop, content in self.p_meta_property.findall(html):
meta[prop.lower()] = content
return meta
def extract_json_ld(self, html):
"""Extract JSON-LD structured data."""
results = []
for match in self.p_json_ld.findall(html):
try:
results.append(json.loads(match.strip()))
except json.JSONDecodeError:
continue
return results
def extract_by_tag(self, html, tag, class_name=None):
"""Extract content from specific tags, optionally filtered by class."""
if class_name:
pattern = re.compile(
rf'<{tag}\s+[^>]*class=["\'][^"\']*{re.escape(class_name)}[^"\']*["\'][^>]*>(.*?)</{tag}>',
re.IGNORECASE | re.DOTALL
)
else:
pattern = re.compile(
rf'<{tag}[^>]*>(.*?)</{tag}>',
re.IGNORECASE | re.DOTALL
)
return [self.clean_text(m) for m in pattern.findall(html)]
def scrape(self, url):
"""Scrape a URL and return all extracted data."""
logger.info(f"Fetching {url}")
html = self.fetch(url)
logger.info(f"Fetched {len(html)} characters")
data = {
'url': url,
'title': self.extract_title(html),
'meta': self.extract_meta(html),
'links': self.extract_links(html),
'json_ld': self.extract_json_ld(html),
}
logger.info(f"Extracted: title={bool(data['title'])}, "
f"meta_tags={len(data['meta'])}, "
f"links={len(data['links'])}, "
f"json_ld={len(data['json_ld'])}")
return data
def main():
scraper = RegexScraper()
# Scrape example.com
result = scraper.scrape('https://example.com')
print(f"\nTitle: {result['title']}")
print(f"\nMeta tags ({len(result['meta'])}):")
for key, value in result['meta'].items():
print(f" {key}: {value}")
print(f"\nLinks ({len(result['links'])}):")
for link in result['links']:
print(f" {link['text']} -> {link['href']}")
if result['json_ld']:
print(f"\nJSON-LD blocks ({len(result['json_ld'])}):")
for block in result['json_ld']:
print(f" {json.dumps(block, indent=2)[:200]}")
# Export as JSON
print(f"\n--- JSON Output ---")
print(json.dumps(result, indent=2))
if __name__ == '__main__':
main()
Save this as regex_scraper.py and run it with python regex_scraper.py. It will fetch https://example.com, extract all available data, and print both a human-readable summary and a JSON export.
Recommendations
If you are going to build scrapers with regex, keep these guidelines in mind:
Always use non-greedy quantifiers (
*?,+?) when extracting content between tags. Greedy matching is the source of most regex scraping bugs.Always include
re.DOTALLin your flags for any pattern that extracts content from HTML. Real HTML spans multiple lines.Compile your patterns with
re.compile()and store them in a class or module. You get better performance and a single place to update patterns when the HTML changes.Use the two-phase approach for repeating data: first extract the containing block, then parse fields within each block. Never try to match all fields in one pattern across the full page.
Wrap every extraction in a safety check.
re.searchreturnsNone, and calling.group()onNoneis the most common crash in regex scrapers.Know when to stop. If you find yourself writing patterns to handle three levels of nesting or attribute order permutations, you have crossed the line where a parser would be simpler and more reliable.
Use regex for its strongest scraping use case: extracting embedded JSON, script blocks, and non-HTML structured data from within HTML pages. Let
json.loadsorcsv.readerhandle the parsing of the actual data.
Regex is not the best general-purpose HTML extraction tool, but it is a capable and sometimes ideal one when you understand its boundaries. The scraper we built here handles the common cases cleanly, and the pattern library approach keeps it maintainable as the sites you scrape evolve.

