Post

Regex Lookahead in Web Scrapers: Advanced Pattern Matching

Regex Lookahead in Web Scrapers: Advanced Pattern Matching

Lookaheads and lookbehinds are one of the most underused features in regex, especially in the context of web scraping. They let you match patterns based on what comes before or after a position in the text – without actually including that surrounding context in the match result. This distinction matters when you are extracting data from web pages, because you often want the value but not the label, or the price but not the currency symbol, or the text between two markers but not the markers themselves. Once you understand how lookarounds work, you will find yourself reaching for them constantly. If you are new to using regex for scraping, our introduction to regex for web scraping covers the fundamentals before you dive into the advanced techniques here.

This post covers all four types of lookaround assertions, explains where they are supported, walks through real scraping examples in Python, and addresses the performance considerations you should keep in mind.

What Is a Lookahead

A lookahead is a zero-width assertion. It checks whether a pattern exists at a certain position in the string, but it does not consume any characters. The regex engine looks ahead from its current position, checks the condition, and if the condition is met, continues matching – but the cursor does not move forward.

The syntax is (?=pattern) for a positive lookahead. It means: “the current position must be followed by this pattern.”

Here is a simple example:

1
2
3
4
5
6
7
8
import re

text = "100px 200em 300px 50vh"

# Match numbers only when followed by "px"
matches = re.findall(r'\d+(?=px)', text)
print(matches)
# ['100', '300']

The (?=px) asserts that px must follow the digits, but px itself is not part of the match. The result is just the numbers. Without the lookahead, you would have to match px and then strip it from the result manually.

Positive Lookahead: Match X Only if Followed by Y

The positive lookahead (?=...) is the most common lookaround you will use in scraping. It answers the question: “does this pattern appear after my match?”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import re

html = """
<span class="price">$49.99</span>
<span class="price">EUR 129.00</span>
<span class="price">$89.50</span>
<span class="weight">12.5kg</span>
"""

# Extract numbers only when followed by "kg"
weights = re.findall(r'[\d.]+(?=kg)', html)
print(weights)
# ['12.5']

# Extract numbers only when followed by a closing </span>
span_numbers = re.findall(r'[\d.]+(?=</span>)', html)
print(span_numbers)
# ['49.99', '129.00', '89.50', '12.5']

The key insight is that the lookahead acts as a filter. The regex engine finds all occurrences of [\d.]+, but only keeps the ones where the lookahead condition is satisfied.

Extracting Prices Before a Currency Symbol

A common scraping task is pulling numeric values that appear before a currency indicator:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re

product_text = """
Widget A: 29.99 USD
Widget B: 45.00 EUR
Widget C: 12.50 (no currency)
Widget D: 199.99 USD
Dimensions: 10.5 x 20.3 cm
"""

# Match prices only when followed by a space and a currency code
prices = re.findall(r'[\d.]+(?=\s(?:USD|EUR|GBP))', product_text)
print(prices)
# ['29.99', '45.00', '199.99']

The dimension values 10.5 and 20.3 are excluded because they are not followed by a currency code. The value 12.50 is also excluded because it lacks a matching currency suffix.

Negative Lookahead: Match X Only if NOT Followed by Y

The negative lookahead (?!...) is the inverse. It asserts that a pattern does not follow the current position. This is invaluable for excluding unwanted matches.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import re

text = """
report_2026.pdf
photo_beach.jpg
data_export.csv
banner_ad.png
quarterly_results.pdf
icon_small.gif
user_data.json
"""

# Match filenames but NOT image files
# Capture the filename before the dot, only if the extension is not an image format
non_image_files = re.findall(
    r'\b[\w]+\.(?!jpg|png|gif|svg|webp)\w+',
    text
)
print(non_image_files)
# ['report_2026.pdf', 'data_export.csv', 'quarterly_results.pdf', 'user_data.json']

Filtering Out Unwanted Patterns in Scraped Data

Negative lookaheads are particularly useful when you are scraping pages that mix real content with noise:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import re

scraped_links = """
https://shop.example.com/product/laptop
https://shop.example.com/cart
https://shop.example.com/product/keyboard
https://shop.example.com/login
https://shop.example.com/product/monitor
https://shop.example.com/checkout
https://shop.example.com/product/mouse
"""

# Match product URLs but skip cart, login, and checkout
product_urls = re.findall(
    r'https://shop\.example\.com/(?!cart|login|checkout)\S+',
    scraped_links
)
print(product_urls)
# ['https://shop.example.com/product/laptop',
#  'https://shop.example.com/product/keyboard',
#  'https://shop.example.com/product/monitor',
#  'https://shop.example.com/product/mouse']

Positive Lookbehind: Match X Only if Preceded by Y

The positive lookbehind (?<=...) checks what comes before the current position. It uses the syntax (?<=pattern).

1
2
3
4
5
6
7
8
9
10
11
12
13
import re

html = """
<span class="price">$49.99</span>
<span class="price">$129.00</span>
<span class="old-price">$199.99</span>
<span class="rating">4.5</span>
"""

# Extract numbers that come after a dollar sign
dollar_amounts = re.findall(r'(?<=\$)[\d.]+', html)
print(dollar_amounts)
# ['49.99', '129.00', '199.99']

The $ sign is not included in the match. You get clean numeric values ready for conversion to float.

Extracting Values After Labels

Web pages often present data as label-value pairs. Lookbehinds let you grab just the value:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import re

product_page = """
Brand: Acme Corp
Model: X-500
Weight: 2.3kg
Color: Matte Black
SKU: ACM-X500-BLK
Price: $149.99
"""

# Extract the value after "Brand: "
brand = re.search(r'(?<=Brand:\s).+', product_page)
if brand:
    print(brand.group())
    # 'Acme Corp'

# Extract the value after "SKU: "
sku = re.search(r'(?<=SKU:\s)\S+', product_page)
if sku:
    print(sku.group())
    # 'ACM-X500-BLK'

# Extract weight number after "Weight: "
weight = re.search(r'(?<=Weight:\s)[\d.]+', product_page)
if weight:
    print(weight.group())
    # '2.3'

Negative Lookbehind: Match X Only if NOT Preceded by Y

The negative lookbehind (?<!...) asserts that a specific pattern does not precede the current position.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import re

text = """
Contact us at support@example.com
Logo: header_logo@2x.png
Sales: sales@example.com
Icon: icon_email@3x.jpg
Info: info@example.org
Avatar: user_avatar@1x.webp
"""

# Extract email addresses but not image file references
# Emails should not be preceded by an underscore or alphanumeric
# (which would indicate filename@resolution patterns)
# Instead, match addresses not followed by image extensions
emails = re.findall(
    r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.(?!png|jpg|gif|webp|svg)[a-zA-Z]{2,}',
    text
)
print(emails)
# ['support@example.com', 'sales@example.com', 'info@example.org']

Here is another practical use – extracting standalone numbers while ignoring version strings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re

log_text = """
Processed 150 records
Version v2.5.1 installed
Exported 3200 rows
Running Python 3.11.4
Found 42 errors
"""

# Match numbers not preceded by 'v' or '.'
standalone_numbers = re.findall(r'(?<!v)(?<!\.)(?<!\d)\b\d+\b(?!\.)', log_text)
print(standalone_numbers)
# ['150', '3200', '42']

Multiple lookbehinds can be chained to exclude several preceding patterns simultaneously. For a deeper look at extracting email addresses specifically, including the edge cases that trip up most patterns, see our guide on email regex patterns for web scraping.

Patterns are everywhere — regex helps you find the ones that matter.
Patterns are everywhere — regex helps you find the ones that matter. Photo by Memet Öz / Pexels

Combining Lookaheads and Lookbehinds

The real power emerges when you combine lookaheads and lookbehinds in a single pattern. This lets you extract data between markers without including the markers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import re

html = """
<div class="product-name">Wireless Keyboard</div>
<div class="product-price">$79.99</div>
<div class="product-name">USB Mouse</div>
<div class="product-price">$24.99</div>
<div class="product-name">Monitor Stand</div>
<div class="product-price">$45.00</div>
"""

# Extract product names: text between <div class="product-name"> and </div>
names = re.findall(
    r'(?<=<div class="product-name">).+?(?=</div>)',
    html
)
print(names)
# ['Wireless Keyboard', 'USB Mouse', 'Monitor Stand']

# Extract prices without the dollar sign
prices = re.findall(
    r'(?<=<div class="product-price">\$)[\d.]+(?=</div>)',
    html
)
print(prices)
# ['79.99', '24.99', '45.00']

Both the opening tag and closing tag are excluded from the match. You get clean data without any post-processing.

Extracting Data Between Specific Markers

A frequent scraping scenario is extracting content that sits between two known strings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import re

api_response = """
BEGIN_DATA
{"name": "Alice", "score": 95}
{"name": "Bob", "score": 82}
{"name": "Carol", "score": 91}
END_DATA

BEGIN_METADATA
timestamp: 2026-02-16T10:30:00Z
source: api-v3
END_METADATA
"""

# Extract everything between BEGIN_DATA and END_DATA
data_block = re.search(
    r'(?<=BEGIN_DATA\n).+?(?=\nEND_DATA)',
    api_response,
    re.DOTALL
)
if data_block:
    lines = data_block.group().strip().split('\n')
    for line in lines:
        print(line)
    # {"name": "Alice", "score": 95}
    # {"name": "Bob", "score": 82}
    # {"name": "Carol", "score": 91}

# Extract metadata values
metadata_block = re.search(
    r'(?<=BEGIN_METADATA\n).+?(?=\nEND_METADATA)',
    api_response,
    re.DOTALL
)
if metadata_block:
    print(metadata_block.group().strip())
    # timestamp: 2026-02-16T10:30:00Z
    # source: api-v3

Python re Module: Full Lookaround Support

Python’s re module supports all four lookaround types. Here is a quick reference:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import re

text = "foo123bar456baz789qux"

# Positive lookahead (?=...)
re.findall(r'\d+(?=[a-z])', text)
# ['123', '456', '789'] -- digits followed by a letter

# Negative lookahead (?!...)
re.findall(r'\d+(?![a-z])', text)
# ['12', '45', '78'] -- this matches partial digit runs not followed by a letter
# (the engine finds the longest digit run where the last digit is not followed by a letter)

# Positive lookbehind (?<=...)
re.findall(r'(?<=[a-z])\d+', text)
# ['123', '456', '789'] -- digits preceded by a letter

# Negative lookbehind (?<!...)
re.findall(r'(?<!\d)\d(?!\d)', text)
# [] -- single digits not surrounded by other digits (none here, all are in groups)

The re.findall() function is the workhorse for scraping because it returns all non-overlapping matches as a list. Use re.search() when you only need the first match, and re.compile() when you are reusing the same pattern in a loop.

Python Lookbehind Limitation: Fixed Width Only

This is the most important gotcha. In Python’s re module, lookbehinds must be fixed-width. You cannot use *, +, or {n,m} (where n differs from m) inside a lookbehind.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re

text = "Price: $49.99"

# This works -- fixed-width lookbehind
result = re.search(r'(?<=\$)\d+\.\d+', text)
print(result.group())
# '49.99'

# This FAILS -- variable-width lookbehind
try:
    result = re.search(r'(?<=Price:\s+)\d+\.\d+', text)
except re.error as e:
    print(f"Error: {e}")
    # Error: look-behind requires fixed-width pattern

The \s+ quantifier makes the lookbehind variable-width, which Python does not allow. Here are the workarounds.

Workaround 1: Use a Capturing Group Instead

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re

text = "Price:    $49.99"

# Skip the lookbehind, use a group to capture just what you need
match = re.search(r'Price:\s+\$([\d.]+)', text)
if match:
    print(match.group(1))
    # '49.99'

# With findall, groups are returned automatically
prices = re.findall(r'Price:\s+\$([\d.]+)', text)
print(prices)
# ['49.99']

This is the most common workaround and works in every situation.

Workaround 2: Use the regex Module

The third-party regex module (install with pip install regex) supports variable-width lookbehinds:

1
2
3
4
5
6
7
8
9
import regex

text = "Price:    $49.99"

# Variable-width lookbehind works in the regex module
result = regex.search(r'(?<=Price:\s+\$)[\d.]+', text)
if result:
    print(result.group())
    # '49.99'

The regex module is a drop-in replacement for re with additional features. If you find yourself hitting the fixed-width limitation frequently, it is worth adding to your project.

Workaround 3: Use a Fixed-Width Approximation

If you know the approximate width, you can use a fixed quantifier:

1
2
3
4
5
6
7
8
9
10
import re

text = "Price: $49.99"

# Use {1,} but replace with a reasonable fixed width
# If you know the space is always exactly one space:
result = re.search(r'(?<=Price:\s)\$[\d.]+', text)
if result:
    print(result.group())
    # '$49.99'

This approach is fragile and only recommended when you have tight control over the input format.

JavaScript Lookahead and Lookbehind Support

Modern JavaScript (ES2018 and later) supports all four lookaround types. This matters if you are writing scraping logic in Node.js or running browser-side extraction scripts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
const text = "Price: $49.99, Weight: 2.3kg, Rating: 4.5/5";

// Positive lookahead
const beforeKg = text.match(/[\d.]+(?=kg)/g);
console.log(beforeKg);
// ['2.3']

// Negative lookahead
const notBeforeKg = text.match(/\d+\.\d+(?!kg)/g);
console.log(notBeforeKg);
// ['49.99', '4.5']

// Positive lookbehind
const afterDollar = text.match(/(?<=\$)[\d.]+/g);
console.log(afterDollar);
// ['49.99']

// Negative lookbehind
const notAfterDollar = text.match(/(?<!\$)[\d.]+(?=kg)/g);
console.log(notAfterDollar);
// ['2.3']

Lookbehinds in JavaScript have no fixed-width restriction, unlike Python’s re module. However, be aware that lookbehinds were only added in V8 (Chrome, Node.js) starting with ES2018. If you are running extraction scripts in older environments, test for support first.

1
2
3
4
5
6
7
8
9
10
11
// Check for lookbehind support
function supportsLookbehind() {
    try {
        new RegExp('(?<=test)');
        return true;
    } catch (e) {
        return false;
    }
}
console.log(supportsLookbehind());
// true in modern environments
A well-crafted pattern can extract data that no parser could reach.
A well-crafted pattern can extract data that no parser could reach. Photo by Berna / Pexels

Performance Considerations

Lookarounds can introduce performance problems if used carelessly. Here are the main points to keep in mind.

Backtracking with Lookaheads

A lookahead inside a repeated group can cause catastrophic backtracking:

1
2
3
4
5
6
7
8
import re

# BAD: This can be extremely slow on certain inputs
# The lookahead is evaluated at every position in the repetition
bad_pattern = r'(?:\w+(?=\s))*'

# GOOD: Move the lookahead outside the repetition when possible
good_pattern = r'\w+(?=\s)'

Anchoring Reduces Work

Always anchor your patterns when you can. An unanchored lookahead forces the engine to check the assertion at every position in the string:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import re

html = '<div class="price">$49.99</div>' * 10000

# Slower: engine checks lookahead at every position
import time

start = time.time()
re.findall(r'(?<=\$)[\d.]+', html)
elapsed_uncompiled = time.time() - start

# Faster: compile the pattern for repeated use
pattern = re.compile(r'(?<=\$)[\d.]+')
start = time.time()
pattern.findall(html)
elapsed_compiled = time.time() - start

print(f"Uncompiled: {elapsed_uncompiled:.4f}s")
print(f"Compiled: {elapsed_compiled:.4f}s")

Keep Lookarounds Simple

Complex lookarounds with alternations or nested groups slow things down. If your lookaround pattern is getting complicated, consider whether a two-pass approach (match broadly, then filter) would be clearer and faster.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import re

html = """
<span data-type="price" data-currency="USD">49.99</span>
<span data-type="rating">4.5</span>
<span data-type="price" data-currency="EUR">89.00</span>
"""

# Complex lookaround approach
prices_v1 = re.findall(
    r'(?<=data-currency="(?:USD|EUR)">)[\d.]+(?=</span>)',
    html
)
# This works in Python because both alternatives (USD, EUR) have the same fixed width

# Two-pass approach (clearer and works in Python)
price_spans = re.findall(
    r'data-currency="(USD|EUR)">([\d.]+)</span>',
    html
)
for currency, price in price_spans:
    print(f"{currency}: {price}")
    # USD: 49.99
    # EUR: 89.00

Complete Example: Extracting Structured Product Data

Here is a full working example that combines multiple lookaround techniques to extract structured product data from an HTML fragment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import re
from dataclasses import dataclass


@dataclass
class Product:
    name: str
    price: float
    currency: str
    sku: str
    in_stock: bool


def extract_products(html: str) -> list[Product]:
    """Extract product data from HTML using lookaround-based regex patterns."""

    products = []

    # Split HTML into individual product blocks
    product_blocks = re.findall(
        r'(?<=<div class="product">).+?(?=</div>\s*(?:<div class="product">|$))',
        html,
        re.DOTALL
    )

    for block in product_blocks:
        # Extract product name: text inside <h3> tags
        name_match = re.search(
            r'(?<=<h3>).+?(?=</h3>)',
            block
        )

        # Extract price: digits after a currency symbol
        price_match = re.search(
            r'(?<=[\$])[\d.]+',
            block
        )

        # Extract currency: the symbol before the price
        currency_match = re.search(
            r'[\$](?=[\d.]+)',
            block
        )

        # Extract SKU: alphanumeric code after "SKU: "
        sku_match = re.search(
            r'(?<=SKU:\s)[A-Z0-9-]+',
            block
        )

        # Check stock status: look for "In Stock" not preceded by "Not "
        in_stock = bool(re.search(
            r'(?<!Not )In Stock',
            block
        ))

        if name_match and price_match:
            currency_symbol = currency_match.group() if currency_match else '$'
            currency_map = {'$': 'USD'}

            products.append(Product(
                name=name_match.group().strip(),
                price=float(price_match.group()),
                currency=currency_map.get(currency_symbol, 'USD'),
                sku=sku_match.group() if sku_match else 'N/A',
                in_stock=in_stock
            ))

    return products


# Sample HTML
html = """
<div class="product">
<h3>Mechanical Keyboard</h3>
<span class="price">$89.99</span>
<span class="sku">SKU: KB-2026-MX</span>
<span class="stock">In Stock</span>
</div>
<div class="product">
<h3>Ergonomic Mouse</h3>
<span class="price">$45.50</span>
<span class="sku">SKU: MS-2026-ER</span>
<span class="stock">Not In Stock</span>
</div>
<div class="product">
<h3>USB-C Hub</h3>
<span class="price">$29.99</span>
<span class="sku">SKU: HB-2026-UC</span>
<span class="stock">In Stock</span>
</div>
"""

products = extract_products(html)
for product in products:
    print(f"{product.name}: {product.currency} {product.price:.2f} "
          f"[{product.sku}] {'Available' if product.in_stock else 'Sold Out'}")

# Mechanical Keyboard: USD 89.99 [KB-2026-MX] Available
# Ergonomic Mouse: USD 45.50 [MS-2026-ER] Sold Out
# USB-C Hub: USD 29.99 [HB-2026-UC] Available

This example demonstrates several key patterns:

  • Positive lookbehind ((?<=<h3>)) to extract text after an opening tag
  • Positive lookahead ((?=</h3>)) to stop the match before a closing tag
  • Positive lookbehind ((?<=\$)) to grab the price without the currency symbol
  • Positive lookahead ((?=[\d.]+)) to match the currency symbol before a price
  • Negative lookbehind ((?<!Not )) to distinguish “In Stock” from “Not In Stock”

Quick Reference Table

Here is a summary of all four lookaround types with their syntax and behavior:

TypeSyntaxMeaningExample
Positive Lookahead(?=...)Followed by pattern\d+(?=px) matches 100 in 100px
Negative Lookahead(?!...)NOT followed by pattern\d+(?!px) matches 200 in 200em
Positive Lookbehind(?<=...)Preceded by pattern(?<=\$)\d+ matches 50 in $50
Negative Lookbehind(?<!...)NOT preceded by pattern(?<!#)\d+ matches 10 in x10

When to Use Lookarounds vs. Capturing Groups

Lookarounds and capturing groups can often solve the same problem. Here is how to decide:

  • Use lookarounds when you need the match itself to be clean – for example, when passing results directly to re.findall() and you want a flat list of values without post-processing.
  • Use capturing groups when the lookbehind would be variable-width (Python limitation), when you need to capture multiple related values from the same pattern, or when readability matters more than elegance.
  • Combine both when you need to capture a group and also assert something about the surrounding context.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import re

text = "Product: Widget (SKU: WDG-100) - $25.00"

# Lookaround approach: clean match, but two separate calls
name = re.search(r'(?<=Product:\s)\w+', text).group()
sku = re.search(r'(?<=SKU:\s)[A-Z0-9-]+', text).group()
price = re.search(r'(?<=\$)[\d.]+', text).group()

print(f"Name: {name}, SKU: {sku}, Price: {price}")
# Name: Widget, SKU: WDG-100, Price: 25.00

# Capturing group approach: one call, multiple values
match = re.search(
    r'Product:\s(\w+)\s\(SKU:\s([A-Z0-9-]+)\)\s-\s\$([\d.]+)',
    text
)
if match:
    name, sku, price = match.groups()
    print(f"Name: {name}, SKU: {sku}, Price: {price}")
    # Name: Widget, SKU: WDG-100, Price: 25.00

Both approaches produce the same result. The capturing group version is more efficient because it makes a single pass over the string. The lookaround version is easier to maintain because each pattern is independent – if the page layout changes, you only need to update the affected pattern.

Lookarounds are a precision tool. They will not replace a proper HTML parser for complex page structures — our guide to building a web scraper with regex covers the practical patterns and pitfalls of going that route. For sites where even regex hits its limits, LLM-based structured data extraction can automate the parsing step entirely. For targeted extraction of values from predictable text patterns, lookarounds give you clean results with minimal code. Master the four types – positive and negative, lookahead and lookbehind – and you will handle most text extraction tasks without reaching for heavier tools.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.