Surviving Anti-Bot Updates on E-Commerce Sites: Tool Comparison
E-commerce sites update their anti-bot defenses frequently. The scraper that pulled product listings cleanly last month may be returning 403 errors or CAPTCHA walls today. These sites have strong incentives to block scrapers: protecting pricing data from competitors, preventing inventory hoarding by bots, and reducing server load from automated traffic. The result is a constant cycle of updates that break existing scrapers and force teams to adapt. This post compares the tools available for e-commerce scraping, examines how each one holds up when anti-bot systems get updated, and covers strategies for building scrapers that degrade gracefully instead of failing silently. For a broader look at how these stealth browsers compare outside the e-commerce context, that guide covers the landscape in detail.
The Anti-Bot Providers Behind E-Commerce Sites
Most major e-commerce platforms do not build their own anti-bot systems from scratch. They purchase solutions from specialized vendors. Understanding which provider protects a site tells you what you are up against.
Cloudflare is the most common. It sits in front of the site as a reverse proxy, handling DNS, CDN, and bot management in one package. Cloudflare’s Bot Management uses machine learning models trained on traffic across millions of sites. When it updates, the change affects every site behind it simultaneously. You might find your scraper blocked on dozens of targets overnight.
DataDome specializes in bot protection and is popular with mid-to-large e-commerce sites. It runs detection scripts on the client side that probe browser fingerprints deeply. DataDome updates its detection models frequently, sometimes multiple times per week.
PerimeterX (now HUMAN) focuses on behavioral analysis. It watches how users interact with pages, building models of normal human behavior and flagging deviations. Its updates often target specific automation patterns it has observed.
Akamai Bot Manager runs at the CDN edge. It combines device fingerprinting, behavioral signals, and reputation scoring. Akamai tends to roll out changes gradually, but when updates land, they are thorough.
How E-Commerce Anti-Bot Stacks Work
Anti-bot protection on e-commerce sites is not a single check. It is a pipeline. A request passes through multiple layers, each one capable of blocking or challenging the visitor.
graph TD
A["Incoming Request"] --> B["CDN / Edge Layer<br>IP reputation, rate limiting,<br>TLS fingerprint check"]
B -->|Suspicious| C["JavaScript Challenge<br>Invisible JS executes in browser,<br>probes APIs and environment"]
B -->|Clean| D["Page Served"]
C -->|Challenge passed| E["Behavioral Analysis<br>Mouse movements, scroll patterns,<br>click timing, keystroke cadence"]
C -->|Challenge failed| F["Blocked / CAPTCHA"]
E -->|Human-like| D
E -->|Bot-like| F
style A fill:#e6e6e6
style B fill:#ffcccc
style C fill:#ffffcc
style D fill:#99ff99
style E fill:#ccffcc
style F fill:#ff9999
The key insight is that each layer is independently updateable. A vendor might push new fingerprint checks at the JS challenge layer without touching the edge layer. This means partial breakage: your scraper might pass three out of four checks but fail the one that changed.
What Changes When Anti-Bot Systems Update
Anti-bot vendors do not announce their updates. You discover them when your scraper starts failing. The evolution of web scraping detection methods shows how these techniques have compounded over the years. Here are the common categories of change.
New JavaScript Challenges
The JS challenge page is the most frequently updated component. Vendors rewrite their challenge scripts to probe new browser APIs, change obfuscation techniques, and add new consistency checks. A typical update might start checking navigator.keyboard or navigator.ink, APIs that exist in real browsers but are absent in many automation setups.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Example: a detection script probing for automation artifacts
(function() {
const checks = [];
// Check 1: webdriver flag
checks.push(navigator.webdriver === true);
// Check 2: Chrome DevTools Protocol artifacts
checks.push(typeof window.cdc_adoQpoasnfa76pfcZLmcfl !== 'undefined');
// Check 3: Inconsistent permissions API
navigator.permissions.query({ name: 'notifications' }).then(function(result) {
checks.push(result.state === 'prompt' && Notification.permission !== 'default');
});
// Check 4: Missing browser-specific APIs
checks.push(typeof window.chrome === 'undefined');
// Send results to detection endpoint
if (checks.some(Boolean)) {
reportBot();
}
})();
Updated Fingerprint Checks
TLS fingerprinting evolves as browsers update their cipher suite preferences. When Chrome 130 changes its default extension order, anti-bot systems update their fingerprint databases to match. If your HTTP client still presents a Chrome 125 fingerprint, it looks outdated and suspicious.
Canvas fingerprinting is another moving target. Detection scripts render specific shapes, gradients, and text to an offscreen canvas and hash the output. They know what each real browser version should produce and flag mismatches.
Behavioral Model Retraining
The behavioral layer gets retrained on fresh data periodically. These models learn new patterns of bot behavior that have been observed in the wild. If a popular automation framework introduces a new way to simulate clicks, behavioral models will eventually learn to detect it.
Cookie and Token Rotation
Anti-bot systems issue challenge cookies that prove a visitor passed validation. These cookies have formats and lifetimes that change with updates. A scraper that relies on a specific cookie structure may break when the format changes.

Tool Resilience Comparison
Not all scraping tools respond to anti-bot updates equally. Some break immediately. Others absorb changes and keep working. Here is how the major options stack up.
Requests and HTTPX
Resilience: Low
Pure HTTP clients like Python’s requests or httpx have no browser engine. They cannot execute JavaScript challenges at all. They fail at the second layer of every anti-bot stack.
1
2
3
4
5
6
7
8
9
import httpx
# This works fine on unprotected sites
response = httpx.get("https://example-shop.com/products")
# But against Cloudflare or DataDome, you get:
# - 403 Forbidden
# - A challenge page HTML instead of product data
# - A redirect to a CAPTCHA endpoint
When anti-bot systems update, requests-based scrapers do not just degrade. They are already broken. The only scenario where they survive is when paired with a tool like httpmorph to fix TLS fingerprints and the target does not serve JS challenges. This combination works for a narrow slice of targets.
When to use: Unprotected sites, APIs, or as a secondary tool for fetching resources after a browser session has obtained valid cookies.
Selenium
Resilience: Low to Moderate
Selenium automates a real browser, which means it can execute JavaScript challenges. However, standard Selenium leaves obvious fingerprints. The navigator.webdriver property is set to true. ChromeDriver injects identifiable variables into the page context. The browser binary itself may differ from a standard Chrome installation.
1
2
3
4
5
6
7
8
9
10
11
12
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get("https://example-shop.com/products")
# navigator.webdriver is True
# window.cdc_ variables are present
# Detection scripts catch this immediately
When anti-bot updates target webdriver detection, Selenium breaks. Tools like undetected-chromedriver patch some of these issues, but they play catch-up with every update. The fundamental problem is that Selenium was built for testing, not stealth.
When to use: Sites with minimal bot protection. Legacy projects where rewriting is not feasible.
Playwright with Stealth Plugins
Resilience: Moderate
Playwright provides better low-level control than Selenium. Combined with stealth plugins like playwright-stealth (for Node.js) or playwright-stealth (for Python), it patches many common detection vectors.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
)
page = context.new_page()
# Stealth patches applied here
# navigator.webdriver = false
# chrome.runtime injected
# plugin array populated
page.goto("https://example-shop.com/products")
products = page.query_selector_all(".product-card")
Playwright stealth survives many routine updates. It breaks when detection scripts start probing deeper than what the stealth patches cover, such as checking Chrome-specific internal APIs or analyzing the timing of how browser APIs respond. For a direct matchup, see Playwright vs Selenium for stealth. After a major anti-bot update, there is usually a gap of days to weeks before the stealth community publishes patches.
When to use: Sites with moderate protection. Projects where you need both scraping and interaction (form filling, pagination).
Nodriver
Resilience: Good
Nodriver connects to Chrome through the raw DevTools Protocol without injecting a driver binary, as explained in the complete nodriver guide. This eliminates the largest category of detection vectors. There is no ChromeDriver, no cdc_ variables, no webdriver flag. The browser itself is a standard Chrome installation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import nodriver as uc
async def scrape():
browser = await uc.start()
page = await browser.get("https://example-shop.com/products")
# No webdriver artifacts
# No injected driver variables
# Standard Chrome TLS fingerprint
products = await page.select_all(".product-card")
for product in products:
title = await product.query_selector(".title")
price = await product.query_selector(".price")
print(await title.text, await price.text)
await browser.close()
Nodriver survives most anti-bot updates because it starts from a clean baseline. Detection scripts that look for automation artifacts find none. It typically breaks only when vendors update behavioral models to detect the specific interaction patterns that nodriver users tend to produce, or when new Chrome-level fingerprint checks are introduced that require specific browser configuration.
When to use: Sites with strong protection where you need a Python-native solution.
Camoufox
Resilience: Excellent
Camoufox takes a different approach entirely. It is a custom build of Firefox with anti-detection patches applied at the C++ engine level. This means that fingerprint spoofing happens inside the browser engine itself, not through JavaScript patches that detection scripts can detect.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from camoufox.sync_api import Camoufox
with Camoufox(
os="windows",
humanize=True,
screen={"width": 1920, "height": 1080},
) as browser:
page = browser.new_page()
page.goto("https://example-shop.com/products")
# Canvas fingerprint matches spoofed OS
# WebGL renderer consistent with claimed GPU
# Font enumeration returns expected system fonts
# All checks are engine-level, not JS patches
products = page.query_selector_all(".product-card")
Camoufox survives anti-bot updates better than any other tool because its modifications are invisible to JavaScript inspection. Detection scripts query browser APIs and get responses that are internally consistent and match a real browser profile. When an anti-bot vendor updates its JS challenges, Camoufox usually passes them without any changes needed on the user’s side.
The main risk is when anti-bot vendors start specifically fingerprinting the Camoufox build of Firefox, or when Firefox itself updates in a way that requires Camoufox patches to be refreshed.
When to use: Sites with the strongest protection. High-value data collection where reliability matters more than speed.
Resilience Summary
graph TD
U["Anti-Bot Update Deployed"] --> R["Requests / HTTPX<br>Already broken<br>No JS execution"]
U --> S["Selenium<br>Breaks immediately<br>WebDriver artifacts"]
U --> P["Playwright + Stealth<br>Breaks on major updates<br>JS patches lag behind"]
U --> N["Nodriver<br>Survives most updates<br>No driver artifacts"]
U --> C["Camoufox<br>Survives nearly all updates<br>Engine-level stealth"]
style R fill:#ff9999
style S fill:#ffcccc
style P fill:#ffffcc
style N fill:#ccffcc
style C fill:#99ff99
Monitoring Strategies: Detect Blocks Before They Cost You Data
The worst outcome is not getting blocked. It is getting blocked and not knowing it. Silent failures, where your scraper receives soft blocks (empty product listings, inflated prices, missing inventory data), can corrupt your dataset for days before anyone notices.
HTTP Status Monitoring
Track the distribution of HTTP status codes over time. A sudden spike in 403 or 429 responses indicates a new block.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import httpx
from collections import Counter
from datetime import datetime
status_log = []
async def monitored_fetch(url: str, client: httpx.AsyncClient):
response = await client.get(url)
status_log.append({
"url": url,
"status": response.status_code,
"timestamp": datetime.utcnow().isoformat(),
})
# Alert on anomalies
recent = [s["status"] for s in status_log[-100:]]
counter = Counter(recent)
error_rate = (counter.get(403, 0) + counter.get(429, 0)) / len(recent)
if error_rate > 0.1:
alert(f"Block rate at {error_rate:.0%} - possible anti-bot update")
return response
Content Validation
Status code monitoring is not enough. Some anti-bot systems return 200 OK but serve a challenge page or degraded content. Validate that the response contains the data you expect.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from bs4 import BeautifulSoup
def validate_product_page(html: str) -> bool:
soup = BeautifulSoup(html, "html.parser")
# Check for expected elements
has_products = len(soup.select(".product-card")) > 0
has_prices = len(soup.select("[data-price]")) > 0
# Check for challenge page indicators
is_challenge = any([
"challenge" in soup.title.text.lower() if soup.title else False,
soup.select("#challenge-running"),
soup.select("[data-ray]"), # Cloudflare ray ID
"datadome" in html.lower(),
])
return has_products and has_prices and not is_challenge
Fingerprint Drift Detection
Periodically check what your browser looks like from the server’s perspective. Services like browserleaks.com or self-hosted fingerprint checkers can tell you if your browser’s fingerprint has drifted from what a real browser would produce.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import nodriver as uc
async def check_fingerprint():
browser = await uc.start()
page = await browser.get("https://browserleaks.com/javascript")
# Extract key fingerprint values
webdriver = await page.evaluate("navigator.webdriver")
plugins_count = await page.evaluate("navigator.plugins.length")
languages = await page.evaluate("navigator.languages")
print(f"webdriver: {webdriver}") # Should be False or undefined
print(f"plugins: {plugins_count}") # Should be > 0
print(f"languages: {languages}") # Should match locale
await browser.close()

Adaptation Patterns: How to Recover Quickly
When an anti-bot update breaks your scraper, speed matters. Here are patterns that reduce your recovery time.
Pattern 1: Layered Tool Fallback
Structure your scraper so it can switch between tools at runtime. Start with the fastest option and fall back to more resilient ones.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import asyncio
import httpx
async def fetch_product(url: str) -> dict | None:
# Layer 1: Try HTTP client first (fastest)
result = await try_httpx(url)
if result:
return result
# Layer 2: Try Playwright with stealth
result = await try_playwright(url)
if result:
return result
# Layer 3: Fall back to Camoufox (slowest but most resilient)
result = await try_camoufox(url)
if result:
return result
# All layers failed - log and alert
log_failure(url)
return None
async def try_httpx(url: str) -> dict | None:
try:
async with httpx.AsyncClient() as client:
response = await client.get(url, timeout=10)
if response.status_code == 200:
data = parse_product(response.text)
if validate_product_data(data):
return data
except Exception:
pass
return None
Pattern 2: Session Rotation
When blocks start, rotate not just proxies but entire browser profiles. Each session should look like a distinct user.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import random
VIEWPORTS = [
{"width": 1920, "height": 1080},
{"width": 1366, "height": 768},
{"width": 1536, "height": 864},
{"width": 1440, "height": 900},
]
LOCALES = ["en-US", "en-GB", "en-CA", "en-AU"]
TIMEZONES = [
"America/New_York",
"America/Chicago",
"America/Denver",
"America/Los_Angeles",
]
def create_fresh_profile() -> dict:
return {
"viewport": random.choice(VIEWPORTS),
"locale": random.choice(LOCALES),
"timezone_id": random.choice(TIMEZONES),
"color_scheme": random.choice(["light", "dark"]),
}
Pattern 3: Delay Randomization
After an anti-bot update, behavioral models are often the trigger. Add human-like timing to your interactions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import random
import asyncio
async def human_delay(min_ms: int = 800, max_ms: int = 3000):
"""Simulate human thinking time between actions."""
delay = random.uniform(min_ms, max_ms) / 1000
# Add occasional longer pauses (reading time)
if random.random() < 0.1:
delay += random.uniform(2, 5)
await asyncio.sleep(delay)
async def scrape_with_human_timing(page, urls: list[str]):
for url in urls:
await page.goto(url)
await human_delay(1000, 3000)
# Scroll down naturally
await page.evaluate("window.scrollBy(0, 300)")
await human_delay(500, 1500)
# Extract data
data = await extract_product_data(page)
yield data
# Longer pause between pages
await human_delay(2000, 5000)
The API Alternative
Before investing effort in bypassing anti-bot systems, check whether the data you need is available through an API. Many e-commerce platforms expose product data through official or semi-official channels.
Official APIs
Some platforms offer developer APIs with structured product data. These are rate-limited but reliable and not subject to anti-bot measures. Examples include product catalog APIs, affiliate data feeds, and marketplace seller APIs.
Undocumented APIs
Modern e-commerce sites are typically single-page applications. The frontend fetches product data from internal JSON APIs. These endpoints often have lighter anti-bot protection than the HTML pages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import httpx
async def find_api_endpoints(page):
"""Monitor network requests to find JSON API endpoints."""
api_calls = []
page.on("response", lambda response: (
api_calls.append({
"url": response.url,
"status": response.status,
"content_type": response.headers.get("content-type", ""),
})
if "json" in response.headers.get("content-type", "")
else None
))
await page.goto("https://example-shop.com/products")
await page.wait_for_load_state("networkidle")
for call in api_calls:
print(f"API endpoint: {call['url']}")
# These JSON endpoints often return structured data
# that is easier to parse than HTML
Once you identify these endpoints, you can often call them directly with an HTTP client, bypassing the frontend anti-bot checks entirely. The trick is that you may need valid session cookies, which you can obtain from a single browser session and reuse across many HTTP requests.

Structured Data: The Path of Least Resistance
Many e-commerce sites embed structured data in their HTML using JSON-LD or Microdata formats. This data is intended for search engines and is often served even when anti-bot systems are active, because blocking search engine crawlers would hurt the site’s SEO.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<!-- JSON-LD structured data embedded in a product page -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Wireless Noise-Canceling Headphones",
"brand": {"@type": "Brand", "name": "AudioTech"},
"offers": {
"@type": "Offer",
"price": "149.99",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.5",
"reviewCount": "2847"
}
}
</script>
Extracting this structured data is straightforward and resilient to frontend changes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import json
from bs4 import BeautifulSoup
def extract_structured_data(html: str) -> list[dict]:
soup = BeautifulSoup(html, "html.parser")
results = []
for script in soup.select('script[type="application/ld+json"]'):
try:
data = json.loads(script.string)
if isinstance(data, list):
results.extend(data)
else:
results.append(data)
except json.JSONDecodeError:
continue
return results
def extract_products(structured_data: list[dict]) -> list[dict]:
products = []
for item in structured_data:
if item.get("@type") == "Product":
products.append({
"name": item.get("name"),
"brand": item.get("brand", {}).get("name"),
"price": item.get("offers", {}).get("price"),
"currency": item.get("offers", {}).get("priceCurrency"),
"availability": item.get("offers", {}).get("availability"),
"rating": item.get("aggregateRating", {}).get("ratingValue"),
})
return products
This approach has a significant advantage: structured data formats rarely change when anti-bot systems are updated. The data is maintained for SEO purposes and follows standard schemas.
Building Resilient Scrapers: The Full Architecture
A production e-commerce scraper should combine multiple strategies into a system that degrades gracefully.
graph TD
A["Target URL"] --> B{"Structured Data<br>Available?"}
B -->|Yes| C["Extract JSON-LD<br>No anti-bot risk"]
B -->|No| D{"API Endpoint<br>Discovered?"}
D -->|Yes| E["Call API with<br>session cookies"]
D -->|No| F{"Protection<br>Level?"}
F -->|None| G["HTTP Client<br>requests / httpx"]
F -->|Basic| H["Playwright + Stealth"]
F -->|Strong| I["Nodriver or Camoufox"]
C --> J["Validate Data"]
E --> J
G --> J
H --> J
I --> J
J -->|Valid| K["Store Results"]
J -->|Invalid| L["Escalate to<br>Next Strategy"]
L --> D
style C fill:#99ff99
style E fill:#ccffcc
style G fill:#ffffcc
style H fill:#ffddaa
style I fill:#ffcccc
style K fill:#99ff99
Implementing Graceful Degradation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import asyncio
import logging
from dataclasses import dataclass, field
from enum import Enum
logger = logging.getLogger(__name__)
class Strategy(Enum):
STRUCTURED_DATA = "structured_data"
API = "api"
HTTP_CLIENT = "http_client"
PLAYWRIGHT = "playwright"
NODRIVER = "nodriver"
CAMOUFOX = "camoufox"
@dataclass
class ScrapeResult:
url: str
strategy: Strategy
data: dict | None = None
success: bool = False
error: str | None = None
timestamp: datetime = field(default_factory=datetime.utcnow)
@dataclass
class ResilientScraper:
strategies: list[Strategy] = field(default_factory=lambda: [
Strategy.STRUCTURED_DATA,
Strategy.API,
Strategy.HTTP_CLIENT,
Strategy.PLAYWRIGHT,
Strategy.NODRIVER,
Strategy.CAMOUFOX,
])
async def scrape(self, url: str) -> ScrapeResult:
for strategy in self.strategies:
logger.info(f"Trying {strategy.value} for {url}")
try:
data = await self._execute(strategy, url)
if data and self._validate(data):
return ScrapeResult(
url=url,
strategy=strategy,
data=data,
success=True,
)
logger.warning(f"{strategy.value} returned invalid data")
except Exception as e:
logger.warning(f"{strategy.value} failed: {e}")
return ScrapeResult(url=url, strategy=self.strategies[-1], success=False)
async def _execute(self, strategy: Strategy, url: str) -> dict | None:
handlers = {
Strategy.STRUCTURED_DATA: self._try_structured_data,
Strategy.API: self._try_api,
Strategy.HTTP_CLIENT: self._try_http,
Strategy.PLAYWRIGHT: self._try_playwright,
Strategy.NODRIVER: self._try_nodriver,
Strategy.CAMOUFOX: self._try_camoufox,
}
return await handlers[strategy](url)
def _validate(self, data: dict) -> bool:
required_fields = ["name", "price"]
return all(data.get(f) for f in required_fields)
# Each _try_* method implements the specific strategy
# ...
Health Check Dashboard
Track which strategies are working and which are failing across your targets.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from collections import defaultdict
from datetime import datetime, timedelta
class HealthTracker:
def __init__(self):
self.results: list[ScrapeResult] = []
def record(self, result: ScrapeResult):
self.results.append(result)
def report(self, hours: int = 24) -> dict:
cutoff = datetime.utcnow() - timedelta(hours=hours)
recent = [r for r in self.results if r.timestamp > cutoff]
stats = defaultdict(lambda: {"success": 0, "failure": 0})
for r in recent:
key = r.strategy.value
if r.success:
stats[key]["success"] += 1
else:
stats[key]["failure"] += 1
return {
strategy: {
"success_rate": s["success"] / max(s["success"] + s["failure"], 1),
"total": s["success"] + s["failure"],
}
for strategy, s in stats.items()
}
When the health report shows a strategy’s success rate dropping, that is your signal that an anti-bot update has landed. You can then investigate, update your approach, and shift traffic to strategies that are still working.
Ethical Considerations
Building resilient scrapers does not mean building aggressive ones. The techniques in this post are about surviving anti-bot updates without disrupting the sites you scrape.
Respect rate limits. If a site serves you a 429 Too Many Requests response, back off. Do not retry immediately. Implement exponential backoff with jitter.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import asyncio
import random
async def backoff_retry(func, max_retries: int = 5):
for attempt in range(max_retries):
result = await func()
if result.status_code != 429:
return result
wait = (2 ** attempt) + random.uniform(0, 1)
logger.info(f"Rate limited. Waiting {wait:.1f}s before retry {attempt + 1}")
await asyncio.sleep(wait)
raise Exception("Max retries exceeded")
Do not hammer servers. Space your requests out. A real human does not load 100 product pages per second. Set a reasonable crawl rate and stick to it.
Check robots.txt. It may not be legally binding in all jurisdictions, but it represents the site owner’s stated preferences. Respecting it is good practice. Meanwhile, some providers are going further with techniques like Cloudflare’s AI Labyrinth, which actively traps bots that ignore these boundaries.
Prefer structured data and APIs. When the data is available through JSON-LD or an API, use those channels. They put less load on the server than rendering full pages.
Cache aggressively. If product data does not change hourly, do not scrape it hourly. Reduce your footprint to what is necessary.
Key Takeaways
The e-commerce anti-bot landscape is a moving target. No single tool guarantees permanent access to any site. The practical approach is to build systems that combine multiple strategies, monitor for breakage, and adapt quickly.
Start with the lightest approach: structured data and APIs. Escalate to browser-based tools only when necessary. When you do use browsers, choose tools with engine-level stealth over those that rely on JavaScript patches. And always monitor your success rates so you know when an update has landed before it corrupts your data.
The scrapers that survive are not the ones that break through the strongest defenses. They are the ones that find the path of least resistance and switch paths when the landscape changes.

