Types of Web Data You'll Encounter: A Field Guide
Every seasoned web scraper knows that not all data is created equal. From pristine JSON APIs to chaotic HTML soup, the web serves up an incredible variety of data formats, each with its own quirks, challenges, and extraction strategies. Understanding these different data types isn’t just academic—it’s the difference between writing elegant, maintainable scrapers and wrestling with brittle code that breaks at the first sign of change.
Let’s dive into the wild world of web data and explore what you’re likely to encounter in your scraping adventures.
Structured Data: The Holy Grail
Structured data is like finding a well-organized library where every book is exactly where it should be. These formats follow strict rules and schemas, making them predictable and relatively easy to parse.
JSON Data
JSON (JavaScript Object Notation) is the darling of modern web development. You’ll encounter it everywhere—from RESTful APIs to embedded data within HTML pages.
1
2
3
4
5
6
7
8
9
10
import requests
import json
# API endpoint returning JSON
response = requests.get('https://jsonplaceholder.typicode.com/users')
users = response.json()
for user in users:
print(f"Name: {user['name']}, Email: {user['email']}")
print(f"Company: {user['company']['name']}")
JSON’s beauty lies in its simplicity and universality. It maps perfectly to Python dictionaries and JavaScript objects, making data extraction straightforward. However, watch out for nested structures and varying schemas—not all JSON endpoints are as clean as the example above.
XML Data
XML might feel like a relic from the early 2000s, but it’s still alive and kicking, especially in enterprise environments, RSS feeds, and SOAP APIs.
1
2
3
4
5
6
7
8
9
10
11
import xml.etree.ElementTree as ET
import requests
# Parsing RSS feed
response = requests.get('https://feeds.feedburner.com/oreilly/radar')
root = ET.fromstring(response.content)
for item in root.findall('.//item'):
title = item.find('title').text
link = item.find('link').text
print(f"Article: {title}\nURL: {link}\n")
XML’s hierarchical structure can be both a blessing and a curse. While it’s excellent for representing complex relationships, parsing can become verbose, especially with namespaces and attributes thrown into the mix.
CSV Data
Don’t underestimate the humble CSV file. Many organizations still export data in this format, and you’ll often find downloadable CSV files on data-heavy websites.
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import requests
from io import StringIO
# Downloading and parsing CSV data
url = 'https://example.com/data/exports/users.csv'
response = requests.get(url)
df = pd.read_csv(StringIO(response.text))
print(df.head())
print(f"Total records: {len(df)}")
The challenge with CSV data often lies in inconsistent formatting, missing headers, or special characters that break parsing. Always validate your assumptions about delimiter types and encoding.
Semi-Structured Data: The Middle Ground
Semi-structured data sits comfortably between the rigid world of structured formats and the chaos of unstructured content. These formats have some organizational principles but allow for flexibility and variation.
HTML with Microdata
HTML pages often contain structured data markup using schemas like JSON-LD, microdata, or RDFa. This is particularly common on e-commerce sites and news websites.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from bs4 import BeautifulSoup
import json
html_content = """
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Wireless Headphones",
"offers": {
"@type": "Offer",
"price": "99.99",
"priceCurrency": "USD"
}
}
</script>
"""
soup = BeautifulSoup(html_content, 'html.parser')
json_ld = soup.find('script', type='application/ld+json')
if json_ld:
data = json.loads(json_ld.string)
print(f"Product: {data['name']}")
print(f"Price: {data['offers']['price']} {data['offers']['priceCurrency']}")
This embedded structured data is a goldmine for scrapers. Search engines rely on it for rich snippets, so websites are incentivized to keep it accurate and up-to-date.
API Responses with Nested Objects
Many modern APIs return complex nested structures that blend predictable patterns with dynamic content.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Handling nested API responses
api_data = {
"user": {
"id": 123,
"profile": {
"personal": {
"name": "John Doe",
"preferences": {
"notifications": ["email", "sms"],
"privacy": {
"level": "medium",
"custom_settings": {
"data_sharing": False
}
}
}
}
}
}
}
def extract_nested_value(data, path):
"""Safely extract values from nested dictionaries"""
current = data
for key in path.split('.'):
if isinstance(current, dict) and key in current:
current = current[key]
else:
return None
return current
# Usage
name = extract_nested_value(api_data, "user.profile.personal.name")
privacy_level = extract_nested_value(api_data, "user.profile.personal.preferences.privacy.level")
The key to handling semi-structured data is building robust extraction functions that can gracefully handle missing or relocated fields.
Unstructured Data: The Wild West
Unstructured data is where scraping gets interesting—and challenging. This is raw HTML content without clear patterns, requiring you to understand the visual and semantic structure of web pages.
Traditional HTML Scraping
Most web scraping involves parsing HTML to extract meaningful data. This requires understanding DOM structure, CSS selectors, and XPath expressions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from bs4 import BeautifulSoup
import requests
def scrape_product_listings(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
products = []
for product_card in soup.find_all('div', class_='product-item'):
name_elem = product_card.find('h3', class_='product-name')
price_elem = product_card.find('span', class_='price')
if name_elem and price_elem:
product = {
'name': name_elem.get_text(strip=True),
'price': price_elem.get_text(strip=True),
'url': product_card.find('a')['href'] if product_card.find('a') else None
}
products.append(product)
return products
The challenge with HTML scraping is that websites frequently change their structure. CSS classes get renamed, DOM hierarchies shift, and new elements appear. Building resilient selectors requires understanding common patterns and having fallback strategies.
Text Content and Natural Language
Sometimes the data you need is buried within paragraphs of text, requiring text processing and potentially natural language processing techniques.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import re
from bs4 import BeautifulSoup
def extract_phone_numbers(text):
"""Extract phone numbers from text content"""
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
return re.findall(phone_pattern, text)
def extract_email_addresses(text):
"""Extract email addresses from text content"""
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return re.findall(email_pattern, text)
# Example usage
html_content = """
<div class="contact-info">
<p>For more information, please contact us at info@example.com
or call us at (555) 123-4567. You can also reach our sales team
at sales@example.com or 555.987.6543.</p>
</div>
"""
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text()
phones = extract_phone_numbers(text)
emails = extract_email_addresses(text)
print(f"Found phones: {phones}")
print(f"Found emails: {emails}")
Dynamic Data: The Moving Target
Modern web applications increasingly rely on JavaScript to load and manipulate data dynamically. This presents unique challenges for scrapers.
JavaScript-Rendered Content
Single Page Applications (SPAs) and heavy JavaScript sites require browser automation tools to render content before extraction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from playwright.sync_api import sync_playwright
def scrape_dynamic_content(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Navigate and wait for content to load
page.goto(url)
page.wait_for_selector('.dynamic-content', timeout=10000)
# Extract data after JavaScript execution
products = page.eval_on_selector_all(
'.product-item',
'elements => elements.map(el => ({name: el.querySelector("h3")?.textContent, price: el.querySelector(".price")?.textContent}))'
)
browser.close()
return products
Real-Time Data Streams
Some applications provide real-time data through WebSockets or Server-Sent Events (SSE). Capturing this requires different approaches.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import asyncio
import websockets
import json
async def monitor_websocket_data(uri):
async with websockets.connect(uri) as websocket:
try:
async for message in websocket:
data = json.loads(message)
# Process real-time data
print(f"Received: {data}")
except websockets.exceptions.ConnectionClosed:
print("WebSocket connection closed")
# Usage
# asyncio.run(monitor_websocket_data("wss://example.com/live-data"))
Binary Data and Media Files
Not all web data is text-based. You’ll often need to handle images, PDFs, videos, and other binary formats.
Image Data Extraction
Images might contain data in their metadata or require OCR processing to extract text content.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import requests
from PIL import Image
from PIL.ExifTags import TAGS
import pytesseract
from io import BytesIO
def extract_image_metadata(image_url):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
# Extract EXIF data
exifdata = image.getexif()
metadata = {}
for tag_id in exifdata:
tag = TAGS.get(tag_id, tag_id)
data = exifdata.get(tag_id)
metadata[tag] = data
return metadata
def extract_text_from_image(image_url):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
# Use OCR to extract text
text = pytesseract.image_to_string(image)
return text.strip()
PDF Data Extraction
Many websites serve data in PDF format, requiring specialized parsing libraries.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import PyPDF2
import requests
from io import BytesIO
def extract_pdf_text(pdf_url):
response = requests.get(pdf_url)
pdf_file = BytesIO(response.content)
reader = PyPDF2.PdfReader(pdf_file)
text_content = ""
for page in reader.pages:
text_content += page.extract_text()
return text_content
Database-Like Structures in Web Applications
Some web applications expose database-like interfaces through their APIs or embed database queries in their responses.
GraphQL Endpoints
GraphQL provides a query language that allows clients to request specific data structures, offering more control than traditional REST APIs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
def query_graphql(endpoint, query, variables=None):
payload = {
'query': query,
'variables': variables or {}
}
response = requests.post(endpoint, json=payload)
return response.json()
# Example GraphQL query
query = """
query GetUserPosts($userId: ID!) {
user(id: $userId) {
name
posts {
title
content
createdAt
}
}
}
"""
result = query_graphql(
'https://api.example.com/graphql',
query,
{'userId': '123'}
)
Handling Data Quality Issues
Real-world data is messy. You’ll encounter encoding problems, malformed structures, and inconsistent formats that require robust error handling.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import chardet
from urllib.parse import unquote
def clean_extracted_data(raw_data):
"""Clean and normalize extracted data"""
if not raw_data:
return None
# Handle encoding issues
if isinstance(raw_data, bytes):
encoding = chardet.detect(raw_data)['encoding']
raw_data = raw_data.decode(encoding or 'utf-8', errors='ignore')
# Normalize whitespace
cleaned = ' '.join(raw_data.split())
# URL decode if necessary
if '%' in cleaned:
try:
cleaned = unquote(cleaned)
except:
pass
return cleaned.strip()
def validate_extracted_email(email):
"""Validate extracted email addresses"""
import re
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, email) is not None
Understanding the landscape of web data types transforms you from someone who writes scrapers to someone who architects data extraction solutions. Each data type requires different tools, techniques, and strategies. The key is recognizing patterns, building flexible extraction logic, and always having fallback plans for when websites inevitably change.
What’s the most unusual or challenging data format you’ve encountered while scraping? Share your war stories and the creative solutions you developed to tackle them—every scraper has that one website that pushed their skills to the limit!