"Some Characters Could Not Be Decoded": Fixing Replacement Character Errors
You are scraping a page or reading a file, and the output is peppered with diamonds: �. Or maybe your logs print a warning like “some characters could not be decoded, and were replaced with replacement character.” That diamond is U+FFFD, the official Unicode replacement character. It appears whenever a decoder encounters a byte sequence that is invalid for the encoding it was told to use. The data is not necessarily corrupt – your program is just interpreting the bytes with the wrong codebook. Once you understand what causes the substitution and how Python’s codec machinery works, fixing it is straightforward.
What Causes the Replacement Character
Every string of text you see on screen started life as a sequence of bytes. An encoding is the rule that maps those bytes to characters. UTF-8 maps the byte 0xC3 0xA9 to the character e with an acute accent. Latin-1 maps the single byte 0xE9 to the same character. If you hand a Latin-1 byte stream to a UTF-8 decoder, the decoder will hit byte sequences that violate UTF-8’s rules. When that happens, it has three choices depending on the error mode: raise an exception, drop the offending bytes, or substitute the replacement character.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Latin-1 encoded bytes for the word "cafe" with an accent
raw = b"caf\xe9"
# Decoding with the wrong encoding (UTF-8) in strict mode
try:
text = raw.decode("utf-8")
except UnicodeDecodeError as e:
print(e)
# 'utf-8' codec can't decode byte 0xe9 in position 3:
# invalid continuation byte
# Decoding with the wrong encoding in replace mode
text = raw.decode("utf-8", errors="replace")
print(text)
# caf�
print(repr(text))
# 'caf\ufffd'
The replacement character is the decoder’s way of saying “I found bytes here that do not form a valid character in the encoding you specified, so I am putting a placeholder instead.” The original bytes are gone once this substitution happens. If you decoded and stored the result, you have already lost data.
Where You See It
Replacement characters show up in predictable places:
- Web scraping. A server sends Latin-1 or Windows-1252 content but the
Content-Typeheader saysutf-8, or says nothing at all. Your HTTP library defaults to UTF-8 and the accented characters become diamonds. - File reading. You open a CSV exported from Excel on a Windows machine. Excel uses Windows-1252 by default. Python 3’s
open()defaults to your system locale, which on Linux and macOS is usually UTF-8. - Database imports. Data migrated from an older system encoded in ISO-8859-1 gets loaded into a UTF-8 column without conversion.
- API responses. A JSON endpoint serves content scraped from multiple sources. Some of those sources were encoded differently, and the aggregator did not normalize before serializing.
The root cause is always the same: the bytes were written in encoding A, and something is reading them as encoding B. For a deeper look at these common encoding problems and fixes, see our companion guide.
Python’s Error Handling Modes
Python’s bytes.decode() method and the built-in open() function both accept an errors parameter that controls what happens when the decoder hits invalid bytes. Understanding these modes is the first step toward a fix.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
raw = b"R\xe9sum\xe9" # "Resume" in Latin-1
# strict: raises UnicodeDecodeError (default for bytes.decode)
try:
raw.decode("utf-8", errors="strict")
except UnicodeDecodeError:
print("strict mode raised an exception")
# replace: inserts U+FFFD for each undecodable byte
print(raw.decode("utf-8", errors="replace"))
# R�sum�
# ignore: silently drops undecodable bytes
print(raw.decode("utf-8", errors="ignore"))
# Rsum
# backslashreplace: shows the byte value as an escape sequence
print(raw.decode("utf-8", errors="backslashreplace"))
# R\xe9sum\xe9
# surrogateescape: maps bytes to lone surrogates (useful for roundtripping)
print(raw.decode("utf-8", errors="surrogateescape"))
# R\udce9sum\udce9 (surrogates, not real characters)
The replace mode is what produces the � characters you see. The backslashreplace mode is useful for debugging because it preserves the original byte values in a readable form. The ignore mode is almost never what you want because it silently destroys data without any indication that something went wrong.
When you open a file with Python’s built-in open(), the default error mode depends on the context. In Python 3.11 and later on some platforms, the default changed to errors="warn" which prints a deprecation warning. In earlier versions it was strict. The requests library uses replace internally when it decodes response.text, which is why you see diamonds in scraped content instead of exceptions.
Diagnosing the Issue
Before you can fix the encoding, you need to see the raw bytes. If you are working with an HTTP response, use response.content (bytes) instead of response.text (string). If you are reading a file, open it in binary mode.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
response = requests.get("https://example.com/page")
# Do NOT use response.text if you suspect encoding issues
# Instead, examine the raw bytes
raw = response.content
# Look at the first 200 bytes
print(raw[:200])
# Check what encoding requests thinks it should use
print(response.encoding)
# This might say 'ISO-8859-1' even if the page is actually UTF-8
# Check the Content-Type header
print(response.headers.get("Content-Type"))
# text/html; charset=windows-1252
For files, open in binary mode to see exactly what is on disk:
1
2
3
4
5
with open("data.csv", "rb") as f:
raw = f.read(500)
print(raw)
# b'Name,City\r\nJos\xe9,Montr\xe9al\r\n...'
# Those \xe9 bytes tell you this is likely Latin-1 or Windows-1252
Look for byte patterns. Single bytes in the range 0x80-0xFF that are not part of valid multi-byte UTF-8 sequences indicate a single-byte encoding like Latin-1 or Windows-1252. Valid UTF-8 multi-byte characters start with specific bit patterns: two-byte sequences start with 0xC0-0xDF, three-byte sequences with 0xE0-0xEF, and four-byte sequences with 0xF0-0xF7.

Fix 1: Use the Correct Encoding
The cleanest fix is to decode with the encoding the data was actually written in. If you can determine it from the source, use it directly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# You inspected the bytes and found single-byte accented characters
raw = b"caf\xe9"
# Decode with the correct encoding
text = raw.decode("latin-1")
print(text)
# cafe (with accent on the e)
# For web responses, override the encoding before accessing .text
import requests
response = requests.get("https://example.com/page")
response.encoding = "windows-1252" # Set the correct encoding
text = response.text # Now decoded with the right codec
When you do not know the encoding, use a charset detection library. charset-normalizer (the default in recent versions of requests) and chardet both analyze byte patterns to guess the encoding.
1
2
3
4
5
6
7
8
9
10
import charset_normalizer
raw = b"R\xe9sum\xe9 du projet"
results = charset_normalizer.from_bytes(raw)
best = results.best()
print(best.encoding)
# cp1252 (Windows-1252)
print(str(best))
# Resume du projet (properly decoded)
1
2
3
4
5
6
7
8
9
10
11
import chardet
raw = b"R\xe9sum\xe9 du projet"
detected = chardet.detect(raw)
print(detected)
# {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
text = raw.decode(detected["encoding"])
print(text)
# Resume du projet
Both latin-1 (ISO-8859-1) and windows-1252 will decode the same byte correctly in this case. The difference between them matters for bytes in the range 0x80-0x9F: Latin-1 maps those to control characters, while Windows-1252 maps them to printable characters like curly quotes, em dashes, and the euro sign. In practice, if you are dealing with Western European text from the web, Windows-1252 is almost always the right choice over strict ISO-8859-1.
Fix 2: Decode with Replace Then Clean Up
Sometimes you need to process the text even if a few characters are undecodable. Decode with errors="replace", then search for and handle the replacement characters.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
raw = b"Price: \x80100\nName: Caf\xe9 Noir"
# Decode as UTF-8 with replacement
text = raw.decode("utf-8", errors="replace")
print(text)
# Price: �100
# Name: Caf� Noir
# Count how many replacements occurred
replacement_count = text.count("\ufffd")
print(f"Found {replacement_count} replacement characters")
# Find positions of replacement characters
for i, char in enumerate(text):
if char == "\ufffd":
print(f" Position {i}: original byte was 0x{raw[i]:02x}")
This approach is useful when the majority of the content is valid UTF-8 and only a handful of bytes are in a different encoding. You can log the positions, decode those segments separately, or replace the diamonds with a known fallback.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def decode_mixed(raw: bytes, primary: str = "utf-8", fallback: str = "windows-1252") -> str:
"""Decode bytes that might mix two encodings.
Try primary encoding first. For any bytes that fail,
fall back to the secondary encoding.
"""
result = []
i = 0
while i < len(raw):
byte = raw[i:i+1]
try:
# Try decoding as a potential multi-byte UTF-8 sequence
if raw[i] & 0x80 == 0:
# ASCII byte
result.append(byte.decode(primary))
i += 1
elif raw[i] & 0xE0 == 0xC0:
# Two-byte UTF-8 sequence
chunk = raw[i:i+2]
result.append(chunk.decode(primary))
i += 2
elif raw[i] & 0xF0 == 0xE0:
# Three-byte UTF-8 sequence
chunk = raw[i:i+3]
result.append(chunk.decode(primary))
i += 3
elif raw[i] & 0xF8 == 0xF0:
# Four-byte UTF-8 sequence
chunk = raw[i:i+4]
result.append(chunk.decode(primary))
i += 4
else:
# Not valid UTF-8 start byte, try fallback
result.append(byte.decode(fallback))
i += 1
except (UnicodeDecodeError, IndexError):
# UTF-8 sequence was incomplete or invalid, try fallback
result.append(raw[i:i+1].decode(fallback))
i += 1
return "".join(result)
raw = b"Price: \xe2\x82\xac100\nCaf\xe9 Noir"
# Contains valid UTF-8 euro sign AND a Latin-1 accented e
print(decode_mixed(raw))
# Price: euro-sign 100
# Cafe Noir
Fix 3: Use ftfy to Fix Double-Encoded Text
Double encoding is a particularly nasty variant. It happens when text is encoded to bytes, then those bytes are mistakenly treated as characters in a different encoding and encoded again. The result is garbled multi-byte sequences where you expected single characters.
1
2
3
4
5
6
7
8
9
# How double encoding happens
original = "cafe" # with accent on e
step1 = original.encode("utf-8") # b'caf\xc3\xa9'
step2 = step1.decode("latin-1") # 'café' (mojibake)
step3 = step2.encode("utf-8") # b'caf\xc3\x83\xc2\xa9'
# Now you have double-encoded bytes
print(step3.decode("utf-8"))
# cafe with garbled accent characters (mojibake)
The ftfy library specializes in detecting and reversing these mangled encodings.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import ftfy
# Classic mojibake from UTF-8 decoded as Latin-1
garbled = "Café"
print(ftfy.fix_text(garbled))
# Cafe (with proper accent)
# Windows-1252 interpreted as UTF-8, producing replacement chars
garbled2 = "Smart quotes: \u201cHello\u201d"
fixed = ftfy.fix_text(garbled2)
print(fixed)
# Smart quotes: "Hello"
# ftfy can explain what it did
from ftfy import explain_unicode
explain_unicode("Café")
# Prints a table showing each character and its properties
ftfy works by recognizing common patterns of encoding errors and reversing them. It handles the most frequent cases: UTF-8 decoded as Latin-1, UTF-8 decoded as Windows-1252, and various combinations of double and triple encoding. If your data looks like it went through a series of wrong encoding/decoding steps, ftfy is the tool to reach for before trying to decode garbled text manually.

Fix 4: Read as Binary and Detect Before Decoding
The safest approach for files with unknown encoding is to always read binary first, detect the encoding, then decode.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import charset_normalizer
from pathlib import Path
def read_unknown_encoding(file_path: str) -> str:
"""Read a text file with unknown encoding."""
raw = Path(file_path).read_bytes()
# Try UTF-8 first (most common on modern systems)
try:
text = raw.decode("utf-8")
# Check for BOM (byte order mark) and strip it
if text.startswith("\ufeff"):
text = text[1:]
return text
except UnicodeDecodeError:
pass
# Fall back to detection
result = charset_normalizer.from_bytes(raw).best()
if result is None:
raise ValueError(f"Could not detect encoding for {file_path}")
print(f"Detected encoding: {result.encoding} "
f"(coherence: {result.coherence})")
return str(result)
# Usage
text = read_unknown_encoding("/path/to/mystery_file.csv")
For CSV files specifically, Python’s csv module works with text streams, so you need to handle the encoding at the file-opening step:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import csv
import charset_normalizer
def read_csv_any_encoding(file_path: str) -> list[dict]:
"""Read a CSV file regardless of its encoding."""
raw = open(file_path, "rb").read()
# Detect encoding
detection = charset_normalizer.from_bytes(raw).best()
encoding = detection.encoding if detection else "utf-8"
rows = []
with open(file_path, "r", encoding=encoding, errors="replace") as f:
reader = csv.DictReader(f)
for row in reader:
rows.append(row)
# Warn if any replacement characters slipped through
for i, row in enumerate(rows):
for key, value in row.items():
if value and "\ufffd" in value:
print(f"Warning: replacement char in row {i}, "
f"column '{key}': {value!r}")
return rows
Prevention in Web Scraping
Most replacement character issues in scraping come from trusting the wrong encoding declaration. Here is how to prevent them.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import requests
import charset_normalizer
def scrape_with_encoding_safety(url: str) -> str:
"""Fetch a URL and decode its content with proper encoding handling."""
response = requests.get(url)
# Step 1: Check the Content-Type header
content_type = response.headers.get("Content-Type", "")
print(f"Content-Type: {content_type}")
# Step 2: Work with raw bytes, not response.text
raw = response.content
# Step 3: Look for encoding in the HTML meta tag
# (only check the first 2KB to avoid decoding the whole thing)
head_bytes = raw[:2048]
meta_encoding = None
# Check for <meta charset="...">
import re
match = re.search(rb'charset=["\']?([a-zA-Z0-9_-]+)', head_bytes)
if match:
meta_encoding = match.group(1).decode("ascii")
print(f"Meta charset: {meta_encoding}")
# Step 4: Try encodings in order of reliability
# 1. Explicit meta charset in the HTML
# 2. Content-Type header charset
# 3. Detection from byte patterns
# 4. UTF-8 as a last resort
if meta_encoding:
try:
return raw.decode(meta_encoding)
except (UnicodeDecodeError, LookupError):
pass
if response.encoding and response.encoding.lower() != "iso-8859-1":
# requests defaults to ISO-8859-1 for text/* content types
# per RFC 2616, which is often wrong
try:
return raw.decode(response.encoding)
except (UnicodeDecodeError, LookupError):
pass
# Auto-detect
detected = charset_normalizer.from_bytes(raw).best()
if detected:
print(f"Detected: {detected.encoding}")
return str(detected)
# Final fallback
return raw.decode("utf-8", errors="replace")
A critical detail: the requests library has a quirk where it defaults response.encoding to ISO-8859-1 for any response with a text/* content type that does not include a charset parameter. This is technically correct per the old HTTP/1.1 spec (RFC 2616), but in practice almost all modern web content without an explicit charset is UTF-8. That default is why response.text often produces replacement characters for pages that are actually UTF-8.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests
response = requests.get("https://example.com")
# This is often wrong -- requests defaults to ISO-8859-1
print(response.encoding)
# 'ISO-8859-1'
# Use apparent_encoding instead, which uses charset_normalizer
print(response.apparent_encoding)
# 'utf-8'
# Override before accessing .text
response.encoding = response.apparent_encoding
clean_text = response.text
Complete Workflow
Here is a complete scraping workflow that handles encoding properly from start to finish.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import requests
import charset_normalizer
import re
from dataclasses import dataclass
@dataclass
class ScrapedContent:
url: str
encoding_used: str
encoding_source: str # "meta", "header", "detected", "fallback"
text: str
had_replacements: bool
def scrape_safely(url: str) -> ScrapedContent:
"""Scrape a URL with robust encoding handling."""
response = requests.get(url, timeout=30)
raw = response.content
# Try meta charset first
match = re.search(rb'charset=["\']?([a-zA-Z0-9_-]+)', raw[:4096])
if match:
encoding = match.group(1).decode("ascii")
try:
text = raw.decode(encoding)
return ScrapedContent(
url=url,
encoding_used=encoding,
encoding_source="meta",
text=text,
had_replacements=False,
)
except (UnicodeDecodeError, LookupError):
pass
# Try header charset (skip the requests ISO-8859-1 default)
ct = response.headers.get("Content-Type", "")
ct_match = re.search(r'charset=([a-zA-Z0-9_-]+)', ct)
if ct_match:
encoding = ct_match.group(1)
try:
text = raw.decode(encoding)
return ScrapedContent(
url=url,
encoding_used=encoding,
encoding_source="header",
text=text,
had_replacements=False,
)
except (UnicodeDecodeError, LookupError):
pass
# Auto-detect
detected = charset_normalizer.from_bytes(raw).best()
if detected:
text = str(detected)
return ScrapedContent(
url=url,
encoding_used=detected.encoding,
encoding_source="detected",
text=text,
had_replacements=False,
)
# Fallback to UTF-8 with replacement
text = raw.decode("utf-8", errors="replace")
return ScrapedContent(
url=url,
encoding_used="utf-8",
encoding_source="fallback",
text=text,
had_replacements="\ufffd" in text,
)
# Usage
result = scrape_safely("https://example.com/page")
print(f"Encoding: {result.encoding_used} (from {result.encoding_source})")
if result.had_replacements:
count = result.text.count("\ufffd")
print(f"Warning: {count} characters could not be decoded")
Common Encoding Pairs That Cause Problems
Most replacement character errors come from a small number of encoding mismatches. Knowing the common pairs helps you diagnose issues faster.
| Actual Encoding | Assumed Encoding | What Happens |
|---|---|---|
| Windows-1252 | UTF-8 | Accented characters (e, n, u with accents) become � |
| UTF-8 | Latin-1 | Multi-byte characters become mojibake (é instead of e with accent) |
| UTF-8 | ASCII | Anything above 127 raises an error or becomes � |
| Shift_JIS | UTF-8 | Japanese text becomes a mix of � and garbage |
| GB2312/GBK | UTF-8 | Chinese text becomes unreadable |
| UTF-8 with BOM | UTF-8 | The BOM (\xef\xbb\xbf) appears as \ufeff at the start |
The Windows-1252 vs UTF-8 mismatch is by far the most common in web scraping of English and Western European content. Windows-1252 is a superset of ISO-8859-1 and was the default encoding on Windows for decades. Many older websites, databases, and file exports still use it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# The problematic byte ranges
# These bytes are valid in Windows-1252 but invalid in UTF-8
problem_bytes = {
0x80: "euro sign",
0x85: "horizontal ellipsis",
0x91: "left single curly quote",
0x92: "right single curly quote",
0x93: "left double curly quote",
0x94: "right double curly quote",
0x96: "en dash",
0x97: "em dash",
0xA0: "non-breaking space",
0xE9: "e with acute accent",
0xF1: "n with tilde",
0xFC: "u with diaeresis",
}
for byte_val, description in problem_bytes.items():
raw = bytes([byte_val])
win1252 = raw.decode("windows-1252")
try:
utf8 = raw.decode("utf-8")
except UnicodeDecodeError:
utf8 = "(invalid)"
print(f"0x{byte_val:02X}: {description:30s} "
f"win1252={win1252!r:6s} utf8={utf8}")
Quick Reference
When you see � in your output, work through this checklist:
- Get the raw bytes. Use
response.contentfor HTTP,open(path, "rb")for files. - Look at the bytes around the diamond. Single bytes in
0x80-0xFFsuggest a single-byte encoding. Multi-byte sequences starting with0xC3followed by a byte in0x80-0xBFare likely valid UTF-8. - Check the declared encoding. Look at
Content-Typeheaders, HTML meta tags, or file metadata. If it says one thing but the bytes say another, trust the bytes. - Try Windows-1252. It covers the most common cases for Western text.
- Use charset-normalizer or chardet. Let a detection library analyze the byte patterns.
- If it looks like mojibake, use ftfy. Double-encoded text has a distinctive look – accented characters turn into two-character sequences.
The replacement character is not a bug. It is your decoder telling you that it needs a different codebook. Listen to it, find the right encoding, and the diamonds will disappear.

