Post

"Some Characters Could Not Be Decoded": Fixing Replacement Character Errors

"Some Characters Could Not Be Decoded": Fixing Replacement Character Errors

You are scraping a page or reading a file, and the output is peppered with diamonds: . Or maybe your logs print a warning like “some characters could not be decoded, and were replaced with replacement character.” That diamond is U+FFFD, the official Unicode replacement character. It appears whenever a decoder encounters a byte sequence that is invalid for the encoding it was told to use. The data is not necessarily corrupt – your program is just interpreting the bytes with the wrong codebook. Once you understand what causes the substitution and how Python’s codec machinery works, fixing it is straightforward.

What Causes the Replacement Character

Every string of text you see on screen started life as a sequence of bytes. An encoding is the rule that maps those bytes to characters. UTF-8 maps the byte 0xC3 0xA9 to the character e with an acute accent. Latin-1 maps the single byte 0xE9 to the same character. If you hand a Latin-1 byte stream to a UTF-8 decoder, the decoder will hit byte sequences that violate UTF-8’s rules. When that happens, it has three choices depending on the error mode: raise an exception, drop the offending bytes, or substitute the replacement character.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Latin-1 encoded bytes for the word "cafe" with an accent
raw = b"caf\xe9"

# Decoding with the wrong encoding (UTF-8) in strict mode
try:
    text = raw.decode("utf-8")
except UnicodeDecodeError as e:
    print(e)
    # 'utf-8' codec can't decode byte 0xe9 in position 3:
    #   invalid continuation byte

# Decoding with the wrong encoding in replace mode
text = raw.decode("utf-8", errors="replace")
print(text)
# caf�
print(repr(text))
# 'caf\ufffd'

The replacement character is the decoder’s way of saying “I found bytes here that do not form a valid character in the encoding you specified, so I am putting a placeholder instead.” The original bytes are gone once this substitution happens. If you decoded and stored the result, you have already lost data.

Where You See It

Replacement characters show up in predictable places:

  • Web scraping. A server sends Latin-1 or Windows-1252 content but the Content-Type header says utf-8, or says nothing at all. Your HTTP library defaults to UTF-8 and the accented characters become diamonds.
  • File reading. You open a CSV exported from Excel on a Windows machine. Excel uses Windows-1252 by default. Python 3’s open() defaults to your system locale, which on Linux and macOS is usually UTF-8.
  • Database imports. Data migrated from an older system encoded in ISO-8859-1 gets loaded into a UTF-8 column without conversion.
  • API responses. A JSON endpoint serves content scraped from multiple sources. Some of those sources were encoded differently, and the aggregator did not normalize before serializing.

The root cause is always the same: the bytes were written in encoding A, and something is reading them as encoding B. For a deeper look at these common encoding problems and fixes, see our companion guide.

Python’s Error Handling Modes

Python’s bytes.decode() method and the built-in open() function both accept an errors parameter that controls what happens when the decoder hits invalid bytes. Understanding these modes is the first step toward a fix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
raw = b"R\xe9sum\xe9"  # "Resume" in Latin-1

# strict: raises UnicodeDecodeError (default for bytes.decode)
try:
    raw.decode("utf-8", errors="strict")
except UnicodeDecodeError:
    print("strict mode raised an exception")

# replace: inserts U+FFFD for each undecodable byte
print(raw.decode("utf-8", errors="replace"))
# R�sum�

# ignore: silently drops undecodable bytes
print(raw.decode("utf-8", errors="ignore"))
# Rsum

# backslashreplace: shows the byte value as an escape sequence
print(raw.decode("utf-8", errors="backslashreplace"))
# R\xe9sum\xe9

# surrogateescape: maps bytes to lone surrogates (useful for roundtripping)
print(raw.decode("utf-8", errors="surrogateescape"))
# R\udce9sum\udce9  (surrogates, not real characters)

The replace mode is what produces the characters you see. The backslashreplace mode is useful for debugging because it preserves the original byte values in a readable form. The ignore mode is almost never what you want because it silently destroys data without any indication that something went wrong.

When you open a file with Python’s built-in open(), the default error mode depends on the context. In Python 3.11 and later on some platforms, the default changed to errors="warn" which prints a deprecation warning. In earlier versions it was strict. The requests library uses replace internally when it decodes response.text, which is why you see diamonds in scraped content instead of exceptions.

Diagnosing the Issue

Before you can fix the encoding, you need to see the raw bytes. If you are working with an HTTP response, use response.content (bytes) instead of response.text (string). If you are reading a file, open it in binary mode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests

response = requests.get("https://example.com/page")

# Do NOT use response.text if you suspect encoding issues
# Instead, examine the raw bytes
raw = response.content

# Look at the first 200 bytes
print(raw[:200])

# Check what encoding requests thinks it should use
print(response.encoding)
# This might say 'ISO-8859-1' even if the page is actually UTF-8

# Check the Content-Type header
print(response.headers.get("Content-Type"))
# text/html; charset=windows-1252

For files, open in binary mode to see exactly what is on disk:

1
2
3
4
5
with open("data.csv", "rb") as f:
    raw = f.read(500)
    print(raw)
    # b'Name,City\r\nJos\xe9,Montr\xe9al\r\n...'
    # Those \xe9 bytes tell you this is likely Latin-1 or Windows-1252

Look for byte patterns. Single bytes in the range 0x80-0xFF that are not part of valid multi-byte UTF-8 sequences indicate a single-byte encoding like Latin-1 or Windows-1252. Valid UTF-8 multi-byte characters start with specific bit patterns: two-byte sequences start with 0xC0-0xDF, three-byte sequences with 0xE0-0xEF, and four-byte sequences with 0xF0-0xF7.

Every character has a number, and getting that number wrong breaks everything.
Every character has a number, and getting that number wrong breaks everything. Photo by Nataliya Vaitkevich / Pexels

Fix 1: Use the Correct Encoding

The cleanest fix is to decode with the encoding the data was actually written in. If you can determine it from the source, use it directly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# You inspected the bytes and found single-byte accented characters
raw = b"caf\xe9"

# Decode with the correct encoding
text = raw.decode("latin-1")
print(text)
# cafe  (with accent on the e)

# For web responses, override the encoding before accessing .text
import requests

response = requests.get("https://example.com/page")
response.encoding = "windows-1252"  # Set the correct encoding
text = response.text  # Now decoded with the right codec

When you do not know the encoding, use a charset detection library. charset-normalizer (the default in recent versions of requests) and chardet both analyze byte patterns to guess the encoding.

1
2
3
4
5
6
7
8
9
10
import charset_normalizer

raw = b"R\xe9sum\xe9 du projet"

results = charset_normalizer.from_bytes(raw)
best = results.best()
print(best.encoding)
# cp1252  (Windows-1252)
print(str(best))
# Resume du projet  (properly decoded)
1
2
3
4
5
6
7
8
9
10
11
import chardet

raw = b"R\xe9sum\xe9 du projet"

detected = chardet.detect(raw)
print(detected)
# {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

text = raw.decode(detected["encoding"])
print(text)
# Resume du projet

Both latin-1 (ISO-8859-1) and windows-1252 will decode the same byte correctly in this case. The difference between them matters for bytes in the range 0x80-0x9F: Latin-1 maps those to control characters, while Windows-1252 maps them to printable characters like curly quotes, em dashes, and the euro sign. In practice, if you are dealing with Western European text from the web, Windows-1252 is almost always the right choice over strict ISO-8859-1.

Fix 2: Decode with Replace Then Clean Up

Sometimes you need to process the text even if a few characters are undecodable. Decode with errors="replace", then search for and handle the replacement characters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
raw = b"Price: \x80100\nName: Caf\xe9 Noir"

# Decode as UTF-8 with replacement
text = raw.decode("utf-8", errors="replace")
print(text)
# Price: �100
# Name: Caf� Noir

# Count how many replacements occurred
replacement_count = text.count("\ufffd")
print(f"Found {replacement_count} replacement characters")

# Find positions of replacement characters
for i, char in enumerate(text):
    if char == "\ufffd":
        print(f"  Position {i}: original byte was 0x{raw[i]:02x}")

This approach is useful when the majority of the content is valid UTF-8 and only a handful of bytes are in a different encoding. You can log the positions, decode those segments separately, or replace the diamonds with a known fallback.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def decode_mixed(raw: bytes, primary: str = "utf-8", fallback: str = "windows-1252") -> str:
    """Decode bytes that might mix two encodings.

    Try primary encoding first. For any bytes that fail,
    fall back to the secondary encoding.
    """
    result = []
    i = 0
    while i < len(raw):
        byte = raw[i:i+1]
        try:
            # Try decoding as a potential multi-byte UTF-8 sequence
            if raw[i] & 0x80 == 0:
                # ASCII byte
                result.append(byte.decode(primary))
                i += 1
            elif raw[i] & 0xE0 == 0xC0:
                # Two-byte UTF-8 sequence
                chunk = raw[i:i+2]
                result.append(chunk.decode(primary))
                i += 2
            elif raw[i] & 0xF0 == 0xE0:
                # Three-byte UTF-8 sequence
                chunk = raw[i:i+3]
                result.append(chunk.decode(primary))
                i += 3
            elif raw[i] & 0xF8 == 0xF0:
                # Four-byte UTF-8 sequence
                chunk = raw[i:i+4]
                result.append(chunk.decode(primary))
                i += 4
            else:
                # Not valid UTF-8 start byte, try fallback
                result.append(byte.decode(fallback))
                i += 1
        except (UnicodeDecodeError, IndexError):
            # UTF-8 sequence was incomplete or invalid, try fallback
            result.append(raw[i:i+1].decode(fallback))
            i += 1
    return "".join(result)


raw = b"Price: \xe2\x82\xac100\nCaf\xe9 Noir"
# Contains valid UTF-8 euro sign AND a Latin-1 accented e
print(decode_mixed(raw))
# Price: euro-sign 100
# Cafe Noir

Fix 3: Use ftfy to Fix Double-Encoded Text

Double encoding is a particularly nasty variant. It happens when text is encoded to bytes, then those bytes are mistakenly treated as characters in a different encoding and encoded again. The result is garbled multi-byte sequences where you expected single characters.

1
2
3
4
5
6
7
8
9
# How double encoding happens
original = "cafe"  # with accent on e
step1 = original.encode("utf-8")       # b'caf\xc3\xa9'
step2 = step1.decode("latin-1")         # 'café'  (mojibake)
step3 = step2.encode("utf-8")           # b'caf\xc3\x83\xc2\xa9'
# Now you have double-encoded bytes

print(step3.decode("utf-8"))
# cafe with garbled accent characters (mojibake)

The ftfy library specializes in detecting and reversing these mangled encodings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import ftfy

# Classic mojibake from UTF-8 decoded as Latin-1
garbled = "Café"
print(ftfy.fix_text(garbled))
# Cafe  (with proper accent)

# Windows-1252 interpreted as UTF-8, producing replacement chars
garbled2 = "Smart quotes: \u201cHello\u201d"
fixed = ftfy.fix_text(garbled2)
print(fixed)
# Smart quotes: "Hello"

# ftfy can explain what it did
from ftfy import explain_unicode
explain_unicode("Café")
# Prints a table showing each character and its properties

ftfy works by recognizing common patterns of encoding errors and reversing them. It handles the most frequent cases: UTF-8 decoded as Latin-1, UTF-8 decoded as Windows-1252, and various combinations of double and triple encoding. If your data looks like it went through a series of wrong encoding/decoding steps, ftfy is the tool to reach for before trying to decode garbled text manually.

The web speaks many languages — your scraper needs to understand all of them.
The web speaks many languages — your scraper needs to understand all of them. Photo by Stefan G / Pexels

Fix 4: Read as Binary and Detect Before Decoding

The safest approach for files with unknown encoding is to always read binary first, detect the encoding, then decode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import charset_normalizer
from pathlib import Path


def read_unknown_encoding(file_path: str) -> str:
    """Read a text file with unknown encoding."""
    raw = Path(file_path).read_bytes()

    # Try UTF-8 first (most common on modern systems)
    try:
        text = raw.decode("utf-8")
        # Check for BOM (byte order mark) and strip it
        if text.startswith("\ufeff"):
            text = text[1:]
        return text
    except UnicodeDecodeError:
        pass

    # Fall back to detection
    result = charset_normalizer.from_bytes(raw).best()
    if result is None:
        raise ValueError(f"Could not detect encoding for {file_path}")

    print(f"Detected encoding: {result.encoding} "
          f"(coherence: {result.coherence})")
    return str(result)


# Usage
text = read_unknown_encoding("/path/to/mystery_file.csv")

For CSV files specifically, Python’s csv module works with text streams, so you need to handle the encoding at the file-opening step:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import csv
import charset_normalizer


def read_csv_any_encoding(file_path: str) -> list[dict]:
    """Read a CSV file regardless of its encoding."""
    raw = open(file_path, "rb").read()

    # Detect encoding
    detection = charset_normalizer.from_bytes(raw).best()
    encoding = detection.encoding if detection else "utf-8"

    rows = []
    with open(file_path, "r", encoding=encoding, errors="replace") as f:
        reader = csv.DictReader(f)
        for row in reader:
            rows.append(row)

    # Warn if any replacement characters slipped through
    for i, row in enumerate(rows):
        for key, value in row.items():
            if value and "\ufffd" in value:
                print(f"Warning: replacement char in row {i}, "
                      f"column '{key}': {value!r}")

    return rows

Prevention in Web Scraping

Most replacement character issues in scraping come from trusting the wrong encoding declaration. Here is how to prevent them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import requests
import charset_normalizer


def scrape_with_encoding_safety(url: str) -> str:
    """Fetch a URL and decode its content with proper encoding handling."""
    response = requests.get(url)

    # Step 1: Check the Content-Type header
    content_type = response.headers.get("Content-Type", "")
    print(f"Content-Type: {content_type}")

    # Step 2: Work with raw bytes, not response.text
    raw = response.content

    # Step 3: Look for encoding in the HTML meta tag
    # (only check the first 2KB to avoid decoding the whole thing)
    head_bytes = raw[:2048]
    meta_encoding = None

    # Check for <meta charset="...">
    import re
    match = re.search(rb'charset=["\']?([a-zA-Z0-9_-]+)', head_bytes)
    if match:
        meta_encoding = match.group(1).decode("ascii")
        print(f"Meta charset: {meta_encoding}")

    # Step 4: Try encodings in order of reliability
    #   1. Explicit meta charset in the HTML
    #   2. Content-Type header charset
    #   3. Detection from byte patterns
    #   4. UTF-8 as a last resort

    if meta_encoding:
        try:
            return raw.decode(meta_encoding)
        except (UnicodeDecodeError, LookupError):
            pass

    if response.encoding and response.encoding.lower() != "iso-8859-1":
        # requests defaults to ISO-8859-1 for text/* content types
        # per RFC 2616, which is often wrong
        try:
            return raw.decode(response.encoding)
        except (UnicodeDecodeError, LookupError):
            pass

    # Auto-detect
    detected = charset_normalizer.from_bytes(raw).best()
    if detected:
        print(f"Detected: {detected.encoding}")
        return str(detected)

    # Final fallback
    return raw.decode("utf-8", errors="replace")

A critical detail: the requests library has a quirk where it defaults response.encoding to ISO-8859-1 for any response with a text/* content type that does not include a charset parameter. This is technically correct per the old HTTP/1.1 spec (RFC 2616), but in practice almost all modern web content without an explicit charset is UTF-8. That default is why response.text often produces replacement characters for pages that are actually UTF-8.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests

response = requests.get("https://example.com")

# This is often wrong -- requests defaults to ISO-8859-1
print(response.encoding)
# 'ISO-8859-1'

# Use apparent_encoding instead, which uses charset_normalizer
print(response.apparent_encoding)
# 'utf-8'

# Override before accessing .text
response.encoding = response.apparent_encoding
clean_text = response.text

Complete Workflow

Here is a complete scraping workflow that handles encoding properly from start to finish.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import requests
import charset_normalizer
import re
from dataclasses import dataclass


@dataclass
class ScrapedContent:
    url: str
    encoding_used: str
    encoding_source: str  # "meta", "header", "detected", "fallback"
    text: str
    had_replacements: bool


def scrape_safely(url: str) -> ScrapedContent:
    """Scrape a URL with robust encoding handling."""
    response = requests.get(url, timeout=30)
    raw = response.content

    # Try meta charset first
    match = re.search(rb'charset=["\']?([a-zA-Z0-9_-]+)', raw[:4096])
    if match:
        encoding = match.group(1).decode("ascii")
        try:
            text = raw.decode(encoding)
            return ScrapedContent(
                url=url,
                encoding_used=encoding,
                encoding_source="meta",
                text=text,
                had_replacements=False,
            )
        except (UnicodeDecodeError, LookupError):
            pass

    # Try header charset (skip the requests ISO-8859-1 default)
    ct = response.headers.get("Content-Type", "")
    ct_match = re.search(r'charset=([a-zA-Z0-9_-]+)', ct)
    if ct_match:
        encoding = ct_match.group(1)
        try:
            text = raw.decode(encoding)
            return ScrapedContent(
                url=url,
                encoding_used=encoding,
                encoding_source="header",
                text=text,
                had_replacements=False,
            )
        except (UnicodeDecodeError, LookupError):
            pass

    # Auto-detect
    detected = charset_normalizer.from_bytes(raw).best()
    if detected:
        text = str(detected)
        return ScrapedContent(
            url=url,
            encoding_used=detected.encoding,
            encoding_source="detected",
            text=text,
            had_replacements=False,
        )

    # Fallback to UTF-8 with replacement
    text = raw.decode("utf-8", errors="replace")
    return ScrapedContent(
        url=url,
        encoding_used="utf-8",
        encoding_source="fallback",
        text=text,
        had_replacements="\ufffd" in text,
    )


# Usage
result = scrape_safely("https://example.com/page")
print(f"Encoding: {result.encoding_used} (from {result.encoding_source})")
if result.had_replacements:
    count = result.text.count("\ufffd")
    print(f"Warning: {count} characters could not be decoded")

Common Encoding Pairs That Cause Problems

Most replacement character errors come from a small number of encoding mismatches. Knowing the common pairs helps you diagnose issues faster.

Actual EncodingAssumed EncodingWhat Happens
Windows-1252UTF-8Accented characters (e, n, u with accents) become
UTF-8Latin-1Multi-byte characters become mojibake (é instead of e with accent)
UTF-8ASCIIAnything above 127 raises an error or becomes
Shift_JISUTF-8Japanese text becomes a mix of and garbage
GB2312/GBKUTF-8Chinese text becomes unreadable
UTF-8 with BOMUTF-8The BOM (\xef\xbb\xbf) appears as \ufeff at the start

The Windows-1252 vs UTF-8 mismatch is by far the most common in web scraping of English and Western European content. Windows-1252 is a superset of ISO-8859-1 and was the default encoding on Windows for decades. Many older websites, databases, and file exports still use it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# The problematic byte ranges
# These bytes are valid in Windows-1252 but invalid in UTF-8
problem_bytes = {
    0x80: "euro sign",
    0x85: "horizontal ellipsis",
    0x91: "left single curly quote",
    0x92: "right single curly quote",
    0x93: "left double curly quote",
    0x94: "right double curly quote",
    0x96: "en dash",
    0x97: "em dash",
    0xA0: "non-breaking space",
    0xE9: "e with acute accent",
    0xF1: "n with tilde",
    0xFC: "u with diaeresis",
}

for byte_val, description in problem_bytes.items():
    raw = bytes([byte_val])
    win1252 = raw.decode("windows-1252")
    try:
        utf8 = raw.decode("utf-8")
    except UnicodeDecodeError:
        utf8 = "(invalid)"
    print(f"0x{byte_val:02X}: {description:30s} "
          f"win1252={win1252!r:6s}  utf8={utf8}")

Quick Reference

When you see in your output, work through this checklist:

  1. Get the raw bytes. Use response.content for HTTP, open(path, "rb") for files.
  2. Look at the bytes around the diamond. Single bytes in 0x80-0xFF suggest a single-byte encoding. Multi-byte sequences starting with 0xC3 followed by a byte in 0x80-0xBF are likely valid UTF-8.
  3. Check the declared encoding. Look at Content-Type headers, HTML meta tags, or file metadata. If it says one thing but the bytes say another, trust the bytes.
  4. Try Windows-1252. It covers the most common cases for Western text.
  5. Use charset-normalizer or chardet. Let a detection library analyze the byte patterns.
  6. If it looks like mojibake, use ftfy. Double-encoded text has a distinctive look – accented characters turn into two-character sequences.

The replacement character is not a bug. It is your decoder telling you that it needs a different codebook. Listen to it, find the right encoding, and the diamonds will disappear.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.