Post

How to Decode Garbled Text: Fixing Encoding Mismatches

How to Decode Garbled Text: Fixing Encoding Mismatches

You scraped a page and got “é” instead of “é”, or “ÄŸ” instead of “ğ”, or “’” instead of a curly apostrophe. The data is there, but it looks like it was run through a blender. This corruption has a name – mojibake – and it happens when bytes that were encoded in one character set get decoded using a different one. The good news is that it is almost always reversible. Once you understand the pattern, you can fix garbled text in a single line of Python.

What Causes Garbled Text

Every character you see on screen is stored as a sequence of bytes. The encoding determines how characters map to bytes, and how bytes map back to characters. UTF-8 encodes “é” as two bytes: 0xC3 0xA9. Latin-1 (ISO 8859-1) encodes “é” as a single byte: 0xE9. When a system writes bytes using UTF-8 but the reader interprets those same bytes using Latin-1, each UTF-8 byte gets mapped to a separate Latin-1 character. The two bytes 0xC3 0xA9 become two characters: “Ô (0xC3 in Latin-1) and “©” (0xA9 in Latin-1). The result is “é” where you expected “é”.

This is not data loss. The original bytes are intact. They were just interpreted through the wrong lens.

1
2
3
4
5
6
7
8
9
# Demonstrating how mojibake happens
original = "é"
utf8_bytes = original.encode('utf-8')
print(utf8_bytes)        # b'\xc3\xa9'
print(len(utf8_bytes))   # 2 bytes

# Now decode those UTF-8 bytes as if they were Latin-1
garbled = utf8_bytes.decode('latin-1')
print(garbled)           # é

The same thing happens in the other direction. If content was encoded as Latin-1 but decoded as UTF-8, you get a different kind of corruption – often replacement characters or decode errors.

Common Mojibake Patterns and What They Mean

When you see garbled text, the specific garbage characters tell you exactly what went wrong. Here are the most common patterns you will encounter when scraping.

UTF-8 Decoded as Latin-1

This is the most frequent case. UTF-8 multi-byte sequences get split into individual Latin-1 characters.

OriginalGarbledUTF-8 Bytes
ééC3 A9
üüC3 BC
ññC3 B1
ööC3 B6
ääC3 A4
ğÄŸC4 9F
şÅŸC5 9F
ççC3 A7

The telltale sign is the “Ô character appearing before accented vowels. If you see “Ô scattered throughout your scraped text, you are almost certainly looking at UTF-8 bytes misinterpreted as Latin-1.

UTF-8 Decoded as Windows-1252

Windows-1252 (often mislabeled as “ANSI” on Windows systems) is a superset of Latin-1 that fills in the 0x80-0x9F range with printable characters like curly quotes and dashes. When UTF-8 text gets decoded as Windows-1252, you see a slightly different pattern.

OriginalGarbledUTF-8 Bytes
’ (right single quote)’E2 80 99
” (left double quote)“E2 80 9C
” (right double quote)â€E2 80 9D
– (em dash)—E2 80 94
… (ellipsis)…E2 80 A6

If you see “— sequences in your scraped data, the source is UTF-8 content decoded as Windows-1252.

Double Encoding

Sometimes content gets corrupted twice. UTF-8 text is misread as Latin-1, then the already-garbled result gets encoded to UTF-8 again. The result is even more garbled.

OriginalSingle MojibakeDouble Mojibake
ééÃ�©
üüÃ�¼

Double encoding produces longer garbage strings with more “Ô characters. The fix requires two rounds of the correction process.

The Fix Pattern

The repair is straightforward: re-encode the garbled text using the wrong encoding (the one that was mistakenly used to decode it), then decode the resulting bytes with the correct encoding.

1
2
3
4
# UTF-8 content that was decoded as Latin-1
broken = "é"
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed)  # é

What happens here step by step:

  1. broken.encode('latin-1') converts the garbled string back to bytes using Latin-1. Since the garbled characters came from Latin-1 interpretation, this recovers the original raw bytes: b'\xc3\xa9'.
  2. .decode('utf-8') interprets those bytes as UTF-8, which is what they were all along. Result: “é”.

For Windows-1252 mojibake, adjust accordingly:

1
2
3
4
# UTF-8 content that was decoded as Windows-1252
broken = "’"
fixed = broken.encode('windows-1252').decode('utf-8')
print(fixed)  # '

For double-encoded text, apply the fix twice:

1
2
3
4
5
6
# Double-encoded text
broken = "Ã\x83©"
first_pass = broken.encode('latin-1').decode('utf-8')
print(first_pass)  # é  (still garbled, but only single mojibake now)
second_pass = first_pass.encode('latin-1').decode('utf-8')
print(second_pass)  # é

Diagnosing the Encoding Pair

When you encounter garbled text and are not sure which encoding pair caused the problem, you can try common combinations systematically.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def try_fix(broken_text, encodings_to_try=None):
    """Try common encoding pair fixes and show results."""
    if encodings_to_try is None:
        encodings_to_try = [
            ('latin-1', 'utf-8'),
            ('windows-1252', 'utf-8'),
            ('iso-8859-15', 'utf-8'),
            ('cp437', 'utf-8'),
        ]

    results = []
    for wrong_enc, right_enc in encodings_to_try:
        try:
            fixed = broken_text.encode(wrong_enc).decode(right_enc)
            results.append((wrong_enc, right_enc, fixed))
        except (UnicodeDecodeError, UnicodeEncodeError):
            continue

    return results


broken = "Café au lait"
for wrong, right, fixed in try_fix(broken):
    print(f"{wrong} -> {right}: {fixed}")
# latin-1 -> utf-8: Café au lait
# iso-8859-15 -> utf-8: Café au lait

You can also inspect the raw bytes to understand what happened:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def inspect_bytes(text, encoding='utf-8'):
    """Show the byte representation of a string."""
    raw = text.encode(encoding, errors='replace')
    hex_str = ' '.join(f'{b:02x}' for b in raw)
    print(f"Text: {text}")
    print(f"Encoding: {encoding}")
    print(f"Bytes: {hex_str}")
    print(f"Length: {len(raw)} bytes, {len(text)} chars")
    print()


inspect_bytes("é", 'utf-8')
# Text: é
# Encoding: utf-8
# Bytes: c3 a9
# Length: 2 bytes, 1 chars

inspect_bytes("é", 'latin-1')
# Text: é
# Encoding: latin-1
# Bytes: e9
# Length: 1 bytes, 1 chars

inspect_bytes("é", 'latin-1')
# Text: é
# Encoding: latin-1
# Bytes: c3 a9
# Length: 2 bytes, 2 chars

Notice that “é” encoded as Latin-1 produces the exact same bytes as “é” encoded as UTF-8. That confirms the mojibake source.

Using ftfy for Automatic Fixes

The ftfy library (short for “fixes text for you”) automates mojibake detection and repair. It recognizes common corruption patterns and applies the right fix automatically.

1
pip install ftfy

Basic usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import ftfy

# Single mojibake
print(ftfy.fix_text("é"))           # é
print(ftfy.fix_text("Café au lait")) # Café au lait

# Windows-1252 mojibake
print(ftfy.fix_text("’"))           # '
print(ftfy.fix_text("“Helloâ€\x9d"))  # "Hello"

# Double encoding
print(ftfy.fix_text("Ã\x83©"))      # é

# Mixed mojibake in longer text
text = "The résumé was written by José García"
print(ftfy.fix_text(text))
# The résumé was written by José García

ftfy also handles other text problems beyond mojibake:

1
2
3
4
5
6
7
8
9
10
import ftfy

# Fixes HTML entities that leaked into text
print(ftfy.fix_text("&amp; &lt;"))  # & <

# Fixes curly quote issues
print(ftfy.fix_text("won’t"))  # won't

# Normalizes Unicode representations
print(ftfy.fix_text("\u00e9"))  # é (already correct, left as-is)

For more control, you can use ftfy.fix_and_explain to see what ftfy detected and how it fixed the text:

1
2
3
4
5
import ftfy

fixed, explanations = ftfy.fix_and_explain("Café au lait")
print(fixed)         # Café au lait
print(explanations)  # ['encode->#1-decode->utf-8']

The explanation string tells you which encoding pair ftfy used, confirming the diagnosis.

Detecting Encodings with chardet and charset-normalizer

When you have raw bytes and do not know the encoding at all, charset detection libraries can help. These analyze byte patterns and give you a best guess.

1
pip install chardet charset-normalizer

Using chardet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import chardet

# Detect encoding of raw bytes
raw_bytes = b'\xc3\xa9'  # UTF-8 encoded "é"
result = chardet.detect(raw_bytes)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}

# Detect encoding of a larger sample (more accurate with more data)
with open('scraped_page.html', 'rb') as f:
    raw = f.read()
    detected = chardet.detect(raw)
    print(detected)
    # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
    text = raw.decode(detected['encoding'])

Using charset-normalizer (faster, often more accurate):

1
2
3
4
5
6
7
8
9
10
from charset_normalizer import from_bytes

raw_bytes = open('scraped_page.html', 'rb').read()
results = from_bytes(raw_bytes)
best = results.best()

if best:
    print(f"Encoding: {best.encoding}")
    print(f"Coherence: {best.coherence}")
    text = str(best)

charset-normalizer can also rank multiple possible encodings:

1
2
3
4
5
6
7
from charset_normalizer import from_bytes

raw = b'Caf\xe9 au lait'  # Latin-1 encoded
results = from_bytes(raw)

for result in results:
    print(f"{result.encoding}: coherence {result.coherence}")
Every character has a number, and getting that number wrong breaks everything.
Every character has a number, and getting that number wrong breaks everything. Photo by Nataliya Vaitkevich / Pexels

Prevention: Getting Encoding Right the First Time

The best fix for mojibake is never creating it in the first place. Here are the key practices for web scraping.

Use Raw Bytes, Not Decoded Text

When scraping with the requests library, use response.content (bytes) instead of response.text (string). The .text property uses the encoding that requests guesses, which is often wrong.

1
2
3
4
5
6
7
8
9
10
import requests

response = requests.get('https://example.com')

# BAD: requests guesses the encoding (often defaults to ISO-8859-1 for text/html)
text = response.text  # May be garbled

# GOOD: get raw bytes and decode yourself
raw = response.content
text = raw.decode('utf-8')  # You control the encoding

The problem is especially bad with requests because of an HTTP specification quirk. RFC 2616 says that text/* content types with no explicit charset should default to ISO-8859-1. So when a server sends UTF-8 content without a charset header, requests decodes it as ISO-8859-1, producing mojibake.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests

response = requests.get('https://example.com')

# Check what encoding requests detected
print(response.encoding)  # Might show 'ISO-8859-1' even for UTF-8 content

# Check the Content-Type header
print(response.headers.get('Content-Type'))
# 'text/html' (no charset specified -- requests defaults to ISO-8859-1)

# Override with the correct encoding before accessing .text
response.encoding = 'utf-8'
text = response.text  # Now correctly decoded

Specify Encoding When Reading Files

1
2
3
4
5
6
7
8
9
10
11
12
# BAD: uses system default encoding (platform-dependent)
with open('data.csv') as f:
    text = f.read()

# GOOD: explicit encoding
with open('data.csv', encoding='utf-8') as f:
    text = f.read()

# ALSO GOOD: read as bytes and decode
with open('data.csv', 'rb') as f:
    raw = f.read()
    text = raw.decode('utf-8')

Check the HTML Meta Tag

Sometimes the HTTP headers say one encoding but the HTML document declares another. Check both.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')

# Check HTTP header
http_encoding = response.encoding
print(f"HTTP says: {http_encoding}")

# Check HTML meta tag
soup = BeautifulSoup(response.content, 'html.parser')
meta = soup.find('meta', charset=True)
if meta:
    html_encoding = meta['charset']
    print(f"HTML says: {html_encoding}")

meta_http = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
if meta_http:
    content = meta_http.get('content', '')
    if 'charset=' in content:
        html_encoding = content.split('charset=')[-1].strip()
        print(f"HTML meta http-equiv says: {html_encoding}")

# Use the HTML-declared encoding as it's usually more accurate
text = response.content.decode(html_encoding, errors='replace')

Common Web Scraping Encoding Scenarios

Server Says Latin-1, Content Is Actually UTF-8

This is the single most common encoding problem in web scraping. The server either sends no charset header (causing libraries to default to Latin-1) or explicitly declares Latin-1 when the content is UTF-8.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import requests
import chardet

def smart_decode(response):
    """Decode response content using the best available encoding info."""
    # First, try what the server says
    declared = response.encoding

    # Detect from content
    detected = chardet.detect(response.content)

    # If server says latin-1 but detector says utf-8 with high confidence, trust detector
    if declared and declared.lower() in ('iso-8859-1', 'latin-1'):
        if detected['encoding'] and detected['encoding'].lower() == 'utf-8':
            if detected['confidence'] > 0.7:
                return response.content.decode('utf-8')

    # Otherwise use detected encoding
    if detected['encoding']:
        return response.content.decode(detected['encoding'], errors='replace')

    # Fallback
    return response.content.decode('utf-8', errors='replace')

response = requests.get('https://example.com')
text = smart_decode(response)

Mixed Encodings on the Same Page

Some pages have content from multiple sources with different encodings. A page template might be UTF-8 while user-generated content embedded in it was stored as Windows-1252.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import ftfy

def fix_mixed_encoding(text):
    """Fix text that may have multiple encoding issues."""
    # ftfy handles mixed encoding well
    fixed = ftfy.fix_text(text)

    # For stubborn cases, try line by line
    if 'Ã' in fixed or 'â€' in fixed:
        lines = fixed.split('\n')
        fixed_lines = [ftfy.fix_text(line) for line in lines]
        fixed = '\n'.join(fixed_lines)

    return fixed

No Encoding Specified Anywhere

When neither the HTTP headers nor the HTML meta tags declare an encoding, you have to detect it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import requests
from charset_normalizer import from_bytes

def fetch_with_detection(url):
    """Fetch a URL and detect encoding from content."""
    response = requests.get(url)

    # Try charset_normalizer on the raw bytes
    results = from_bytes(response.content)
    best = results.best()

    if best:
        return str(best), best.encoding
    else:
        # Last resort: try UTF-8, fall back to latin-1
        try:
            return response.content.decode('utf-8'), 'utf-8'
        except UnicodeDecodeError:
            return response.content.decode('latin-1'), 'latin-1'

text, encoding = fetch_with_detection('https://example.com')
print(f"Detected encoding: {encoding}")

Complete Fixing Workflow

Here is a practical workflow for handling encoding issues in a scraping pipeline.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import ftfy
import chardet
import requests


def scrape_with_encoding_fix(url):
    """Full encoding-aware scraping workflow."""

    # Step 1: Fetch raw bytes
    response = requests.get(url)
    raw_bytes = response.content

    # Step 2: Detect encoding
    detected = chardet.detect(raw_bytes)
    encoding = detected['encoding'] or 'utf-8'
    confidence = detected['confidence']

    print(f"Detected: {encoding} (confidence: {confidence:.2f})")
    print(f"Server declared: {response.encoding}")

    # Step 3: Decode with detected encoding
    try:
        text = raw_bytes.decode(encoding)
    except (UnicodeDecodeError, LookupError):
        text = raw_bytes.decode('utf-8', errors='replace')

    # Step 4: Fix any remaining mojibake
    text = ftfy.fix_text(text)

    # Step 5: Verify -- check for common mojibake markers
    mojibake_markers = ['é', 'ü', 'ñ', '’', '“', 'ö', 'ä']
    for marker in mojibake_markers:
        if marker in text:
            print(f"Warning: possible remaining mojibake detected ({marker})")
            # Try one more round of fixing
            text = ftfy.fix_text(text)
            break

    return text


# Usage
text = scrape_with_encoding_fix('https://example.com')

For batch processing scraped files that are already saved with wrong encoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import ftfy
import os


def fix_encoding_in_files(directory, pattern='*.txt'):
    """Fix mojibake in all text files in a directory."""
    import glob

    files = glob.glob(os.path.join(directory, pattern))

    for filepath in files:
        # Read as bytes to avoid further corruption
        with open(filepath, 'rb') as f:
            raw = f.read()

        # Try UTF-8 first
        try:
            text = raw.decode('utf-8')
        except UnicodeDecodeError:
            text = raw.decode('latin-1')

        # Fix any mojibake
        fixed = ftfy.fix_text(text)

        if fixed != text:
            print(f"Fixed: {filepath}")
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(fixed)
        else:
            print(f"OK: {filepath}")

Quick Reference: Encoding Fix Commands

For quick one-off fixes, here are the most common patterns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# UTF-8 misread as Latin-1 (most common)
text.encode('latin-1').decode('utf-8')

# UTF-8 misread as Windows-1252
text.encode('windows-1252').decode('utf-8')

# Latin-1 misread as UTF-8 (less common, usually causes errors not mojibake)
text.encode('utf-8').decode('latin-1')

# Double-encoded UTF-8
text.encode('latin-1').decode('utf-8').encode('latin-1').decode('utf-8')

# Automatic fix with ftfy
import ftfy
ftfy.fix_text(text)

And from the command line:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Convert file encoding with iconv
iconv -f LATIN1 -t UTF-8 input.txt > output.txt

# Detect encoding with file command
file -bi scraped_page.html
# text/html; charset=utf-8

# Python one-liner to fix a file
python -c "
import ftfy, sys
text = open(sys.argv[1], 'rb').read().decode('latin-1')
print(ftfy.fix_text(text))
" garbled.txt > fixed.txt

Summary

Garbled text from web scraping is almost always caused by an encoding mismatch between writer and reader. The fix follows a consistent pattern: re-encode with the wrong encoding to recover the original bytes, then decode with the correct one. For automated fixes, ftfy handles the detection and correction in a single call. For prevention, always work with raw bytes from HTTP responses, detect encoding before decoding, and never trust library defaults. The combination of response.content instead of response.text, chardet or charset-normalizer for detection, and ftfy as a safety net will handle the vast majority of encoding issues in a scraping pipeline.

Contact Arman for Complex Problems
This post is licensed under CC BY 4.0 by the author.