---
title: "How to Decode Garbled Text: Fixing Encoding Mismatches"
canonical_url: https://bytetunnels.com/posts/how-to-decode-garbled-text-fixing-encoding-mismatches/
date: 2026-02-07
author: "Arman Hossain"
author_email: "arman@bytetunnels.com"
author_links: ["https://twitter.com/armanfixing", "https://github.com/arman-bd", "https://www.linkedin.com/in/armanhossain"]
categories: ["Data Extraction"]
tags: ["encoding", "garbled text", "mojibake", "python", "utf-8", "latin-1", "text processing"]
image: "https://bytetunnels.com/assets/img/2026-02-07-how-to-decode-garbled-text-fixing-encoding-mismatches-hero.jpg"
site: "ByteTunnels"
site_url: "https://bytetunnels.com"
license: "https://creativecommons.org/licenses/by/4.0/"
license_name: "CC BY 4.0"
---

You scraped a page and got "Ã©" instead of "é", or "ÄŸ" instead of "ğ", or "â€™" instead of a curly apostrophe. The data is there, but it looks like it was run through a blender. This corruption has a name -- mojibake -- and it happens when bytes that were encoded in one character set get decoded using a different one. The good news is that it is almost always reversible. Once you understand the pattern, you can fix garbled text in a single line of Python.

## What Causes Garbled Text

Every character you see on screen is stored as a sequence of bytes. The [encoding](/posts/character-encodings-handling-text/) determines how characters map to bytes, and how bytes map back to characters. UTF-8 encodes "é" as two bytes: `0xC3 0xA9`. Latin-1 (ISO 8859-1) encodes "é" as a single byte: `0xE9`. When a system writes bytes using UTF-8 but the reader interprets those same bytes using Latin-1, each UTF-8 byte gets mapped to a separate Latin-1 character. The two bytes `0xC3 0xA9` become two characters: "Ã" (`0xC3` in Latin-1) and "©" (`0xA9` in Latin-1). The result is "Ã©" where you expected "é".

This is not data loss. The original bytes are intact. They were just interpreted through the wrong lens.

```python
# Demonstrating how mojibake happens
original = "é"
utf8_bytes = original.encode('utf-8')
print(utf8_bytes)        # b'\xc3\xa9'
print(len(utf8_bytes))   # 2 bytes

# Now decode those UTF-8 bytes as if they were Latin-1
garbled = utf8_bytes.decode('latin-1')
print(garbled)           # Ã©
```

The same thing happens in the other direction. If content was encoded as Latin-1 but decoded as UTF-8, you get a different kind of corruption -- often [replacement characters](/posts/some-characters-could-not-be-decoded-fixing-replacement-character-errors/) or decode errors.

## Common Mojibake Patterns and What They Mean

When you see garbled text, the specific garbage characters tell you exactly what went wrong. Here are the most common patterns you will encounter when scraping.

### UTF-8 Decoded as Latin-1

This is the most frequent case. UTF-8 multi-byte sequences get split into individual Latin-1 characters.

| Original | Garbled | UTF-8 Bytes |
|----------|---------|-------------|
| é | Ã© | `C3 A9` |
| ü | Ã¼ | `C3 BC` |
| ñ | Ã± | `C3 B1` |
| ö | Ã¶ | `C3 B6` |
| ä | Ã¤ | `C3 A4` |
| ğ | ÄŸ | `C4 9F` |
| ş | ÅŸ | `C5 9F` |
| ç | Ã§ | `C3 A7` |

The telltale sign is the "Ã" character appearing before accented vowels. If you see "Ã" scattered throughout your scraped text, you are almost certainly looking at UTF-8 bytes misinterpreted as Latin-1.

### UTF-8 Decoded as Windows-1252

Windows-1252 (often mislabeled as "ANSI" on Windows systems) is a superset of Latin-1 that fills in the `0x80`-`0x9F` range with printable characters like curly quotes and dashes. When UTF-8 text gets decoded as Windows-1252, you see a slightly different pattern.

| Original | Garbled | UTF-8 Bytes |
|----------|---------|-------------|
| ' (right single quote) | â€™ | `E2 80 99` |
| " (left double quote) | â€œ | `E2 80 9C` |
| " (right double quote) | â€ | `E2 80 9D` |
| -- (em dash) | â€" | `E2 80 94` |
| ... (ellipsis) | â€¦ | `E2 80 A6` |

If you see "â€" sequences in your scraped data, the source is UTF-8 content decoded as Windows-1252.

### Double Encoding

Sometimes content gets corrupted twice. UTF-8 text is misread as Latin-1, then the already-garbled result gets encoded to UTF-8 again. The result is even more garbled.

| Original | Single Mojibake | Double Mojibake |
|----------|----------------|-----------------|
| é | Ã© | Ã�Â© |
| ü | Ã¼ | Ã�Â¼ |

Double encoding produces longer garbage strings with more "Ã" characters. The fix requires two rounds of the correction process.

## The Fix Pattern

The repair is straightforward: re-encode the garbled text using the wrong encoding (the one that was mistakenly used to decode it), then decode the resulting bytes with the correct encoding.

```python
# UTF-8 content that was decoded as Latin-1
broken = "Ã©"
fixed = broken.encode('latin-1').decode('utf-8')
print(fixed)  # é
```

What happens here step by step:

1. `broken.encode('latin-1')` converts the garbled string back to bytes using Latin-1. Since the garbled characters came from Latin-1 interpretation, this recovers the original raw bytes: `b'\xc3\xa9'`.
2. `.decode('utf-8')` interprets those bytes as UTF-8, which is what they were all along. Result: "é".

For Windows-1252 mojibake, adjust accordingly:

```python
# UTF-8 content that was decoded as Windows-1252
broken = "â€™"
fixed = broken.encode('windows-1252').decode('utf-8')
print(fixed)  # '
```

For double-encoded text, apply the fix twice:

```python
# Double-encoded text
broken = "Ã\x83Â©"
first_pass = broken.encode('latin-1').decode('utf-8')
print(first_pass)  # Ã©  (still garbled, but only single mojibake now)
second_pass = first_pass.encode('latin-1').decode('utf-8')
print(second_pass)  # é
```

## Diagnosing the Encoding Pair

When you encounter garbled text and are not sure which encoding pair caused the problem, you can try common combinations systematically.

```python
def try_fix(broken_text, encodings_to_try=None):
    """Try common encoding pair fixes and show results."""
    if encodings_to_try is None:
        encodings_to_try = [
            ('latin-1', 'utf-8'),
            ('windows-1252', 'utf-8'),
            ('iso-8859-15', 'utf-8'),
            ('cp437', 'utf-8'),
        ]

    results = []
    for wrong_enc, right_enc in encodings_to_try:
        try:
            fixed = broken_text.encode(wrong_enc).decode(right_enc)
            results.append((wrong_enc, right_enc, fixed))
        except (UnicodeDecodeError, UnicodeEncodeError):
            continue

    return results


broken = "CafÃ© au lait"
for wrong, right, fixed in try_fix(broken):
    print(f"{wrong} -> {right}: {fixed}")
# latin-1 -> utf-8: Café au lait
# iso-8859-15 -> utf-8: Café au lait
```

You can also inspect the raw bytes to understand what happened:

```python
def inspect_bytes(text, encoding='utf-8'):
    """Show the byte representation of a string."""
    raw = text.encode(encoding, errors='replace')
    hex_str = ' '.join(f'{b:02x}' for b in raw)
    print(f"Text: {text}")
    print(f"Encoding: {encoding}")
    print(f"Bytes: {hex_str}")
    print(f"Length: {len(raw)} bytes, {len(text)} chars")
    print()


inspect_bytes("é", 'utf-8')
# Text: é
# Encoding: utf-8
# Bytes: c3 a9
# Length: 2 bytes, 1 chars

inspect_bytes("é", 'latin-1')
# Text: é
# Encoding: latin-1
# Bytes: e9
# Length: 1 bytes, 1 chars

inspect_bytes("Ã©", 'latin-1')
# Text: Ã©
# Encoding: latin-1
# Bytes: c3 a9
# Length: 2 bytes, 2 chars
```

Notice that "Ã©" encoded as Latin-1 produces the exact same bytes as "é" encoded as UTF-8. That confirms the mojibake source.

## Using ftfy for Automatic Fixes

The [ftfy](https://github.com/rspeer/python-ftfy) library (short for "fixes text for you") automates mojibake detection and repair. It recognizes common corruption patterns and applies the right fix automatically.

```bash
pip install ftfy
```

Basic usage:

```python
import ftfy

# Single mojibake
print(ftfy.fix_text("Ã©"))           # é
print(ftfy.fix_text("CafÃ© au lait")) # Café au lait

# Windows-1252 mojibake
print(ftfy.fix_text("â€™"))           # '
print(ftfy.fix_text("â€œHelloâ€\x9d"))  # "Hello"

# Double encoding
print(ftfy.fix_text("Ã\x83Â©"))      # é

# Mixed mojibake in longer text
text = "The rÃ©sumÃ© was written by JosÃ© GarcÃ­a"
print(ftfy.fix_text(text))
# The résumé was written by José García
```

ftfy also handles other text problems beyond mojibake:

```python
import ftfy

# Fixes HTML entities that leaked into text
print(ftfy.fix_text("&amp; &lt;"))  # & <

# Fixes curly quote issues
print(ftfy.fix_text("wonâ€™t"))  # won't

# Normalizes Unicode representations
print(ftfy.fix_text("\u00e9"))  # é (already correct, left as-is)
```

For more control, you can use `ftfy.fix_and_explain` to see what ftfy detected and how it fixed the text:

```python
import ftfy

fixed, explanations = ftfy.fix_and_explain("CafÃ© au lait")
print(fixed)         # Café au lait
print(explanations)  # ['encode->#1-decode->utf-8']
```

The explanation string tells you which encoding pair ftfy used, confirming the diagnosis.

## Detecting Encodings with chardet and charset-normalizer

When you have raw bytes and do not know the encoding at all, [charset detection libraries](/posts/charset-detection-python-chardet-cchardet-charset-normalizer/) can help. These analyze byte patterns and give you a best guess.

```bash
pip install chardet charset-normalizer
```

Using chardet:

```python
import chardet

# Detect encoding of raw bytes
raw_bytes = b'\xc3\xa9'  # UTF-8 encoded "é"
result = chardet.detect(raw_bytes)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}

# Detect encoding of a larger sample (more accurate with more data)
with open('scraped_page.html', 'rb') as f:
    raw = f.read()
    detected = chardet.detect(raw)
    print(detected)
    # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
    text = raw.decode(detected['encoding'])
```

Using charset-normalizer (faster, often more accurate):

```python
from charset_normalizer import from_bytes

raw_bytes = open('scraped_page.html', 'rb').read()
results = from_bytes(raw_bytes)
best = results.best()

if best:
    print(f"Encoding: {best.encoding}")
    print(f"Coherence: {best.coherence}")
    text = str(best)
```

charset-normalizer can also rank multiple possible encodings:

```python
from charset_normalizer import from_bytes

raw = b'Caf\xe9 au lait'  # Latin-1 encoded
results = from_bytes(raw)

for result in results:
    print(f"{result.encoding}: coherence {result.coherence}")
```


<figure>
  <img src="/assets/img/inline-how-to-decode-garbled-text-fixing-encodi-1.jpg" alt="Every character has a number, and getting that number wrong breaks everything." loading="lazy">
  <figcaption>Every character has a number, and getting that number wrong breaks everything. <span class="img-credit">Photo by Nataliya Vaitkevich / <a href="https://www.pexels.com" target="_blank" rel="noopener noreferrer">Pexels</a></span></figcaption>
</figure>

## Prevention: Getting Encoding Right the First Time

The best fix for mojibake is never creating it in the first place. Here are the key practices for web scraping.

### Use Raw Bytes, Not Decoded Text

When scraping with the `requests` library, use `response.content` (bytes) instead of `response.text` (string). The `.text` property uses the encoding that `requests` guesses, which is often wrong.

```python
import requests

response = requests.get('https://example.com')

# BAD: requests guesses the encoding (often defaults to ISO-8859-1 for text/html)
text = response.text  # May be garbled

# GOOD: get raw bytes and decode yourself
raw = response.content
text = raw.decode('utf-8')  # You control the encoding
```

The problem is especially bad with `requests` because of an HTTP specification quirk. RFC 2616 says that `text/*` content types with no explicit charset should default to ISO-8859-1. So when a server sends UTF-8 content without a charset header, `requests` decodes it as ISO-8859-1, producing mojibake.

```python
import requests

response = requests.get('https://example.com')

# Check what encoding requests detected
print(response.encoding)  # Might show 'ISO-8859-1' even for UTF-8 content

# Check the Content-Type header
print(response.headers.get('Content-Type'))
# 'text/html' (no charset specified -- requests defaults to ISO-8859-1)

# Override with the correct encoding before accessing .text
response.encoding = 'utf-8'
text = response.text  # Now correctly decoded
```

### Specify Encoding When Reading Files

```python
# BAD: uses system default encoding (platform-dependent)
with open('data.csv') as f:
    text = f.read()

# GOOD: explicit encoding
with open('data.csv', encoding='utf-8') as f:
    text = f.read()

# ALSO GOOD: read as bytes and decode
with open('data.csv', 'rb') as f:
    raw = f.read()
    text = raw.decode('utf-8')
```

### Check the HTML Meta Tag

Sometimes the HTTP headers say one encoding but the HTML document declares another. Check both.

```python
import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')

# Check HTTP header
http_encoding = response.encoding
print(f"HTTP says: {http_encoding}")

# Check HTML meta tag
soup = BeautifulSoup(response.content, 'html.parser')
meta = soup.find('meta', charset=True)
if meta:
    html_encoding = meta['charset']
    print(f"HTML says: {html_encoding}")

meta_http = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
if meta_http:
    content = meta_http.get('content', '')
    if 'charset=' in content:
        html_encoding = content.split('charset=')[-1].strip()
        print(f"HTML meta http-equiv says: {html_encoding}")

# Use the HTML-declared encoding as it's usually more accurate
text = response.content.decode(html_encoding, errors='replace')
```

## Common Web Scraping Encoding Scenarios

### Server Says Latin-1, Content Is Actually UTF-8

This is the single most [common encoding problem in web scraping](/posts/text-encoding-issues-web-scraping-common-problems-fixes/). The server either sends no charset header (causing libraries to default to Latin-1) or explicitly declares Latin-1 when the content is UTF-8.

```python
import requests
import chardet

def smart_decode(response):
    """Decode response content using the best available encoding info."""
    # First, try what the server says
    declared = response.encoding

    # Detect from content
    detected = chardet.detect(response.content)

    # If server says latin-1 but detector says utf-8 with high confidence, trust detector
    if declared and declared.lower() in ('iso-8859-1', 'latin-1'):
        if detected['encoding'] and detected['encoding'].lower() == 'utf-8':
            if detected['confidence'] > 0.7:
                return response.content.decode('utf-8')

    # Otherwise use detected encoding
    if detected['encoding']:
        return response.content.decode(detected['encoding'], errors='replace')

    # Fallback
    return response.content.decode('utf-8', errors='replace')

response = requests.get('https://example.com')
text = smart_decode(response)
```

### Mixed Encodings on the Same Page

Some pages have content from multiple sources with different encodings. A page template might be UTF-8 while user-generated content embedded in it was stored as Windows-1252.

```python
import ftfy

def fix_mixed_encoding(text):
    """Fix text that may have multiple encoding issues."""
    # ftfy handles mixed encoding well
    fixed = ftfy.fix_text(text)

    # For stubborn cases, try line by line
    if 'Ã' in fixed or 'â€' in fixed:
        lines = fixed.split('\n')
        fixed_lines = [ftfy.fix_text(line) for line in lines]
        fixed = '\n'.join(fixed_lines)

    return fixed
```

### No Encoding Specified Anywhere

When neither the HTTP headers nor the HTML meta tags declare an encoding, you have to detect it.

```python
import requests
from charset_normalizer import from_bytes

def fetch_with_detection(url):
    """Fetch a URL and detect encoding from content."""
    response = requests.get(url)

    # Try charset_normalizer on the raw bytes
    results = from_bytes(response.content)
    best = results.best()

    if best:
        return str(best), best.encoding
    else:
        # Last resort: try UTF-8, fall back to latin-1
        try:
            return response.content.decode('utf-8'), 'utf-8'
        except UnicodeDecodeError:
            return response.content.decode('latin-1'), 'latin-1'

text, encoding = fetch_with_detection('https://example.com')
print(f"Detected encoding: {encoding}")
```

## Complete Fixing Workflow

Here is a practical workflow for handling encoding issues in a scraping pipeline.

```python
import ftfy
import chardet
import requests


def scrape_with_encoding_fix(url):
    """Full encoding-aware scraping workflow."""

    # Step 1: Fetch raw bytes
    response = requests.get(url)
    raw_bytes = response.content

    # Step 2: Detect encoding
    detected = chardet.detect(raw_bytes)
    encoding = detected['encoding'] or 'utf-8'
    confidence = detected['confidence']

    print(f"Detected: {encoding} (confidence: {confidence:.2f})")
    print(f"Server declared: {response.encoding}")

    # Step 3: Decode with detected encoding
    try:
        text = raw_bytes.decode(encoding)
    except (UnicodeDecodeError, LookupError):
        text = raw_bytes.decode('utf-8', errors='replace')

    # Step 4: Fix any remaining mojibake
    text = ftfy.fix_text(text)

    # Step 5: Verify -- check for common mojibake markers
    mojibake_markers = ['Ã©', 'Ã¼', 'Ã±', 'â€™', 'â€œ', 'Ã¶', 'Ã¤']
    for marker in mojibake_markers:
        if marker in text:
            print(f"Warning: possible remaining mojibake detected ({marker})")
            # Try one more round of fixing
            text = ftfy.fix_text(text)
            break

    return text


# Usage
text = scrape_with_encoding_fix('https://example.com')
```

For batch processing scraped files that are already saved with wrong encoding:

```python
import ftfy
import os


def fix_encoding_in_files(directory, pattern='*.txt'):
    """Fix mojibake in all text files in a directory."""
    import glob

    files = glob.glob(os.path.join(directory, pattern))

    for filepath in files:
        # Read as bytes to avoid further corruption
        with open(filepath, 'rb') as f:
            raw = f.read()

        # Try UTF-8 first
        try:
            text = raw.decode('utf-8')
        except UnicodeDecodeError:
            text = raw.decode('latin-1')

        # Fix any mojibake
        fixed = ftfy.fix_text(text)

        if fixed != text:
            print(f"Fixed: {filepath}")
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(fixed)
        else:
            print(f"OK: {filepath}")
```

## Quick Reference: Encoding Fix Commands

For quick one-off fixes, here are the most common patterns.

```python
# UTF-8 misread as Latin-1 (most common)
text.encode('latin-1').decode('utf-8')

# UTF-8 misread as Windows-1252
text.encode('windows-1252').decode('utf-8')

# Latin-1 misread as UTF-8 (less common, usually causes errors not mojibake)
text.encode('utf-8').decode('latin-1')

# Double-encoded UTF-8
text.encode('latin-1').decode('utf-8').encode('latin-1').decode('utf-8')

# Automatic fix with ftfy
import ftfy
ftfy.fix_text(text)
```

And from the command line:

```bash
# Convert file encoding with iconv
iconv -f LATIN1 -t UTF-8 input.txt > output.txt

# Detect encoding with file command
file -bi scraped_page.html
# text/html; charset=utf-8

# Python one-liner to fix a file
python -c "
import ftfy, sys
text = open(sys.argv[1], 'rb').read().decode('latin-1')
print(ftfy.fix_text(text))
" garbled.txt > fixed.txt
```

## Summary

Garbled text from web scraping is almost always caused by an encoding mismatch between writer and reader. The fix follows a consistent pattern: re-encode with the wrong encoding to recover the original bytes, then decode with the correct one. For automated fixes, ftfy handles the detection and correction in a single call. For prevention, always work with raw bytes from HTTP responses, [detect encoding](/posts/character-encoding-detection-automated-tools-techniques/) before decoding, and never trust library defaults. The combination of `response.content` instead of `response.text`, chardet or charset-normalizer for detection, and ftfy as a safety net will handle the vast majority of encoding issues in a scraping pipeline.
