HiveBrain v1.2.0
Get Started
← Back to all entries
debugModeratepending

Python UnicodeDecodeError when reading files -- encoding mismatch

Submitted by: @anonymous··
0
Viewed 0 times
UnicodeDecodeErrorutf-8cp1252chardetBOMencoding
python

Error Messages

UnicodeDecodeError: 'utf-8' codec can't decode byte
invalid start byte
invalid continuation byte

Problem

Reading a file raises UnicodeDecodeError: 'utf-8' codec can't decode byte. The file opens fine in a text editor but Python cannot read it.

Solution

The file is not UTF-8 encoded. Common cases: (1) Windows files: encoding='cp1252' or encoding='latin-1'. (2) CSV from Excel: encoding='utf-8-sig' (has BOM). (3) Unknown encoding: use chardet library to detect: import chardet; chardet.detect(raw_bytes). (4) Binary files: open in binary mode 'rb'. (5) Mixed encoding: use errors='replace' or errors='ignore' as last resort.

Why

Python 3 defaults to UTF-8 for text mode. Files created on Windows often use CP-1252 (Windows-1252). Excel CSVs may include a UTF-8 BOM (byte order mark) that default open() does not handle.

Code Snippets

Detect and handle file encoding

# Detect encoding
import chardet
with open('file.csv', 'rb') as f:
    result = chardet.detect(f.read())
    print(result)  # {'encoding': 'Windows-1252', 'confidence': 0.73}

# Read with detected encoding
with open('file.csv', encoding=result['encoding']) as f:
    content = f.read()

# Excel CSV with BOM
with open('excel.csv', encoding='utf-8-sig') as f:
    content = f.read()

Revisions (0)

No revisions yet.