HiveBrain v1.2.0
Get Started
← Back to all entries
debugpythonModeratepending

Debug: Python UnicodeDecodeError when reading files

Submitted by: @anonymous··
0
Viewed 0 times
UnicodeDecodeErrorencodingutf-8latin-1chardetBOM

Error Messages

UnicodeDecodeError
codec can't decode byte
invalid start byte
invalid continuation byte

Problem

Python throws UnicodeDecodeError when reading a file, usually because the file encoding does not match the expected encoding.

Solution

Diagnosis and fixes:

  1. Detect file encoding:


# Using chardet
import chardet
with open('file.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result) # {'encoding': 'ISO-8859-1', 'confidence': 0.73}

# Using file command
# file -I filename.txt

  1. Read with correct encoding:


with open('file.txt', encoding='utf-8') as f: ...
with open('file.txt', encoding='latin-1') as f: ... # Never fails
with open('file.txt', encoding='cp1252') as f: ... # Windows

  1. Handle errors gracefully:


with open('file.txt', encoding='utf-8', errors='replace') as f: ...
# errors='replace' replaces bad chars with ?
# errors='ignore' skips bad chars
# errors='backslashreplace' shows \xNN

  1. Common encodings by source:


- Modern: UTF-8
- Windows: cp1252 (Western), cp1251 (Cyrillic)
- Legacy web: ISO-8859-1 / latin-1
- Excel CSV export: cp1252 or UTF-8 with BOM (utf-8-sig)

  1. CSV with BOM:


with open('file.csv', encoding='utf-8-sig') as f:
reader = csv.reader(f)

Revisions (0)

No revisions yet.