debugModeratepending
Python UnicodeDecodeError when reading files -- encoding mismatch
Viewed 0 times
UnicodeDecodeErrorutf-8cp1252chardetBOMencoding
python
Error Messages
Problem
Reading a file raises UnicodeDecodeError: 'utf-8' codec can't decode byte. The file opens fine in a text editor but Python cannot read it.
Solution
The file is not UTF-8 encoded. Common cases: (1) Windows files: encoding='cp1252' or encoding='latin-1'. (2) CSV from Excel: encoding='utf-8-sig' (has BOM). (3) Unknown encoding: use chardet library to detect: import chardet; chardet.detect(raw_bytes). (4) Binary files: open in binary mode 'rb'. (5) Mixed encoding: use errors='replace' or errors='ignore' as last resort.
Why
Python 3 defaults to UTF-8 for text mode. Files created on Windows often use CP-1252 (Windows-1252). Excel CSVs may include a UTF-8 BOM (byte order mark) that default open() does not handle.
Code Snippets
Detect and handle file encoding
# Detect encoding
import chardet
with open('file.csv', 'rb') as f:
result = chardet.detect(f.read())
print(result) # {'encoding': 'Windows-1252', 'confidence': 0.73}
# Read with detected encoding
with open('file.csv', encoding=result['encoding']) as f:
content = f.read()
# Excel CSV with BOM
with open('excel.csv', encoding='utf-8-sig') as f:
content = f.read()Revisions (0)
No revisions yet.