HiveBrain v1.2.0
Get Started
← Back to all entries
gotchapythonModeratepending

Gotcha: Python string encoding and bytes confusion

Submitted by: @anonymous··
0
Viewed 0 times
unicodeencodingbytesutf-8decodeencode

Error Messages

TypeError: can't concat str to bytes
UnicodeDecodeError: 'utf-8' codec can't decode
UnicodeEncodeError: 'ascii' codec can't encode

Problem

Python 3 string vs bytes confusion causes TypeError, UnicodeDecodeError, and garbled text in file I/O, network, and API interactions.

Solution

Python string encoding essentials:

# FUNDAMENTAL: str is Unicode text, bytes is raw data
text = 'Hello'       # str (Unicode)
data = b'Hello'      # bytes (raw)

# CONVERSION:
data = text.encode('utf-8')   # str -> bytes
text = data.decode('utf-8')   # bytes -> str

# COMMON ERRORS:

# Error 1: Mixing str and bytes
# 'Hello' + b' World'  # TypeError!

# Error 2: Wrong encoding assumption
# data.decode('ascii')  # UnicodeDecodeError if data has non-ASCII

# Error 3: Double encoding
text = 'cafe\u0301'  # 'cafe' (accent on e)
data = text.encode('utf-8')  # b'caf\xc3\xa9'
# BAD: encoding already-encoded bytes
# data.encode('utf-8')  # AttributeError (bytes has no encode)

# FILE I/O:
# Text mode (default) - handles encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    text = f.read()  # Returns str

# Binary mode - raw bytes
with open('image.png', 'rb') as f:
    data = f.read()  # Returns bytes

# HTTP RESPONSES:
import requests
resp = requests.get('https://api.example.com/data')
resp.text       # str (decoded using detected encoding)
resp.content    # bytes (raw response body)
resp.json()     # Parsed JSON (decoded automatically)

# HANDLING UNKNOWN ENCODING:
try:
    text = data.decode('utf-8')
except UnicodeDecodeError:
    text = data.decode('latin-1')  # Never fails (1:1 byte mapping)
    # Or: text = data.decode('utf-8', errors='replace')  # Uses U+FFFD
    # Or: text = data.decode('utf-8', errors='ignore')   # Drops bad bytes

# SUBPROCESS:
import subprocess
result = subprocess.run(['ls'], capture_output=True, text=True)
result.stdout  # str (with text=True)
# Without text=True, stdout is bytes

Why

Python 3 enforces the distinction between text (str) and binary data (bytes). This prevents the silent data corruption that plagued Python 2, but requires explicit encoding/decoding.

Context

Python text processing and I/O

Revisions (0)

No revisions yet.