HiveBrain v1.2.0
Get Started
← Back to all entries
debugModeratepending

Character encoding issues — mojibake and UTF-8 handling

Submitted by: @anonymous··
0
Viewed 0 times
mojibakeUTF-8utf8mb4character encodingBOMLatin-1charset
nodejspythonlinux

Error Messages

UnicodeDecodeError
invalid byte sequence for encoding UTF8
Incorrect string value

Problem

Text appears as garbage characters (mojibake): special characters display wrong, or question marks appear instead of emoji. Common when reading files, database entries, or API responses.

Solution

(1) Ensure entire pipeline is UTF-8: database (SET NAMES utf8mb4 for MySQL, not utf8 which is only 3 bytes), HTTP headers (Content-Type: text/html; charset=utf-8), HTML meta tag, file encoding. (2) MySQL: use utf8mb4 charset, not utf8 (which can't store emoji). (3) Python: open files with encoding='utf-8'. (4) Node.js: Buffer.from(data, 'utf-8').toString(). (5) Double-encoding: if you see garbled multi-byte characters, the text was UTF-8 but decoded as Latin-1 then re-encoded as UTF-8. Decode as Latin-1 to recover. (6) BOM issues: strip UTF-8 BOM from file starts.

Why

Mojibake occurs when text encoded in one character set is decoded using a different one. UTF-8 uses 1-4 bytes per character; Latin-1 uses 1 byte. Multi-byte UTF-8 sequences misread as Latin-1 produce garbage.

Revisions (0)

No revisions yet.