patternsqlMinor
Postgres database encoding problem
Viewed 0 times
databaseproblempostgresencoding
Problem
I'm striving to convert badly encoded data from my table. For instance, I have a field with
I tried using Postgres's functions
Using python I'm able to get the correct encoding with:
Any hint on how to convert these wrongly encoded data in postgres ?
Nadège which should be Nadège.I tried using Postgres's functions
convert, convert_from, convert_to without much success.db=# SHOW client_encoding;
client_encoding
-----------------
UTF8
(1 row)
db=# SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)
db=# SELECT "firstName", encode("firstName"::bytea, 'hex') FROM contact;
firstName | encode
-----------+--------------------
Nadège | 4e6164c3a86765
Nadège | 4e6164c383c2a86765
(2 rows)
db=# SELECT "firstName", convert_from("firstName"::bytea, 'latin1') FROM contact WHERE "lastName" ILIKE 'crochard';
firstName | convert_from
-----------+----------------
Nadège | Nadège
Nadège | NadÃ\u0083¨ge
(2 rows)
db=# SELECT "firstName", convert("firstName"::bytea, 'utf8', 'latin1') FROM contact;
firstName | convert
-----------+------------------
Nadège | \x4e6164e86765
Nadège | \x4e6164c3a86765
(2 rows)Using python I'm able to get the correct encoding with:
data.encode('latin1').decode('utf8')Any hint on how to convert these wrongly encoded data in postgres ?
Solution
As you have correctly identified,
To fix it you need to:
So:
The Python equivalent would be close to what you wrote, but starts with a unicode representation to illustrate that PostgreSQL stores everything in the database encoding. Something like:
The problem with all your attempted solutions is that the cast from
because you have to explicitly decode the utf-8 representation produced by the cast before re-interpreting as latin-1 and decoding again.
You just needed to use
Nadège is the UTF-8 representation of Nadège incorrectly decoded as ISO-8859-1 ("latin-1"). Then, in your case, re-encoded to UTF-8 for storage in the DB.To fix it you need to:
- Take the current representation and decode the UTF-8 to latin-1 as a byte string
- re-interpret the byte string, decoding it as utf-8
So:
test=> SELECT convert_from(convert_to('Nadège', 'latin-1'), 'utf-8');
convert_from
--------------
Nadège
(1 row)The Python equivalent would be close to what you wrote, but starts with a unicode representation to illustrate that PostgreSQL stores everything in the database encoding. Something like:
>>> print u"Nadège".encode("latin-1").decode("utf-8")
NadègeThe problem with all your attempted solutions is that the cast from
text to bytea uses the database encoding. So you're starting with the bytes for the utf-8 representation of utf-8 mis-decoded as latin-1. With the cast you'd have to write:test=> SELECT convert_from(convert_to(convert_from((TEXT 'Nadège')::bytea, 'utf-8'), 'latin-1'), 'utf-8');
convert_from
--------------
Nadège
(1 row)because you have to explicitly decode the utf-8 representation produced by the cast before re-interpreting as latin-1 and decoding again.
You just needed to use
convert_to(mycol, 'latin-1') instead of mycol::byteaCode Snippets
test=> SELECT convert_from(convert_to('Nadège', 'latin-1'), 'utf-8');
convert_from
--------------
Nadège
(1 row)>>> print u"Nadège".encode("latin-1").decode("utf-8")
Nadègetest=> SELECT convert_from(convert_to(convert_from((TEXT 'Nadège')::bytea, 'utf-8'), 'latin-1'), 'utf-8');
convert_from
--------------
Nadège
(1 row)Context
StackExchange Database Administrators Q#123815, answer score: 4
Revisions (0)
No revisions yet.