snippetsqlMinor
Tell MySQL to start using utf-8 encoding without `convert to`ing it
Viewed 0 times
withoutingutfconverttellmysqlusingstartencoding
Problem
In a fairly unique situation, my team has ended up with UTF-8 bytes in a database that thinks the data is encoded as latin1.
At least, I'm 85% certain that this is the situation at hand.
For example, a right single quotation mark was handed to the database by a programming language that had no concept of encodings (Ruby 1.8) and just treated the data as raw bytes (0xE2 0x80 0x99). This data, as far as I can tell (how to verify?), was stored as those actual bytes. So now when the data is read out by a more intelligent programming language (Ruby 1.9), the database helpfully says "Oh! 0xE2 is 'â', 0x80 is '€', 0x99 is '™'", and so instead of "Mike’s", we end up with "Mike’s". This is also what I get in the mysql prompt when SELECTing that value.
So, essentially, we have a bunch of utf-8 encoded data stored in a database that thinks the data is encoded as latin1.
This makes me to somehow tell the database "No, no matter what you think, this stuff is actually utf-8".
Failed/moronic attempt #1
I noticed this:
And thought that maybe changing
At least, I'm 85% certain that this is the situation at hand.
For example, a right single quotation mark was handed to the database by a programming language that had no concept of encodings (Ruby 1.8) and just treated the data as raw bytes (0xE2 0x80 0x99). This data, as far as I can tell (how to verify?), was stored as those actual bytes. So now when the data is read out by a more intelligent programming language (Ruby 1.9), the database helpfully says "Oh! 0xE2 is 'â', 0x80 is '€', 0x99 is '™'", and so instead of "Mike’s", we end up with "Mike’s". This is also what I get in the mysql prompt when SELECTing that value.
So, essentially, we have a bunch of utf-8 encoded data stored in a database that thinks the data is encoded as latin1.
This makes me to somehow tell the database "No, no matter what you think, this stuff is actually utf-8".
CONVERT TO doesn't seem like the right tool, because then I'll end up with permanent "Mike’s".Failed/moronic attempt #1
I noticed this:
> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8_unicode_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+And thought that maybe changing
character_set_results to latin1 would trick Solution
The solution isn't precisely the same but this question is where I originally found direction for a similar issue and the concepts there should take you where you want to go. MySQL has a
Test case with character_set_client = utf8:
You could use that logic to populate a new column that MySQL believes to be utf8.
BINARY character set and from all appearances, by converting through it, you can prevent MySQL from realizing what you're actually doing and being "too helpful."Test case with character_set_client = utf8:
mysql> select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
+--------------------------------------------------------------------------------+
| CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8) |
+--------------------------------------------------------------------------------+
| Mike’s |
+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)You could use that logic to populate a new column that MySQL believes to be utf8.
Code Snippets
mysql> select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
+--------------------------------------------------------------------------------+
| CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8) |
+--------------------------------------------------------------------------------+
| Mike’s |
+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)Context
StackExchange Database Administrators Q#33365, answer score: 4
Revisions (0)
No revisions yet.