HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

Tell MySQL to start using utf-8 encoding without `convert to`ing it

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
withoutingutfconverttellmysqlusingstartencoding

Problem

In a fairly unique situation, my team has ended up with UTF-8 bytes in a database that thinks the data is encoded as latin1.

At least, I'm 85% certain that this is the situation at hand.

For example, a right single quotation mark was handed to the database by a programming language that had no concept of encodings (Ruby 1.8) and just treated the data as raw bytes (0xE2 0x80 0x99). This data, as far as I can tell (how to verify?), was stored as those actual bytes. So now when the data is read out by a more intelligent programming language (Ruby 1.9), the database helpfully says "Oh! 0xE2 is 'â', 0x80 is '€', 0x99 is '™'", and so instead of "Mike’s", we end up with "Mike’s". This is also what I get in the mysql prompt when SELECTing that value.

So, essentially, we have a bunch of utf-8 encoded data stored in a database that thinks the data is encoded as latin1.

This makes me to somehow tell the database "No, no matter what you think, this stuff is actually utf-8". CONVERT TO doesn't seem like the right tool, because then I'll end up with permanent "Mike’s".

Failed/moronic attempt #1

I noticed this:

> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | utf8              |
| character_set_connection | utf8              |
| character_set_database   | utf8              |
| character_set_filesystem | binary            |
| character_set_results    | utf8              |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | utf8_general_ci   |
| collation_database       | utf8_unicode_ci   |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+


And thought that maybe changing character_set_results to latin1 would trick

Solution

The solution isn't precisely the same but this question is where I originally found direction for a similar issue and the concepts there should take you where you want to go. MySQL has a BINARY character set and from all appearances, by converting through it, you can prevent MySQL from realizing what you're actually doing and being "too helpful."

Test case with character_set_client = utf8:

mysql> select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
+--------------------------------------------------------------------------------+
| CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8)     |
+--------------------------------------------------------------------------------+
| Mike’s                                                                         |
+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)


You could use that logic to populate a new column that MySQL believes to be utf8.

Code Snippets

mysql> select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
+--------------------------------------------------------------------------------+
| CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8)     |
+--------------------------------------------------------------------------------+
| Mike’s                                                                         |
+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Context

StackExchange Database Administrators Q#33365, answer score: 4

Revisions (0)

No revisions yet.