patternsqlMinor
MySQL silently replaces UTF chars with literal question marks
Viewed 0 times
utfwithliteralmarksmysqlreplacessilentlycharsquestion
Problem
I'm experiencing a situation similar to this SO question i.e. I'm working with a legacy database which has UTF8 content in latin1 tables (pretty ugly I know).
Now I'm getting new data from a new application which is completely utf8 and works with its database. To support other legacy system the application also writes a copy of its utf8 data in legacy tables. As far as I know it should be possible to write utf8 stuff in latin1 tables as long as you read it back and show it those data as UTF8. There are a lot of tutorials explaining how to fix this situation for the long term but I'd prefer to not apply them unless absolutely necessary (legacy system will be dismissed soon and I don't want to have downtimes to fix this if possible)
Here is a minimal SQL script which reproduces my issue:
In my system I get no errors from MySQL but after the
However if I paste the same script on http://sqlfiddle.com/ I get an error when I press 'build schema' which states:
Why on my system the invalid utf8 char is simply dropped and I get no errors? Is there any mysql config value to enable in or
Now I'm getting new data from a new application which is completely utf8 and works with its database. To support other legacy system the application also writes a copy of its utf8 data in legacy tables. As far as I know it should be possible to write utf8 stuff in latin1 tables as long as you read it back and show it those data as UTF8. There are a lot of tutorials explaining how to fix this situation for the long term but I'd prefer to not apply them unless absolutely necessary (legacy system will be dismissed soon and I don't want to have downtimes to fix this if possible)
Here is a minimal SQL script which reproduces my issue:
CREATE TABLE `articles` (
`content` mediumtext NOT NULL,
FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SET NAMES utf8;
SET CHARACTER SET utf8;
-- Turkish word for Croatia, second char is \xC4\xB1
INSERT INTO `articles` (`content`) VALUES ('Hırvatistan');In my system I get no errors from MySQL but after the
INSERT statement second char of the word is silently dropped and replaced by a literal ? ('\x3F').mysql> SELECT content, HEX(content), HEX('Hırvatistan') FROM articles;
+-------------+------------------------+--------------------------+
| content | HEX(content) | HEX('Hırvatistan') |
+-------------+------------------------+--------------------------+
| H?rvatistan | 483F72766174697374616E | 48C4B172766174697374616E |
+-------------+------------------------+--------------------------+However if I paste the same script on http://sqlfiddle.com/ I get an error when I press 'build schema' which states:
Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1Why on my system the invalid utf8 char is simply dropped and I get no errors? Is there any mysql config value to enable in or
Solution
I made some attempts to dig into this issue, here are the results.
When you set a connection charset (i.e.
When I insert invalid utf8 chars into a latin1 it replaces them with question marks as I shown in the original question.
In order to fix my issue I had to run this command on the original table (actually I tried it on a small copy of it). It takes care of changing charset, collation and also convert existing data. I took a record with a char which is different in latin1 and utf8 encoding
After the conversion the
Now I'm experimenting with inserting invalid utf8 chars into the converted table and I get a different result for that. It seems that invalid (or not supported 4 bytes) utf chars are simply dropped from the stored value with a warning (shown only if warning enabled)
After this I found that also in my original example a warning is shown if they are enabled:
Finally to trigger an error instead of just a warning (to avoid data loss) just change SQL mode for session or globally (at server level)
When you set a connection charset (i.e.
SET NAMES utf8) MySQL transparently handle encoding conversion for you. For instance if I insert a à (\xE0 in latin1 \xC3A0 in utf8) in a latin1 table using a UTF8 connection it reads the UTF 8 value and store it in table as \xE0mysql> SELECT HEX('à');
+-----------+
| HEX('à') |
+-----------+
| C3A0 |
+-----------+
mysql> INSERT INTO articles VALUES(50001, 'à');
Query OK, 1 row affected (0,00 sec)
mysql> SELECT content, HEX(content) FROM articles WHERE id_p = 50001;
+---------+--------------+
| content | HEX(content) |
+---------+--------------+
| à | E0 |
+---------+--------------+
1 row in set (0,00 sec)When I insert invalid utf8 chars into a latin1 it replaces them with question marks as I shown in the original question.
In order to fix my issue I had to run this command on the original table (actually I tried it on a small copy of it). It takes care of changing charset, collation and also convert existing data. I took a record with a char which is different in latin1 and utf8 encoding
mysql> select HEX(BINARY SUBSTRING(content, 17, 1)), SUBSTRING(content, 17, 1) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): 93
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)
mysql> ALTER TABLE `articles` CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 34905 rows affected (1 min 10,73 sec)
Records: 34905 Duplicates: 0 Warnings: 0
mysql> select HEX(BINARY SUBSTRING(content, 17, 3)), SUBSTRING(content, 17, 3) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): E2809C
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)After the conversion the
“ char was replaced by its utf8 encoding in the content and all data are still readable. The conversion also changed content column type from MEDIUMTEXT to LONGTEXt because latin1 uses 1 byte per char and utf8 up to 3 bytes per char to avoid data truncation.Now I'm experimenting with inserting invalid utf8 chars into the converted table and I get a different result for that. It seems that invalid (or not supported 4 bytes) utf chars are simply dropped from the stored value with a warning (shown only if warning enabled)
$ mysql --show-warnings
mysql> INSERT INTO articles VALUES(90000, 0xC328);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC3(' for column 'content' at row 1
mysql> SELECT 0xf09f8eb6;
+------------+
| 0xf09f8eb6 |
+------------+
| |
+------------+
1 row in set (0,00 sec)
mysql> INSERT INTO articles VALUES(90001, 0xf09f8eb6);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xF0\x9F\x8E\xB6' for column 'content' at row 1After this I found that also in my original example a warning is shown if they are enabled:
-- With warnings enabled
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1Finally to trigger an error instead of just a warning (to avoid data loss) just change SQL mode for session or globally (at server level)
mysql> SET SESSION sql_mode = 'TRADITIONAL';
Query OK, 0 rows affected (0,00 sec)
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1Code Snippets
mysql> SELECT HEX('à');
+-----------+
| HEX('à') |
+-----------+
| C3A0 |
+-----------+
mysql> INSERT INTO articles VALUES(50001, 'à');
Query OK, 1 row affected (0,00 sec)
mysql> SELECT content, HEX(content) FROM articles WHERE id_p = 50001;
+---------+--------------+
| content | HEX(content) |
+---------+--------------+
| à | E0 |
+---------+--------------+
1 row in set (0,00 sec)mysql> select HEX(BINARY SUBSTRING(content, 17, 1)), SUBSTRING(content, 17, 1) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): 93
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)
mysql> ALTER TABLE `articles` CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 34905 rows affected (1 min 10,73 sec)
Records: 34905 Duplicates: 0 Warnings: 0
mysql> select HEX(BINARY SUBSTRING(content, 17, 3)), SUBSTRING(content, 17, 3) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): E2809C
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)$ mysql --show-warnings
mysql> INSERT INTO articles VALUES(90000, 0xC328);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC3(' for column 'content' at row 1
mysql> SELECT 0xf09f8eb6;
+------------+
| 0xf09f8eb6 |
+------------+
| |
+------------+
1 row in set (0,00 sec)
mysql> INSERT INTO articles VALUES(90001, 0xf09f8eb6);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xF0\x9F\x8E\xB6' for column 'content' at row 1-- With warnings enabled
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1mysql> SET SESSION sql_mode = 'TRADITIONAL';
Query OK, 0 rows affected (0,00 sec)
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1Context
StackExchange Database Administrators Q#96265, answer score: 4
Revisions (0)
No revisions yet.