HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Is there a (non-binary) MySQL collation that doesn't treat different mathematical symbols as the same character?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
samethenonsymbolscharacterdifferentmysqldoesnbinarythat

Problem

I've run into a real headache with MySQL's collations and non-BMP characters (ones with Unicode codepoints above U+FFFF).

Basically, given a table and data like:

CREATE TABLE `math` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `symbols` varchar(32) character set utf8mb4 not null,
  PRIMARY KEY (`id`),
  UNIQUE KEY `symbols` (`symbols`)
);
INSERT INTO `math` VALUES (1,'');


(You may not have a font to display the character in the string literal above. It's U+1D542 MATHEMATICAL DOUBLE-STRUCK CAPITAL K)

things look OK:

mysql> select * from math;
+----+---------+
| id | symbols |
+----+---------+
|  1 |        |
+----+---------+
1 row in set (0.00 sec)

mysql> select * from math where symbols = '';
+----+---------+
| id | symbols |
+----+---------+
|  1 |        |
+----+---------+
1 row in set (0.00 sec)


So far so good. But then there's this crap:

mysql> select * from math where symbols = '';
+----+---------+
| id | symbols |
+----+---------+
|  1 |        |
+----+---------+
1 row in set (0.00 sec)


and

mysql> INSERT INTO `math` VALUES (2,'');
ERROR 1062 (23000): Duplicate entry '?' for key 'symbols'


(The string literal above has U+1D543 MATHEMATICAL DOUBLE-STRUCK CAPITAL L. Note that MySQL's error message has a ?, but the U+1D542 in the results of the SELECT above does dispaly correctly for me, so there don't seem to be encoding issues as far as IO with the server.)

(Code above updated; it originally had 1 for the primary key, which fails for obvious reasons.)

Screenshot for those with font issues:

So, MySQL thinks these two characters are the same? I know it case-folds, but this isn't a matter of casing.

Needless to say, I didn't even realize I had this problem till it came up in production, because the real-world data involved rarely differs on only these characters. However, this is totally unacceptable collation behavior.

Switching to the binary collation does fix it, however I'm using Django to access the database

Solution

From 10.1.14.1 Unicode Character Sets in the MySQL 5.5 Reference Manual (emphasis added):


For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA collations, their collating weight is 0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.

and:


The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters.

So the answer to your question appears to be "no".

finally:


If you really want rows sorted by MySQL's rule and secondarily by code
point value, it is easy:


ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

Though it is not clear to me how this should be applied for comparisons.

The documentation extracts quoted above are unchanged for MySQL 5.7.

Context

StackExchange Database Administrators Q#88969, answer score: 4

Revisions (0)

No revisions yet.