gotchasqlMinor
What is the difference between different utf8mb4 binary collations?
Viewed 0 times
thewhatcollationsdifferencedifferentbetweenbinaryutf8mb4
Problem
What is the difference between
utf8mb4_0900_bin vs utf8mb4_bin binary collations?Solution
There are three differences as far as I can tell (according to their documentation):
-
Case-mappings (for
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-uca
The LOWER() and UPPER() functions perform case folding according to the collation of their argument.
The difference between the two collations in this context is that the
-
Padding vs No Padding (of trailing spaces):
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-pad-attributes
The pad attribute for
Essentially,
-
Sorting (performance only, not the ordering):
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-collating-weights
-
For
-
For
Translating that into human, they are saying that for a code point such as U+FF9D,
-
Case-mappings (for
LOWER() / UPPER() functions):https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-uca
The LOWER() and UPPER() functions perform case folding according to the collation of their argument.
The difference between the two collations in this context is that the
_0900_ version, being based on a newer version of Unicode, quite likely has more mapping definitions (and possibly even some corrections).-
Padding vs No Padding (of trailing spaces):
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-pad-attributes
The pad attribute for
utf8mb4_bin is PAD SPACE, whereas for utf8mb4_0900_bin it is NO PAD. Consequently, operations involving utf8mb4_0900_bin do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collationsEssentially,
utf8mb4_bin ignores trailing spaces while utf8mb4_0900_bin does not ignore them. See the documentation (linked above) for an example.-
Sorting (performance only, not the ordering):
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-collating-weights
-
For
_bin collations except utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.-
For
utf8mb4_0900_bin, the weight is the utf8mb4 encoding bytes. The sort order is the same as for utf8mb4_bin, but much faster.Translating that into human, they are saying that for a code point such as U+FF9D,
utf8mb4_bin will see the UTF-8 encoded byte sequence of EF BE 9D and convert that into 00 FF 9D. But, utf8mb4_0900_bin will not convert it into the code point value. This is due to the UTF-8 byte sequence already being sequential, hence the ordering is the same as it for the code point values. So why bother with that extra conversion step?Context
StackExchange Database Administrators Q#278010, answer score: 7
Revisions (0)
No revisions yet.