HiveBrain v1.2.0
Get Started
← Back to all entries
gotchasqlMinor

What is the difference between different utf8mb4 binary collations?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thewhatcollationsdifferencedifferentbetweenbinaryutf8mb4

Problem

What is the difference between utf8mb4_0900_bin vs utf8mb4_bin binary collations?

Solution

There are three differences as far as I can tell (according to their documentation):

-
Case-mappings (for LOWER() / UPPER() functions):

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-uca

The LOWER() and UPPER() functions perform case folding according to the collation of their argument.

The difference between the two collations in this context is that the _0900_ version, being based on a newer version of Unicode, quite likely has more mapping definitions (and possibly even some corrections).

-
Padding vs No Padding (of trailing spaces):

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-pad-attributes

The pad attribute for utf8mb4_bin is PAD SPACE, whereas for utf8mb4_0900_bin it is NO PAD. Consequently, operations involving utf8mb4_0900_bin do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collations

Essentially, utf8mb4_bin ignores trailing spaces while utf8mb4_0900_bin does not ignore them. See the documentation (linked above) for an example.

-
Sorting (performance only, not the ordering):

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html#charset-unicode-sets-collating-weights

-
For _bin collations except utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.

-
For utf8mb4_0900_bin, the weight is the utf8mb4 encoding bytes. The sort order is the same as for utf8mb4_bin, but much faster.

Translating that into human, they are saying that for a code point such as U+FF9D, utf8mb4_bin will see the UTF-8 encoded byte sequence of EF BE 9D and convert that into 00 FF 9D. But, utf8mb4_0900_bin will not convert it into the code point value. This is due to the UTF-8 byte sequence already being sequential, hence the ordering is the same as it for the code point values. So why bother with that extra conversion step?

Context

StackExchange Database Administrators Q#278010, answer score: 7

Revisions (0)

No revisions yet.