HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Are utf8_bin and utf8_general_cs the same?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sametheutf8_general_csareutf8_binand

Problem

Here's the situation:

  • I have come across several posts for "utf8_bin vs utf8_general_ci".



  • Similarly, "utf8_general_cs vs utf8_general_ci".



But I can't find a post for "utf8_bin vs utf8_general_cs". So, are they the same? If not, what's the difference between them?

Note: I have checked that utf8_general_cs is not available in default MySQL. May I know why?

Solution

The notion that binary collations are the same as case-sensitive collations is, unfortunately, an extremely common one.

However, they are very much not functionally equivalent. There are four areas where behavioral differences can be seen (at least four that I am aware of):

-
Combining Characters

Consider having a lower-case ü ("u" with diaeresis) and an upper-case Ü ("U" with diaeresis). Both types of collations will be able to distinguish them.

Now, consider having an upper-case U and a ̈ (combining diaeresis). When the combining character is placed after the non-accented U, you get . Visually it is the same as the single Ü ("U" with diaeresis). And a case-sensitive (and even accent-sensitive) collation will seem them as being the same, even though one is a single code point and the other is a combination of two code points. But a binary collation cannot compare them as being equal since they are not the same code points (or even the same number of code points).

-
Fullwidth Characters

A collation that is case-sensitive but width-insensitive would be able to equate =o= and =o=. However, a binary collation cannot seem them as equal since they are different code points.

-
Accent Insensitivity

A collation that is case-sensitive but accent-insensitive would be able to equate o and ô. However, a binary collation cannot seem them as equal since they are different code points.

-
Sorting

A case-sensitive collation will sort ~ before a and A after it. But, a binary collation will sort those same characters as: A then a then ~.

All of this is documented in the following post of mine:

No, Binary Collations are not Case-Sensitive

It is presented in terms of Microsoft SQL Server, but the behavior is based on rules defined by the Unicode standard, which should be the same across RDBMSs, languages, OSs, etc (with minor variation due to which version of the Unicode Standard is being used, and who did the implementation since Unicode is only a standard and has been implemented with slight variation across vendors).


I have checked that utf8_general_cs is not available in default MySQL. May I know why?

My guess is that the "general" collations are obsolete and have been superseded by newer "unicode" and culture-specific collations. The documentation (half-way down the page, paragraph starting with "For any Unicode character set,") even states:


utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

The "unicode" collations are probably the default sort weights and collation rules. The culture-specific collations tailor the weights and rules to be correct for that culture (when the default is incorrect). For more info on why there are different collations, please see my answer to the following S.O. question:

What is the point of COLLATIONS for nvarchar (Unicode) columns?

Context

StackExchange Database Administrators Q#217464, answer score: 6

Revisions (0)

No revisions yet.