patternsqlMinor
What is the SQL Server equivalent of MySQL's utf8mb4 character set?
Viewed 0 times
theequivalentwhatsqlcharactermysqlutf8mb4serverset
Problem
We have some web applications based on databases using
We saw that we can use any character we want with this setting.
In SQL Server Express the situation is not very clear to me.
When I switch to
But I do not know which character encoding this is and how it would affect the scenario if we would like to take over some data from
When I look into the datatype definition in SQL Server, I can see that there are Unicode and non Unicode types. So I am wondering if the collation actually has impact into how it is stored:
It seems that if you use
But, what does the collation
Especially how would this behave if you had Chinese characters, for example?
utf8mb4 as character set and utf8mb4_Standard as collation.We saw that we can use any character we want with this setting.
In SQL Server Express the situation is not very clear to me.
When I switch to
Standard it chooses the Latin1_General_CI_AS collation.But I do not know which character encoding this is and how it would affect the scenario if we would like to take over some data from
utf7mb8 MySQL tables into SQL Server.When I look into the datatype definition in SQL Server, I can see that there are Unicode and non Unicode types. So I am wondering if the collation actually has impact into how it is stored:
It seems that if you use
nchar, nvarchar or nvarchar(max) you are on the safe side as you use UTF-16.But, what does the collation
Latin1_General_CI_AS mean?Especially how would this behave if you had Chinese characters, for example?
Solution
First: the particular "edition" of SQL Server (i.e. Express, Standard, Enterprise, etc) does not matter. All editions for a particular version will behave the same.
When I switch to "Standard" it chooses the
Well, that is better than a collation starting with
In your case, most likely you want:
Technically speaking, the closest equivalent to
Still, you want to stick with:
I do not know which character encoding this is and how it would affect the scenario if we would like to take over some data from
The encoding is determined by a combination of the datatype and the collation:
That said, it shouldn't matter what the source encoding is as long as the destination encoding can handle all of the incoming characters (and behave in a similar manner, of course, which is where the culture and sensitivities come in). Assuming you will be storing everything in
what does the collation
It means:
-
I said "for the most part" because the collations ending in
-
Optional last piece:
When I switch to "Standard" it chooses the
Latin1_General_CI_AS collation.Well, that is better than a collation starting with
SQL_, but still not ideal. If you are using SQL Server 2008 or newer, then you should be using a version 100 collation (or a version 140 collation IF using SQL Server 2017 or newer and specifying a Japanese collation). AND, if using SQL Server 2012 or newer, then you should be using a collation that supports supplementary characters, which means your choices are:- version 100 collations with names ending in
_SC, or
- version 140 collations (only Japanese collations have a version 140 set, but none of these end in
_SCsince supplementary character support is built into all version 140 collations)
In your case, most likely you want:
Latin1_General_100_CI_AI_SCTechnically speaking, the closest equivalent to
utf8mb4_general_ci (there is no utf8mb4_Standard, and your screen shot even shows utf8mb4_general_ci) is Latin1_General_CI_AI. Reason being:- The
utf8mb4character set allows you to store supplementary characters (whichNVARCHARalso does, regardless of collation),
- the
generalpart of the MySQL collation means that supplementary characters all have the same weight. This pre-level 100 SQL Server collation is similar in that all supplementary characters have the same weight, it's just that they have no weight at all.
- the
ciin the MySQL collation impliesaisinceasisn't specified.
Still, you want to stick with:
Latin1_General_100_CI_AI_SC.I do not know which character encoding this is and how it would affect the scenario if we would like to take over some data from
utf7mb8 MySQL tables into SQL Server.The encoding is determined by a combination of the datatype and the collation:
NVARCHAR(andNCHAR/NTEXT) is always UTF-16 LE (Little Endian).
VARCHAR(andCHAR/TEXT) is always an 8-bit encoding, the specific encoding determined by the code page associated with the collation being used.
That said, it shouldn't matter what the source encoding is as long as the destination encoding can handle all of the incoming characters (and behave in a similar manner, of course, which is where the culture and sensitivities come in). Assuming you will be storing everything in
NVARCHAR (maybe occasionally NCHAR, but never NTEXT as that has been deprecated since SQL Server 2005), then the data transfer tool will handle any necessary conversions.what does the collation
Latin1_General_CI_AS mean?It means:
- Because the name does not start with
SQL_, this is a Windows collation, not a SQL Server collation (and this is a good thing as the SQL Server collations — those starting withSQL_— are mainly for pre-SQL Server 2000 compatibility, although quite unfortunatelySQL_Latin1_General_CP1_CI_ASis very common due to it being the default when installing on an OS using US English as its language)
Latin1_Generalis the culture / locale.
- For
NVARCHARdata this determines the linguistic rules used for sorting and comparison.
- For
VARCHARdata this determines the:
- linguistic rules used for sorting and comparison.
- code page used to encode the characters. For example,
Latin1_Generalcollations use code page 1252,Hebrewcollations use code page 1255, and so on.
-
{version}, while not present in this collation name, refers to the SQL Server version in which the collation was introduced (for the most part). Windows collations with no version number in the name are version 80 (meaning SQL Server 2000 as that is version 8.0). Not all versions of SQL Server come with new collations, so there are gaps in the version numbers. There are some that are 90 (for SQL Server 2005, which is version 9.0), most are 100 (for SQL Server 2008, version 10.0), and a small set has 140 (for SQL Server 2017, version 14.0).I said "for the most part" because the collations ending in
_SC were introduced in SQL Server 2012 (version 11.0), but the underlying data wasn't new, they merely added support for supplementary characters for the built-in functions. So, those endings exist for version 90 and 100 collations, but only starting in SQL Server 2012.- Next you have the sensitivities, that can be in any combination of the following, but always specified in this order:
CS= case-sensitive orCI= case-insensitive
AS= accent-sensitive orAI= accent-insensitive
KS= Kana type-sensitive or missing = Kana type-insensitive
WS= width-sensitive or missing = width insensitive
VSS= variation selector sensitive (only available in the version 140 collations) or missing = variation selector insensitive
-
Optional last piece:
_SCat the end means "Supplementary Character support". The "support" onl
Context
StackExchange Database Administrators Q#72566, answer score: 6
Revisions (0)
No revisions yet.