patternsqlMinor
SQL Server 2019 UTF-8 Support Benefits
Viewed 0 times
utfsql2019benefitsserversupport
Problem
I'm already quite comfortable with using
COMPRESS() and DECOMPRESS() in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8 to my current collation as in Latin1_General_100_CI_AS_SC_UTF8 upon future migration to SQL Server 2019?Solution
trying to make the database as efficient as possible
There are at least two different types of efficiency that are really at play here:
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8
So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
Just keep in mind the following:
-
UTF-8 was designed to achieve ASCII compatibility for easier implementation. This allows systems that are Standard ASCII-based (values 0 - 127; values 128 - 255 are Extended ASCII and not covered by this) to enable Unicode without having to re-save anything in a new encoding.
For SQL Server, the goal was that existing apps that are currently using
It was not designed to be a form of compression. If you have data that has a reduced footprint in UTF-8, then great. But when working with data that is not Standard ASCII, then either won't have any savings, or even worse, you might increase the data size by going to UTF-8 (given that 63k of the 65k BMP characters are 3 bytes in UTF-8, which is 1 byte more than the 2 bytes they require in UTF-16).
And, if UTF-8 gives a performance gain, or at least you don't see a drop in performance, then great. But, don't expect it. In fact, don't be surprised if you happen to see a decrease in performance.
-
If you decide to implement the UTF-8 collations in SQL Server, you need to be aware of some potential data "issues":
requiring more bytes in UTF-8 than they did in their original encoding.
For more details and examples, please see the "Things to Keep in Mind: Operational" section of my post: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?
There are at least two different types of efficiency that are really at play here:
- space (disk and memory)
- speed
Under certain conditions (as described in Outman's answer, which is a copy/paste of the "Recommended Uses / Guidance" section of my blog post, linked at the top of that answer) you can save space, but that is entirely dependent on the type and per-row quantity of characters.
However, at least in its current implementation, you are more likely than not to have a decrease in speed. This could be due to how they are handling the UTF-8 data internally. I know that when comparing UTF-8 data to non-UTF-8
VARCHAR data, both values are converted to UTF-16 LE (i.e. NVARCHAR). I wouldn't be surprised if other (perhaps even most) operations needed to convert the UTF-8 data into NVARCHAR given that is how Windows / SQL Server / .NET have always handled Unicode.So, assuming that you have a scenario that could possibly benefit from using UTF-8, you need to choose which efficiency is more important.
Now, whether or not UTF-8 will benefit scenarios where the environment itself is naturally UTF-8 (e.g. Linux) remains to be seen. Typically the database driver (ODBC, SQL Native Client, etc) handles the translation between client and server. I suppose there could be a performance / efficiency gain here if doing this would result in the driver software skipping the additional steps (and CPU cycles) it takes to do those encoding translations. So far this is just a theory as I have not tested it.
Just keep in mind the following:
-
UTF-8 was designed to achieve ASCII compatibility for easier implementation. This allows systems that are Standard ASCII-based (values 0 - 127; values 128 - 255 are Extended ASCII and not covered by this) to enable Unicode without having to re-save anything in a new encoding.
For SQL Server, the goal was that existing apps that are currently using
VARCHAR can start supporting Unicode without needing to do much re-coding (i.e. adding N prefixes to string literals) or updating datatypes from VARCHAR to NVARCHAR.It was not designed to be a form of compression. If you have data that has a reduced footprint in UTF-8, then great. But when working with data that is not Standard ASCII, then either won't have any savings, or even worse, you might increase the data size by going to UTF-8 (given that 63k of the 65k BMP characters are 3 bytes in UTF-8, which is 1 byte more than the 2 bytes they require in UTF-16).
And, if UTF-8 gives a performance gain, or at least you don't see a drop in performance, then great. But, don't expect it. In fact, don't be surprised if you happen to see a decrease in performance.
-
If you decide to implement the UTF-8 collations in SQL Server, you need to be aware of some potential data "issues":
- Data loss from mixing UTF-8 string literals and/or variables (due to the current database having a UTF-8 default collation) and non-UTF-8
VARCHARcolumns. This is caused by collation precedence effectively downgrading the collation from UTF-16 to whatever code page the column is using.
- Minor truncation from mixing non-UTF-8 string literals and/or variables with UTF-8 columns (and, in some cases, variables). This is caused by certain characters
requiring more bytes in UTF-8 than they did in their original encoding.
- Invalid byte sequences in UTF-8 can throw an error instead of returning the default replacement character "�". This is a different approach than has been taken so far with invalid sequences in any other 8-bit encoding or in UTF-16.
For more details and examples, please see the "Things to Keep in Mind: Operational" section of my post: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?
Context
StackExchange Database Administrators Q#221531, answer score: 6
Revisions (0)
No revisions yet.