HiveBrain v1.2.0
Get Started
← Back to all entries
patternMajor

Does any DBMS have a collation that is both case-sensitive and accent-insensitive?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
caseinsensitiveanybothaccentthatdbmsdoescollationand

Problem

Note this question is vendor/version agnostic

It seems to me, as a speaker (typist, writer) of English, reasonable to expect words to be properly cased but not necessarily have the correct accents going in the right direction:


as I mused in a tete-a-tete with Chloe the maitre d'hotel at the Champs-Elysees restaurant, while waiting for the garcon to fetch my sauted jalapeno pate...

You get the idea with that.

So today I thought I wanted a search condition to use a case-sensitive but accent-insensitive collation but couldn't find one. Is there a good reason for this or is mine merely a rare use case?

Here's an example of some documentation I was looking at (though thinking vendor/version agnostic):

SQL Server Collation Name (SQL Server 2008 R2)

Solution

TL;DR

There is no such thing as a "vendor-agnostic" view of Collations, nor even "version-agnostic", since their implementations -- including which aspects can be made insensitive and their naming conventions -- are vendor-specific and change over time.

Here is a summary of what I have found, and the details are in the longer section below the line:

RDBMS Naming- Combinations Case-Sensitive and
convention of options? Accent-Insensitive support?
------- ------------ ------------- -----
SQL Server _CS, _AI, etc Yes Latin1_General_100_CS_AI

DB2 _E{x}, _S{y}, etc Yes CLDR181_EO_S1

PostgreSQL locale: en_US N/A unaccent(), not via Collation

MySQL _cs, maybe _ai No No: _cs implies _as & _ci implies _ai
Yes? Create your own Collation :-)

Oracle only _CI & _AI No No: _AI always implies _CI

SAP ASE arbitrary: turdict N/A No: "AI" always implies "CI"

Informix locale.codepage N/A No: no "AI" via Collations


As you can see in the chart, two of the seven RDBMSs do natively support "Case-sensitive and Accent-insensitive" operations via Collations, though they have different naming conventions (and several other functional differences).

One RDBMS -- PostgreSQL -- doesn't natively support this combination, but you can still achieve it by stripping off the accents with the unaccent() add-on function.

The last four RDBMSs, two of which have a similar naming convention for the options, neither natively support this combination nor does there appear to be a means of accomplishing this without writing your own function to remove the accents / diacritical marks. MySQL allows for creating your own Collations, but that requires that you then add it to source control and incorporate it into your testing & deployment process so that it can be applied to all servers in all environments (but still a very cool and flexible option). SAP ASE mentions that SAP can supply additional Unicode sort orders, but no mention of what they might be willing to supply.

With regards to:


Is there a good reason for this or is mine merely a rare use case?

I can say that in doing the research for this answer I came across a lot of instances of people wanting case-insensitive and accent-sensitive for MySQL, but few, if any, asking for your desired combination.


I wanted a search condition to use a case-sensitive but accent-insensitive collation but couldn't find one.

...

this question is vendor/version agnostic

You were unsuccessful in your search because it doesn't really make sense to look for an RDBMS based on a Collation specification. That's just not how Collations work. And while you are wanting to approach this as vendor-agnostic, reality is that Collations -- at least the part that we interact with -- are very much vendor-specific, and don't always fit into the scheme that you were searching on.

String comparison and sorting is highly complex, and there are different ways of performing these rules. One method is to have mappings that take into account one or more rules. Hence the four combinations of Sensitive and Insensitive for Case and Accents would equate to four separate mappings. For example, you saw this on that MSDN page for SQL Server Collation Name. If you scroll down, you will see that the left column of the chart is the Sort Order ID. Each Collation has a different ID: SQL_Latin1_General_Cp1_CI_AS = 52 while SQL_Latin1_General_Cp1_CS_AS = 51, even though the only difference is in the case-sensitivity.

Or, it can be rule-based, such as what Unicode offers through the Unicode Collation Algorithm (UCA). In this approach, every character is given, by default, one or more weights. Then, each culture / locale has the option to override any of those weights, or remove rules, or add rules. The algorithm takes into account any locale-specific rules, and then potentially manipulates those weights based on any options chosen (sensitivity, which case comes first when doing case-sensitive sorts, etc). This is one reason why doing Unicode sorting is a bit slower than non-Unicode sorting.

To get a sense of how many options there really are (i.e. the actual complexity), check out this demo from the ICU (International Components for Unicode) project:

ICU Collation Demo

There are 8 separate options to specify, and some of them get represented in multiple elements of the Collation name specification that you are thinking of (e.g. CS, CI, AS, AI, etc). Given how many variations there are, using the mapping file approach where each combination has its own ID would result in many thousands of files. Many of those files would need to get updated whenever there are changes in those particular languages, or when bugs are found. This is probably why there a

Code Snippets

SELECT *
FROM   sys.fn_helpcollations()
WHERE  [name] LIKE '%[_]CS[_]AI%';

Context

StackExchange Database Administrators Q#144181, answer score: 33

Revisions (0)

No revisions yet.