patternsqlMajor
Which collation should I use for biblical Hebrew?
Viewed 0 times
biblicalhebrewforshouldwhichcollationuse
Problem
Which SQL Server collation should I use for biblical Hebrew? The database under consideration needs to accommodate diacritics (i.e., vowels, accents, trope, etc.).
Solution
First: There is no distinction, collation-wise, between biblical Hebrew and modern Hebrew. We are just dealing with Hebrew.
Second: Regardless of anything else, you want to use the newest set of collations, which are the
Third: There is no concept of "Kana" (or Kana-type) in Hebrew, so you can ignore any of the collation variations with
Forth: The collations ending in
Fifth: You don't want a binary collation (
For example (vowel and cantillation mark combining characters in opposite order):
Sixth: It depends on how you will be interacting with the string values. Hebrew does not have upper-case/lower-case letters, yet there are a few code points that are affected by case-sensitivity. There are even a few code points that are width-sensitive. Accent-sensitive / insensitive affects diacritics used for vowels, pronunciation, and cantillation marks (i.e. trope).
-
Do you need to distinguish between final and non-final forms of the same letter? There are five letters in Hebrew that look different when used as the final letter of a word. SQL Server handles this via case-sensitivity /
-
Do you need to distinguish between pronunciation marks, vowels, and cantillation marks? SQL Server handles this via accent-sensitivity /
Pronunciation marks
There are several letters that have two different sounds. Sometimes the only indicator for which sound to use is the context of what word the letter is in (and sometimes even the surrounding words), such as in the actual Torah (where there are no pronunciation marks or vowels). But, that same text in other forms, as well as other text, will have dots placed either inside the letter, or for the letter Shin, above the letter. The letter Shin can have either an "sh" or "s" sound. To indicate the "sh" sound (i.e. letter "shin"), there is a dot above on the right side, while a dot above on the left side denotes the "s" sound (letter "sin"):
`SELECT NCHAR(0x05E9) AS [Shin], -- ש
NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot] -- שׂ
WHERE NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_CS_AI_KS_WS
AND NCHAR(0x05E9) = NCHAR(0x05E9) + NCHA
Second: Regardless of anything else, you want to use the newest set of collations, which are the
_100_ series as they have newer / more complete sort weights and linguistic rules than the older series with no version number in the name (technically the are version 80). Normally I highly recommend using the newest version of whatever collation you need, but at least in this one instance there is good reason to use the version with no version number in the name. The version 100 (or newer) collations are far more complete, and can distinguish between supplementary characters (or even fully support them if using an SC or 140 collation), but assuming that you are not dealing with supplementary characters, then the version 80 (no version number in name) collations do a better job of handling Hebrew (see item "Sixth" below).Third: There is no concept of "Kana" (or Kana-type) in Hebrew, so you can ignore any of the collation variations with
_KS in their names (as that is a sensitivity that you will never use).Forth: The collations ending in
_SC support supplementary characters (i.e. full UTF-16) so usually best to pick one of those, if available (meaning: if you are using SQL Server 2012 or newer).Fifth: You don't want a binary collation (
_BIN or _BIN2) as those can't distinguish between Hebrew letters with both vowels and cantillation marks that are the same but have the combining characters in different orders, nor can they ignore vowels and other marks to equate things like א and אֽ.For example (vowel and cantillation mark combining characters in opposite order):
SELECT NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8),
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C)
WHERE NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8) =
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C) COLLATE Hebrew_100_CS_AS_SC;
-- אָ֜ אָ֜
SELECT NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8),
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C)
WHERE NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8) =
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C) COLLATE Hebrew_100_BIN2;
-- no rows
Sixth: It depends on how you will be interacting with the string values. Hebrew does not have upper-case/lower-case letters, yet there are a few code points that are affected by case-sensitivity. There are even a few code points that are width-sensitive. Accent-sensitive / insensitive affects diacritics used for vowels, pronunciation, and cantillation marks (i.e. trope).
-
Do you need to distinguish between final and non-final forms of the same letter? There are five letters in Hebrew that look different when used as the final letter of a word. SQL Server handles this via case-sensitivity /
_CS collations (though, unfortunately, it appears to be broken in the newer, and usually better, version 100 and newer collations):SELECT NCHAR(0x05DE) AS [Mem],
NCHAR(0x05DD) AS [Final Mem]
WHERE NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_CI_AS_KS_WS;
-- 1 row (expected; all sensitive except case)
-- Mem Final Mem
-- מ ם
SELECT NCHAR(0x05DE) AS [Mem],
NCHAR(0x05DD) AS [Final Mem]
WHERE NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_CS_AI;
-- no rows (expected; all insensitive except case)
SELECT NCHAR(0x05DE) AS [Mem],
NCHAR(0x05DD) AS [Final Mem]
WHERE NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_100_CI_AI;
-- no rows (expected 1 row; all insensitive)
-
Do you need to distinguish between pronunciation marks, vowels, and cantillation marks? SQL Server handles this via accent-sensitivity /
_AS collations (though, unfortunately, it appears to be broken in the newer, and usually better, version 100 and newer collations). Please note that all three of these are grouped together under accent-sensitivity and cannot be controlled separately (i.e. you can't do vowel sensitive but cantillation mark insensitive).Pronunciation marks
There are several letters that have two different sounds. Sometimes the only indicator for which sound to use is the context of what word the letter is in (and sometimes even the surrounding words), such as in the actual Torah (where there are no pronunciation marks or vowels). But, that same text in other forms, as well as other text, will have dots placed either inside the letter, or for the letter Shin, above the letter. The letter Shin can have either an "sh" or "s" sound. To indicate the "sh" sound (i.e. letter "shin"), there is a dot above on the right side, while a dot above on the left side denotes the "s" sound (letter "sin"):
`SELECT NCHAR(0x05E9) AS [Shin], -- ש
NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot] -- שׂ
WHERE NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_CS_AI_KS_WS
AND NCHAR(0x05E9) = NCHAR(0x05E9) + NCHA
Context
StackExchange Database Administrators Q#250215, answer score: 35
Revisions (0)
No revisions yet.