snippetsqlMajor
How to choose a collation for international database?
Viewed 0 times
howchoosedatabaseinternationalforcollation
Problem
I'm designing a database which will store data in different languages (using UTF-8), so I think the best way to display the query's results is ordering it according to the user's language during the query itself (because there are more than one correct ways to do that), as follows:
Assuming this is the correct way to work with international data, which is the best collation for the database itself? PostgreSQL documentation says:
The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.
I think this is the best choice in this case, or am I wrong?
(Bonus question: is it too slow to select the collation in the query itself?).
SELECT a < b COLLATE "de_DE" FROM test1;Assuming this is the correct way to work with international data, which is the best collation for the database itself? PostgreSQL documentation says:
The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.
I think this is the best choice in this case, or am I wrong?
(Bonus question: is it too slow to select the collation in the query itself?).
Solution
The
Everything is a bit faster without locale. And since no collation is right anyway, create the database without collation, meaning with
It may be a pain to have to provide a collation for many operations. There shouldn't be a noticeable difference in speed between the default collation and an ad-hoc collation, though. After all it's just unsorted data, and collation rules are applied when sorting.
Be aware that Postgres builds on the locale settings provided by the underlying OS, so you need to have locales generated for each locale to be used. More in related answer on SO here and here.
However, as @Craig already mentioned, indexes are the bottleneck in this scenario. The collation of the index has to match the collation of the applied operator in many cases that involve character data.
You can use the
For example, a table with international strings:
And you are mostly interested in one language at a time:
Then create partial indexes like:
One for each language you need.
Actually, inheritance might be a superior approach for a table like this. Then you can have a plain index on each inherited table containing only strings for a single locale. You need to be comfortable with the special rules for inherited tables, of course.
C collation is the right choice.Everything is a bit faster without locale. And since no collation is right anyway, create the database without collation, meaning with
C.It may be a pain to have to provide a collation for many operations. There shouldn't be a noticeable difference in speed between the default collation and an ad-hoc collation, though. After all it's just unsorted data, and collation rules are applied when sorting.
Be aware that Postgres builds on the locale settings provided by the underlying OS, so you need to have locales generated for each locale to be used. More in related answer on SO here and here.
However, as @Craig already mentioned, indexes are the bottleneck in this scenario. The collation of the index has to match the collation of the applied operator in many cases that involve character data.
You can use the
COLLATE specifier in indexes to produce matching indexes. Partial indexes may be the perfect choice if you are mixing data in the same table.For example, a table with international strings:
CREATE TABLE string (
string_id serial
,lang_id int NOT NULL
,string text NOT NULL
);And you are mostly interested in one language at a time:
SELECT *
FROM string
WHERE lang_id = 5 -- 5 being German / Germany here
AND string > 'foo' COLLATE "de_DE"
ORDER BY string COLLATE "de_DE";Then create partial indexes like:
CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;One for each language you need.
Actually, inheritance might be a superior approach for a table like this. Then you can have a plain index on each inherited table containing only strings for a single locale. You need to be comfortable with the special rules for inherited tables, of course.
Code Snippets
CREATE TABLE string (
string_id serial
,lang_id int NOT NULL
,string text NOT NULL
);SELECT *
FROM string
WHERE lang_id = 5 -- 5 being German / Germany here
AND string > 'foo' COLLATE "de_DE"
ORDER BY string COLLATE "de_DE";CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;Context
StackExchange Database Administrators Q#46408, answer score: 32
Revisions (0)
No revisions yet.