snippetsqlMajor

How to choose a collation for international database?

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

howchoosedatabaseinternationalforcollation

Problem

I'm designing a database which will store data in different languages (using UTF-8), so I think the best way to display the query's results is ordering it according to the user's language during the query itself (because there are more than one correct ways to do that), as follows:

SELECT a < b COLLATE "de_DE" FROM test1;

Assuming this is the correct way to work with international data, which is the best collation for the database itself? PostgreSQL documentation says:

The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.

I think this is the best choice in this case, or am I wrong?

(Bonus question: is it too slow to select the collation in the query itself?).

Solution

The C collation is the right choice.

Everything is a bit faster without locale. And since no collation is right anyway, create the database without collation, meaning with C.

It may be a pain to have to provide a collation for many operations. There shouldn't be a noticeable difference in speed between the default collation and an ad-hoc collation, though. After all it's just unsorted data, and collation rules are applied when sorting.

Be aware that Postgres builds on the locale settings provided by the underlying OS, so you need to have locales generated for each locale to be used. More in related answer on SO here and here.

However, as @Craig already mentioned, indexes are the bottleneck in this scenario. The collation of the index has to match the collation of the applied operator in many cases that involve character data.

You can use the COLLATE specifier in indexes to produce matching indexes. Partial indexes may be the perfect choice if you are mixing data in the same table.

For example, a table with international strings:

CREATE TABLE string (
   string_id serial
  ,lang_id   int NOT NULL
  ,string    text NOT NULL
);

And you are mostly interested in one language at a time:

SELECT *
FROM   string
WHERE  lang_id = 5  -- 5 being German / Germany here
AND    string > 'foo' COLLATE "de_DE"
ORDER  BY string COLLATE "de_DE";

Then create partial indexes like:

CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;

One for each language you need.

Actually, inheritance might be a superior approach for a table like this. Then you can have a plain index on each inherited table containing only strings for a single locale. You need to be comfortable with the special rules for inherited tables, of course.

Code Snippets

CREATE TABLE string (
   string_id serial
  ,lang_id   int NOT NULL
  ,string    text NOT NULL
);

SELECT *
FROM   string
WHERE  lang_id = 5  -- 5 being German / Germany here
AND    string > 'foo' COLLATE "de_DE"
ORDER  BY string COLLATE "de_DE";

CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;

Context

StackExchange Database Administrators Q#46408, answer score: 32

Revisions (0)

No revisions yet.