patternsqlModerate
Why not use two columns instead of collation for case insensitive data?
Viewed 0 times
casewhycolumnsinsensitivedatainsteadtwoforcollationuse
Problem
I need to store emails (like 'ivan@email.com') in Postgres 15, be able to search them in case-insensitive manner ('iVaN@email.com', 'IVAN@email.com', etc are the same), and be able to retrieve the original email to use it for actual sending of emails.
The suggested approach for handling case-insensitive data is to use collations:
So far so good, but the documentation states that this approach has its drawbacks in terms of performance. This makes sense as it is challenging to make indexes on non-deterministic data.
This got me thinking: what if we use two columns, one will store the original email, and the second one will store the lower-cased email. To make it fool-proof, the lower-cased email will be a generated column (so it cannot be manually changed), and a primary key at the same time to avoid duplication. Searching on the lower-cased email column can be very efficient as it is deterministic in nature and can use B-trees. Example:
`DROP TABLE IF EXISTS test_two_columns;
CREATE TABLE test_two_columns
(
original_email TEXT NOT NULL UNIQUE,
lowered_email TEXT NOT NULL UNIQUE PRIMARY KEY GENERATED ALWAYS AS ( LOWER(original_email) ) STORED
);
INSERT INTO test_two_columns (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_two_column
The suggested approach for handling case-insensitive data is to use collations:
DROP TABLE IF EXISTS test_collation;
DROP COLLATION IF EXISTS case_insensitive;
CREATE COLLATION case_insensitive (PROVIDER = icu, LOCALE = '@colStrength=secondary', DETERMINISTIC = FALSE);
CREATE TABLE test_collation
(
original_email TEXT COLLATE case_insensitive NOT NULL UNIQUE PRIMARY KEY
);
INSERT INTO test_collation (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_collation (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_collation
WHERE original_email = 'iVaN@EmaIl.com';
So far so good, but the documentation states that this approach has its drawbacks in terms of performance. This makes sense as it is challenging to make indexes on non-deterministic data.
This got me thinking: what if we use two columns, one will store the original email, and the second one will store the lower-cased email. To make it fool-proof, the lower-cased email will be a generated column (so it cannot be manually changed), and a primary key at the same time to avoid duplication. Searching on the lower-cased email column can be very efficient as it is deterministic in nature and can use B-trees. Example:
`DROP TABLE IF EXISTS test_two_columns;
CREATE TABLE test_two_columns
(
original_email TEXT NOT NULL UNIQUE,
lowered_email TEXT NOT NULL UNIQUE PRIMARY KEY GENERATED ALWAYS AS ( LOWER(original_email) ) STORED
);
INSERT INTO test_two_columns (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test_two_column
Solution
Third option:
In this case, we do not store an additional field in the table, only the unique index itself. The condition
CREATE TABLE test
(
original_email TEXT NOT NULL
);
CREATE UNIQUE INDEX test_email_uniq on test(lower(original_email));
INSERT INTO test (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');In this case, we do not store an additional field in the table, only the unique index itself. The condition
WHERE LOWER(original_email) = lower(?) will use this unique index to speed up this query.Code Snippets
CREATE TABLE test
(
original_email TEXT NOT NULL
);
CREATE UNIQUE INDEX test_email_uniq on test(lower(original_email));
INSERT INTO test (original_email)
VALUES ('IVAN@email.com');
-- This entry fails as expected as a duplicate:
INSERT INTO test (original_email)
VALUES ('ivAn@email.com');
-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');Context
StackExchange Database Administrators Q#324960, answer score: 11
Revisions (0)
No revisions yet.