HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Why not use two columns instead of collation for case insensitive data?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
casewhycolumnsinsensitivedatainsteadtwoforcollationuse

Problem

I need to store emails (like 'ivan@email.com') in Postgres 15, be able to search them in case-insensitive manner ('iVaN@email.com', 'IVAN@email.com', etc are the same), and be able to retrieve the original email to use it for actual sending of emails.

The suggested approach for handling case-insensitive data is to use collations:
DROP TABLE IF EXISTS test_collation;
DROP COLLATION IF EXISTS case_insensitive;

CREATE COLLATION case_insensitive (PROVIDER = icu, LOCALE = '@colStrength=secondary', DETERMINISTIC = FALSE);

CREATE TABLE test_collation
(
original_email TEXT COLLATE case_insensitive NOT NULL UNIQUE PRIMARY KEY
);

INSERT INTO test_collation (original_email)
VALUES ('IVAN@email.com');

-- This entry fails as expected as a duplicate:
INSERT INTO test_collation (original_email)
VALUES ('ivAn@email.com');

-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test_collation
WHERE original_email = 'iVaN@EmaIl.com';



So far so good, but the documentation states that this approach has its drawbacks in terms of performance. This makes sense as it is challenging to make indexes on non-deterministic data.

This got me thinking: what if we use two columns, one will store the original email, and the second one will store the lower-cased email. To make it fool-proof, the lower-cased email will be a generated column (so it cannot be manually changed), and a primary key at the same time to avoid duplication. Searching on the lower-cased email column can be very efficient as it is deterministic in nature and can use B-trees. Example:
`DROP TABLE IF EXISTS test_two_columns;
CREATE TABLE test_two_columns
(
original_email TEXT NOT NULL UNIQUE,
lowered_email TEXT NOT NULL UNIQUE PRIMARY KEY GENERATED ALWAYS AS ( LOWER(original_email) ) STORED
);

INSERT INTO test_two_columns (original_email)
VALUES ('IVAN@email.com');

-- This entry fails as expected as a duplicate:
INSERT INTO test_two_column

Solution

Third option:

CREATE TABLE test
(
    original_email TEXT NOT NULL
);

CREATE UNIQUE INDEX test_email_uniq on test(lower(original_email));

INSERT INTO test (original_email)
VALUES ('IVAN@email.com');

-- This entry fails as expected as a duplicate:
INSERT INTO test (original_email)
VALUES ('ivAn@email.com');

-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');


In this case, we do not store an additional field in the table, only the unique index itself. The condition WHERE LOWER(original_email) = lower(?) will use this unique index to speed up this query.

Code Snippets

CREATE TABLE test
(
    original_email TEXT NOT NULL
);

CREATE UNIQUE INDEX test_email_uniq on test(lower(original_email));

INSERT INTO test (original_email)
VALUES ('IVAN@email.com');

-- This entry fails as expected as a duplicate:
INSERT INTO test (original_email)
VALUES ('ivAn@email.com');

-- Getting the original email provided by the user regardless of case in which it is entered:
SELECT original_email
FROM test
WHERE LOWER(original_email) = LOWER('iVaN@EmaIl.com');

Context

StackExchange Database Administrators Q#324960, answer score: 11

Revisions (0)

No revisions yet.