HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Numeric collation sorts by digits instead of value- postgres

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresnumericdigitsvaluesortsinsteadcollation

Problem

I have a table of users with ids in the following form user123@domain.com.
When searching and sorting the users, I need user1@domian.com to be before user14@domain.com but since 4 is “smaller” than @ it sorts the other way around.

After looking around I came across the following collation:

CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');


Which fixed the problem, but caused another one, users with numbers that started with 2 appeared after users that started with 1

For example, user19@domain.com came before user2@domain.com

From the docs, it says that kn-true and looks at the number’s value but it looks like it only compares the digits.

I'm using Postgres 11.6. I have two servers: one on Windows (local) and a remote one on Linux.

Is anyone familiar with this issue?

Solution

This is a potential alternative solution to your problem. Rather than creating a collation, you could do following (all of the code below is available on the fiddle here):

CREATE TABLE test
(
  e_id  INT NOT NULL GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL
);


Then populate it (out of order - i.e. we're not relying on INSERT order or physical order on disk of either the heap or the PRIMARY KEY):

INSERT INTO test (email) VALUES
('user34@x.com'),   
('user123@x.com'),  -- Note: deliberately out of order!
('user1@x.com'),
('user23343@x.com'),
('user3@x.com'),
('user7@x.com'),
('user21@x.com'),
('user223@x.com'),


and then we check sorting by email (SELECT * FROM test ORDER BY email;) - result (out of order both by email and PRIMARY KEY):

test
(2,user123@x.com)
(3,user1@x.com)
(7,user21@x.com)
...
... snipped for brevity - see fiddle
...


So, now we combine a couple of functions to obtain the desired sort order. We will use the SUBSTRING() (with a regular expression1 - also see the manual) and the SPLIT_PART() (manual), then cast the result of this function combination to an INTEGER using PostgreSQL's double-colon (::) cast operator and sort by that INTEGER.

  1. An excellent introduction to regexes



We can see what's happening in the following query:

SELECT 
  e_id, email,
  SUBSTRING(email FROM '\d.*'),
  SPLIT_PART(SUBSTRING(email FROM '\d.*'), '@', 1)::INT AS num_split_part
FROM test
ORDER BY num_split_part;


Result:

e_id    email            substring  num_split_part
   3  user1@x.com          1@x.com               1
   5  user3@x.com          3@x.com               3
   6  user7@x.com          7@x.com               7
   7  user21@x.com        21@x.com              21
   1  user34@x.com        34@x.com              34
   2  user123@x.com      123@x.com             123
   8  user223@x.com      223@x.com             223
   4  user23343@x.com  23343@x.com           23343


Et voilà - the emails are sorted in the desired order!

We can add an enhancement to this by using an expression index (aka a functional index - see the manual) as follows.

First, we'll take a look at the EXPLAIN plan without the index:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT 
  email
FROM test
ORDER BY num_split_part;


Result:

QUERY PLAN
Sort  (cost=10000000104.05..10000000107.22 rows=1270 width=72) (actual time=0.048..0.049 rows=8 loops=1)
  Output: e_id, email, ("substring"(email, '\d.*'::text)), ((split_part("substring"(email, '\d.*'::text), '@'::text, 1))::integer)
  Sort Key: ((split_part("substring"(test.email, '\d.*'::text), '@'::text, 1))::integer)
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=1
  ->  Seq Scan on public.test  (cost=10000000000.00..10000000038.58 rows=1270 width=72) (actual time=0.015..0.034 rows=8 loops=1)
        Output: e_id, email, "substring"(email, '\d.*'::text), (split_part("substring"(email, '\d.*'::text), '@'::text, 1))::integer
        Buffers: shared hit=1
Planning Time: 0.042 ms
Execution Time: 0.074 ms


Note the Seq Scan on public.test which means that every record of the table has to be traversed before a result can be derived - even though I did SET enable_seqscan = OFF;.

A brief diversion about SET enable_seqscan = OFF;.

-
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.

-
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.

The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without enable_seqscan = OFF, the very small sample tables here would cause the optimiser to automatically choose a sequential scan. With a large number of records on a production system, this should not be a problem.

From the documentation here:

enable_seqscan (boolean)

Enables or disables the query planner's use of sequential scan plan
types. It is impossible to suppress sequential scans entirely, but
turning this variable off discourages the planner from using one if
there are other methods available. The default is on.

We create the expression index as follows:

CREATE INDEX func_idx ON test 
((
  SPLIT_PART(SUBSTRING(email FROM '\d.*'), '@', 1)::INT 
));  -- the use of ::INT obliges us to have two opening and closin brackets!


and rerun the EXPLAIN (ANALYZE, BUFFERS, VERBOSE) (see the fiddle) and we get this:

QUERY PLAN
Index Scan using func_idx on public.test  (cost=0.13..12.35 rows=8 width=72) (actual time=0.037..0.057 rows=8 loops=1)
  Output: e_id, email, "substring"(email, '\d.*'::text), (split_part("substring"(email, '\d.*'::text), '@'::text, 1))::integer
  Buffers: shared hit=1 read=1
Planning Time: 0.225 ms
Execution Time: 0.078 ms


Now,

Code Snippets

CREATE TABLE test
(
  e_id  INT NOT NULL GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL
);
INSERT INTO test (email) VALUES
('user34@x.com'),   
('user123@x.com'),  -- Note: deliberately out of order!
('user1@x.com'),
('user23343@x.com'),
('user3@x.com'),
('user7@x.com'),
('user21@x.com'),
('user223@x.com'),
test
(2,user123@x.com)
(3,user1@x.com)
(7,user21@x.com)
...
... snipped for brevity - see fiddle
...
SELECT 
  e_id, email,
  SUBSTRING(email FROM '\d.*'),
  SPLIT_PART(SUBSTRING(email FROM '\d.*'), '@', 1)::INT AS num_split_part
FROM test
ORDER BY num_split_part;
e_id    email            substring  num_split_part
   3  user1@x.com          1@x.com               1
   5  user3@x.com          3@x.com               3
   6  user7@x.com          7@x.com               7
   7  user21@x.com        21@x.com              21
   1  user34@x.com        34@x.com              34
   2  user123@x.com      123@x.com             123
   8  user223@x.com      223@x.com             223
   4  user23343@x.com  23343@x.com           23343

Context

StackExchange Database Administrators Q#321504, answer score: 2

Revisions (0)

No revisions yet.