patternsqlMinor
Numeric collation sorts by digits instead of value- postgres
Viewed 0 times
postgresnumericdigitsvaluesortsinsteadcollation
Problem
I have a table of users with ids in the following form
When searching and sorting the users, I need
After looking around I came across the following collation:
Which fixed the problem, but caused another one, users with numbers that started with 2 appeared after users that started with 1
For example,
From the docs, it says that
I'm using Postgres 11.6. I have two servers: one on Windows (local) and a remote one on Linux.
Is anyone familiar with this issue?
user123@domain.com.When searching and sorting the users, I need
user1@domian.com to be before user14@domain.com but since 4 is “smaller” than @ it sorts the other way around.After looking around I came across the following collation:
CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');Which fixed the problem, but caused another one, users with numbers that started with 2 appeared after users that started with 1
For example,
user19@domain.com came before user2@domain.comFrom the docs, it says that
kn-true and looks at the number’s value but it looks like it only compares the digits.I'm using Postgres 11.6. I have two servers: one on Windows (local) and a remote one on Linux.
Is anyone familiar with this issue?
Solution
This is a potential alternative solution to your problem. Rather than creating a collation, you could do following (all of the code below is available on the fiddle here):
Then populate it (out of order - i.e. we're not relying on
and then we check sorting by email (
So, now we combine a couple of functions to obtain the desired sort order. We will use the
We can see what's happening in the following query:
Result:
Et voilà - the emails are sorted in the desired order!
We can add an enhancement to this by using an
First, we'll take a look at the
Result:
Note the
A brief diversion about
-
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.
-
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.
The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without enable_seqscan = OFF, the very small sample tables here would cause the optimiser to automatically choose a sequential scan. With a large number of records on a production system, this should not be a problem.
From the documentation here:
enable_seqscan (boolean)
Enables or disables the query planner's use of sequential scan plan
types. It is impossible to suppress sequential scans entirely, but
turning this variable off discourages the planner from using one if
there are other methods available. The default is on.
We create the expression index as follows:
and rerun the
Now,
CREATE TABLE test
(
e_id INT NOT NULL GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL
);Then populate it (out of order - i.e. we're not relying on
INSERT order or physical order on disk of either the heap or the PRIMARY KEY):INSERT INTO test (email) VALUES
('user34@x.com'),
('user123@x.com'), -- Note: deliberately out of order!
('user1@x.com'),
('user23343@x.com'),
('user3@x.com'),
('user7@x.com'),
('user21@x.com'),
('user223@x.com'),and then we check sorting by email (
SELECT * FROM test ORDER BY email;) - result (out of order both by email and PRIMARY KEY):test
(2,user123@x.com)
(3,user1@x.com)
(7,user21@x.com)
...
... snipped for brevity - see fiddle
...So, now we combine a couple of functions to obtain the desired sort order. We will use the
SUBSTRING() (with a regular expression1 - also see the manual) and the SPLIT_PART() (manual), then cast the result of this function combination to an INTEGER using PostgreSQL's double-colon (::) cast operator and sort by that INTEGER.- An excellent introduction to regexes
We can see what's happening in the following query:
SELECT
e_id, email,
SUBSTRING(email FROM '\d.*'),
SPLIT_PART(SUBSTRING(email FROM '\d.*'), '@', 1)::INT AS num_split_part
FROM test
ORDER BY num_split_part;Result:
e_id email substring num_split_part
3 user1@x.com 1@x.com 1
5 user3@x.com 3@x.com 3
6 user7@x.com 7@x.com 7
7 user21@x.com 21@x.com 21
1 user34@x.com 34@x.com 34
2 user123@x.com 123@x.com 123
8 user223@x.com 223@x.com 223
4 user23343@x.com 23343@x.com 23343Et voilà - the emails are sorted in the desired order!
We can add an enhancement to this by using an
expression index (aka a functional index - see the manual) as follows.First, we'll take a look at the
EXPLAIN plan without the index:EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
email
FROM test
ORDER BY num_split_part;Result:
QUERY PLAN
Sort (cost=10000000104.05..10000000107.22 rows=1270 width=72) (actual time=0.048..0.049 rows=8 loops=1)
Output: e_id, email, ("substring"(email, '\d.*'::text)), ((split_part("substring"(email, '\d.*'::text), '@'::text, 1))::integer)
Sort Key: ((split_part("substring"(test.email, '\d.*'::text), '@'::text, 1))::integer)
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Seq Scan on public.test (cost=10000000000.00..10000000038.58 rows=1270 width=72) (actual time=0.015..0.034 rows=8 loops=1)
Output: e_id, email, "substring"(email, '\d.*'::text), (split_part("substring"(email, '\d.*'::text), '@'::text, 1))::integer
Buffers: shared hit=1
Planning Time: 0.042 ms
Execution Time: 0.074 msNote the
Seq Scan on public.test which means that every record of the table has to be traversed before a result can be derived - even though I did SET enable_seqscan = OFF;.A brief diversion about
SET enable_seqscan = OFF;.-
This doesn't actually disable sequential table scans, it just makes them very expensive - see discussion below.
-
Do not do this on production systems, or at least don't do it globally. You could, if and only if you fully understand any consequences, do it on a case-by-case, query-by-query basis, but it's not to be recommended. Today's query hints are tomorrow's bugs - use with caution.
The reason I'm doing it here is to force the optimiser to choose the index over a sequential scan. Without enable_seqscan = OFF, the very small sample tables here would cause the optimiser to automatically choose a sequential scan. With a large number of records on a production system, this should not be a problem.
From the documentation here:
enable_seqscan (boolean)
Enables or disables the query planner's use of sequential scan plan
types. It is impossible to suppress sequential scans entirely, but
turning this variable off discourages the planner from using one if
there are other methods available. The default is on.
We create the expression index as follows:
CREATE INDEX func_idx ON test
((
SPLIT_PART(SUBSTRING(email FROM '\d.*'), '@', 1)::INT
)); -- the use of ::INT obliges us to have two opening and closin brackets!and rerun the
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) (see the fiddle) and we get this:QUERY PLAN
Index Scan using func_idx on public.test (cost=0.13..12.35 rows=8 width=72) (actual time=0.037..0.057 rows=8 loops=1)
Output: e_id, email, "substring"(email, '\d.*'::text), (split_part("substring"(email, '\d.*'::text), '@'::text, 1))::integer
Buffers: shared hit=1 read=1
Planning Time: 0.225 ms
Execution Time: 0.078 msNow,
Code Snippets
CREATE TABLE test
(
e_id INT NOT NULL GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL
);INSERT INTO test (email) VALUES
('user34@x.com'),
('user123@x.com'), -- Note: deliberately out of order!
('user1@x.com'),
('user23343@x.com'),
('user3@x.com'),
('user7@x.com'),
('user21@x.com'),
('user223@x.com'),test
(2,user123@x.com)
(3,user1@x.com)
(7,user21@x.com)
...
... snipped for brevity - see fiddle
...SELECT
e_id, email,
SUBSTRING(email FROM '\d.*'),
SPLIT_PART(SUBSTRING(email FROM '\d.*'), '@', 1)::INT AS num_split_part
FROM test
ORDER BY num_split_part;e_id email substring num_split_part
3 user1@x.com 1@x.com 1
5 user3@x.com 3@x.com 3
6 user7@x.com 7@x.com 7
7 user21@x.com 21@x.com 21
1 user34@x.com 34@x.com 34
2 user123@x.com 123@x.com 123
8 user223@x.com 223@x.com 223
4 user23343@x.com 23343@x.com 23343Context
StackExchange Database Administrators Q#321504, answer score: 2
Revisions (0)
No revisions yet.