patternsqlMajor
Algorithm for finding the longest prefix
Viewed 0 times
thealgorithmforfindingprefixlongest
Problem
I have two tables.
First one is a table with prefixes
Second is call records with phone numbers
I need write a script which find longest prefix from prefixes for each record, and write all this data to third table, like this:
For number 834353212 we must trim '8', and then find the longest code from prefix table, its 3435.
We must always drop first '8' and prefix must be in the beginning.
I solved this task long time ago, with very bad way.
Its was terrible perl script which do a lot of queries for each record.
This script:
-
Take a number from calls table, do substring from length(number) to
1 => $prefix in the loop
-
Do the query : select count(*) from prefixes where code like '$prefix'
First problem is query counts - it's
Second problem is
I tried to solve the second problem by:
That speeds up each query, but did not solve problem in general.
I have 20k prefixes and 170k numbers now, and my old solution is bad. Looks like I need some new solution without loops.
Only one query for each call record or something like this.
First one is a table with prefixes
code name price
343 ek1 10
3435 nt 4
3432 ek2 2Second is call records with phone numbers
number time
834353212 10
834321242 20
834312345 30I need write a script which find longest prefix from prefixes for each record, and write all this data to third table, like this:
number code ....
834353212 3435
834321242 3432
834312345 343For number 834353212 we must trim '8', and then find the longest code from prefix table, its 3435.
We must always drop first '8' and prefix must be in the beginning.
I solved this task long time ago, with very bad way.
Its was terrible perl script which do a lot of queries for each record.
This script:
-
Take a number from calls table, do substring from length(number) to
1 => $prefix in the loop
-
Do the query : select count(*) from prefixes where code like '$prefix'
- If count>0 then take first prefixes and write into table
First problem is query counts - it's
call_records * length(number).Second problem is
LIKE expressions. I am afraid those are slow.I tried to solve the second problem by:
CREATE EXTENSION pg_trgm;
CREATE INDEX prefix_idx ON prefix USING gist (code gist_trgm_ops);That speeds up each query, but did not solve problem in general.
I have 20k prefixes and 170k numbers now, and my old solution is bad. Looks like I need some new solution without loops.
Only one query for each call record or something like this.
Solution
Assuming data type
"Simple" Solution
Key elements:
db<>fiddle here
Old sqlfiddle
Without index, this query runs for a very long time (didn't wait to see it finish). We need index support. Trigram indexes (supplied by the additional module
In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):
Advanced dbfiddle here.
All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:
Much faster yet
Failed attempt with
Ignoring the distracting first noise character, it comes down to a basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class
This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:
However, the query planner will not consider this index for joining two tables (before
Partial / functional B-tree indexes
The alternative it to use equality checks on partial strings with partial indexes. This can be used in a
Since we typically only have a limited number of
Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:
Since these are partial indexes, all of them together are barely larger than a single complete index.
Add matching indexes for numbers (taking the leading noise character into account):
While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.
If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...
If numbers are never shorter then
Recursive CTE
With all the setup so far I was hoping for very elegant solution with a recursive CTE:
```
WITH RECURSIVE cte AS (
SELECT n.number, p.code, 4 AS len
FROM num n
LEFT JOIN prefix p
ON substring(number, 2, 5) = p.code
AND length(n.number) >= 6 -- incl. noise character
AND length(p.code) = 5
UNION ALL
SELECT c.number, p.code, len - 1
FROM cte c
LEFT JOIN prefix p
ON substring(number, 2, c.len) = p.code
AND length(c.number) >= c.len+1 -- incl. noise character
AND length(p.code) = c.len
WHERE c.len > 0
AND c.code IS NULL
)
SELECT number, code
FROM
text for the relevant columns.CREATE TABLE prefix (code text, name text, price int);
CREATE TABLE num (number text, time int);"Simple" Solution
SELECT DISTINCT ON (n.number)
n.number, p.code
FROM num n
JOIN prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER BY n.number, p.code DESC;Key elements:
DISTINCT ON is a Postgres extension of the SQL standard DISTINCT. See:- Select first row in each GROUP BY group?.
ORDER BY p.code DESC picks the longest match, because '1234' sorts after '123' (in ascending order).db<>fiddle here
Old sqlfiddle
Without index, this query runs for a very long time (didn't wait to see it finish). We need index support. Trigram indexes (supplied by the additional module
pg_trgm) are a good candidate. You have to choose between GIN and GiST index. The first character of the numbers is just noise and can be excluded from the index, making it a functional index.In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):
CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);Advanced dbfiddle here.
All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:
- Total runtime: 1719.552 ms (trigram GiST)
- Total runtime: 912.329 ms (trigram GIN)
Much faster yet
Failed attempt with
text_pattern_opsIgnoring the distracting first noise character, it comes down to a basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class
text_pattern_ops (assuming column type text).CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:
SELECT * FROM num WHERE right(number, -1) LIKE '2345%'- Total runtime: 3.816 ms (trgm_gin_idx)
- Total runtime: 0.147 ms (text_pattern_idx)
However, the query planner will not consider this index for joining two tables (before
LATERAL joins existed).Partial / functional B-tree indexes
The alternative it to use equality checks on partial strings with partial indexes. This can be used in a
JOIN.Since we typically only have a limited number of
different lengths for prefixes, we can build a solution similar to the one presented here with partial indexes:- Can spatial index help a "range - order by - limit" query
Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:
CREATE INDEX prefix_code_idx5 ON prefix(code) WHERE length(code) = 5;
CREATE INDEX prefix_code_idx4 ON prefix(code) WHERE length(code) = 4;
CREATE INDEX prefix_code_idx3 ON prefix(code) WHERE length(code) = 3;
CREATE INDEX prefix_code_idx2 ON prefix(code) WHERE length(code) = 2;
CREATE INDEX prefix_code_idx1 ON prefix(code) WHERE length(code) = 1;Since these are partial indexes, all of them together are barely larger than a single complete index.
Add matching indexes for numbers (taking the leading noise character into account):
CREATE INDEX num_number_idx5 ON num(substring(number, 2, 5)) WHERE length(number) >= 6;
CREATE INDEX num_number_idx4 ON num(substring(number, 2, 4)) WHERE length(number) >= 5;
CREATE INDEX num_number_idx3 ON num(substring(number, 2, 3)) WHERE length(number) >= 4;
CREATE INDEX num_number_idx2 ON num(substring(number, 2, 2)) WHERE length(number) >= 3;
CREATE INDEX num_number_idx1 ON num(substring(number, 2, 1)) WHERE length(number) >= 2;While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.
If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...
If numbers are never shorter then
n characters, drop redundant WHERE clauses from some or all, and also drop the corresponding WHERE clause from all following queries.Recursive CTE
With all the setup so far I was hoping for very elegant solution with a recursive CTE:
```
WITH RECURSIVE cte AS (
SELECT n.number, p.code, 4 AS len
FROM num n
LEFT JOIN prefix p
ON substring(number, 2, 5) = p.code
AND length(n.number) >= 6 -- incl. noise character
AND length(p.code) = 5
UNION ALL
SELECT c.number, p.code, len - 1
FROM cte c
LEFT JOIN prefix p
ON substring(number, 2, c.len) = p.code
AND length(c.number) >= c.len+1 -- incl. noise character
AND length(p.code) = c.len
WHERE c.len > 0
AND c.code IS NULL
)
SELECT number, code
FROM
Code Snippets
CREATE TABLE prefix (code text, name text, price int);
CREATE TABLE num (number text, time int);SELECT DISTINCT ON (n.number)
n.number, p.code
FROM num n
JOIN prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER BY n.number, p.code DESC;CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);SELECT * FROM num WHERE right(number, -1) LIKE '2345%'Context
StackExchange Database Administrators Q#43415, answer score: 23
Revisions (0)
No revisions yet.