patternsqlMajor

Algorithm for finding the longest prefix

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

thealgorithmforfindingprefixlongest

Problem

I have two tables.

First one is a table with prefixes

code name price
343  ek1   10
3435 nt     4
3432 ek2    2

Second is call records with phone numbers

number        time
834353212     10
834321242     20
834312345     30

I need write a script which find longest prefix from prefixes for each record, and write all this data to third table, like this:

number        code   ....
 834353212     3435
 834321242     3432
 834312345     343

For number 834353212 we must trim '8', and then find the longest code from prefix table, its 3435.

We must always drop first '8' and prefix must be in the beginning.

I solved this task long time ago, with very bad way.
Its was terrible perl script which do a lot of queries for each record.
This script:

-
Take a number from calls table, do substring from length(number) to
1 => $prefix in the loop

-
Do the query : select count(*) from prefixes where code like '$prefix'

If count>0 then take first prefixes and write into table

First problem is query counts - it's call_records * length(number).
Second problem is LIKE expressions. I am afraid those are slow.

I tried to solve the second problem by:

CREATE EXTENSION pg_trgm;
CREATE INDEX prefix_idx ON prefix USING gist (code gist_trgm_ops);

That speeds up each query, but did not solve problem in general.

I have 20k prefixes and 170k numbers now, and my old solution is bad. Looks like I need some new solution without loops.

Only one query for each call record or something like this.

Solution

Assuming data type text for the relevant columns.

CREATE TABLE prefix (code text, name text, price int);
CREATE TABLE num (number text, time int);

"Simple" Solution

SELECT DISTINCT ON (n.number)
       n.number, p.code
FROM   num n
JOIN   prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER  BY n.number, p.code DESC;

Key elements:

DISTINCT ON is a Postgres extension of the SQL standard DISTINCT. See:

Select first row in each GROUP BY group?.

ORDER BY p.code DESC picks the longest match, because '1234' sorts after '123' (in ascending order).

db<>fiddle here

Old sqlfiddle

Without index, this query runs for a very long time (didn't wait to see it finish). We need index support. Trigram indexes (supplied by the additional module pg_trgm) are a good candidate. You have to choose between GIN and GiST index. The first character of the numbers is just noise and can be excluded from the index, making it a functional index.

In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):

CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);

Advanced dbfiddle here.

All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:

Total runtime: 1719.552 ms (trigram GiST)

Total runtime: 912.329 ms (trigram GIN)

Much faster yet
Failed attempt with text_pattern_ops

Ignoring the distracting first noise character, it comes down to a basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class text_pattern_ops (assuming column type text).

CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);

This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:

SELECT * FROM num WHERE right(number, -1) LIKE '2345%'

Total runtime: 3.816 ms (trgm_gin_idx)

Total runtime: 0.147 ms (text_pattern_idx)

However, the query planner will not consider this index for joining two tables (before LATERAL joins existed).
Partial / functional B-tree indexes

The alternative it to use equality checks on partial strings with partial indexes. This can be used in a JOIN.

Since we typically only have a limited number of different lengths for prefixes, we can build a solution similar to the one presented here with partial indexes:

Can spatial index help a "range - order by - limit" query

Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:

CREATE INDEX prefix_code_idx5 ON prefix(code) WHERE length(code) = 5;
CREATE INDEX prefix_code_idx4 ON prefix(code) WHERE length(code) = 4;
CREATE INDEX prefix_code_idx3 ON prefix(code) WHERE length(code) = 3;
CREATE INDEX prefix_code_idx2 ON prefix(code) WHERE length(code) = 2;
CREATE INDEX prefix_code_idx1 ON prefix(code) WHERE length(code) = 1;

Since these are partial indexes, all of them together are barely larger than a single complete index.

Add matching indexes for numbers (taking the leading noise character into account):

CREATE INDEX num_number_idx5 ON num(substring(number, 2, 5)) WHERE length(number) >= 6;
CREATE INDEX num_number_idx4 ON num(substring(number, 2, 4)) WHERE length(number) >= 5;
CREATE INDEX num_number_idx3 ON num(substring(number, 2, 3)) WHERE length(number) >= 4;
CREATE INDEX num_number_idx2 ON num(substring(number, 2, 2)) WHERE length(number) >= 3;
CREATE INDEX num_number_idx1 ON num(substring(number, 2, 1)) WHERE length(number) >= 2;

While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.

If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...

If numbers are never shorter then n characters, drop redundant WHERE clauses from some or all, and also drop the corresponding WHERE clause from all following queries.
Recursive CTE

With all the setup so far I was hoping for very elegant solution with a recursive CTE:

```
WITH RECURSIVE cte AS (
SELECT n.number, p.code, 4 AS len
FROM num n
LEFT JOIN prefix p
ON substring(number, 2, 5) = p.code
AND length(n.number) >= 6 -- incl. noise character
AND length(p.code) = 5

UNION ALL
SELECT c.number, p.code, len - 1
FROM cte c
LEFT JOIN prefix p
ON substring(number, 2, c.len) = p.code
AND length(c.number) >= c.len+1 -- incl. noise character
AND length(p.code) = c.len
WHERE c.len > 0
AND c.code IS NULL
)
SELECT number, code
FROM

Code Snippets

CREATE TABLE prefix (code text, name text, price int);
CREATE TABLE num (number text, time int);

SELECT DISTINCT ON (n.number)
       n.number, p.code
FROM   num n
JOIN   prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER  BY n.number, p.code DESC;

CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);

CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);

SELECT * FROM num WHERE right(number, -1) LIKE '2345%'

Context

StackExchange Database Administrators Q#43415, answer score: 23

Revisions (0)

No revisions yet.