principlesqlMinor
Best approach to apply a long list of LIKE patterns?
Viewed 0 times
applylonglikepatternslistapproachbest
Problem
As a follow-up to this question, I have one of my own.
The original question concerns the use of a
In my answer, I proposed the use of a
Another poster (@AndriyM) proposed the use of a
I then added a CTE (Common Table Expression) to my original answer as yet another possible solution.
So, the OP now has 5 options:
I excluded the option of changing the underlying data since, frequently in this forum, consultants/DBA's/programmers are not allowed change underlying data - makes the answers more interesting also!
Obviously, a
In my opinion (and not just because it's my answer!), a
The
Which of the 5 approaches is better/best and at what point does the technical (ease of use, speed, query plan optimisation) tilt in the particular solution's favour?
The original question concerns the use of a
CASE statement with > than 100 options and the statement has to be used in 4 places - so obviously the SQL is pretty hairy. The OP's question concerned SQL Server 2012, my question, however, is about PostgreSQL.In my answer, I proposed the use of a
VIEW as a "one-stop-shop" solution - i.e. declare the VIEW once, use it anywhere - and this applies for any query in the future also and any variant thereof.Another poster (@AndriyM) proposed the use of a
CROSS APPLY to resolve the issue which is another solution. The PostgreSQL syntax is JOIN LATERALI then added a CTE (Common Table Expression) to my original answer as yet another possible solution.
So, the OP now has 5 options:
CASE
VIEW
JOIN LATERAL(CROSS APPLYfor SQL Server)
CTE
Separate table
I excluded the option of changing the underlying data since, frequently in this forum, consultants/DBA's/programmers are not allowed change underlying data - makes the answers more interesting also!
Obviously, a
CASE expression with > 100 options (x4) is horribly cumbersome and complex - but when is it a good idea to use CASE and at what point does it become a minus rather than a plus?In my opinion (and not just because it's my answer!), a
VIEW is the optimal solution - it's simple, will work for all RDBMS's and is permanent and will work for all queries now and into the future should the OP wish to modify the query.The
JOIN LATERAL construct will work also as a kind of derived table, which is pretty much what a CTE is also. They can both be used down the line in the same query.Which of the 5 approaches is better/best and at what point does the technical (ease of use, speed, query plan optimisation) tilt in the particular solution's favour?
Solution
I would use a translation table in a
See:
Manipulate
Query:
Or with a correlated subquery:
This is easy to handle, keeps the long list of options out of the code and puts it into a table where it can be maintained properly. And is moderately fast.
We can't easily use a plain
The
To top it off, I slipped in
Speed
Performance deteriorates with the number of rows in the translation table since Postgres is forced to walk through all of them sequentially and evaluate the
There is a workaround, though. My example requires that patterns have at least 3 leading characters (3 is my arbitrary pick). Add a
Adapt the query:
With enough rows in the translation table (and favorable estimates and cost settings) Postgres will use a very fast index scan to narrow it down to the few candidates (if any) and only filter the rest with the
db<>fiddle here
LATERAL subquery. Demo (Postgres 10+):CREATE TABLE ac_translate (
ord_nr int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, like_pattern text NOT NULL
, target text NOT NULL
);
INSERT INTO ac_translate(like_pattern, target) VALUES
('AIR NEW Z%' , 'AIR NEW ZEALAND') -- rows in order of precedence!
, ('AIR BP%' , 'AIR BP')
, ('ADDICTION ADVICE%', 'ADDICTION ADVICE')
, ('AIA%' , 'AIA')
;See:
- Does Postgres preserve insertion order of records?
Manipulate
ord_nr to adjust priorities.Query:
SELECT COALESCE(act.target, ac.accountName) AS accountname
, SUM(ac.charge_amount) AS gstexcl
FROM account_code ac
LEFT JOIN LATERAL (
SELECT a1.target
FROM ac_translate a1
WHERE ac.accountname LIKE a1.like_pattern
ORDER BY a1.ord_nr
LIMIT 1
) act ON true
GROUP BY 1;Or with a correlated subquery:
SELECT COALESCE(
(SELECT a1.target
FROM ac_translate a1
WHERE ac.accountname LIKE a1.like_pattern
ORDER BY a1.ord_nr
LIMIT 1), ac.accountName) AS accountname
, SUM(ac.charge_amount) AS sum_amount
FROM account_code ac
GROUP BY 1;This is easy to handle, keeps the long list of options out of the code and puts it into a table where it can be maintained properly. And is moderately fast.
We can't easily use a plain
LEFT JOIN ac_translate since CASE traverses patterns in order to return the single, first match. We can't just join to a set, that might return multiple matches if one pattern is the prefix of the other, like 'AIR%' and 'AIR N%'. So we use an ordering number in the translation table to prioritize matches in the subquery.The
ELSE clause in the referenced question resolves to the original value. That's implemented with COALESCE here. Basically, this combines the virtues of the top two answers over there.To top it off, I slipped in
GROUP BY 1 as another way to avoid repeating lengthy expressions (which is not actually needed here any more). See:- Concatenate multiple result rows of one column into one, group by another column [duplicate]
Speed
Performance deteriorates with the number of rows in the translation table since Postgres is forced to walk through all of them sequentially and evaluate the
LIKE expression. If that's not fast enough any more, we need index support, but the expression is not "sargable" - the expression we need to index is to the right of the operator and there is no COMMUTATOR for LIKE. Details:- Can PostgreSQL index array columns?
There is a workaround, though. My example requires that patterns have at least 3 leading characters (3 is my arbitrary pick). Add a
CHECK constraint in the translation table to enforce this rule, and an expression index on the leading trigram:CREATE INDEX ac_translate_left_idx ON ac_translate (left(like_pattern, 3));Adapt the query:
SELECT COALESCE(act.target, ac.accountName) AS accountname
, SUM(ac.charge_amount) AS gstexcl
FROM account_code ac
LEFT JOIN LATERAL (
SELECT a1.target
FROM ac_translate a1
WHERE left(ac.accountname, 3) = left(a1.like_pattern, 3)
AND ac.accountname LIKE a1.like_pattern
ORDER BY a1.ord_nr
LIMIT 1
) act ON true
GROUP BY 1;With enough rows in the translation table (and favorable estimates and cost settings) Postgres will use a very fast index scan to narrow it down to the few candidates (if any) and only filter the rest with the
LIKE expression. Should scale just fine. I added the EXPLAIN output to the fiddle as proof of concept:db<>fiddle here
Code Snippets
CREATE TABLE ac_translate (
ord_nr int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
, like_pattern text NOT NULL
, target text NOT NULL
);
INSERT INTO ac_translate(like_pattern, target) VALUES
('AIR NEW Z%' , 'AIR NEW ZEALAND') -- rows in order of precedence!
, ('AIR BP%' , 'AIR BP')
, ('ADDICTION ADVICE%', 'ADDICTION ADVICE')
, ('AIA%' , 'AIA')
;SELECT COALESCE(act.target, ac.accountName) AS accountname
, SUM(ac.charge_amount) AS gstexcl
FROM account_code ac
LEFT JOIN LATERAL (
SELECT a1.target
FROM ac_translate a1
WHERE ac.accountname LIKE a1.like_pattern
ORDER BY a1.ord_nr
LIMIT 1
) act ON true
GROUP BY 1;SELECT COALESCE(
(SELECT a1.target
FROM ac_translate a1
WHERE ac.accountname LIKE a1.like_pattern
ORDER BY a1.ord_nr
LIMIT 1), ac.accountName) AS accountname
, SUM(ac.charge_amount) AS sum_amount
FROM account_code ac
GROUP BY 1;CREATE INDEX ac_translate_left_idx ON ac_translate (left(like_pattern, 3));SELECT COALESCE(act.target, ac.accountName) AS accountname
, SUM(ac.charge_amount) AS gstexcl
FROM account_code ac
LEFT JOIN LATERAL (
SELECT a1.target
FROM ac_translate a1
WHERE left(ac.accountname, 3) = left(a1.like_pattern, 3)
AND ac.accountname LIKE a1.like_pattern
ORDER BY a1.ord_nr
LIMIT 1
) act ON true
GROUP BY 1;Context
StackExchange Database Administrators Q#206884, answer score: 6
Revisions (0)
No revisions yet.