HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlCritical

Pattern matching with LIKE, SIMILAR TO or regular expressions

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
withregularlikeexpressionssimilarpatternmatching

Problem

I had to write a simple query where I go looking for people's name that start with a B or a D:

SELECT s.name 
FROM spelers s 
WHERE s.name LIKE 'B%' OR s.name LIKE 'D%'
ORDER BY 1


I was wondering if there is a way to rewrite this to become more performant. So I can avoid or and / or like?

Solution

Pattern matching operators

-
LIKE (~~) is simple and fast but limited in its capabilities.

ILIKE (~~*) the case insensitive variant.

-
~ (regular expression match) is powerful but more complex and may be slow for anything more than basic expressions.

~* is the case insensitive variant.

-
SIMILAR TO is just pointless. A peculiar blend of LIKE and regular expressions. I never use it. See below.

All of the above can use a trigram index. See below.

For left-anchored patterns, also a B-tree index using COLLATE "C". Or with any other collation and the operator class text_pattern_ops. See below. Or even with an SP-GiST index.

Basics about pattern matching in the manual.
Related operators

-
^@ is "starts with" operator (for prefix matching), equivalent to the starts_with() function.

Added with Postgres 11, can use an SP-GiST index. Since Postgres 15 also a B-tree index using a "C" collation. See below.

-
% is the "similarity" operator, provided by the additional module pg_trgm. See below.

-
@@ is the text search operator. See below.

Your query

... is pretty much the optimum. Syntax won't get much shorter, query won't get much faster:

SELECT name FROM spelers
WHERE  name LIKE 'B%' OR name LIKE 'D%'
ORDER  BY 1;


Or equivalent (slightly more expensive):

... WHERE name ~ '^B' OR name LIKE '^D'


A bit shorter (can use the same index in modern Postgres):

... WHERE name LIKE ANY ('{B%,D%}')
... WHERE name ~ ANY ('{^B,^D}')


A regular expression with branches shortens the syntax some more:

... WHERE name ~ '^(B|D).*'


Or a character class (only for the case with a single character):

... WHERE name ~ '^[BD].*'


For bigger tables, index support improves performance by orders of magnitude.

In Postgres 11 or later the new ^@ is more convenient as we can use the unadorned prefix directly - and fast when supported with an SP-GiST index:

... WHERE name ^@ 'B' OR name ^@ 'D'


Or:

... WHERE name ^@ ANY ('{B,D}')


Since Postgres 15, the first variant can also use a B-tree index using COLLATE "C".

db<>fiddle here
Index

If concerned with performance, create an index like this for bigger tables to support left-anchored search patterns (matching from the start of the string):

CREATE INDEX spelers_name_special_idx ON spelers (name COLLATE "C");


Requires per-column collation support added with Postgres 9.1.

See:

  • Is there a difference between text_pattern_ops and COLLATE "C"?



  • B-tree index does not seem to be used?



In DBs running with the "C" locale (not typical), a plain B-tree index does the job.

In older versions (or still today), you can use the special operator class text_pattern_ops for the same purpose:

CREATE INDEX spelers_name_special_idx ON spelers (name text_pattern_ops);


SIMILAR TO or regular expressions with basic left-anchored expressions can use this index, too. Very old versions failed to use the index for branches (B|D), character classes [BD], or constructs with LIKE ANY or ~ ANY.

Or use an expression index based on a truncated substring for columns with long strings, and repeat the same expression in queries.

Or use the ^@ operator with a matching index.
Trigram matching

Trigram matches or text search use special GIN or GiST indexes.

You can install the additional module pg_trgm to provide index support for any LIKE / ILIKE pattern (and simple regexp patterns with ~ / ~*) using a GIN or GiST index.

Details, example and links:

  • How is LIKE implemented?



pg_trgm provides additional operators like:

  • % - the "similarity" operator



  • `) - the "word_similarity" operator in Postgres 9.6 or later



  • >) - the "strict_word_similarity" operator in Postgres 11 or later



Text search

Is a special type of pattern matching with separate infrastructure and index types. It uses dictionaries and stemming and is a great tool to find words in documents, especially for natural languages.

Prefix matching is also supported:

  • Get partial match from GIN indexed TSVECTOR column



As well as phrase search since Postgres 9.6:

  • How to search hyphenated words in PostgreSQL full text search?



Consider the introduction in the manual and the overview of operators and functions.
Additional tools for fuzzy string matching

The additional module fuzzystrmatch offers some more options, but performance is generally inferior to all of the above.

In particular, various implementations of the
levenshtein() function may be instrumental.
Why are regular expressions (
~) always faster than SIMILAR TO?

SIMILAR TO expressions are rewritten into regular expressions internally. For every SIMILAR TO expression, there is at least one faster regular expression (saving the overhead of rewriting the expression). There is no performance gain in using SIMILAR TO ever.

Simple expressions that can make do with
LIKE (~~) are faster with LIKE anyway.

SIMILAR TO` is only

Code Snippets

SELECT name FROM spelers
WHERE  name LIKE 'B%' OR name LIKE 'D%'
ORDER  BY 1;
... WHERE name ~ '^B' OR name LIKE '^D'
... WHERE name LIKE ANY ('{B%,D%}')
... WHERE name ~ ANY ('{^B,^D}')
... WHERE name ~ '^(B|D).*'
... WHERE name ~ '^[BD].*'

Context

StackExchange Database Administrators Q#10694, answer score: 236

Revisions (0)

No revisions yet.