snippetsqlMinor
How do you count the occurrences of an anchored string using PostgreSQL?
Viewed 0 times
postgresqltheoccurrencesyouanchoredusinghowcountstring
Problem
If I have a string in a column on a row in a table like this
How would I count the occurrence of a substring
For the purposes of this, let's treat the numbers as substrings
Sample data
1 2 2 2 2 2 2How would I count the occurrence of a substring
2 inside the string. Assume nothing other than a space-delimiter of " ".For the purposes of this, let's treat the numbers as substrings
Sample data
CREATE TABLE foo
AS
SELECT 1 AS id, '1 2 2 2 2 2 2'::text AS data;
TABLE foo
id | data
----+---------------
1 | 1 2 2 2 2 2 2Solution
You can solve this with a
-
FAST A string function, such as one the pattern explained below
Possible solutions
String
Using
Most RDBMS's provide some method to calculate substring occurrences like this,
This method isn't applicable here because without an anchor, we can't be sure if we're replacing the substring inside of something-space-delimited. As an example, the above replaces the
Because we're not splitting on simple spaces (' ') though we could with more complexity, we may also want to accommodate sub-strings of different lengths like in this question. This is why we explicitly include
Using ARRAY[]
Splitting into an
Here we have to subtract one match splits a string into two fragments, and thus the occurrences is one less than the fragment counts: this
Alternatively, we can use
Using a TABLE
Splitting into a table with
Here we split the regex into a table.. In this method you're using
And, from there you can run regular SQL on it.
Using
Here we get away from splitting and instead use the
Procedural Languages
Perl
This method turned out to the be the fastest overall,
Summary and Performance Implications
Following the same format of data, performance implications can be obtained with
Under these constraints, I found that the procedural method with plperl was the fastest. Next I found the following as being the fastest native method,
Keep in mind the tried and true method of string replacement outside of needing to anchor the string is still the fastest and most efficient native method however clunky it may be,
That said, the
- FASTEST was the pl/perl method which I placed last on this list because it requires pl/perl, and is likely not needed for most workloads.
-
FAST A string function, such as one the pattern explained below
length(str) - length(*replace(str, replaceStr))
/ length(replaceStr)- Something that converts from a string to an array.
- SLOW Something that converts from a string to a table.
Possible solutions
String
Using
length and regexp_replaceMost RDBMS's provide some method to calculate substring occurrences like this,
SELECT length(data) - length(replace(data, '2', ''))
/ length('2')
FROM foo;This method isn't applicable here because without an anchor, we can't be sure if we're replacing the substring inside of something-space-delimited. As an example, the above replaces the
2 in 329. We can remedy that by using regexp_replace to anchor the substring.SELECT length(data) - length(regexp_replace(data, '\m2\M', '', 'g'))
/ length('2')
FROM foo;Because we're not splitting on simple spaces (' ') though we could with more complexity, we may also want to accommodate sub-strings of different lengths like in this question. This is why we explicitly include
/ length('2'). That reduces to a no-op, but if we we're search for something that was longer than one character, it'd be required.SELECT length(data) - length(regexp_replace(data, '\m42\M', '', 'g'))
/ length('42')
FROM foo;Using ARRAY[]
Splitting into an
ARRAY[]Here we have to subtract one match splits a string into two fragments, and thus the occurrences is one less than the fragment counts: this
xyx split on y, produces {'x', 'x'} and we want the length to be 1 corresponding to the occurrences of y.SELECT array_length(x, 1) - 1
FROM foo
CROSS JOIN LATERAL regexp_split_to_array(data, '\m2\M') AS t(x);
-- un-anchored version for reference.
-- CROSS JOIN LATERAL string_to_array(data, '2') AS t(x);Alternatively, we can use
string_to_array to separate something that's space-delimited and then count the matches, SELECT id, array_length(array_positions(x, '2'), 1)
FROM foo
CROSS JOIN LATERAL string_to_array(data, ' ') AS t(x);Using a TABLE
Splitting into a table with
regexp_split_to_tableHere we split the regex into a table.. In this method you're using
GROUP BY and count().SELECT id, x
FROM foo
CROSS JOIN LATERAL regexp_split_to_table(data, ' ')
AS t(x);
id | x
----+---
1 | 1
1 | 2
1 | 2
1 | 2
1 | 2
1 | 2
1 | 2
(7 rows)And, from there you can run regular SQL on it.
SELECT id, x, count(*)
FROM (
SELECT id, x
FROM foo
CROSS JOIN LATERAL regexp_split_to_table(data, ' ')
AS t(x)
) AS t(id,x)
GROUP BY id, x;
id | x | count
----+---+-------
1 | 1 | 1
1 | 2 | 6Using
regex_matchesHere we get away from splitting and instead use the
\m, and \M anchors for word-boundaries.SELECT count(*)
FROM foo
CROSS JOIN LATERAL regexp_matches(data, '\m2\M', 'g');Procedural Languages
Perl
This method turned out to the be the fastest overall,
CREATE LANGUAGE plperl
CREATE FUNCTION count_occurances(inputStr text, regex text)
RETURNS smallint
AS $BODY$
scalar @{[ $_[0] =~ m/$_[1]/g ]}
$BODY$
LANGUAGE plperl
IMMUTABLE;Summary and Performance Implications
Following the same format of data, performance implications can be obtained with
CREATE TABLE foo
AS
SELECT
1 AS id,
array_to_string(
ARRAY(SELECT trunc(random()*100+1)::int % 100 FROM generate_series(1,5000) AS t(x)),
' '
) AS data
;Under these constraints, I found that the procedural method with plperl was the fastest. Next I found the following as being the fastest native method,
length(str) - regexp_replace(str, replacement, g)
/ length(replacement)Keep in mind the tried and true method of string replacement outside of needing to anchor the string is still the fastest and most efficient native method however clunky it may be,
length(str) - replace(str, replacement)
/ length(replacement)That said, the
ARRAY[] method is massively faster than splitting to a table.Code Snippets
length(str) - length(*replace(str, replaceStr))
/ length(replaceStr)SELECT length(data) - length(replace(data, '2', ''))
/ length('2')
FROM foo;SELECT length(data) - length(regexp_replace(data, '\m2\M', '', 'g'))
/ length('2')
FROM foo;SELECT length(data) - length(regexp_replace(data, '\m42\M', '', 'g'))
/ length('42')
FROM foo;SELECT array_length(x, 1) - 1
FROM foo
CROSS JOIN LATERAL regexp_split_to_array(data, '\m2\M') AS t(x);
-- un-anchored version for reference.
-- CROSS JOIN LATERAL string_to_array(data, '2') AS t(x);Context
StackExchange Database Administrators Q#166762, answer score: 8
Revisions (0)
No revisions yet.