patternsqlMinor
Find rows where integer sequence contains a given subsequence
Viewed 0 times
rowswheresubsequencegivensequencecontainsfindinteger
Problem
Problem
Note: I refer to the mathematical sequences, not the sequences mechanism of PostgreSQL.
I have a table representing sequences of integers. The definition is:
My goal is to find rows using a given subsequence. That is to say, the rows where the
Example
Suppose the table contains following data:
So if the given subsequence is
Similarly, if the given subsequence is
Finally, if the given subsequence is
Investigations
First, I am not completely certain that represent sequences with an array data type is wise. Although this seems appropriate to the situation; I fear it makes more complicated handling. Perhaps it is better to represent the sequences differently, using a model of relations with another table.
In the same way, I think about expand the sequences using the
I know it is also possible to cut my sequence in subsequence using the
Note: I refer to the mathematical sequences, not the sequences mechanism of PostgreSQL.
I have a table representing sequences of integers. The definition is:
CREATE TABLE sequences
(
id serial NOT NULL,
title character varying(255) NOT NULL,
date date NOT NULL,
sequence integer[] NOT NULL,
CONSTRAINT "PRIM_KEY_SEQUENCES" PRIMARY KEY (id)
);My goal is to find rows using a given subsequence. That is to say, the rows where the
sequence field is a sequence that contains the given subsequence (in my case, the sequence is ordered).Example
Suppose the table contains following data:
+----+-------+------------+-------------------------------+
| id | title | date | sequence |
+----+-------+------------+-------------------------------+
| 1 | BG703 | 2004-12-24 | {1,3,17,25,377,424,242,1234} |
| 2 | BG256 | 2005-05-11 | {5,7,12,742,225,547,2142,223} |
| 3 | BD404 | 2004-10-13 | {3,4,12,5698,526} |
| 4 | BK956 | 2004-08-17 | {12,4,3,17,25,377,456,25} |
+----+-------+------------+-------------------------------+So if the given subsequence is
{12, 742, 225, 547}, I want to find the row 2.Similarly, if the given subsequence is
{3, 17, 25, 377}, I want to find the row 1 and the row 4.Finally, if the given subsequence is
{12, 4, 3, 25, 377}, then there are no rows returned.Investigations
First, I am not completely certain that represent sequences with an array data type is wise. Although this seems appropriate to the situation; I fear it makes more complicated handling. Perhaps it is better to represent the sequences differently, using a model of relations with another table.
In the same way, I think about expand the sequences using the
unnest array function and then add my search criteria. Nevertheless, the number of terms in the sequence being variable I do not see how to do that.I know it is also possible to cut my sequence in subsequence using the
subarray function oSolution
If you are looking for significant performance improvements to dnoeth's answer, consider using a native C-function and creating the appropriate operator.
Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).
Now you can filter rows like this.
I have conducted a little experiment to find how much faster this solution is.
So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.
Here is an example for int4 arrays. (A generic array variant and the corresponding SQL script).
Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
return DirectFunctionCall2(_int_contains_sequence,
PG_GETARG_DATUM(1),
PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
int na, nb;
int32 *pa, *pb;
int i, j;
na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
pa = (int32 *) ARR_DATA_PTR(a);
pb = (int32 *) ARR_DATA_PTR(b);
/* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
for (i = 0; i <= na - nb; ++i)
{
for (j = 0; j < nb; ++j)
if (pa[i + j] != pb[j])
break;
if (j == nb)
PG_RETURN_BOOL(true);
}
PG_RETURN_BOOL(false);
}CREATE FUNCTION _int_contains_sequence(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE FUNCTION _int_sequence_contained(_int4, _int4)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
CREATE OPERATOR @@> (
LEFTARG = _int4,
RIGHTARG = _int4,
PROCEDURE = _int_contains_sequence,
COMMUTATOR = '',
RESTRICT = contsel,
JOIN = contjoinsel
);
Now you can filter rows like this.
SELECT * FROM sequences WHERE sequence @@> '{12, 742, 225, 547}'
I have conducted a little experiment to find how much faster this solution is.
CREATE TEMPORARY TABLE sequences AS
SELECT array_agg((random() * 10)::int4) AS sequence, g1 AS id
FROM generate_series(1, 100000) g1
CROSS JOIN generate_series(1, 30) g2
GROUP BY g1;
EXPLAIN ANALYZE SELECT * FROM sequences
WHERE translate(cast(sequence as text), '{}',',,')
LIKE '%' || translate(cast('{1,2,3,4}'as text), '{}',',,') || '%'
"Seq Scan on sequences (cost=0.00..7869.42 rows=28 width=36) (actual time=2.487..334.318 rows=251 loops=1)"
" Filter: (translate((sequence)::text, '{}'::text, ',,'::text) ~~ '%,1,2,3,4,%'::text)"
" Rows Removed by Filter: 99749"
"Planning time: 0.104 ms"
"Execution time: 334.365 ms"
EXPLAIN ANALYZE SELECT * FROM sequences WHERE sequence @@> '{1,2,3,4}'
"Seq Scan on sequences (cost=0.00..5752.01 rows=282 width=36) (actual time=0.178..20.792 rows=251 loops=1)"
" Filter: (sequence @@> '{1,2,3,4}'::integer[])"
" Rows Removed by Filter: 99749"
"Planning time: 0.091 ms"
"Execution time: 20.859 ms"
So, it is about 16 times faster. If it is not enough, you can add support for GIN or GiST indexes, but this will be much more difficult task.
Code Snippets
Datum
_int_sequence_contained(PG_FUNCTION_ARGS)
{
return DirectFunctionCall2(_int_contains_sequence,
PG_GETARG_DATUM(1),
PG_GETARG_DATUM(0));
}
Datum
_int_contains_sequence(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P(0);
ArrayType *b = PG_GETARG_ARRAYTYPE_P(1);
int na, nb;
int32 *pa, *pb;
int i, j;
na = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));
nb = ArrayGetNItems(ARR_NDIM(b), ARR_DIMS(b));
pa = (int32 *) ARR_DATA_PTR(a);
pb = (int32 *) ARR_DATA_PTR(b);
/* The naive searching algorithm. Replace it with a better one if your arrays are quite large. */
for (i = 0; i <= na - nb; ++i)
{
for (j = 0; j < nb; ++j)
if (pa[i + j] != pb[j])
break;
if (j == nb)
PG_RETURN_BOOL(true);
}
PG_RETURN_BOOL(false);
}Context
StackExchange Database Administrators Q#108373, answer score: 4
Revisions (0)
No revisions yet.