patternsqlMinor
Indexing of a PostgreSQL bitstring (up to 20,000 bits each)
Viewed 0 times
postgresqleachindexingbitsbitstring000
Problem
I am building a table that contains chemical compounds (many millions of rows) and of those compounds certain predetermined features/fragments are flagged in a fixed length bitstring.
This bitstring will have between 2000 and 20000 bits, further research needs to be done to determine more precise figures there.
When searching for compounds that have certain specific features, or lack specific features, a search is done on a select subset of this bitstring. This can be a different subset every time.
Is there an index type that can make these searches efficient in PostgreSQL (9.6 or 10)?
Insertions are infrequent and done in a batch process, whereas searches are the mostly used operation and should preferably be quick and not have false positives or false negatives.
To me this vaguely sounds like work for a GIN index, but my understanding of this index type is not enough to say for sure if that is realy the case.
Actually there could be another solution, and that is to create a separate 'fragment_index' table, with fragment identifiers (as they have a fixed position in the bitstring they also have a numerical identifier) + compound-id pairs.
My worry there is that that table can become enormous (20M compounds with avg 50 hits on fragments = 1G rows) and multiple joins against it (one for each fragment) where that join can also return up to 80% of matches with the compounds table (in some cases this is likely) will not perform at all.
I would be gratefull to get any suggestions towards a direction to get this on the road.
update: i tried a GIN index using the trigram module on a varchar array with codified shortcodes, it gave mixed results, mainly depending on the amount of data left over after the filter operations.
For the sake of giving examples that have any meaning let's assume the table looks like this:
```
CREATE TABLE testcompounds (
id serial primary key,
cd_structure text,
features_as_text varchar(128),
features_as_bits bit varying(32)
);
CREA
This bitstring will have between 2000 and 20000 bits, further research needs to be done to determine more precise figures there.
When searching for compounds that have certain specific features, or lack specific features, a search is done on a select subset of this bitstring. This can be a different subset every time.
Is there an index type that can make these searches efficient in PostgreSQL (9.6 or 10)?
Insertions are infrequent and done in a batch process, whereas searches are the mostly used operation and should preferably be quick and not have false positives or false negatives.
To me this vaguely sounds like work for a GIN index, but my understanding of this index type is not enough to say for sure if that is realy the case.
Actually there could be another solution, and that is to create a separate 'fragment_index' table, with fragment identifiers (as they have a fixed position in the bitstring they also have a numerical identifier) + compound-id pairs.
My worry there is that that table can become enormous (20M compounds with avg 50 hits on fragments = 1G rows) and multiple joins against it (one for each fragment) where that join can also return up to 80% of matches with the compounds table (in some cases this is likely) will not perform at all.
I would be gratefull to get any suggestions towards a direction to get this on the road.
update: i tried a GIN index using the trigram module on a varchar array with codified shortcodes, it gave mixed results, mainly depending on the amount of data left over after the filter operations.
For the sake of giving examples that have any meaning let's assume the table looks like this:
```
CREATE TABLE testcompounds (
id serial primary key,
cd_structure text,
features_as_text varchar(128),
features_as_bits bit varying(32)
);
CREA
Solution
I did test some index strategies for bitwise operation. A little long procedure, sorry.
First of all,
Environment
-
version: PostgreSQL 10.3
-
host: A new instance of Amazon RDS db.m4.large (vCPU:2, Memory:8GB, Disk:GP2-SSD 100GB)
table/index definition
Create extensions
Create table
Create one btree index on each boolean column
Create one composite btree index on bool columns
Create one bloom index on bool columns
```
CREATE INDEX IF NOT EXISTS idx_bloom_on_bool ON t_bitwise
USING bloom (
CAST(c_bool_1 AS INTEGER) ,CAST(c_bool_2 AS INTEGER) ,CAST(c_bool_3 AS INTEGER) ,CAST(c_bool_4 AS INTEGER) ,CAST(c_bool_5 AS INTEGER)
,CAST(c_bool_6 AS INTEGER) ,CAST(c_bool_7 AS INTEGER) ,CAST(c_bool_8 AS INTEGER) ,CAST(c_bool_9 AS INTEGER) ,CAST(c_bool_10 AS INTEGER)
,CAST(c_bool_11 AS INTEGER) ,CAST(c_bool_12 AS INTEGER) ,CAST(c_bool_13 AS INTEGER) ,CAST(c_bool_14 AS INTEGER) ,CAST(c_bool_15 AS INTEGER)
,CAST(c_bool_16 AS INTEGER) ,CAS
First of all,
- This is just a test on my RDS environment. Your database performance depends on your environment and data (CPU, Memory, Disk, number of rows in your table, cardinality of your column and so on)
- I am not familiar with PostgreSQL. Especially, I am not sure of which bit length of bloom index is best. (Could someone help me ?)
- Generally, small index size is preferred. So, table partitioning may be a good solution.
Environment
-
version: PostgreSQL 10.3
-
host: A new instance of Amazon RDS db.m4.large (vCPU:2, Memory:8GB, Disk:GP2-SSD 100GB)
table/index definition
Create extensions
CREATE EXTENSION IF NOT EXISTS intarray;
CREATE EXTENSION IF NOT EXISTS bloom;Create table
CREATE TABLE IF NOT EXISTS t_bitwise (
id BIGINT
,c_int INTEGER
,c_bit BIT(31)
,c_int_arr _int4
,c_bool_0 BOOLEAN ,c_bool_1 BOOLEAN ,c_bool_2 BOOLEAN ,c_bool_3 BOOLEAN ,c_bool_4 BOOLEAN
,c_bool_5 BOOLEAN ,c_bool_6 BOOLEAN ,c_bool_7 BOOLEAN ,c_bool_8 BOOLEAN ,c_bool_9 BOOLEAN
,c_bool_10 BOOLEAN ,c_bool_11 BOOLEAN ,c_bool_12 BOOLEAN ,c_bool_13 BOOLEAN ,c_bool_14 BOOLEAN
,c_bool_15 BOOLEAN ,c_bool_16 BOOLEAN ,c_bool_17 BOOLEAN ,c_bool_18 BOOLEAN ,c_bool_19 BOOLEAN
,c_bool_20 BOOLEAN ,c_bool_21 BOOLEAN ,c_bool_22 BOOLEAN ,c_bool_23 BOOLEAN ,c_bool_24 BOOLEAN
,c_bool_25 BOOLEAN ,c_bool_26 BOOLEAN ,c_bool_27 BOOLEAN ,c_bool_28 BOOLEAN ,c_bool_29 BOOLEAN
,c_bool_30 BOOLEAN
,PRIMARY KEY (id)
);Create one btree index on each boolean column
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_1 ON t_bitwise (c_bool_1);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_2 ON t_bitwise (c_bool_2);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_3 ON t_bitwise (c_bool_3);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_4 ON t_bitwise (c_bool_4);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_5 ON t_bitwise (c_bool_5);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_6 ON t_bitwise (c_bool_6);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_7 ON t_bitwise (c_bool_7);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_8 ON t_bitwise (c_bool_8);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_9 ON t_bitwise (c_bool_9);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_10 ON t_bitwise (c_bool_10);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_11 ON t_bitwise (c_bool_11);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_12 ON t_bitwise (c_bool_12);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_13 ON t_bitwise (c_bool_13);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_14 ON t_bitwise (c_bool_14);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_15 ON t_bitwise (c_bool_15);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_16 ON t_bitwise (c_bool_16);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_17 ON t_bitwise (c_bool_17);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_18 ON t_bitwise (c_bool_18);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_19 ON t_bitwise (c_bool_19);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_20 ON t_bitwise (c_bool_20);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_21 ON t_bitwise (c_bool_21);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_22 ON t_bitwise (c_bool_22);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_23 ON t_bitwise (c_bool_23);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_24 ON t_bitwise (c_bool_24);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_25 ON t_bitwise (c_bool_25);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_26 ON t_bitwise (c_bool_26);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_27 ON t_bitwise (c_bool_27);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_28 ON t_bitwise (c_bool_28);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_29 ON t_bitwise (c_bool_29);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_30 ON t_bitwise (c_bool_30);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_31 ON t_bitwise (c_bool_31);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_32 ON t_bitwise (c_bool_32);Create one composite btree index on bool columns
CREATE INDEX IF NOT EXISTS idx_btree_on_composite_bool ON t_bitwise (
c_bool_32 ,c_bool_31
,c_bool_30 ,c_bool_29 ,c_bool_28 ,c_bool_27 ,c_bool_26 ,c_bool_25 ,c_bool_24 ,c_bool_23 ,c_bool_22 ,c_bool_21
,c_bool_20 ,c_bool_19 ,c_bool_18 ,c_bool_17 ,c_bool_16 ,c_bool_15 ,c_bool_14 ,c_bool_13 ,c_bool_12 ,c_bool_11
,c_bool_10 ,c_bool_9 ,c_bool_8 ,c_bool_7 ,c_bool_6 ,c_bool_5 ,c_bool_4 ,c_bool_3 ,c_bool_2 ,c_bool_1
)
;Create one bloom index on bool columns
```
CREATE INDEX IF NOT EXISTS idx_bloom_on_bool ON t_bitwise
USING bloom (
CAST(c_bool_1 AS INTEGER) ,CAST(c_bool_2 AS INTEGER) ,CAST(c_bool_3 AS INTEGER) ,CAST(c_bool_4 AS INTEGER) ,CAST(c_bool_5 AS INTEGER)
,CAST(c_bool_6 AS INTEGER) ,CAST(c_bool_7 AS INTEGER) ,CAST(c_bool_8 AS INTEGER) ,CAST(c_bool_9 AS INTEGER) ,CAST(c_bool_10 AS INTEGER)
,CAST(c_bool_11 AS INTEGER) ,CAST(c_bool_12 AS INTEGER) ,CAST(c_bool_13 AS INTEGER) ,CAST(c_bool_14 AS INTEGER) ,CAST(c_bool_15 AS INTEGER)
,CAST(c_bool_16 AS INTEGER) ,CAS
Code Snippets
CREATE EXTENSION IF NOT EXISTS intarray;
CREATE EXTENSION IF NOT EXISTS bloom;CREATE TABLE IF NOT EXISTS t_bitwise (
id BIGINT
,c_int INTEGER
,c_bit BIT(31)
,c_int_arr _int4
,c_bool_0 BOOLEAN ,c_bool_1 BOOLEAN ,c_bool_2 BOOLEAN ,c_bool_3 BOOLEAN ,c_bool_4 BOOLEAN
,c_bool_5 BOOLEAN ,c_bool_6 BOOLEAN ,c_bool_7 BOOLEAN ,c_bool_8 BOOLEAN ,c_bool_9 BOOLEAN
,c_bool_10 BOOLEAN ,c_bool_11 BOOLEAN ,c_bool_12 BOOLEAN ,c_bool_13 BOOLEAN ,c_bool_14 BOOLEAN
,c_bool_15 BOOLEAN ,c_bool_16 BOOLEAN ,c_bool_17 BOOLEAN ,c_bool_18 BOOLEAN ,c_bool_19 BOOLEAN
,c_bool_20 BOOLEAN ,c_bool_21 BOOLEAN ,c_bool_22 BOOLEAN ,c_bool_23 BOOLEAN ,c_bool_24 BOOLEAN
,c_bool_25 BOOLEAN ,c_bool_26 BOOLEAN ,c_bool_27 BOOLEAN ,c_bool_28 BOOLEAN ,c_bool_29 BOOLEAN
,c_bool_30 BOOLEAN
,PRIMARY KEY (id)
);CREATE INDEX IF NOT EXISTS idx_btree_on_bool_1 ON t_bitwise (c_bool_1);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_2 ON t_bitwise (c_bool_2);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_3 ON t_bitwise (c_bool_3);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_4 ON t_bitwise (c_bool_4);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_5 ON t_bitwise (c_bool_5);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_6 ON t_bitwise (c_bool_6);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_7 ON t_bitwise (c_bool_7);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_8 ON t_bitwise (c_bool_8);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_9 ON t_bitwise (c_bool_9);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_10 ON t_bitwise (c_bool_10);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_11 ON t_bitwise (c_bool_11);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_12 ON t_bitwise (c_bool_12);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_13 ON t_bitwise (c_bool_13);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_14 ON t_bitwise (c_bool_14);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_15 ON t_bitwise (c_bool_15);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_16 ON t_bitwise (c_bool_16);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_17 ON t_bitwise (c_bool_17);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_18 ON t_bitwise (c_bool_18);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_19 ON t_bitwise (c_bool_19);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_20 ON t_bitwise (c_bool_20);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_21 ON t_bitwise (c_bool_21);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_22 ON t_bitwise (c_bool_22);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_23 ON t_bitwise (c_bool_23);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_24 ON t_bitwise (c_bool_24);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_25 ON t_bitwise (c_bool_25);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_26 ON t_bitwise (c_bool_26);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_27 ON t_bitwise (c_bool_27);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_28 ON t_bitwise (c_bool_28);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_29 ON t_bitwise (c_bool_29);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_30 ON t_bitwise (c_bool_30);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_31 ON t_bitwise (c_bool_31);
CREATE INDEX IF NOT EXISTS idx_btree_on_bool_32 ON t_bitwise (c_bool_32);CREATE INDEX IF NOT EXISTS idx_btree_on_composite_bool ON t_bitwise (
c_bool_32 ,c_bool_31
,c_bool_30 ,c_bool_29 ,c_bool_28 ,c_bool_27 ,c_bool_26 ,c_bool_25 ,c_bool_24 ,c_bool_23 ,c_bool_22 ,c_bool_21
,c_bool_20 ,c_bool_19 ,c_bool_18 ,c_bool_17 ,c_bool_16 ,c_bool_15 ,c_bool_14 ,c_bool_13 ,c_bool_12 ,c_bool_11
,c_bool_10 ,c_bool_9 ,c_bool_8 ,c_bool_7 ,c_bool_6 ,c_bool_5 ,c_bool_4 ,c_bool_3 ,c_bool_2 ,c_bool_1
)
;CREATE INDEX IF NOT EXISTS idx_bloom_on_bool ON t_bitwise
USING bloom (
CAST(c_bool_1 AS INTEGER) ,CAST(c_bool_2 AS INTEGER) ,CAST(c_bool_3 AS INTEGER) ,CAST(c_bool_4 AS INTEGER) ,CAST(c_bool_5 AS INTEGER)
,CAST(c_bool_6 AS INTEGER) ,CAST(c_bool_7 AS INTEGER) ,CAST(c_bool_8 AS INTEGER) ,CAST(c_bool_9 AS INTEGER) ,CAST(c_bool_10 AS INTEGER)
,CAST(c_bool_11 AS INTEGER) ,CAST(c_bool_12 AS INTEGER) ,CAST(c_bool_13 AS INTEGER) ,CAST(c_bool_14 AS INTEGER) ,CAST(c_bool_15 AS INTEGER)
,CAST(c_bool_16 AS INTEGER) ,CAST(c_bool_17 AS INTEGER) ,CAST(c_bool_18 AS INTEGER) ,CAST(c_bool_19 AS INTEGER) ,CAST(c_bool_20 AS INTEGER)
,CAST(c_bool_21 AS INTEGER) ,CAST(c_bool_22 AS INTEGER) ,CAST(c_bool_23 AS INTEGER) ,CAST(c_bool_24 AS INTEGER) ,CAST(c_bool_25 AS INTEGER)
,CAST(c_bool_26 AS INTEGER) ,CAST(c_bool_27 AS INTEGER) ,CAST(c_bool_28 AS INTEGER) ,CAST(c_bool_29 AS INTEGER) ,CAST(c_bool_30 AS INTEGER)
,CAST(c_bool_31 AS INTEGER) ,CAST(c_bool_32 AS INTEGER)
) WITH (length=128)
;Context
StackExchange Database Administrators Q#198711, answer score: 5
Revisions (0)
No revisions yet.