patternsqlMajor
Index for large bit strings
Viewed 0 times
bitlargeforindexstrings
Problem
I'm trying to extend PostgreSQL to index bit strings up to 1000 bits. (These bit strings are created by quantization of high-dimensional vectors, so for each dimension up to 4 bits are assigned). Insertions are rather infrequent, whereas searches are the mostly used operation. In a search, I would like to get all the rows that exactly match the bit string.
It looks like a perfect job for GIN (in combination with my own data type), or what do you think?
It looks like a perfect job for GIN (in combination with my own data type), or what do you think?
Solution
In a search, I would like to get all the rows that exactly match the
bit string.
Use a B-Tree index, the default type. I don't see a case for a GIN index here.
Up to 1000 bits result in up to 133 bytes (or slightly more) storage size on disk for a
Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following tricks improve performance.
If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, the overhead would eat the benefit. But could make a big difference for big tables.
Example
Given table:
If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression
Extra parentheses are required for the cast operator in an index definition. See:
Then, instead of the query
You would use:
Be aware that shorter values are padded with
In a real world application this starts to make sense with several 100s of bits. Test for the turning point.
Optimize further
Since most installations operate with a
Plus some minor overhead per page and index / table. Details in the manual or in this related answer on Stackoverflow.
Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to
I cast twice (
Effectively, this is just a very fast and simple hash function, where the hash value also allows to look up ranges of values. Depending on exact requirements you could go one step further and use any
The query to go along with that:
The resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:
-
The index typically returns much fewer hits (64 bit of information vs. 10 bit)
-
Postgres can work with integer arithmetic, which should be faster, even for a plain
-
The type
Effectively, to keep the index at its minimum size, you can only pack 24 bit into a
In such a case
The command rewrites tables and indexes and holds an exclusive lock until finished, which can take a while for big tables. The effect deteriorates over time with write operations to the table. Repeat at intervals of your design. Be sure to read the manual on
If the first 64 bit of your values are unique most of the time,
bit string.
Use a B-Tree index, the default type. I don't see a case for a GIN index here.
Up to 1000 bits result in up to 133 bytes (or slightly more) storage size on disk for a
bit varying type.SELECT pg_column_size(repeat('1', 1000)::varbit) -- 133Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following tricks improve performance.
If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, the overhead would eat the benefit. But could make a big difference for big tables.
Example
Given table:
CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression
b_col::bit(10). Casting to bit(n) truncates the bitstring to n bit.CREATE INDEX tbl_b_col10_idx ON tbl ((b_col::bit(10)))Extra parentheses are required for the cast operator in an index definition. See:
- How to create an index on an integer json property in postgres
Then, instead of the query
SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit; -- 16 bitYou would use:
SELECT *
FROM tbl
WHERE b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND b_col = '1111011110111101'::varbit; -- filter to exact matchBe aware that shorter values are padded with
0's to the right (least significant bits) when cast to bit(n).In a real world application this starts to make sense with several 100s of bits. Test for the turning point.
Optimize further
Since most installations operate with a
MAXALIGN of 8 bytes (64 bit OS) (more details here), your index size is the same for any data not exceeding 8 bytes. Effectively, per row:4 bytes item identifier
8 bytes for the index tuple header (or 23 + 1 byte for heap tuples)
? actual space for data
? padding to the nearest multiple of 8 bytes
Plus some minor overhead per page and index / table. Details in the manual or in this related answer on Stackoverflow.
Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to
bigint and build an index on this expression.CREATE INDEX tbl_b_col64_idx ON tbl ((b_col::bit(64)::bigint))I cast twice (
b_col::bit(64)::bigint) for there is no cast defined between varbit and bigint. Details in this related answer on SO:- Convert hex in text representation to decimal number
Effectively, this is just a very fast and simple hash function, where the hash value also allows to look up ranges of values. Depending on exact requirements you could go one step further and use any
IMMUTABLE hash function - like md5(). See link above. OrThe query to go along with that:
SELECT *
FROM tbl
WHERE b_col::bit(64)::bigint = '1111011110111101'::bit(64)::bigint -- utilize index
AND b_col = '1111011110111101'::varbit; -- narrow down to exact matchThe resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:
-
The index typically returns much fewer hits (64 bit of information vs. 10 bit)
-
Postgres can work with integer arithmetic, which should be faster, even for a plain
= operation. (Didn't test to verify that.)-
The type
integer has no overhead like varbit - 5 or 8 bytes. (In my installation 5 bytes for up to 960 bit, 8 bytes for more).Effectively, to keep the index at its minimum size, you can only pack 24 bit into a
varbit index - compared to 64 bit of information for a bigint index.CLUSTERIn such a case
CLUSTER should improve performance:CLUSTER TABLE tbl USING tbl_b_col10_idx;The command rewrites tables and indexes and holds an exclusive lock until finished, which can take a while for big tables. The effect deteriorates over time with write operations to the table. Repeat at intervals of your design. Be sure to read the manual on
CLUSTER. Or consider the community tools like pg_repack or pg_suqeeze. See:- Configuring PostgreSQL for read performance
If the first 64 bit of your values are unique most of the time,
CLUSTER will barely help, since the index scan will return a single row in most cases. If not, CLUSTER will help a lot. Consequently, the effect will be far greater for the first example with the less optimized index.Code Snippets
SELECT pg_column_size(repeat('1', 1000)::varbit) -- 133CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);CREATE INDEX tbl_b_col10_idx ON tbl ((b_col::bit(10)))SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit; -- 16 bitSELECT *
FROM tbl
WHERE b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND b_col = '1111011110111101'::varbit; -- filter to exact matchContext
StackExchange Database Administrators Q#33785, answer score: 20
Revisions (0)
No revisions yet.