patternsqlMajor

Index for large bit strings

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

bitlargeforindexstrings

Problem

I'm trying to extend PostgreSQL to index bit strings up to 1000 bits. (These bit strings are created by quantization of high-dimensional vectors, so for each dimension up to 4 bits are assigned). Insertions are rather infrequent, whereas searches are the mostly used operation. In a search, I would like to get all the rows that exactly match the bit string.

It looks like a perfect job for GIN (in combination with my own data type), or what do you think?

Solution

In a search, I would like to get all the rows that exactly match the
bit string.

Use a B-Tree index, the default type. I don't see a case for a GIN index here.

Up to 1000 bits result in up to 133 bytes (or slightly more) storage size on disk for a bit varying type.

SELECT pg_column_size(repeat('1', 1000)::varbit)  -- 133

Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following tricks improve performance.

If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, the overhead would eat the benefit. But could make a big difference for big tables.
Example

Given table:

CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);

If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression b_col::bit(10). Casting to bit(n) truncates the bitstring to n bit.

CREATE INDEX tbl_b_col10_idx ON tbl ((b_col::bit(10)))

Extra parentheses are required for the cast operator in an index definition. See:

How to create an index on an integer json property in postgres

Then, instead of the query

SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit; -- 16 bit

You would use:

SELECT *
FROM   tbl
WHERE  b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- filter to exact match

Be aware that shorter values are padded with 0's to the right (least significant bits) when cast to bit(n).

In a real world application this starts to make sense with several 100s of bits. Test for the turning point.
Optimize further

Since most installations operate with a MAXALIGN of 8 bytes (64 bit OS) (more details here), your index size is the same for any data not exceeding 8 bytes. Effectively, per row:

4 bytes item identifier
8 bytes for the index tuple header (or 23 + 1 byte for heap tuples)
? actual space for data
? padding to the nearest multiple of 8 bytes

Plus some minor overhead per page and index / table. Details in the manual or in this related answer on Stackoverflow.

Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to bigint and build an index on this expression.

CREATE INDEX tbl_b_col64_idx ON tbl ((b_col::bit(64)::bigint))

I cast twice (b_col::bit(64)::bigint) for there is no cast defined between varbit and bigint. Details in this related answer on SO:

Convert hex in text representation to decimal number

Effectively, this is just a very fast and simple hash function, where the hash value also allows to look up ranges of values. Depending on exact requirements you could go one step further and use any IMMUTABLE hash function - like md5(). See link above. Or

The query to go along with that:

SELECT *
FROM   tbl
WHERE  b_col::bit(64)::bigint = '1111011110111101'::bit(64)::bigint  -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- narrow down to exact match

The resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:

-
The index typically returns much fewer hits (64 bit of information vs. 10 bit)

-
Postgres can work with integer arithmetic, which should be faster, even for a plain = operation. (Didn't test to verify that.)

-
The type integer has no overhead like varbit - 5 or 8 bytes. (In my installation 5 bytes for up to 960 bit, 8 bytes for more).

Effectively, to keep the index at its minimum size, you can only pack 24 bit into a varbit index - compared to 64 bit of information for a bigint index.

CLUSTER

In such a case CLUSTER should improve performance:

CLUSTER TABLE tbl USING tbl_b_col10_idx;

The command rewrites tables and indexes and holds an exclusive lock until finished, which can take a while for big tables. The effect deteriorates over time with write operations to the table. Repeat at intervals of your design. Be sure to read the manual on CLUSTER. Or consider the community tools like pg_repack or pg_suqeeze. See:

Configuring PostgreSQL for read performance

If the first 64 bit of your values are unique most of the time, CLUSTER will barely help, since the index scan will return a single row in most cases. If not, CLUSTER will help a lot. Consequently, the effect will be far greater for the first example with the less optimized index.

Code Snippets

SELECT pg_column_size(repeat('1', 1000)::varbit)  -- 133

CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);

CREATE INDEX tbl_b_col10_idx ON tbl ((b_col::bit(10)))

SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit; -- 16 bit

SELECT *
FROM   tbl
WHERE  b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- filter to exact match

Context

StackExchange Database Administrators Q#33785, answer score: 20

Revisions (0)

No revisions yet.