HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL, integer arrays, index for equality

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlarraysequalityforindexinteger

Problem

I have a huge list of integer arrays (300,000,000 records) stored in Postgres 9.2 DB. I want to efficiently search these records for an exact match (equality only). I have heard of the intarray module and the corresponding gist-gin indexes. I would like to ask the following questions:

  • Does PostgreSQL uses a hash function for checking equality of integer arrays or does it perform a brute-force algorithm comparing one-by-one the elements of the array?



  • If PostgreSQL uses a hash function, is there some PostgreSQL function code to actually get the hash function result for a particular array?



  • Which index will be best for such a task? B-tree, or the gist - gin indexes provided by the intarray module? The dataset will be static, i.e, once all records are inserted there will no more insertions. So, building index / updating index time is not important to me.

Solution

Q: Does PostgreSQL uses a hash function for checking equality of integer
arrays or does it perform a brute-force algorithm comparing one-by-one
the elements of the array?

Not according to Array Functions and Operators in the doc:


Array comparisons compare the array contents element-by-element, using
the default B-tree comparison function for the element data type

No mention of hashing.

intarray provides other operators but does not replace the equality operator between int[]. The closest function _int_same() that it exposes is semantically different (the order of elements does not matter) and is implemented as sorting+sequential comparison, not hashing.

Fortunately implementing fast hash-based search at the SQL level is not hard, and in your case (large arrays, no updates, exact match) it may even be the most effective method.

Steps:

1) choose a hash function. I'd suggest md5 on the text representation of the array:

create function arr_hash(int[]) returns bytea as
$ select digest($1::text, 'md5');$
language sql immutable;


The function digest(text,text) is part of the pgcrypto extension. Compared to md5 it has the advantage of producing binary (16 bytes) instead of hexadecimal (32 bytes) for a leaner index.

2) create a functional index:

create index index_name on table_name(arr_hash(col_name));


It will be several orders of magnitude faster than a GIN index for the kind of dataset you have (actually I would worry about the GIN index creation taking a really unreasonable time, but do try it).

3) use it like this:

select 1 from table_name
 where arr_hash(col_name)=arr_hash('{10,20,30,...lot of values}'::int[])
 and   col_name='{10,20,30,...lot of values}'::int[];

Code Snippets

create function arr_hash(int[]) returns bytea as
$$ select digest($1::text, 'md5');$$
language sql immutable;
create index index_name on table_name(arr_hash(col_name));
select 1 from table_name
 where arr_hash(col_name)=arr_hash('{10,20,30,...lot of values}'::int[])
 and   col_name='{10,20,30,...lot of values}'::int[];

Context

StackExchange Database Administrators Q#62822, answer score: 7

Revisions (0)

No revisions yet.