patternsqlMinor
PostgreSQL, integer arrays, index for equality
Viewed 0 times
postgresqlarraysequalityforindexinteger
Problem
I have a huge list of integer arrays (300,000,000 records) stored in Postgres 9.2 DB. I want to efficiently search these records for an exact match (equality only). I have heard of the intarray module and the corresponding gist-gin indexes. I would like to ask the following questions:
- Does PostgreSQL uses a hash function for checking equality of integer arrays or does it perform a brute-force algorithm comparing one-by-one the elements of the array?
- If PostgreSQL uses a hash function, is there some PostgreSQL function code to actually get the hash function result for a particular array?
- Which index will be best for such a task? B-tree, or the gist - gin indexes provided by the intarray module? The dataset will be static, i.e, once all records are inserted there will no more insertions. So, building index / updating index time is not important to me.
Solution
Q: Does PostgreSQL uses a hash function for checking equality of integer
arrays or does it perform a brute-force algorithm comparing one-by-one
the elements of the array?
Not according to Array Functions and Operators in the doc:
Array comparisons compare the array contents element-by-element, using
the default B-tree comparison function for the element data type
No mention of hashing.
intarray provides other operators but does not replace the equality operator between
Fortunately implementing fast hash-based search at the SQL level is not hard, and in your case (large arrays, no updates, exact match) it may even be the most effective method.
Steps:
1) choose a hash function. I'd suggest
The function
2) create a functional index:
It will be several orders of magnitude faster than a GIN index for the kind of dataset you have (actually I would worry about the GIN index creation taking a really unreasonable time, but do try it).
3) use it like this:
arrays or does it perform a brute-force algorithm comparing one-by-one
the elements of the array?
Not according to Array Functions and Operators in the doc:
Array comparisons compare the array contents element-by-element, using
the default B-tree comparison function for the element data type
No mention of hashing.
intarray provides other operators but does not replace the equality operator between
int[]. The closest function _int_same() that it exposes is semantically different (the order of elements does not matter) and is implemented as sorting+sequential comparison, not hashing.Fortunately implementing fast hash-based search at the SQL level is not hard, and in your case (large arrays, no updates, exact match) it may even be the most effective method.
Steps:
1) choose a hash function. I'd suggest
md5 on the text representation of the array:create function arr_hash(int[]) returns bytea as
$ select digest($1::text, 'md5');$
language sql immutable;The function
digest(text,text) is part of the pgcrypto extension. Compared to md5 it has the advantage of producing binary (16 bytes) instead of hexadecimal (32 bytes) for a leaner index.2) create a functional index:
create index index_name on table_name(arr_hash(col_name));It will be several orders of magnitude faster than a GIN index for the kind of dataset you have (actually I would worry about the GIN index creation taking a really unreasonable time, but do try it).
3) use it like this:
select 1 from table_name
where arr_hash(col_name)=arr_hash('{10,20,30,...lot of values}'::int[])
and col_name='{10,20,30,...lot of values}'::int[];Code Snippets
create function arr_hash(int[]) returns bytea as
$$ select digest($1::text, 'md5');$$
language sql immutable;create index index_name on table_name(arr_hash(col_name));select 1 from table_name
where arr_hash(col_name)=arr_hash('{10,20,30,...lot of values}'::int[])
and col_name='{10,20,30,...lot of values}'::int[];Context
StackExchange Database Administrators Q#62822, answer score: 7
Revisions (0)
No revisions yet.