HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Fast hamming distance queries in postgres

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fastpostgresdistancequerieshamming

Problem

I have a large database (16M rows) containing perceptual hashes of images.

I'd like to be able to search for rows by hamming distance in a reasonable timeframe.

Currently, as far as I properly understand the issue, I think the best option here would be a custom SP-GiST implementation that implements a BK-Tree, but that seems like a lot of work, and I'm still fuzzy on the practical details of properly implementing a custom index. Calculating the hamming distance is tractable enough, and I do know C, though.

Basically, what is the appropriate approach here? I need to be able to query for matches within a certain edit-distance of a hash. As I understand it, Levenshtein distance with strings of equal length is functionally hamming distance, so there is at least some existing support for what I want, though no clear way to create an index from it (remember, the value I'm querying for changes. I cannot pre-compute the distance from a fixed value, since that would only be useful for that one value).

The hashes are currently stored as a 64-char string containing the binary ASCII encoding of the hash (e.g. "10010101..."), but I can convert them to int64 easily enough. The real issue is I need to be able to query relatively fast.

It seems like it could be possible to achieve something along the lines of what I want with the pg_trgm, but I'm a bit unclear on how the trigram matching mechamism works (in particular, what does the similarity metric it returns actually represent? It looks kind of like edit-distance).

Insert performance is not critical (it's very computationally expensive to calculate the hashes for each row), so I primarily care about searching.

Solution

MOAR ANSWERS!

Ok, I've finally taken the time to write a custom PostgreSQL indexing extension. I used the SP-GiST interface.

This was fairly challenging, mostly because Posgres is big.

Anyways, as usual, it's up on github here.

Performance-wise, it's currently ~2-3 times slower then the pure-in-memory implementation in my other answer to this question, but it's so much more convenient to use I'll happily eat that performance hit (realistically, it's ~50 ms/query - 150 ms/query, which is still pretty small).

Context

StackExchange Database Administrators Q#72134, answer score: 15

Revisions (0)

No revisions yet.