patternsqlModerate
Fast hamming distance queries in postgres
Viewed 0 times
fastpostgresdistancequerieshamming
Problem
I have a large database (16M rows) containing perceptual hashes of images.
I'd like to be able to search for rows by hamming distance in a reasonable timeframe.
Currently, as far as I properly understand the issue, I think the best option here would be a custom SP-GiST implementation that implements a BK-Tree, but that seems like a lot of work, and I'm still fuzzy on the practical details of properly implementing a custom index. Calculating the hamming distance is tractable enough, and I do know C, though.
Basically, what is the appropriate approach here? I need to be able to query for matches within a certain edit-distance of a hash. As I understand it, Levenshtein distance with strings of equal length is functionally hamming distance, so there is at least some existing support for what I want, though no clear way to create an index from it (remember, the value I'm querying for changes. I cannot pre-compute the distance from a fixed value, since that would only be useful for that one value).
The hashes are currently stored as a 64-char string containing the binary ASCII encoding of the hash (e.g. "10010101..."), but I can convert them to int64 easily enough. The real issue is I need to be able to query relatively fast.
It seems like it could be possible to achieve something along the lines of what I want with the
Insert performance is not critical (it's very computationally expensive to calculate the hashes for each row), so I primarily care about searching.
I'd like to be able to search for rows by hamming distance in a reasonable timeframe.
Currently, as far as I properly understand the issue, I think the best option here would be a custom SP-GiST implementation that implements a BK-Tree, but that seems like a lot of work, and I'm still fuzzy on the practical details of properly implementing a custom index. Calculating the hamming distance is tractable enough, and I do know C, though.
Basically, what is the appropriate approach here? I need to be able to query for matches within a certain edit-distance of a hash. As I understand it, Levenshtein distance with strings of equal length is functionally hamming distance, so there is at least some existing support for what I want, though no clear way to create an index from it (remember, the value I'm querying for changes. I cannot pre-compute the distance from a fixed value, since that would only be useful for that one value).
The hashes are currently stored as a 64-char string containing the binary ASCII encoding of the hash (e.g. "10010101..."), but I can convert them to int64 easily enough. The real issue is I need to be able to query relatively fast.
It seems like it could be possible to achieve something along the lines of what I want with the
pg_trgm, but I'm a bit unclear on how the trigram matching mechamism works (in particular, what does the similarity metric it returns actually represent? It looks kind of like edit-distance). Insert performance is not critical (it's very computationally expensive to calculate the hashes for each row), so I primarily care about searching.
Solution
MOAR ANSWERS!
Ok, I've finally taken the time to write a custom PostgreSQL indexing extension. I used the SP-GiST interface.
This was fairly challenging, mostly because Posgres is big.
Anyways, as usual, it's up on github here.
Performance-wise, it's currently ~2-3 times slower then the pure-in-memory implementation in my other answer to this question, but it's so much more convenient to use I'll happily eat that performance hit (realistically, it's ~50 ms/query - 150 ms/query, which is still pretty small).
Ok, I've finally taken the time to write a custom PostgreSQL indexing extension. I used the SP-GiST interface.
This was fairly challenging, mostly because Posgres is big.
Anyways, as usual, it's up on github here.
Performance-wise, it's currently ~2-3 times slower then the pure-in-memory implementation in my other answer to this question, but it's so much more convenient to use I'll happily eat that performance hit (realistically, it's ~50 ms/query - 150 ms/query, which is still pretty small).
Context
StackExchange Database Administrators Q#72134, answer score: 15
Revisions (0)
No revisions yet.