patternModerate
Efficient map data structure supporting approximate lookup
Viewed 0 times
mapapproximateefficientstructuresupportinglookupdata
Problem
I'm looking for a data structure that supports efficient approximate lookups of keys (e.g., Levenshtein distance for strings), returning the closest possible match for the input key. The best suited data structure I've found so far are Burkhard-Keller trees, but I was wondering if there are other/better data structures for this purpose.
Edit:
Some more details of my specific case:
Edit:
Some more details of my specific case:
- Strings usually have a fairly large Levenshtein difference from each other.
- Strings have a max length of around 20-30 chars, with an average closer to 10-12.
- I'm more interested in efficient lookup than insertion as I will be building a set of mostly static data that I want to query efficiently.
Solution
What you are looking for is "approximate near neighbor search" (ANNS) in the Levenshtein/edit distance. From a theoretical perspective, edit distance has so far turned out to be relatively hard for near-neighbor searches, afaik. Still, there are many results, see the references in this Ostrovsky and Rabani paper. If you are willing to consider alternative distance metrics for which there are simpler and better solutions, move on to the next paragraph. For ANNS in edit distance, there is a result due to Indyk, who shows how to build a data structure of size $n^{O(\sqrt{d})}$ that answers any query in time $O(d)$ and reports a string which is at most three times further than the string nearest to the query string (this generalizes to size $n^{O(d^\epsilon)}$ and approximation $3^{1/\epsilon}$). Here $n$ is the number of strings and $d$ is the maximum length of any string. The Ostrovsky and Rabani paper I linked above improve this result by mapping strings to vectors so that the $\ell_1$-distance (a kind of natural geometric distance similar to the euclidean distance) between vectors approximates the edit distance between the corresponding strings (this is called a "low-distortion embedding"). Once this is done, an ANNS data structure for $\ell_1$ can be used, and these turn out to be more efficient (see next paragraph).
If you are willing to consider other distances, then locality sensitive hashing (LSH) does a great job. Locality sensitive hashing is a technique pioneered by Indyk and Motwani for solving the ANNS problem, where points that live in a high-dimensional space (read long vectors, long strings, etc.) are hashed into a small number of buckets so that points that are near each other are mapped to the same bin with good probability and points that are far from each other are mapped to different bins, also with good probability. There is a great and very accessible survey article by Indyk and Andoni in CACM. This technique is simple and fast, and has small space requirements; there is code out there too (I think the article links to the code). It works well for things like Hamming distance (and in certain regimes $\ell_1$ distance) and Euclidean distance, cosine distance. Also, Muthu and Sahinalp design LSH schemes for a very natural generalization of edit distance, the block edit distance (where a some edit operations can operate on a block of symbols).
This kind of question is a good fit for cstheory.SE. There is a related question there, but it seems to ask for exact near neighbor.
If you are willing to consider other distances, then locality sensitive hashing (LSH) does a great job. Locality sensitive hashing is a technique pioneered by Indyk and Motwani for solving the ANNS problem, where points that live in a high-dimensional space (read long vectors, long strings, etc.) are hashed into a small number of buckets so that points that are near each other are mapped to the same bin with good probability and points that are far from each other are mapped to different bins, also with good probability. There is a great and very accessible survey article by Indyk and Andoni in CACM. This technique is simple and fast, and has small space requirements; there is code out there too (I think the article links to the code). It works well for things like Hamming distance (and in certain regimes $\ell_1$ distance) and Euclidean distance, cosine distance. Also, Muthu and Sahinalp design LSH schemes for a very natural generalization of edit distance, the block edit distance (where a some edit operations can operate on a block of symbols).
This kind of question is a good fit for cstheory.SE. There is a related question there, but it seems to ask for exact near neighbor.
Context
StackExchange Computer Science Q#2093, answer score: 19
Revisions (0)
No revisions yet.