HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to speed up process of finding duplicates/similar items in a large amount of strings?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
howprocessamountstringsitemslargefindingsimilarspeedduplicates

Problem

Our software receives documents (in the order of tens of thousands) from various providers, each document flows through a number of steps, one of those steps finds duplicates and similar documents (within 80% threshold) to this document.

We implemented Levenshtein distance algorithm and we limit amount of documents current document has to look over (via a set of hash codes), but this solution is still horribly slow (since it's M*N problem) even after we launch it on multiple cores.

Are there any suggestions/techniques we can use to speed up the processing? The requirement for it is to work in near real-time (I guess we can accept up to 10-20 minutes gap). The only idea I have is to split exact duplicates and similar docs processing in two steps and use something like HyperLogLog for duplicates.

The texts are from social media (tweets, blog posts, etc.) so it's reasonable to assume that duplicates share are far greater than that of a similar docs.

I apologize if my usage of terminology isn't very accurate.

Solution

Edit distance is definitely not the way to proceed. The standard approach toward near-identical document deduplication is to compute hashes of shingles. Here's one way to do it:

  • Compute a set of k-shingles (say 30 char shingles) for each document



  • Compute min-hashes for the documents



  • Use locally-sensitive hashing to look-up similar documents when inserting into the database



Here's a great article that explains how this can be done: http://deduplication.tumblr.com/

And you can pick up an existing tool do this, for example Solr: https://wiki.apache.org/solr/Deduplication

Context

StackExchange Computer Science Q#47794, answer score: 6

Revisions (0)

No revisions yet.