patternMinor
Find closest database values across several attributes without O(n) traversal
Viewed 0 times
findwithoutclosestdatabaseseveralattributesacrossvaluestraversal
Problem
I have a use case where I have access to a large database with many documents (millions+). I want to build functionality such that, when I want to add a new document to that database, I first want to scan the database for potential duplicate documents. Documents may differ in file type and other information (i.e. one is a digital copy, while another is a hard copy that was scanned in via a printer) while still materially being the same document, so I unfortunately can't rely on metadata or similar.
In order to determine if a document is a duplicate, I plan to use some weighted combination of NLP (via something like spaCy) to compare document similarity, comparison of word distributions (both the simple word distribution and something like TF-IDF), and other relevant metrics.
In order to actually find potential duplicates for a newly uploaded document, I can't think of any way to avoid scanning every document in the database, comparing one-by-one, and tracking the one with metrics that matched most closely.
Thoughts I've had for optimizing this:
In order to determine if a document is a duplicate, I plan to use some weighted combination of NLP (via something like spaCy) to compare document similarity, comparison of word distributions (both the simple word distribution and something like TF-IDF), and other relevant metrics.
In order to actually find potential duplicates for a newly uploaded document, I can't think of any way to avoid scanning every document in the database, comparing one-by-one, and tracking the one with metrics that matched most closely.
Thoughts I've had for optimizing this:
- I know indexing can often be used to speed up search operations, but to my understanding that's only good for searching for a specific value in a specific column. I don't think it's a good fit, as I'm trying to essentially take a weighted average of each metric and report which are closest. I guess I could index every column; could this potentially be worth it, or would the constant reindexing be a huge performance penalty?
- I've been thinking of using a clustering (unsupervised machine learning) model in order to cluster similar documents together, then try and use the clusters it gives me to try and determine which cluster the new document would fit in, then just search in there, but I'm getting caught up in the details. I'm sure this would be a practical approach for finding preexisting duplicates in the database, but is this practical to do every time a new document is added to the set (i.e. could it be used to sp
Solution
You could define an index on each measure, query each to find the few documents that are closest, then build the intersection of these various results. That would turn a size-of-data comparison into a (number of measures) x (width of search in a measure). This is likely to be smaller than the number of rows for a sensible number of measures.
Depending on the DBMS and how good your SQL-fu is you may even be able to convince the query processor to perform these index read and then inner join the sub-results together to produces the intersection.
One big snag with this is outliers. Say we define 20 measures. An in-coming document matches an existing document exactly in 19 of these 20 but is out-of-range on the last. The existing document will never make it into the result even though a human would likely say the two matched. To avoid that you'd have to define degree of correlation in the index matches, and then you're back to size-of data operations again.
What's wrong with doing the usual vector-space comparison?
Intel reckons a modern server chip can do a few hundred gigaflops or so. Let's call it 10^11 floating point calculations per second. To compare one document to another on 20 measures using Euclidean distance would need 80 computations. Let's call it 100. That's 10^9 comparisons per second. So to compare to 1M existing documents would take 10^-3 seconds, or 1 millisecond. Let's knock off a few more zeros because data has to move around inside chips. How often will you be adding documents? How many scanners do you have & how long does OCR take to process? Can you live with a duplication check taking a bit less than one second per document.
Taking the Euclidean approach farther, if your DBMS supports a geometry data type you could choose the two or three most selective measures and define a spatial index on those. That should efficiently reduce the search of 1M+ existing documents down to a level where the full vector search is tractable.
Depending on the DBMS and how good your SQL-fu is you may even be able to convince the query processor to perform these index read and then inner join the sub-results together to produces the intersection.
One big snag with this is outliers. Say we define 20 measures. An in-coming document matches an existing document exactly in 19 of these 20 but is out-of-range on the last. The existing document will never make it into the result even though a human would likely say the two matched. To avoid that you'd have to define degree of correlation in the index matches, and then you're back to size-of data operations again.
What's wrong with doing the usual vector-space comparison?
Intel reckons a modern server chip can do a few hundred gigaflops or so. Let's call it 10^11 floating point calculations per second. To compare one document to another on 20 measures using Euclidean distance would need 80 computations. Let's call it 100. That's 10^9 comparisons per second. So to compare to 1M existing documents would take 10^-3 seconds, or 1 millisecond. Let's knock off a few more zeros because data has to move around inside chips. How often will you be adding documents? How many scanners do you have & how long does OCR take to process? Can you live with a duplication check taking a bit less than one second per document.
Taking the Euclidean approach farther, if your DBMS supports a geometry data type you could choose the two or three most selective measures and define a spatial index on those. That should efficiently reduce the search of 1M+ existing documents down to a level where the full vector search is tractable.
Context
StackExchange Database Administrators Q#294460, answer score: 4
Revisions (0)
No revisions yet.