patternMinor
Set Similarity - Calculate Jaccard index without quadratic complexity
Viewed 0 times
withoutjaccardcalculatecomplexityindexsimilaritysetquadratic
Problem
I have a group of n sets for which I need to calculate a sort of "uniqueness" or "similarity" value. I've settled on the Jaccard index as a suitable metric. Unfortunately, the Jaccard index only operates on two sets at a time. In order to calculate the similarity between all $n$ sets, it will require in the order of $n^2$ Jaccard calculations.
(If it helps, $n$ is usually between 10 and 10000, and each set contains on average 500 elements. Also, in the end, I don't care how similar any two specific sets are - rather, I only care what the internal similarity of the whole group of sets is. (In other words, the mean (or at least a sufficiently accurate approximation of the mean) of all Jaccard indexes in the group))
Two questions:
(If it helps, $n$ is usually between 10 and 10000, and each set contains on average 500 elements. Also, in the end, I don't care how similar any two specific sets are - rather, I only care what the internal similarity of the whole group of sets is. (In other words, the mean (or at least a sufficiently accurate approximation of the mean) of all Jaccard indexes in the group))
Two questions:
- Is there a way to still use the Jaccard index without the $n^2$ complexity?
- Is there a better way to calculate set similarity/uniqueness across a group of sets than the way I've suggested above?
Solution
An option would be to use the Signature Scheme of [1], size-based filtering: a scheme which uses size information to reduce the number of set pairs that need to be considered.
They also experiment with a weighted form; where weights are IDF-based.
[1] Arasu, Arvind, Venkatesh Ganti, and Raghav Kaushik. “Efficient Exact Set-similarity Joins.” In Proceedings of the 32nd International Conference on Very Large Data Bases, 918–929. VLDB ’06. VLDB Endowment, 2006
They also experiment with a weighted form; where weights are IDF-based.
[1] Arasu, Arvind, Venkatesh Ganti, and Raghav Kaushik. “Efficient Exact Set-similarity Joins.” In Proceedings of the 32nd International Conference on Very Large Data Bases, 918–929. VLDB ’06. VLDB Endowment, 2006
Context
StackExchange Computer Science Q#6526, answer score: 4
Revisions (0)
No revisions yet.