patternMinor

Set Similarity - Calculate Jaccard index without quadratic complexity

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

time-complexity cs stackoverflow algorithms

withoutjaccardcalculatecomplexityindexsimilaritysetquadratic

Problem

I have a group of n sets for which I need to calculate a sort of "uniqueness" or "similarity" value. I've settled on the Jaccard index as a suitable metric. Unfortunately, the Jaccard index only operates on two sets at a time. In order to calculate the similarity between all $n$ sets, it will require in the order of $n^2$ Jaccard calculations.

(If it helps, $n$ is usually between 10 and 10000, and each set contains on average 500 elements. Also, in the end, I don't care how similar any two specific sets are - rather, I only care what the internal similarity of the whole group of sets is. (In other words, the mean (or at least a sufficiently accurate approximation of the mean) of all Jaccard indexes in the group))

Two questions:

Is there a way to still use the Jaccard index without the $n^2$ complexity?

Is there a better way to calculate set similarity/uniqueness across a group of sets than the way I've suggested above?

Solution

An option would be to use the Signature Scheme of [1], size-based filtering: a scheme which uses size information to reduce the number of set pairs that need to be considered.

They also experiment with a weighted form; where weights are IDF-based.

[1] Arasu, Arvind, Venkatesh Ganti, and Raghav Kaushik. “Efficient Exact Set-similarity Joins.” In Proceedings of the 32nd International Conference on Very Large Data Bases, 918–929. VLDB ’06. VLDB Endowment, 2006

Context

StackExchange Computer Science Q#6526, answer score: 4

Revisions (0)

No revisions yet.