gotchaMinor
Computing set difference between two large sets
Viewed 0 times
differencetwobetweenlargesetssetcomputing
Problem
I have two large sets of integers $A$ and $B$. Each set has about a million entries, and each entry is a positive integer that is at most 10 digits long.
What is the best algorithm to compute $A\setminus B$ and $B\setminus A$? In other words, how can I efficiently compute the list of entries of $A$ that are not in $B$ and vice versa? What would be the best data structure to represent these two sets, to make these operations efficient?
The best approach I can come up with is storing these two sets as sorted lists, and compare every element of $A$ against every element of $B$, in a linear fashion. Can we do better?
What is the best algorithm to compute $A\setminus B$ and $B\setminus A$? In other words, how can I efficiently compute the list of entries of $A$ that are not in $B$ and vice versa? What would be the best data structure to represent these two sets, to make these operations efficient?
The best approach I can come up with is storing these two sets as sorted lists, and compare every element of $A$ against every element of $B$, in a linear fashion. Can we do better?
Solution
If you are willing to store the sets in a specialized data-structure, then you can possibly get some interesting complexities.
Let $I=\mathcal O\left(\min\left(|A|,|B|,|A\Delta B|\right)\right)$
Then you can do set operations $A\cup B, A\cap B,A\setminus B$ and $A\Delta B$, each in $\mathcal O\left(I\cdot\log\frac{|A|+|B|}{I}\right)$ expected time. So essentially, you get the minimum size of the two sets, or, the size of the symmetric difference, whichever is less. This is better than linear, if the symmetric difference is small; ie. if they have a large intersection. In fact, for the two set-difference operations you want, this is practically output-sensitive, since together they make up the size of the symmetric difference.
See Confluently Persistent Sets and Maps by Olle Liljenzin (2013) for more information.
Let $I=\mathcal O\left(\min\left(|A|,|B|,|A\Delta B|\right)\right)$
Then you can do set operations $A\cup B, A\cap B,A\setminus B$ and $A\Delta B$, each in $\mathcal O\left(I\cdot\log\frac{|A|+|B|}{I}\right)$ expected time. So essentially, you get the minimum size of the two sets, or, the size of the symmetric difference, whichever is less. This is better than linear, if the symmetric difference is small; ie. if they have a large intersection. In fact, for the two set-difference operations you want, this is practically output-sensitive, since together they make up the size of the symmetric difference.
See Confluently Persistent Sets and Maps by Olle Liljenzin (2013) for more information.
Context
StackExchange Computer Science Q#17984, answer score: 9
Revisions (0)
No revisions yet.