patternMinor
Efficient implementation of sets
Viewed 0 times
implementationsetsefficient
Problem
Let U be a pre-determined and fixed universal set (and |U| = n = 2k for some integer k, so the set may be huge).
I create many arbitrary subsets of U in running time (These sets may be or may not be sparse).
I do not need to check whether an element is in a set.
But I need to be able to
What is the most time-efficient data structure for this? It should be also space-efficient, if possible.
I have bit vector in my mind. Elements of U are numbered from 0 to n-1. Each of n bits represents the existence of the corresponding element.
I create many arbitrary subsets of U in running time (These sets may be or may not be sparse).
I do not need to check whether an element is in a set.
But I need to be able to
- check whether a set is empty,
- do the basic set operations difference and intersection, and
- calculate the cardinality of a set.
What is the most time-efficient data structure for this? It should be also space-efficient, if possible.
I have bit vector in my mind. Elements of U are numbered from 0 to n-1. Each of n bits represents the existence of the corresponding element.
- For k ≤ 5 or k ≤ 6 (depends on the computer architecture and the programming language) I can implement these sets as [unsigned] integers.
- But there will be a need for multi-word arithmetics in general case. So I believe, the space complexity of any set is ϴ(n) and time complexities of operations are ϴ(n) using bitwise operations and hamming weight.
Solution
Yes, using a bitvector, the running time will be $\Theta(n)$, i.e., $\Theta(2^k)$, for all operations.
If most sets are small in cardinality, a better option may be to use a sorted list of the elements in the set. In this way, you can compute the intersection or difference of two sets of size $m_1,m_2$ in $\Theta(m_1+m_2)$ time (they work similar to the "merge" operation in mergesort). You can compute the cardinality of a set of size $m$ in $\Theta(m)$ time, and test emptiness in $O(1)$ time. If the size of the sets is small compared to $n$, this might be much faster.
You can make the cardinality operation take $O(1)$ time when using a sorted list by augmenting the data structure: when you construct the list, compute its cardinality and store that in a separate slot along with the list. Now cardinality lookup takes $O(1)$ time. It is easy to keep track of the cardinality of the new set when performing an intersection or differencing operator.
If there is some "pattern" to the set of elements in the set, another option might be to use a binary decision diagram (BDD). You think of each element of $U$ as a $k$-bit vector, and think of a set as a set of $k$-bit vectors. Then, you can use a BDD to represent that set. Alternatively, we can associate to any set $S$ of a function $f:\{0,1\}^n \to \{0,1\}$ so that $f(x)=1$ if $x \in S$; then represent $f$ as a BDD (this is equivalent, and just a different way of thinking about the same thing). If there is sufficient "structure" in the elements of the set, this might allow a more concise representation of the sets and more efficient operations. The running time to compute the intersection or difference of two BDDs is related to the sizes of those BDDs. In many contexts BDDs might be ineffective (they add implementation complexity with no improvements in performance), but in some special cases they might be useful.
If most sets are small in cardinality, a better option may be to use a sorted list of the elements in the set. In this way, you can compute the intersection or difference of two sets of size $m_1,m_2$ in $\Theta(m_1+m_2)$ time (they work similar to the "merge" operation in mergesort). You can compute the cardinality of a set of size $m$ in $\Theta(m)$ time, and test emptiness in $O(1)$ time. If the size of the sets is small compared to $n$, this might be much faster.
You can make the cardinality operation take $O(1)$ time when using a sorted list by augmenting the data structure: when you construct the list, compute its cardinality and store that in a separate slot along with the list. Now cardinality lookup takes $O(1)$ time. It is easy to keep track of the cardinality of the new set when performing an intersection or differencing operator.
If there is some "pattern" to the set of elements in the set, another option might be to use a binary decision diagram (BDD). You think of each element of $U$ as a $k$-bit vector, and think of a set as a set of $k$-bit vectors. Then, you can use a BDD to represent that set. Alternatively, we can associate to any set $S$ of a function $f:\{0,1\}^n \to \{0,1\}$ so that $f(x)=1$ if $x \in S$; then represent $f$ as a BDD (this is equivalent, and just a different way of thinking about the same thing). If there is sufficient "structure" in the elements of the set, this might allow a more concise representation of the sets and more efficient operations. The running time to compute the intersection or difference of two BDDs is related to the sizes of those BDDs. In many contexts BDDs might be ineffective (they add implementation complexity with no improvements in performance), but in some special cases they might be useful.
Context
StackExchange Computer Science Q#77123, answer score: 4
Revisions (0)
No revisions yet.