patternMinor
Efficiently representing a set of sets
Viewed 0 times
efficientlysetssetrepresenting
Problem
(The question is an extension of this unanswered question on stackoverflow)
If we have a set of strings, we can efficiently represent them with tries. Common branches can also be merged, resulting in a DAG instead of a tree that is even more compact.
However, if we have a set of sets (i.e. the order does not matter), there are a lot of possible tries that represent the same set of elements.
An example can be found in the stackoverflow question I linked above.
Edit: For example, assume that we are given the following sets of integers.
Two possible representations are shown below (trie on the left, DAG on the right)
My questions are:
In the scenario I have in mind there is an additional constraint that no set is a subset of another set.
Any link/paper that is even slightly relevant to any of the questions is helpful.
If we have a set of strings, we can efficiently represent them with tries. Common branches can also be merged, resulting in a DAG instead of a tree that is even more compact.
However, if we have a set of sets (i.e. the order does not matter), there are a lot of possible tries that represent the same set of elements.
An example can be found in the stackoverflow question I linked above.
Edit: For example, assume that we are given the following sets of integers.
{1,2,3,4,5}
{1,2,6,7}
{1,2,4,7}
{1,3,5,7}Two possible representations are shown below (trie on the left, DAG on the right)
My questions are:
- How hard is the problem of finding an optimal (i.e minimal) such trie?
- Are there any fast algorithms for solving this problem?
- If not, are there any fast algorithms that find "good" tries?
- What about the DAG case?
In the scenario I have in mind there is an additional constraint that no set is a subset of another set.
Any link/paper that is even slightly relevant to any of the questions is helpful.
Solution
I suggest you start by looking at BDDs and ZDDs.
Your problem is similar to the problem of finding an optimal BDD.
I'll explain. Suppose we have sets $S_1,\dots,S_k$ over a universe $U$ containing $n$ items (so $S_i \subseteq U$). A BDD is a finite-state automaton (a DAG) for recognizing the language of all of the allowed sets, where each set is presented to the finite-state automaton as the $n$-bit string that is the characteristic function of that set. In other words, given a total order on $U$, each set $S \subseteq U$ corresponds to a binary string $[S] \in \{0,1\}^n$: $[S]_i=1$ if $S$ includes the $i$ element of $U$, otherwise not. (This order is known in the BDD literature as a "variable ordering".) Then define the language $L$ by $L=\{[S_1],[S_2],\dots,[S_k]\}$. A BDD is a finite-state automaton that recognizes the language $L$.
Notice that you have the freedom of what "variable ordering" to choose. The choice of the variable ordering determines how each set is represented as a binary function, which can affect the size of the BDD. It is well-known that the problem of choosing the optimal variable ordering (the one that leads to the smallest BDD) is NP-hard. Nonetheless, there are various heuristics for trying to find a good variable ordering, and in real-world problems, they often work reasonably well.
So, your problem can be framed as the problem of choosing the optimal variable ordering to represent your sets as a BDD.
ZDDs are a variant of BDDs that are better in some circumstances. ZDDs are especially oriented at representing a set of sets, so they might be particularly helpful for your needs. ZDDs are especially handy if you need to represent a bunch of sets that are related to each other in some way; ZDDs provide a compact and efficient encoding for that situation.
Your problem is similar to the problem of finding an optimal BDD.
I'll explain. Suppose we have sets $S_1,\dots,S_k$ over a universe $U$ containing $n$ items (so $S_i \subseteq U$). A BDD is a finite-state automaton (a DAG) for recognizing the language of all of the allowed sets, where each set is presented to the finite-state automaton as the $n$-bit string that is the characteristic function of that set. In other words, given a total order on $U$, each set $S \subseteq U$ corresponds to a binary string $[S] \in \{0,1\}^n$: $[S]_i=1$ if $S$ includes the $i$ element of $U$, otherwise not. (This order is known in the BDD literature as a "variable ordering".) Then define the language $L$ by $L=\{[S_1],[S_2],\dots,[S_k]\}$. A BDD is a finite-state automaton that recognizes the language $L$.
Notice that you have the freedom of what "variable ordering" to choose. The choice of the variable ordering determines how each set is represented as a binary function, which can affect the size of the BDD. It is well-known that the problem of choosing the optimal variable ordering (the one that leads to the smallest BDD) is NP-hard. Nonetheless, there are various heuristics for trying to find a good variable ordering, and in real-world problems, they often work reasonably well.
So, your problem can be framed as the problem of choosing the optimal variable ordering to represent your sets as a BDD.
ZDDs are a variant of BDDs that are better in some circumstances. ZDDs are especially oriented at representing a set of sets, so they might be particularly helpful for your needs. ZDDs are especially handy if you need to represent a bunch of sets that are related to each other in some way; ZDDs provide a compact and efficient encoding for that situation.
Context
StackExchange Computer Science Q#22807, answer score: 3
Revisions (0)
No revisions yet.