HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to get the quick answer to the "Is there AT LEAST ONE subset that contains all given elements?" question if the set of subsets is very large?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
answerquickleastelementsallsubsetsonegetthathow

Problem

I have a very large amount of objects that look like this:

DataDict = {
id1: {"a": true, "b": true, "bc": true, "hgf": true},
id2: {"bcwe": true, "nKNNn": true, "mjj": true, "AAt": true},
id3: {"h": true, "a": true, "mjj": true, "ABwAU": true},
id4: {"wvzy": true, "zzba": true, "abc": true, "a": true},
...
}


or just (as a set of sets, note that ids may be excluded)

DataSet = {
{"a", "b", "bc", "hgf"},
{"bcwe", "nKNNn", "mjj", "AAt"},
{"h", "a", "mjj", "ABwAU"},
{"wvzy", "zzba", "abc", "a"},
...
}


Let M denote the total number of objects in DataDict (or subsets in DataSet), and let N denote the total number of unique names of properties found in DataDict (or the total number of unique strings found in DataSet).

The question is: given the set of strings {string1, string2, string3, ...}, how to get the Yes/No answer (assuming that there is a way to prepare the “index” for DataSet) to the “Does DataSet contain at least one subset that contains string1 AND string2 AND string3 AND ...?” question as fast as possible?

In another form, the question is: given the array A = [string1, string2, string3, ...], how to build the index (data structure) for DataDict that allows to quickly determine if it contains at least one object obj (I don’t care which object to choose, moreover, I don't want to return this object, all I want is that the function should return true if such an object exists, and false if not) such that DataDict.obj.string1 is true AND DataDict.obj.string2 is true AND DataDict.obj.string3 is true AND ...?

The only way that I see is to build the index like this:

{ 
"a": [1, 3, 4], 
"abc": [4],
...
}


and, for example, if A = ["a", "abc"], then I need to find the intersection of two arrays ([1, 3, 4] and [4]), but stop as soon as one common element is found and return true (if there are no common elements, return false). But these operations are very costly and time-consuming, w

Solution

It's possible to answer such queries in time $O(M^{1 - S/N})$, where $S$ is the size of the subset in the query, using a data structure from Ron Rivest:

Partial-Match Retrieval Algorithms. Ronald L. Rivest. SIAM Journal Computing, vol 5 no 1, March 1976.

Basically, you convert each set to a bitvector of length $N$, then use Rivest's partial-match queries. This is ever so slightly faster than the naive algorithm, whose running time will be about $O(M)$ (or a bit more).

Your idea using indices is another approach that might be faster in practice, especially if most sets are much smaller than $N$. Note that it's possible to optimize this a bit. Your index will have, for each string $x$, a list of indices of sets that contain $x$ and the length of that list. Now, given a set $A$, you can check for each $x \in A$ the length of the corresponding list in the index, and find the shortest such list, and enumerate all entries in that list to see if any of them are a superset of $A$. See also https://cstheory.stackexchange.com/q/19526/5038. I think the worst-case running time is no better than the naive algorithm, but in practice it might perform significantly better.

Context

StackExchange Computer Science Q#70405, answer score: 2

Revisions (0)

No revisions yet.