HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Find all items which are subsets of an item

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
itemallaresubsetsitemsfindwhich

Problem

I have a problem that I think should have been studied. I am looking for algorithms for it.

Each item is a set of key-value pairs.
Let $x$ be an item and $F$ be a set of items.

Each key and each value can appear multiple times.
The number of possible keys and possible values can be arbitrary large.

We are given $x$ and $F$. We want to find all those items $y$ in $F$ such that $y.val \subseteq x$.

For example,

$x = \{(a,1), (b,2), (c,3), (d,4)\}$

$F= \{$

$(A, \{(a,1)\}), $

$(B, \{(a,1), (b,2)\}),$

$(C, \{(a,1), (b,3)\}),$

$(D, \{(b,2), (c,3), (d,4)\}),$

$(E, \{(a,1), (b,2), (c,3), (d,4)\}),$

$(F, \{(a,1), (b,2), (c,3), (d,4), (e,5)\}),$

$(G, \{(a,1), (b,2), (c,3), (e,5)\})$

$\}$

The answer is:
$A$ yes, $B$ yes, $C$ no (right keys, wrong values), $D$ yes, $E$ yes (exact match),
$F$ no, $G$ no.

Has this problem been studied?

The problem seems similar to finding features from a DNA sequence or detecting plagiarism in a document.

I asked this problem in theoretical CS stack exchange and did not get very helpful answers. https://cstheory.stackexchange.com/questions/18052/find-all-items-which-are-subsets-of-an-item

Solution

Your problem is the same one faced by SAT solvers who want to eliminate subsumed clauses from a CNF formula. Any clause $B$ that contains a superset of the literals of another clause $A$ in the formula is redundant and can be removed without changing the satisfiability of the formula. We say $A$ subsumes $B$, so $B$ can be discarded. Instead of single valued literals, you have these item tuples but it is the same problem as finding $A$.

The algorithms used are refinements of the naive algorithm, where you iterate through each element of a potential subset, checking to see if it occurs in $x$.
If you sort $x$'s items you can reduce the linear time search to log $\vert{x}\vert$. If you use a hashtable to store $x$'s items you can have amortized constant time for each search. Because of the hashing overhead, save this optimization for large sets. Bitmask signatures for each set allow you to do a cheap initial check, only doing the expensive iteration and searching of $x$ if the bitmaps can't rule out a subset match.

Context

StackExchange Computer Science Q#12777, answer score: 3

Revisions (0)

No revisions yet.