patternMinor
Find all items which are subsets of an item
Viewed 0 times
itemallaresubsetsitemsfindwhich
Problem
I have a problem that I think should have been studied. I am looking for algorithms for it.
Each item is a set of key-value pairs.
Let $x$ be an item and $F$ be a set of items.
Each key and each value can appear multiple times.
The number of possible keys and possible values can be arbitrary large.
We are given $x$ and $F$. We want to find all those items $y$ in $F$ such that $y.val \subseteq x$.
For example,
$x = \{(a,1), (b,2), (c,3), (d,4)\}$
$F= \{$
$(A, \{(a,1)\}), $
$(B, \{(a,1), (b,2)\}),$
$(C, \{(a,1), (b,3)\}),$
$(D, \{(b,2), (c,3), (d,4)\}),$
$(E, \{(a,1), (b,2), (c,3), (d,4)\}),$
$(F, \{(a,1), (b,2), (c,3), (d,4), (e,5)\}),$
$(G, \{(a,1), (b,2), (c,3), (e,5)\})$
$\}$
The answer is:
$A$ yes, $B$ yes, $C$ no (right keys, wrong values), $D$ yes, $E$ yes (exact match),
$F$ no, $G$ no.
Has this problem been studied?
The problem seems similar to finding features from a DNA sequence or detecting plagiarism in a document.
I asked this problem in theoretical CS stack exchange and did not get very helpful answers. https://cstheory.stackexchange.com/questions/18052/find-all-items-which-are-subsets-of-an-item
Each item is a set of key-value pairs.
Let $x$ be an item and $F$ be a set of items.
Each key and each value can appear multiple times.
The number of possible keys and possible values can be arbitrary large.
We are given $x$ and $F$. We want to find all those items $y$ in $F$ such that $y.val \subseteq x$.
For example,
$x = \{(a,1), (b,2), (c,3), (d,4)\}$
$F= \{$
$(A, \{(a,1)\}), $
$(B, \{(a,1), (b,2)\}),$
$(C, \{(a,1), (b,3)\}),$
$(D, \{(b,2), (c,3), (d,4)\}),$
$(E, \{(a,1), (b,2), (c,3), (d,4)\}),$
$(F, \{(a,1), (b,2), (c,3), (d,4), (e,5)\}),$
$(G, \{(a,1), (b,2), (c,3), (e,5)\})$
$\}$
The answer is:
$A$ yes, $B$ yes, $C$ no (right keys, wrong values), $D$ yes, $E$ yes (exact match),
$F$ no, $G$ no.
Has this problem been studied?
The problem seems similar to finding features from a DNA sequence or detecting plagiarism in a document.
I asked this problem in theoretical CS stack exchange and did not get very helpful answers. https://cstheory.stackexchange.com/questions/18052/find-all-items-which-are-subsets-of-an-item
Solution
Your problem is the same one faced by SAT solvers who want to eliminate subsumed clauses from a CNF formula. Any clause $B$ that contains a superset of the literals of another clause $A$ in the formula is redundant and can be removed without changing the satisfiability of the formula. We say $A$ subsumes $B$, so $B$ can be discarded. Instead of single valued literals, you have these item tuples but it is the same problem as finding $A$.
The algorithms used are refinements of the naive algorithm, where you iterate through each element of a potential subset, checking to see if it occurs in $x$.
If you sort $x$'s items you can reduce the linear time search to log $\vert{x}\vert$. If you use a hashtable to store $x$'s items you can have amortized constant time for each search. Because of the hashing overhead, save this optimization for large sets. Bitmask signatures for each set allow you to do a cheap initial check, only doing the expensive iteration and searching of $x$ if the bitmaps can't rule out a subset match.
The algorithms used are refinements of the naive algorithm, where you iterate through each element of a potential subset, checking to see if it occurs in $x$.
If you sort $x$'s items you can reduce the linear time search to log $\vert{x}\vert$. If you use a hashtable to store $x$'s items you can have amortized constant time for each search. Because of the hashing overhead, save this optimization for large sets. Bitmask signatures for each set allow you to do a cheap initial check, only doing the expensive iteration and searching of $x$ if the bitmaps can't rule out a subset match.
Context
StackExchange Computer Science Q#12777, answer score: 3
Revisions (0)
No revisions yet.