patternMinor
Existence of Efficient Set Difference Algorithm
Viewed 0 times
existenceefficientdifferencealgorithmset
Problem
As a foreword, I'm not asking what the algorithm is, just whether one can possibly exist (though, if it does already exist and someone knows what it is, that'd be great).
Basically, given two sets $S$ and $T$, I want to compute the two set differences $I = T \setminus S$ and $R = S \setminus T$. The goal is that if $S$ represents a "before" set, and $T$ an "after" set, I want to know what elements were inserted into (represented by $I$) and removed from (represented by $R$) the original set $S$ to get to $T$.
A naïve approach would be to compare every element of $S$ with every element of $T$, a worst-case runtime of $O(n * m)$, where $n = |S|$ and $m = |T|$.
An improvement would be to treat $S$ as a list and sort it, then do a binary search in the sorted list for every element in $T$. This would be worst-case (I believe) $O(n \log n + m \log n)$.
However, I think I can get it down to $O(n + m)$, and here's my thought process:
If $S$ is a list of $s_i$ and $T$ a list of $t_i$, we can define a "modification" bit for each element in each list.
If we define
$$c_i = \begin{cases} 1, & s_i \in R \\ 0, & \text{else} \end{cases}$$
and
$$d_i = \begin{cases} 1, & t_i \in I \\ 0, & \text{else} \end{cases}$$
and we define a "concatenation" sequence of the two:
$$x_i = \begin{cases} c_i, & 0 \leq i < n \\ d_{i-n}, & n \leq i < n+m \end{cases}$$
we can define a "modification" number:
$$X = \sum_{i=0}^{n+m-1} 2^i x_i$$
We can then compute $R$ and $I$ by breaking down the "modification" number $X$ into $c_i$ and $d_i$, giving
$$I = \{ t_i | d_i = 1 \}$$
and
$$R = \{ s_i | c_i = 1 \}$$.
Now, since $c_i$ and $d_i$ rely on $R$ and $I$, we can redefine them based on how $R$ and $I$ are defined:
$$(s_i \in R) \iff (s_i \in S \land s_i \notin T)$$
$$(t_i \in I) \iff (t_i \in T \land t_i \notin S)$$
So $c_i$ and $d_i$ are now:
$$c_i = \begin{cases} 1, & s_i \in S \land s_i \notin T \\ 0, & \text{else} \end{cases}$$
$$d_i = \begin{cases} 1, & t_i \in T \land t_i \notin S \\ 0, &
Basically, given two sets $S$ and $T$, I want to compute the two set differences $I = T \setminus S$ and $R = S \setminus T$. The goal is that if $S$ represents a "before" set, and $T$ an "after" set, I want to know what elements were inserted into (represented by $I$) and removed from (represented by $R$) the original set $S$ to get to $T$.
A naïve approach would be to compare every element of $S$ with every element of $T$, a worst-case runtime of $O(n * m)$, where $n = |S|$ and $m = |T|$.
An improvement would be to treat $S$ as a list and sort it, then do a binary search in the sorted list for every element in $T$. This would be worst-case (I believe) $O(n \log n + m \log n)$.
However, I think I can get it down to $O(n + m)$, and here's my thought process:
If $S$ is a list of $s_i$ and $T$ a list of $t_i$, we can define a "modification" bit for each element in each list.
If we define
$$c_i = \begin{cases} 1, & s_i \in R \\ 0, & \text{else} \end{cases}$$
and
$$d_i = \begin{cases} 1, & t_i \in I \\ 0, & \text{else} \end{cases}$$
and we define a "concatenation" sequence of the two:
$$x_i = \begin{cases} c_i, & 0 \leq i < n \\ d_{i-n}, & n \leq i < n+m \end{cases}$$
we can define a "modification" number:
$$X = \sum_{i=0}^{n+m-1} 2^i x_i$$
We can then compute $R$ and $I$ by breaking down the "modification" number $X$ into $c_i$ and $d_i$, giving
$$I = \{ t_i | d_i = 1 \}$$
and
$$R = \{ s_i | c_i = 1 \}$$.
Now, since $c_i$ and $d_i$ rely on $R$ and $I$, we can redefine them based on how $R$ and $I$ are defined:
$$(s_i \in R) \iff (s_i \in S \land s_i \notin T)$$
$$(t_i \in I) \iff (t_i \in T \land t_i \notin S)$$
So $c_i$ and $d_i$ are now:
$$c_i = \begin{cases} 1, & s_i \in S \land s_i \notin T \\ 0, & \text{else} \end{cases}$$
$$d_i = \begin{cases} 1, & t_i \in T \land t_i \notin S \\ 0, &
Solution
You can compute $S\setminus T$ and $T\setminus S$ from $S$ and $T$ in $O(n+m)$ time using a hash table. Put all of list $S$ into a hashtable, and then iterate through list $T$ and look it up in the hashtable. Then do the same, with $T$ in the hashtable and iterating through $S$.
Fine print for complexity purists: this is expected running time, making suitable assumptions about the hash function. However, the probability that the running time takes longer than $c \cdot (n+m)$ can be made exponentially small in $c$ with a suitable choice of hash function. For practical purposes, you typically don't need to worry about this.
You can also do it in $O(n \lg n + m \lg m)$ time, using a suitable sorting algorithm plus a standard merge algorithm. In some special circumstances, sorting might be even faster (look up counting sort and radix sort).
Which of these is faster in practice will depend upon the platform you run it on. You probably will need to implement and try both, to see which will be faster. You can't trust the asymptotic complexity in this situation, as caching effects and other implementation considerations have the potential to be more important than the asymptotics.
Fine print for complexity purists: this is expected running time, making suitable assumptions about the hash function. However, the probability that the running time takes longer than $c \cdot (n+m)$ can be made exponentially small in $c$ with a suitable choice of hash function. For practical purposes, you typically don't need to worry about this.
You can also do it in $O(n \lg n + m \lg m)$ time, using a suitable sorting algorithm plus a standard merge algorithm. In some special circumstances, sorting might be even faster (look up counting sort and radix sort).
Which of these is faster in practice will depend upon the platform you run it on. You probably will need to implement and try both, to see which will be faster. You can't trust the asymptotic complexity in this situation, as caching effects and other implementation considerations have the potential to be more important than the asymptotics.
Context
StackExchange Computer Science Q#27570, answer score: 5
Revisions (0)
No revisions yet.