HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Test if there exists an integer k to add to one sequence to make it a subsequence of another sequence

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
makesubsequencesequenceonetestanotherexiststhereintegeradd

Problem

Suppose that sequence $A$ contains $n$ integers $a_1,a_2,a_3,\ldots,a_n$ and sequence $B$ contains $m$ integers $b_1,b_2,b_3,\ldots,b_m$. We know that $m \geq n$. We assume without loss of generality that both sequences $A$ and $B$ are sorted in ascending order.

What's the fastest algorithm to determine if there exists an integer $k$ such that the sequence $a_1+k,a_2+k,a_3+k,\ldots,a_n+k$ is a subsequence of the sequence $B$?

Here is a naive algorithm, which would take $O(n(m-n))$ time. Store the sequence $B$ as a hashtable. For each element $b_j$ of the $B$ (except the largest $n$ elements), use $b_j-a_1$ as your guess at $k$; you can verify this guess by checking whether each of $a_1+k,\dots,a_n+k$ are in the hashtable $B$. This takes $O(n)$ expected time per guess at $k$, and you make $m-n$ guesses, so in total the expected running time is $O(n(m-n))$. Can we do better?

I came across this problem while trying to match two binary images (testing whether one image contains the other).

Solution

Here is a heuristic that won't always work, but should work with high probability if the integers in the arrays are chosen randomly from a large enough space.

Initialize a hashtable of counts $C$ to all zeros. Then, repeat $t$ times: randomly pick $i,j$, compute $b_j-a_i$, and increment $C[b_j-a_i]$. Finally, sort $C$ by counts, from largest count to smallest; then for each of the largest few values of $C[k']$, try $k'$ as your guess at $k$, and verify each guess.

Notice that in each iteration, the probability of incrementing $C[k]$ is at least $1/m$; whereas if $l\ne k$, we expect $C[l]$ to be incremented much more rarely (assuming the integers in the arrays are random and large enough). Thus, after $t$ iterations, we expect $\mathbb{E}[C[k]] \ge t/m$ but $\mathbb{E}[C[l]] \ll t/m$. So, once $t$ is large enough, $C[k]$ should be larger than every other entry in $C$.

How large does $t$ need to be? I expect that $t = O(m \log m)$ should suffice, based on a central limit theorem approximation for the counts $C[l]$, assuming we are willing to accept a small probability of error (which can be driven exponentially small). The constant factor hidden by the big-O notation might be non-trivial.

This is only a heuristic, and there will certainly be inputs where it fails.

Context

StackExchange Computer Science Q#117706, answer score: 3

Revisions (0)

No revisions yet.