HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Shortest Non-Subsequence String With Constant-Size Alphabet

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
alphabetconstantshortestnonwithsizesubsequencestring

Problem

Let $S[1..n] \in \Sigma^*$ be a string of $n$ symbols over the alphabet $\Sigma$ where $|\Sigma| \in \mathcal{O}(1)$. Determine a shortest string which cannot be obtained from $S$ by deleting some (possibly no) symbols.

For instance, we are given $\Sigma = \{ A, C, G, T \}$ and $S = ACGTACGT$. In this case, one valid answer is $CCA$, because it does not occur as a subsequence in $S$ (i.e. cannot be obtained by removing some of the symbols from $S$) and there is no such string of length $2$.

I am looking for an algorithm which takes $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$ time. It is not sufficient to compute the length of the answer, an example should also be constructed.

Observations and Approaches. The length of the answer does not exceed $\displaystyle \frac{n}{|\Sigma|} + 1$ by pigeonhole principle. This upper bound is attained for $S = ACGTACGT...ACGT$ in the above example. If we are only interested in the length of the answer, the following approach should work. Let $f(i)$ denote the largest $k \in \mathbb{N}$ such that every string of length $k$ is a subsequence of the prefix $S[1..i]$. Our final answer is $f(n) + 1$ and $f$ can be computed by dynamic programming in $\mathcal{O}(n)$ time for $1 \le i \le n$ simply by recurrence

$\displaystyle f(i) = 1 + f\left(\min{(\ell(i, \sigma_1), \ell(i, \sigma_2), \dots, \ell(i, \sigma_{|\Sigma|})) - 1}\right)$

where $\ell(i, \sigma)$ denotes the last occurrence of $\sigma \in \Sigma$ in the prefix $S[1..i]$. Obviously, the function $f$ is increasing, i.e. $f(i + 1) \ge f(i)$ for $1 \le i \le n - 1$. It seems more difficult to construct an example from this information, that is where I am stuck.

Finally note that the problem is a lot easier, if we require the subsequence do be contigous. Since there are only $\mathcal{O}(n^2)$ substrings, but $\mathcal{O}(|\Sigma|^k)$ strings of length $k$ over $\Sigma$, $k$ turns out to be very small and even brute-force will succeed.

Any kind of help is appreciated, thank

Solution

For each $i$, let $s(i)$ be a string of length $f(i)+1$ that is not a subsequence of $S[1\ldots i]$, then what you seek for is exactly $s(n)$. Also denote $g(i) = \min\left(\ell\left(i,\sigma_1\right),\ldots,\ell\left(i,\sigma_{|\Sigma|}\right)\right)$. $g(i)$'s can be computed in $O(n)$ time in advance.

We have
$$
s(i)=s(g(i)-1)S[g(i)].
$$
Firstly $s(g(i)-1)S[g(i)]$ is of length $f(i)+1$. Suppose by contradiction $s(g(i)-1)S[g(i)]$ is a subsequence of $S[1\ldots i]$, then the index that the last character $S[g(i)]$ matches must be less than or equal to $g(i)$, which means $s(g(i)-1)$ is a subsequence of $S[1\ldots (g(i)-1)]$, a contradiction. So the recurrence is correct.

Now you can use this recurrence to compute $s(n)$:

  • Let $i \leftarrow n$, and let $s$ be an empty string.



  • If $g(i)$ does not exist, which means there is at least one character not appearing in $S[1\ldots i]$, insert this character to the beginning of $s$ and return $s$.



  • Insert $S[g(i)]$ to the beginning of $s$.



  • Let $i\leftarrow g(i)-1$.



  • Go back to step 2.

Context

StackExchange Computer Science Q#88786, answer score: 3

Revisions (0)

No revisions yet.