HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Given n strings, is one of them a substring of another?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
substringoneanotherthemstringsgiven

Problem

Suppose we are given a collection of $n$ strings, $S_1,\dots,S_n$. I would like to know whether any of those strings is a substring of any other string in the collection. In other words, I'd like an algorithm for the following task:

Input: $S_1,\dots,S_n$

Output: $i,j$ such that $S_i$ is a substring of $S_j$ and $i\ne j$, or None if no such $i,j$ exist

Is there an efficient algorithm for this?

If we replace "substring" with "prefix", there is an efficient algorithm (sort the strings, then do a linear scan to compare adjacent strings; sorting will ensure that substrings are adjacent). But it seems more challenging to test whether any string is a substring of any other string. A naive algorithm is to iterate over all pairs $i,j$, but this requires $\Theta(n^2)$ substring tests. Is there a more efficient algorithm?

I guess we could call this "all-pairs substring testing", or something like that.

My ultimate goal is to prune the collection so no string is a substring of any other, by removing each one that is a substring of something else in the collection.

Solution

You can build a suffix tree in linear time and check if there's an inner node that corresponds to a full string (constant time per node).

In more detail, assume we are given strings $s_1, \dots, s_n \in \Sigma^*$.

-
Build a (generalised) suffix tree of $s_1\$_1, s_2\$_2, \dots, s_n\$_n$ with $n$ pairwise distinct terminal markers $\$_1,\dots,\$_n \notin \Sigma$.

Using Ukkonen's algorithm, this can be done in linear time; linear in the sum of all string lengths.

-
Assuming that you label leaves with $(i,j)$ if they represent suffix $s_i[j..|s_i|]$ of $s_i$, traverse the tree and find those $n$ leaves labelled $(i,0)$, i.e. the leaves that correspond to the full strings.

This takes time linear in the tree size, which itself is linear in the input size.

-
The descendant leaves of the parent of $(i,0)$ (which is reached by an edge labelled $\$_i$) represent all matches from the set; this follows from the basic invariant of suffix trees. Find any one match by descending to any leaf (but $(i,0)$).

This again takes linear time.

The distinct terminal markers are not really necessary; a single one used to terminate all strings is quite sufficient as long as you allow multiple labels per leaf.

Context

StackExchange Computer Science Q#32628, answer score: 6

Revisions (0)

No revisions yet.