HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Which is the proper algorithm to figure repeated substrings of longest total length?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
totalsubstringsproperthefigurerepeatedlengthalgorithmwhichlongest

Problem

Given a string $S$ and a substring $T$, define the value of $T$ to be the total length (in characters) of all non-overlapping instances of $T$ in $S$. In other words, the value of $T$ is the length of $T$, times the number of times that $T$ occurs in $S$ without overlap.

I need an algorithm that does the following:

Given a string $S$, find the substring $T$ whose value is maximal, out of all substrings of length at least two; the index of each location where $T$ appears in $S$; and the value of $T$ (the total number of characters across all of those occurrences of $T$).

You can assume the string $S$ is ANSI encoded, where each character is one byte.

Example:

String => "AOOISDF000923RJJASPDFIJJ1023ASPD499FJJDSJIK"

Has the following repetitions:
1 - "DF", two times, total characters (4)
2 - "23", two times, total characters (4)
3 - "JJ", three times, total characters (6)
...
N - "ASPD", two times, total characters (8)


The desired output would be ASPD, because it has produced the largest total amount of characters (8).

Is there an efficient algorithm for this problem? Is there a search term or name for this that I could use to search for it, or papers in the literature that describe techniques for this problem?

Solution

You might want to check out the literature on the longest repeated substring problem, which is closely related but is trying to optimize a different metric. There are many algorithms and techniques; it's possible some of them could be adjusted to apply to your problem.

For instance, one can find the optimum solution to your problem in $O(nr)$ time using Rabin-Karp's method, where $r$ is the length of the longest repeated substring in $S$. First, we compute $r$ using any standard method. Then, for each $i$ in the range $1,2,\dots,r$, we look for the substring $T$ of length $i$ that is repeated the most number of times (without counting overlaps). This can be done in $O(n)$ time for a fixed value of $i$, by using a rolling hash to compute the hash of all length-$i$ substrings of $S$ and for each such length-$i$ substring, compute the number of times it is repeated (without counting overlaps). This can be done using a hashtable and a greedy algorithm (each time you see a substring $T$, if it doesn't overlap with the last accepted occurrence of $T$, increment the count for $T$ and accept this occurrence, otherwise reject it).

For many distributions on input strings, $r$ is relatively small compared to $n$: e.g., $r = O(\lg n)$. If that applies to your situation, then this Rabin-Karp-based method will be easy to implement and perform reasonably well.

You could also look at suffix trees or suffix arrays. Each internal node in the suffix tree represents a candidate for the substring (a substring that occurs more than once in $S$). However, it's not quite clear to me at the moment how to efficiently account for overlap according to your problem statement, so I don't have a specific algorithm to suggest.

Context

StackExchange Computer Science Q#53826, answer score: 2

Revisions (0)

No revisions yet.