patternMinor
Algorithm to find repeated patterns in a large string
Viewed 0 times
repeatedpatternsalgorithmlargefindstring
Problem
For optimization purposes I'm trying to analyze a large list of executed program commands to find chunks of commands that are executed over and over again. This problem is similar to searching repeated substrings in a string. However, in my case I'm not looking for the longest substrings but rather smaller substrings that occur very often.
For example, say each command is represented by a letter, then a program might look like
Things I have tried so far:
-
I played with suffix trees to find the longest repeated substrings that occur at least k times. While that is simple to implement, it doesn't work well in my use case, because also overlapping substrings are found. Trying to remove those wasn't very successful either. The approach mentioned in this post either gave wrong or incomplete results (or I misunderstood the approach), and it also doesn't seem to be customizable. Suffix trees still seem the most promissing approach to me. Perhaps someone has an idea how to include the minumim/maximum chunk lengths into the search here?
EDIT: I finally found a working algorithm for the longest non-overlapping repeated substring, explained here. My current implementation works fine for smaller inputs but has an excessive memory consumption for inputs larger than 10 MB (which shouldn't be a problem of the algorithm per se).
For example, say each command is represented by a letter, then a program might look like
xabca yabca xabca yabca. If we are looking for the longest repeated substrings, the best result is xabca yabca. A "better" result would be abca, though. While being shorter, it occurs more often in the string. a occurs even more often on its own, but it would be considered a too short match. So an algorithm should be parameterizable by a minimum and maximum chunk length.Things I have tried so far:
-
I played with suffix trees to find the longest repeated substrings that occur at least k times. While that is simple to implement, it doesn't work well in my use case, because also overlapping substrings are found. Trying to remove those wasn't very successful either. The approach mentioned in this post either gave wrong or incomplete results (or I misunderstood the approach), and it also doesn't seem to be customizable. Suffix trees still seem the most promissing approach to me. Perhaps someone has an idea how to include the minumim/maximum chunk lengths into the search here?
EDIT: I finally found a working algorithm for the longest non-overlapping repeated substring, explained here. My current implementation works fine for smaller inputs but has an excessive memory consumption for inputs larger than 10 MB (which shouldn't be a problem of the algorithm per se).
- Another attempt was using the substring table that is created for the LZW compression algorithm. The problem with this approach is that is doesn't find repeated chunks that occur early and it also creates longer and longer table entries the farer it processes the input (whi
Solution
To improve your brute force, you would only need to look at the substrings of minimum length, as longer substrings will occur at most as many times as their own substrings. This should take linear time to create the dictionary, and O(nlog(n)) to sort.
If this is too memory intensive, you could do pruning by first brute-forcing this for say, substrings of length 2, and then only consider min-length substrings that contain at least k of the most common 2-length substrings for some threshold k.
If this is too memory intensive, you could do pruning by first brute-forcing this for say, substrings of length 2, and then only consider min-length substrings that contain at least k of the most common 2-length substrings for some threshold k.
Context
StackExchange Computer Science Q#112901, answer score: 2
Revisions (0)
No revisions yet.