HiveBrain v1.2.0
Get Started
← Back to all entries
principlepythonMinor

Longest common substring approach using suffix arrays

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
suffixarraysapproachsubstringusinglongestcommon

Problem

I am trying to speed up a function to return the longest common substring.

The requirements:

  • The function takes two strings of arbitrary length (although on average they will be less than 50 chars each)



  • If two subsequences of the same length exist it can return either



  • Speed is the primary concern



This is what I have and it works according to my tests:

from os.path import commonprefix

class LongestCommonSubstr(object):

    def __init__(self, lstring, rstring):
        self.lstring = lstring+'0'
        self.rstring = rstring+'1'
        self._suffix_str_array = sorted(self._get_suffix_str(self.lstring) 
                                    + self._get_suffix_str(self.rstring))
        self.longest_common_substr = self._get_lcsubstr()

    @staticmethod
    def _get_suffix_str(string):

        return [string[i:] for i in range(len(string))]

    def _get_lcsubstr(self):

        try:
            substr_len =0
            max_len = 0
            lcs = None
            for i,n in enumerate(self._suffix_str_array):
                if n[-1] != self._suffix_str_array[i+1][-1]:
                    substr = commonprefix([n,self._suffix_str_array[i+1]])
                    substr_len = len(substr)
                    if substr_len > max_len:
                        max_len = substr_len
                        lcs = substr

        except IndexError:
            pass
            return lcs


Can the code be made faster in pure Python? Is there a better option to differentiate the input strings when made into one sorted list than concatenating an ID to them?

Solution

OK then: since your concern is speed, let's track our progress with actual timing data. The first step is to run the code through the Python profiler. With the addition of a bit of driver code that just calls LongestCommonSubstr().longest_common_substr 10000 times, I get the following results:

$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
         1130008 function calls in 3.083 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(__build_class__)
        1    0.000    0.000    3.082    3.082 :0(exec)
        1    0.000    0.000    0.000    0.000 :0(hasattr)
   280000    0.257    0.000    0.257    0.000 :0(len)
   260000    0.330    0.000    0.330    0.000 :0(max)
   260000    0.350    0.000    0.350    0.000 :0(min)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    0.042    0.000    0.042    0.000 :0(sorted)
        1    0.000    0.000    0.000    0.000 :2264(_handle_fromlist)
   260000    1.067    0.000    1.746    0.000 genericpath.py:69(commonprefix)
        1    0.022    0.022    3.082    3.082 lcs_profile.py:1()
    20000    0.069    0.000    0.157    0.000 lcs_profile.py:11(_get_suffix_str)
    20000    0.071    0.000    0.071    0.000 lcs_profile.py:13()
    10000    0.807    0.000    2.794    0.000 lcs_profile.py:15(_get_lcsubstr)
        1    0.000    0.000    0.000    0.000 lcs_profile.py:3(LongestCommonSubstr)
    10000    0.067    0.000    3.060    0.000 lcs_profile.py:4(__init__)
        1    0.000    0.000    3.083    3.083 profile:0( at 0x10996f030, file "lcs_profile.py", line 1>)
        0    0.000             0.000          profile:0(profiler)


The most important column is the second one, tottime, marking the total amount of time consumed by each function (but not the functions it calls). The largest entries in that column are for the commonprefix function and the _get_lc_substr function, so those are the spots you should focus on optimizing.

To keep track of our progress, I coupled your class with the following driver program which prints the execution time for 10000 runs:

def lcs_longest_match(a, b):
    return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
    import timeit, sys
    print('lcs    ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
                                   setup='from __main__ import lcs_longest_match',
                                   number=10000))


(this is for Python 3). A first run on my computer gives

$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs     0.49046793299203273


I'll start with _get_lc_substr, since you wrote that code and it'll be easier to work through. The algorithm you use is to go through each pair of consecutive strings in the sorted suffix array, find the common prefix, and save that prefix only if it's longer than any common prefix already found. I can suggest a few improvements:

  • Iterating over pairs of consecutive elements is a common task that has a fairly standard recipe, pairwise(iterable), given in the documentation for the itertools module. You can use the implementation from the more-itertools package if you want. This also lets you get rid of the try/except block (which probably didn't affect runtime much, but it helps code clarity).



  • Python has a built-in function, max, to find the maximum value of an iterable. If you use it, then the nuts and bolts of the loop as well as the compare-and-store-if-greater process get handled internally by the interpreter, which should be faster than doing them manually in pure Python.



  • Repeatedly accessing an attribute of an object, namely the method self._suffix_str_array, is slower than accessing it once and storing it locally as a new variable.



Let's check the effect of these changes on the runtime of the driver program:

lcs     0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453


Not much of a change. Actually, using max is slower here, so let's discard that change.

Time now to take a closer look at commonprefix.

-
If you look at the source code of this function,

def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    s1 = min(m)
    s2 = max(m)
    for i, c in enumerate(s1):
        if c != s2[i]:
            return s1[:i]
    return s1


you see that it goes through some preliminary steps relating to the fact that it needs to handle a list of potentially several paths. You only ever have two strings to compare, so you can skip that - and in fact, if you look back at the profiling data, you'll see that these calls to max and min do take up a significant amount of time. Therefore, it makes sense to implement your own version of commonprefix without the max and min calls.

Thi

Code Snippets

$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
         1130008 function calls in 3.083 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(__build_class__)
        1    0.000    0.000    3.082    3.082 :0(exec)
        1    0.000    0.000    0.000    0.000 :0(hasattr)
   280000    0.257    0.000    0.257    0.000 :0(len)
   260000    0.330    0.000    0.330    0.000 :0(max)
   260000    0.350    0.000    0.350    0.000 :0(min)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    0.042    0.000    0.042    0.000 :0(sorted)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist)
   260000    1.067    0.000    1.746    0.000 genericpath.py:69(commonprefix)
        1    0.022    0.022    3.082    3.082 lcs_profile.py:1(<module>)
    20000    0.069    0.000    0.157    0.000 lcs_profile.py:11(_get_suffix_str)
    20000    0.071    0.000    0.071    0.000 lcs_profile.py:13(<listcomp>)
    10000    0.807    0.000    2.794    0.000 lcs_profile.py:15(_get_lcsubstr)
        1    0.000    0.000    0.000    0.000 lcs_profile.py:3(LongestCommonSubstr)
    10000    0.067    0.000    3.060    0.000 lcs_profile.py:4(__init__)
        1    0.000    0.000    3.083    3.083 profile:0(<code object <module> at 0x10996f030, file "lcs_profile.py", line 1>)
        0    0.000             0.000          profile:0(profiler)
def lcs_longest_match(a, b):
    return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
    import timeit, sys
    print('lcs    ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
                                   setup='from __main__ import lcs_longest_match',
                                   number=10000))
$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs     0.49046793299203273
lcs     0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453
def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    s1 = min(m)
    s2 = max(m)
    for i, c in enumerate(s1):
        if c != s2[i]:
            return s1[:i]
    return s1

Context

StackExchange Code Review Q#110420, answer score: 8

Revisions (0)

No revisions yet.