principlepythonMinor
Longest common substring approach using suffix arrays
Viewed 0 times
suffixarraysapproachsubstringusinglongestcommon
Problem
I am trying to speed up a function to return the longest common substring.
The requirements:
This is what I have and it works according to my tests:
Can the code be made faster in pure Python? Is there a better option to differentiate the input strings when made into one sorted list than concatenating an ID to them?
The requirements:
- The function takes two strings of arbitrary length (although on average they will be less than 50 chars each)
- If two subsequences of the same length exist it can return either
- Speed is the primary concern
This is what I have and it works according to my tests:
from os.path import commonprefix
class LongestCommonSubstr(object):
def __init__(self, lstring, rstring):
self.lstring = lstring+'0'
self.rstring = rstring+'1'
self._suffix_str_array = sorted(self._get_suffix_str(self.lstring)
+ self._get_suffix_str(self.rstring))
self.longest_common_substr = self._get_lcsubstr()
@staticmethod
def _get_suffix_str(string):
return [string[i:] for i in range(len(string))]
def _get_lcsubstr(self):
try:
substr_len =0
max_len = 0
lcs = None
for i,n in enumerate(self._suffix_str_array):
if n[-1] != self._suffix_str_array[i+1][-1]:
substr = commonprefix([n,self._suffix_str_array[i+1]])
substr_len = len(substr)
if substr_len > max_len:
max_len = substr_len
lcs = substr
except IndexError:
pass
return lcsCan the code be made faster in pure Python? Is there a better option to differentiate the input strings when made into one sorted list than concatenating an ID to them?
Solution
OK then: since your concern is speed, let's track our progress with actual timing data. The first step is to run the code through the Python profiler. With the addition of a bit of driver code that just calls
The most important column is the second one,
To keep track of our progress, I coupled your class with the following driver program which prints the execution time for 10000 runs:
(this is for Python 3). A first run on my computer gives
I'll start with
Let's check the effect of these changes on the runtime of the driver program:
Not much of a change. Actually, using
Time now to take a closer look at
-
If you look at the source code of this function,
you see that it goes through some preliminary steps relating to the fact that it needs to handle a list of potentially several paths. You only ever have two strings to compare, so you can skip that - and in fact, if you look back at the profiling data, you'll see that these calls to
Thi
LongestCommonSubstr().longest_common_substr 10000 times, I get the following results:$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
1130008 function calls in 3.083 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(__build_class__)
1 0.000 0.000 3.082 3.082 :0(exec)
1 0.000 0.000 0.000 0.000 :0(hasattr)
280000 0.257 0.000 0.257 0.000 :0(len)
260000 0.330 0.000 0.330 0.000 :0(max)
260000 0.350 0.000 0.350 0.000 :0(min)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 0.042 0.000 0.042 0.000 :0(sorted)
1 0.000 0.000 0.000 0.000 :2264(_handle_fromlist)
260000 1.067 0.000 1.746 0.000 genericpath.py:69(commonprefix)
1 0.022 0.022 3.082 3.082 lcs_profile.py:1()
20000 0.069 0.000 0.157 0.000 lcs_profile.py:11(_get_suffix_str)
20000 0.071 0.000 0.071 0.000 lcs_profile.py:13()
10000 0.807 0.000 2.794 0.000 lcs_profile.py:15(_get_lcsubstr)
1 0.000 0.000 0.000 0.000 lcs_profile.py:3(LongestCommonSubstr)
10000 0.067 0.000 3.060 0.000 lcs_profile.py:4(__init__)
1 0.000 0.000 3.083 3.083 profile:0( at 0x10996f030, file "lcs_profile.py", line 1>)
0 0.000 0.000 profile:0(profiler)The most important column is the second one,
tottime, marking the total amount of time consumed by each function (but not the functions it calls). The largest entries in that column are for the commonprefix function and the _get_lc_substr function, so those are the spots you should focus on optimizing.To keep track of our progress, I coupled your class with the following driver program which prints the execution time for 10000 runs:
def lcs_longest_match(a, b):
return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
import timeit, sys
print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
setup='from __main__ import lcs_longest_match',
number=10000))(this is for Python 3). A first run on my computer gives
$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs 0.49046793299203273I'll start with
_get_lc_substr, since you wrote that code and it'll be easier to work through. The algorithm you use is to go through each pair of consecutive strings in the sorted suffix array, find the common prefix, and save that prefix only if it's longer than any common prefix already found. I can suggest a few improvements:- Iterating over pairs of consecutive elements is a common task that has a fairly standard recipe,
pairwise(iterable), given in the documentation for the itertools module. You can use the implementation from the more-itertools package if you want. This also lets you get rid of thetry/exceptblock (which probably didn't affect runtime much, but it helps code clarity).
- Python has a built-in function,
max, to find the maximum value of an iterable. If you use it, then the nuts and bolts of the loop as well as the compare-and-store-if-greater process get handled internally by the interpreter, which should be faster than doing them manually in pure Python.
- Repeatedly accessing an attribute of an object, namely the method
self._suffix_str_array, is slower than accessing it once and storing it locally as a new variable.
Let's check the effect of these changes on the runtime of the driver program:
lcs 0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453Not much of a change. Actually, using
max is slower here, so let's discard that change.Time now to take a closer look at
commonprefix.-
If you look at the source code of this function,
def commonprefix(m):
"Given a list of pathnames, returns the longest common leading component"
if not m: return ''
s1 = min(m)
s2 = max(m)
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1you see that it goes through some preliminary steps relating to the fact that it needs to handle a list of potentially several paths. You only ever have two strings to compare, so you can skip that - and in fact, if you look back at the profiling data, you'll see that these calls to
max and min do take up a significant amount of time. Therefore, it makes sense to implement your own version of commonprefix without the max and min calls.Thi
Code Snippets
$ python3.4 -m profile lcs_profile.py abcdeffghiklnopqr bdefghijklmnopmnop
1130008 function calls in 3.083 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(__build_class__)
1 0.000 0.000 3.082 3.082 :0(exec)
1 0.000 0.000 0.000 0.000 :0(hasattr)
280000 0.257 0.000 0.257 0.000 :0(len)
260000 0.330 0.000 0.330 0.000 :0(max)
260000 0.350 0.000 0.350 0.000 :0(min)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 0.042 0.000 0.042 0.000 :0(sorted)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist)
260000 1.067 0.000 1.746 0.000 genericpath.py:69(commonprefix)
1 0.022 0.022 3.082 3.082 lcs_profile.py:1(<module>)
20000 0.069 0.000 0.157 0.000 lcs_profile.py:11(_get_suffix_str)
20000 0.071 0.000 0.071 0.000 lcs_profile.py:13(<listcomp>)
10000 0.807 0.000 2.794 0.000 lcs_profile.py:15(_get_lcsubstr)
1 0.000 0.000 0.000 0.000 lcs_profile.py:3(LongestCommonSubstr)
10000 0.067 0.000 3.060 0.000 lcs_profile.py:4(__init__)
1 0.000 0.000 3.083 3.083 profile:0(<code object <module> at 0x10996f030, file "lcs_profile.py", line 1>)
0 0.000 0.000 profile:0(profiler)def lcs_longest_match(a, b):
return LongestCommonSubstr(a, b).longest_common_substr
if __name__ == '__main__':
import timeit, sys
print('lcs ', timeit.timeit('lcs_longest_match(sys.argv[1], sys.argv[2])',
setup='from __main__ import lcs_longest_match',
number=10000))$ python3.4 lcs.py abcdeffghiklnopqr bdefghijklmnopmnop
lcs 0.49046793299203273lcs 0.49046793299203273
optlcs1 0.4739605739887338
optlcs2 0.4895538759883493
optlcs3 0.4828952929965453def commonprefix(m):
"Given a list of pathnames, returns the longest common leading component"
if not m: return ''
s1 = min(m)
s2 = max(m)
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1Context
StackExchange Code Review Q#110420, answer score: 8
Revisions (0)
No revisions yet.