patternpythonMinor
Longest Common Substring solution for a string and a list
Viewed 0 times
solutionsubstringforandlistlongeststringcommon
Problem
Recently I have discovered one thing, which makes me feel frustrated. The Perl is lack of syntax sugar for strings, to do the same things, those are applicable to lists. The implementation I posted here would require to be rewritten to work for lists.
Thus, I had to re-write the code to Python to work for list as efficient as it does for strings:
Here I implemented one minor improvement - filtering out inner substrings from the resultset, which aims to display more laconic result.
To not to frighten anyone with long sheets of code, here I am attaching a link to units tests, I used to tune up this solution: https://gist.github.com/anonymous/aa7540cc7db2ca446e2b
Your comments are welcomed! Especially on function '_contains', which does not look like a simple way of doing a simple task. I have started studying the Python just a few weeks ago.
Thus, I had to re-write the code to Python to work for list as efficient as it does for strings:
def _contains(subseq, inseq):
return any(inseq[pos:pos + len(subseq)] == subseq for pos in range(0, len(inseq) - len(subseq) + 1))
def lcss(s1, s2, threshold=3):
all_of = []
def is_inner_seq(i, j):
return any(k < i < j <= l or k <= i < j < l for k, l in all_of)
for i in range(0, len(s1)):
for j in range(i + threshold, len(s1) + 1):
if _contains(subseq=s1[i:j], inseq=s2):
all_of.append((i, j)) # add a tuple to work OK when s1, s2 are lists
only_outer_seq = filter(lambda tup: not is_inner_seq(tup[0], tup[1]),
all_of)
return [s1[i:j] for i, j in only_outer_seq]Here I implemented one minor improvement - filtering out inner substrings from the resultset, which aims to display more laconic result.
To not to frighten anyone with long sheets of code, here I am attaching a link to units tests, I used to tune up this solution: https://gist.github.com/anonymous/aa7540cc7db2ca446e2b
Your comments are welcomed! Especially on function '_contains', which does not look like a simple way of doing a simple task. I have started studying the Python just a few weeks ago.
Solution
-
No docstrings! What do your functions do and how do I call them?
-
Your
-
Python's
It's also much faster.
-
I guess you didn't use
See this answer for a Boyer–Moore–Horspool implementation in Python.
-
Your algorithm has complexity \$O(m^3n)\$ (where \$m\$ and \$n\$ are the lengths of the two sequences). But the longest common subsequence problem for two sequences has a well known \$O(mn)\$ algorithm, described in detail in the Wikipedia article.
No docstrings! What do your functions do and how do I call them?
-
Your
lcss function has a name that suggests it might return the longest common subsequence. But in fact it returns all common subsequences of length 3 or more (except for ones that appear inside other common subsequences in s1). Is this really what you want the function to return? If so, you probably want to rethink the name.-
Python's
in operator already implements your _contains operation for strings:>>> [w.strip() for w in open('/usr/share/dict/words') if 'abc' in w]
['crabcatcher', 'dabchick']It's also much faster.
-
I guess you didn't use
in because you want your subsequence operation to work on all kinds of sequence (not just strings). But in that case you should prefer to use one of the Boyer–Moore search algorithms, for example Boyer–Moore–Horspool, so that the complexity is more like \$O(n)\$ (on average on random inputs) than \$O(mn)\$.See this answer for a Boyer–Moore–Horspool implementation in Python.
-
Your algorithm has complexity \$O(m^3n)\$ (where \$m\$ and \$n\$ are the lengths of the two sequences). But the longest common subsequence problem for two sequences has a well known \$O(mn)\$ algorithm, described in detail in the Wikipedia article.
Code Snippets
>>> [w.strip() for w in open('/usr/share/dict/words') if 'abc' in w]
['crabcatcher', 'dabchick']Context
StackExchange Code Review Q#56464, answer score: 4
Revisions (0)
No revisions yet.