HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Multiple longest common subsequence (another algorithm)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
subsequencealgorithmanothermultiplelongestcommon

Problem

Here is the algorithm:

Input: a set of \$M\$ strings \$S=\{s_i\}\$ of length \$N\$ with alphabet \$\Sigma\$

Make an index of each string by chars.
Maintain a position \$p=(p_i)\$ where \$p_i\$ is an index into string \$s_i\$.
Set the initial position such that \$\forall i: p_i = |s_i|+1\$
(1-based indexing).

While chars left in strings do:

  • Select the next point among matches (same char in every string) by looping through the alphabet shared by the strings and picking the point with the minimal m-dimensional euclidean distance to the current point.



Is this \$O(|\Sigma|M\log(N))\$?

Here is (quite bad) sample python code:

```
from collections import defaultdict
import sys, bisect

def mlcs(strings):
strings = list(set(strings))
indexes = dict()

sigma = set()
for x in strings:
sigma = sigma.union(set(x))

for x in strings:
tx = defaultdict(list)
for i in range(len(x)):
tx[x[i]] += [i]
indexes[x] = tx

def match(pos):
s = True
for i in range(len(strings)-1):
x = strings[i]
y = strings[i+1]
s &= x[pos[x]] == y[pos[y]]
return s

def distance(ind1, ind2):
d = 0
for x in strings:
d += (ind1[x] - ind2[x])**2
return d

def find(xs, y):
i = bisect.bisect_right(xs, y)
return i-1

pos = {}
for x in strings:
pos[x] = len(x)
result = ""
loop = True
while loop:
for x in strings:
if pos[x] pos[x] - 1:
ind[x] = -128
else:
ind[x] = -128
d = distance(ind, pos)
if d < dr:
indr = ind
dr = d
x = strings[0]
def notnegative(p):
r = True
for x in p.values():
if x < 0:
r = False
return r
if notnegative(indr):

Solution


  1. Analysis



This is a greedy algorithm: at each step it prefixes one letter to the result, choosing the letter than minimizes the change in the indexes into the strings.

The main loop of the algorithm adds one letter to the result, so can execute at most \$ N \$ times. Each iteration of the loop considers each letter in the alphabet \$ Σ \$, and each of the \$ M \$ strings, and searches (by bisection) the list of occurrences of that letter in that string, taking \$ O(\log N) \$. Thus the overall runtime is \$ O(\left|Σ\right|NM\log N) \$. (You missed a factor of \$ N \$.)

This is polynomial time, but the longest common subsequence problem is NP-hard, so the algorithm does not solve the problem in the general case.

Here's a simple case where mlcs returns the wrong result:

>>> mlcs(('ab', 'aba'))
'a'


  1. Review



-
There's no docstring for mlcs. What does this function do? How do I call it? What does it return?

-
The case of no strings is not handled:

>>> mlcs(())
Traceback (most recent call last):
    ...
IndexError: list index out of range


This should probably be a ValueError, like max(()).

-
The case where all strings are empty is not handled:

>>> mlcs(('', ''))
Traceback (most recent call last):
    ....
AttributeError: 'NoneType' object has no attribute 'values'


This should return the empty string.

-
The code takes the union of the letters in the strings. But the only letters that can appear in the longest common subsequence are those letters that appear in all the strings, so the intersection is needed here here, not the union.

-
The code accumulates a list by repeated addition:

for i in range(len(x)):
    tx[x[i]] += [i]


This wastes space because Python might have to allocate a new list each time and copy the old list across. It's more efficient to accumulate a list using the append method.

-
Iteration over the indexes of a sequence can often be simplified using enumerate, for example the loop above can be written:

for i, letter in enumerate(x):
    tx[letter].append(i)


-
pos is a dictionary mapping input string to the current position in that string in the search. Since strings is a list of strings, it would make more sense for pos to be a list of positions, so that pos[i] was the current position in strings[i]. Then match simplifies to:

def match(pos):
    c = strings[0][pos[0]]
    return all(c == s[p] for s, p in zip(strings, pos))


-
The function match is not called, so could be omitted.

-
The initialization of pos:

pos = {}
for x in strings:
    pos[x] = len(x)


Could be written as a dictionary comprehension:

pos = {s: len(s) for s in strings}


or, if you follow my advice above about making pos into a list, a list comprehension:

pos = [len(s) for s in strings]


-
This loop:

for x in strings:
    if pos[x] < 0:
        loop = False
if not loop:
    break


always tests all elements of pos, so that even if the first one is negative, it will go on and uselessly test all the others. It would be better to fail as soon as possible:

if any(pos[s] < 0 for s in strings):
    break


or, if you follow my advice above about making pos into a list:

if any(p < 0 for p in pos):
    break


-
Note that the change discussed above gets rid of the only assignment to loop, so that we have:

loop = True
while loop:
    if any(p < 0 for p in pos):
        break


which simplifies to:

while all(p >= 0 for p in pos):


-
Instead of setting ind[x] and then setting it back to the default value based on a condition:

ind = defaultdict(int)
for x in strings:
    indxc = indexes[x][c]
    if indxc:
        i = find(indxc, pos[x] - 1)
        ind[x] = indxc[i]
        if ind[x] > pos[x] - 1:
            ind[x] = -128                    
    else:
        ind[x] = -128


initialize it to the default value and set it only if the condition does not apply:

ind = defaultdict(lambda:-128)
for x in strings:
    indxc = indexes[x][c]
    if indxc:
        i = find(indxc, pos[x] - 1)
        if indxc[i] <= pos[x] - 1:
            ind[x] = indxc[i]


-
Just as I recommended making pos into a list, I also recommend making ind into a list. If you did, then distance would become:

def distance(v, w):
    return sum((i - j)**2 for i, j in zip(v, w)):


and since this is only called from one place, you could easily inline it.

-
Similarly, indexes would be more conveniently organized so that indexes[letter][i] is a list of the indexes of the occurrences of letter in strings[i].

-
A magic number like this:

dr = 12777216


needs to be explained. It looks to me as though this needs to be some number larger than the biggest possible distance between pos and ind. But for very long strings (tens of thousands of letters), this won't be the case. It would be more reliable to start at infinity:

```
min_distance = flo

Code Snippets

>>> mlcs(('ab', 'aba'))
'a'
>>> mlcs(())
Traceback (most recent call last):
    ...
IndexError: list index out of range
>>> mlcs(('', ''))
Traceback (most recent call last):
    ....
AttributeError: 'NoneType' object has no attribute 'values'
for i in range(len(x)):
    tx[x[i]] += [i]
for i, letter in enumerate(x):
    tx[letter].append(i)

Context

StackExchange Code Review Q#90194, answer score: 7

Revisions (0)

No revisions yet.