patternpythonMinor
Suffix array construction in \$ O(n \log^2 n) \$
Viewed 0 times
suffixconstructionarraylog
Problem
Here is my implementation of the suffix array construction algorithm which follows this paper (specifically, pages 4 and 6).
This implementation looks correct, here are the unit tests I have tried.
My concerns are:
#!/usr/bin/env python3
from math import ceil, log2
def sufar(txt):
txt = txt + chr(0)
N, tokens = len(txt), sorted(set(t for t in txt))
equivalence = {t: i for i, t in enumerate(tokens)}
c, r = [equivalence[t] for t in txt], [(0, 0, 0)]
for i in range(1, ceil(log2(N)) + 1):
n = 2 ** (i - 1)
r = sorted([(c[j], c[(j + n) % N], j) for j in range(N)])
c[r[0][2]] = 0
for j in range(1, N):
c[r[j][2]] = c[r[j - 1][2]]
if r[j][0:2] != r[j - 1][0:2]:
c[r[j][2]] += 1
return [result[2] for result in r][1:]This implementation looks correct, here are the unit tests I have tried.
My concerns are:
txt = txt + chr(0)to add a special symbol$which is not supposed to be intxt.
- using
if r[j][0:2] != r[j - 1][0:2]to understand if the suffixes at positionsjandj - 1are from the same equivalence class. I would expect the comparison to beif r[j][0:1] != r[j - 1][0:1]but that fails the unit tests.
Solution
- Review
I'm reviewing the updated version of the code only.
-
There's no docstring. What does this function do? How do I call it? What does it return? Can you give an example of what it does?
-
The names are obscure. We're not running out of letters, so
suffix_array would be clearer to the reader than sufar. (You said in comments that this was to make it "quicker to type" but in that case maybe you need a better editor? Most programming editors these days have some kind of completion facility on names.)-
The algorithm is tricky enough that it needs explanation. My advice would be to add comments to the loop explaining the invariants that holds at each iteration. (Figuring out loop invariants is invaluable for understand how an algorithm works, and they can be turned into assertions that check the correctness of the implementation.)
-
The code does not actually implement the algorithm in the Vladu/Negruşeri paper. A comparison with their C implementation on page 6 shows that there are two points of difference.
(i) When preparing the
L array (result in sufar), if i + cnt is out of bounds (j + n in sufar), the dummy element is used, but in sufar the index is wrapped around to get cls[(j + n) % N].(ii) When computing the new
P array (cls in sufar), if L[i] differs from L[i - 1] then the entry in P is assigned the value i. However, in sufar, the entry in cls is assigned the value cls[result[j - 1][2]] + 1.Nonetheless, despite these differences in the implementation,
sufar gets the correct result!In case (i) that's because the original algorithm imagines extending the string with enough dummy characters to make the suffixes the same length (thus
abac becomes abac$$$$ and the 4-character suffixes are abac, bac$, ac$$, c$$$, and $$$$ which gets discarded), whereas sufar images extending the string with a single dummy character and then a repeat of itself (thus abac becomes abac$aba and the 4-character suffixes are abac, bac$, ac$a, c$ab and $aba which gets discarded). But the single $ is enough for these to come out in the correct order.In case (ii) we only care about the sorted order of the entries in
P, not their actual values, so the change makes no difference.I think that difference (ii) is fine, but (i) is sufficiently tricky that I would avoid it if at all possible. (Or else write a comment explaining why I did it.)
-
To make it easy for the reader to check that you've correctly implemented the algorithm in the paper, I would use the same notation as the paper unless there was a very good reason not to. So
txt → A, cls → P, result → L, N → n, n → count, etc.-
Instead of
set(t for t in txt), just write set(txt).-
Instead of adding
chr(0) to txt and then making equivalence classes for the entries in txt (which relies on chr(0) not already appearing in txt), make the equivalence class first, and then append an equivalence class that you know is not present, for example -1.-
Instead of computing
ceil(log2(N)), start with count = 1 and double it each iteration until it is too large. This avoids the dependence on the math module.-
It turns out that there's no need to actually add a dummy value to the text in the first place. If we avoid this, then we can also avoid adding 1 to
N and avoid having to drop the initial element of result at the end.-
The initialization of the
cls array based on the sorted equivalence classes of the length-1 substrings of txt is really just a special case of the way that cls is updated in the loop. If the loop were turned upside down so that it computes cls first and then result, you could arrange to avoid the special initialization of cls. See below for how this is done.-
Another benefit of turning the loop upside down is that it avoids a wasteful computation of
cls on the last iteration that will never be used.-
Numeric tuple lookups like the
[2] in result[j][2] are hard to understand. It's usually better to unpack the tuple into local variables so that you can refer to its elements by name.-
In Python it's usually best to avoid iterating over sequence indices, if possible, and to prefer iterating over the elements directly. Here the computation of
cls involves iterating over pairs of adjacent elements in result; you could use the pairwise recipe from the itertools module, or combine zip and itertools.islice as I've done below.- Revised code
```
from itertools import chain, islice
def suffix_array(A):
"""Return a list of the starting positions of the suffixes of the
sequence A in sorted order.
For example, the suffixes of ABAC, in sorted order, are ABAC, AC,
BAC and C, starting at positions 0, 2, 1, and 3 respectively:
>>> suffix_array('ABAC')
[0, 2, 1, 3]
"""
# This implements the algorithm of Vladu and Negruşeri; see
# http://web.stanford.edu
Code Snippets
from itertools import chain, islice
def suffix_array(A):
"""Return a list of the starting positions of the suffixes of the
sequence A in sorted order.
For example, the suffixes of ABAC, in sorted order, are ABAC, AC,
BAC and C, starting at positions 0, 2, 1, and 3 respectively:
>>> suffix_array('ABAC')
[0, 2, 1, 3]
"""
# This implements the algorithm of Vladu and Negruşeri; see
# http://web.stanford.edu/class/cs97si/suffix-array.pdf
L = sorted((a, i) for i, a in enumerate(A))
n = len(A)
count = 1
while count < n:
# Invariant: L is now a list of pairs such that L[i][1] is the
# starting position in A of the i'th substring of length
# 'count' in sorted order. (Where we imagine A to be extended
# with dummy elements as necessary.)
P = [0] * n
for (r, i), (s, j) in zip(L, islice(L, 1, None)):
P[j] = P[i] + (r != s)
# Invariant: P[i] is now the position of A[i:i+count] in the
# sorted list of unique substrings of A of length 'count'.
L = sorted(chain((((P[i], P[i+count]), i) for i in range(n - count)),
(((P[i], -1), i) for i in range(n - count, n))))
count *= 2
return [i for _, i in L]Context
StackExchange Code Review Q#87335, answer score: 7
Revisions (0)
No revisions yet.