HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

Increasing speed of BWT inverse

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
bwtinversespeedincreasing

Problem

I think understanding basic data compression and pattern finding / entropy reduction primitives is really important. Mostly the basic algorithms (before the many tweaks and optimizations are made) are really elegant and inspiring. What's even more impressive is how widespread they are, but perhaps also how little they are really understood except outside of people working in compression directly. So I want to understand.

I understand and have implemented the naive implementations of LZ77 (just find the maximal prefix that existed before the current factor) and LZ78 (just see if we can extend a factor that occurred before at least twice, or we make a new record and start a new factor). I also understand the naive version of Burrows-Wheeler Transform (maintain the invariant that the first column is sorted, and keep adding that column and resorting until the matrix is square).

I also understand why all these (parts of) compression algorithms work:

  • LZ77 asymptotically optimal since in the long term and with an infinite window all factors that exist are composed of prefixes earlier in the input which LZ77 can find.



  • LZ78 is also pretty good since if a factor occurs at least once before it can be compressed the second time, so long as it is not a suffix of another factor found earlier.



  • BWT, by grouping same letters on the left hand side (of the matrix), and finding patterns in the preceding letters in the right hand side, can exploit bigram repetitions, recursively.



The BWT code below works, but the inverse is not efficient. What are the optimizations, and how can I understand why they work?

BWT

```
def rot(v):
return v[-1] + v[:-1]

def bwt_matrix(s):
return sorted(reduce(lambda m,s : m + [rot(m[-1])],xrange(len(s)-1),[s]))

def last_column(m):
return ''.join(zip(*m)[-1])

def bwt(s):
bwtm = bwt_matrix(s)
print 'BWT matrix : ', bwtm
return bwtm.index(s), ''.join(last_column(bwtm))

def transpose(m):
ret

Solution


  1. Introduction



The only actual question you asked is about how to make the inverse Burrows–Wheeler transform go faster. But as you'll see below, that's plenty for one question here on Code Review. If you have anything to ask about your LZ77 and LZ78 implementations, I suggest you start a new question.

  1. A bug



First, the output of the function ibwt doesn't match the example output in the question:

>>> from pprint import pprint
>>> pprint(ibwt('ssyxxiit'))
['iisstxxy',
'xxiiysts',
'stxxsiyi',
'iystixsx',
'xsiyxtis',
'tixssyxi',
'yxtiissx',
'ssyxxiit']


You should always prepare your question by copying and pasting input and output from source files and from an interactive session, rather than typing in what you think it should be. It's easy to introduce or cover up mistakes if you do the latter.

So it looks as if there is a missing transpose:

>>> pprint(transpose(ibwt('ssyxxiit')))
['ixsixtys',
'ixtysixs',
'sixsixty',
'sixtysix',
'tysixsix',
'xsixtysi',
'xtysixsi',
'ysixsixt']


  1. Avoiding transpositions



With that bug fixed, let's see how long the function takes on a medium-sized amount of data:

>>> import random, string
>>> from timeit import timeit
>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> timeit(lambda:ibwt(t[1]), number=1)
49.83987808227539


Not so good! What can we change? Well, an obvious idea is to cut out all those calls to transpose. The function ibwt transposes the matrix in order to sort it, and then transposes it back again each time round the loop. By keeping the matrix in transposed form throughout, it would be possible to avoid having to call transpose at all:

def ibwt2(s):
    """Return the inverse Burrows-Wheeler Transform matrix based on the
    string s.

    >>> ibwt2('SSYXXIIT')[3]
    'SIXTYSIX'

    """
    matrix = [''] * len(s)
    for _ in s:
        matrix = sorted(i + j for i, j in zip(s, matrix))
    return matrix


That simple change results in a big improvement:

>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> timeit(lambda:ibwt2(t[1]), number=1)
0.9414288997650146


  1. Aside on programming style



Something that's perhaps worth mentioning at this point is that you seem quite keen on transforming code so as to use a functional style (for example in this question, where you wanted to express a loop using reduce). Transforming code into pure functional form is an excellent intellectual exercise, but you have to be aware of two problems that often arise:

-
Pure functional programs can easily end up doing a lot of allocation and copying. For example, it is often tempting to transform your data into a format suitable for passing in to a function (rather than transforming your function so that it operates on the data you have). Thus in ibwt it is convenient to call sorted on the data in row-major order, but convenient to append a new instance of the string s in column-major order. In order to express both of these operations in the most convenient way, the function has to transpose the matrix back and forth, and since it do so functionally, each transposition has to copy the whole matrix.

-
Pursuit of pure functional style often results in tacit or point-free programming where variable names are avoided and data is routed from one function to another using combinators. This can seem very elegant, but variable names have two roles in a program: they don't just move data around, they also act as mnemonics to remind programmers of what the data means. Point-free programs can end up being totally mysterious because you lose track of exactly what is being operated on.

Anyway, aside over. Can we make any more progress on the inverse Burrows–Wheeler transform?

  1. An efficient implementation



If you look at what the inverse Burrows–Wheeler transform does, you'll see that it repeatedly permutes an array of strings (by sorting them). The permutation is exactly the same at each step (exercise: prove this!). For example, if we are given the string SSYXXIIT, then at each step we apply the permutation SSYXXIITIISSTXXY, which (taking note of where repeated letters go) is the permutation 0123456756017342.

This observation allows us to reconstruct a row of the matrix without having to compute the whole matrix. In this example we want to reconstruct row number 3. Well, we know that the final matrix looks like this:

I.......
I.......
S.......
S*......
T.......
X.......
X.......
Y.......


so the first element of the row is S. What's the next letter of this row (marked with *)? Where did this come from? Well, we know that the last step of the transform must have looked like this:

S0...... I5......
S1...... I6......
Y2...... -> S0......
X3...... -> S1......
X4...... -> T7......
I5...... -> X3......
I6...... X4......
T7...... Y2......


So the * must have come from row 1 (whi

Code Snippets

def ibwt2(s):
    """Return the inverse Burrows-Wheeler Transform matrix based on the
    string s.

    >>> ibwt2('SSYXXIIT')[3]
    'SIXTYSIX'

    """
    matrix = [''] * len(s)
    for _ in s:
        matrix = sorted(i + j for i, j in zip(s, matrix))
    return matrix
def ibwt3(k, s):
    """Return the kth row of the inverse Burrows-Wheeler Transform
    matrix based on the string s.

    >>> ibwt3(3, 'SSYXXIIT')
    'SIXTYSIX'

    """
    def row(k):
        permutation = sorted((t, i) for i, t in enumerate(s))
        for _ in s:
            t, k = permutation[k]
            yield t
    return ''.join(row(k))

Context

StackExchange Code Review Q#21058, answer score: 23

Revisions (0)

No revisions yet.