HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What properties must a string satisfy to be a possible output of the Burrows-Wheeler transform?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
themustwhatpropertiesburrowswheeleroutputsatisfypossibletransform

Problem

The Buttows-Wheeler transform is a function that takes a string $S$ to a new string $S'$ and an index $I$. The new string $S'$ is always an anagram (rearrangement) of the input string $S$.

However, if I pick a random string $S'$ and index $I$, the inverse Burrows-Wheeler transform usually fails. For example, if $S'$ is "QUICK", the (intuitive, block-sorting) inverse method results in the following block:

CQCQC
IKUIK
KUIKU
QCQCQ
UIKUI


Clearly, none of the rows is an anagram of "QUICK", no matter the choice of $I$.

What properties must the string $S'$ satisfy so that the block contains at least one anagram of $S'$? (I suspect that if one of them is an anagram, they all are, but I don’t know how to prove it.)

Solution

Consider a string with repeats of a letter, for example BANANA. The forward transform will create this block:

ABANAN
ANABAN
ANANAB
BANANA
NABANA
NANABA


Now imagine we label every occurrence of A with 1, 2 or 3 depending on which A in the original word it is.

Observe that the A’s in the leftmost column are in the same order as in the rightmost column. This is because they are both sorted by whatever text comes after them in the original string.

Now imagine we take an arbitrary string $S'$ of length $n$ and label the letters in alphabetical order, but identical letters in order of occurrence. For example, for BANANA the labeling would be:

BANANA
415263

AAABNN  (sorted)
123456


This list of digits from $1$ to $n$ can be interpreted as a permutation (rearrangement).

In order to be a possible output of the Burrows-Wheeler transform, the string $S'$ must have one of the following properties:

  • all letters in the string are the same; or



  • the permutation is a full cycle (i.e. it has order $n$).



For example, the word “QUICK” does not work because Q and C form a separate cycle from the rest of the letters:

By contrast, “QUIKC” works because the permutation of letters now forms a single complete cycle:

This property of $S'$ is necessary because without it, the inverse BWT couldn’t reconstruct the string $S$: instead of visiting every character, it would visit only the characters contained in one of the cycles (and visit them multiple times).

This property of $S'$ is also sufficient because once it is met, the inverse BWT can be performed on it to produce a string $S_i$ (with any index $i \in \left\{0..n-1\right\}$). The result of this is a string which will reconstitute $S'$ when run through BWT again.

Code Snippets

ABANAN
ANABAN
ANANAB
BANANA
NABANA
NANABA
BANANA
415263

AAABNN  (sorted)
123456

Context

StackExchange Computer Science Q#150126, answer score: 2

Revisions (0)

No revisions yet.