HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Succinct data structure to return occurences of letters until an index

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
untilreturnoccurencesstructuresuccinctindexdataletters

Problem

I want to try and use a succinct type data structure in order to find the amount of occurences of a letter $C$ in a string $S$ until a given index $I$.

Assume we have a string $S$ of length $n$ over the alphabet $\Sigma = \{1,2,3,\ldots,n\}$. I want to build a data structure whose space usage is at most $O(n)$ machine-size words.

Given $C$ (a valid letter from $\Sigma$) and an index $i$ such that $0 \le i \lt n$, the data structure should be able to return the number of occurrences of $C$ in $S[1..i]$ as efficiently as possible (but not necessarily in $O(1)$ time). This is the only query possible.

The data structure is aware of $S$ when created.

What I've tried to far is to split the string $S$ into blocks and to store info about each block with the number of occurrences of each letter in the previous block. I couldn't find how to make it lighter so the memory complexity be as requested.

Solution

To put it another way, you want to implement rank queries on an arbitrary string with an arbitrary alphabet.

If $n$ is of modest size, the usual approach is to use a Wavelet tree, associating a succinct binary rank index with each node in the tree. The shape of the tree is arbitrary, but using a Huffman tree gives you a data structure that is very close to zero-order entropy compression.

When the alphabet is very large, though, the size and implementation of the tree itself becomes a concern. A way to fix this which seems to work well in practice is to store $\left\lceil \log_2 n \right\rceil$ rank-select indexes of length $|S|$, but use an index implementation that it sensitive to local variation in density, such as esp or RRR.

This is covered in Claude & Navarro's 2008 paper, Practical Rank/Select Queries over Arbitrary Sequences. There may be some more recent work than this if you chase citations, but that's a good start.

Context

StackExchange Computer Science Q#67287, answer score: 5

Revisions (0)

No revisions yet.