patternMinor
Looking for a set implementation with small memory footprint
Viewed 0 times
withlookingsmallformemoryfootprintimplementationset
Problem
I am looking for implementation of the set data type. That is, we have to
I don't care about other operations. For orientation, in applications I'm working with we have $u \approx 10^{10}$.
I know of implementations that provide both operations in time $O(1)$, so I worry mostly about the size of data structure. I expect billions of entries but want to avoid swapping as much as possible.
I am willing to sacrifice runtime if necessary. Amortised runtime of $O(\log n)$ is what I can admit; expected runtimes or runtimes in $\omega(\log n)$ are not admissable.
One idea I have is that if $S$ can be represented as a union of ranges
Could you please point me to data structures which can do something like that?
- maintain a dynamic subset $S$ (of size $n$) from the universe $U = \{0, 1, 2, 3, \dots , u – 1\}$ of size $u$ with
- operations
insert(x)(add an elementxto $S$) andfind(x)(checks whether elementxis a member of $S$).
I don't care about other operations. For orientation, in applications I'm working with we have $u \approx 10^{10}$.
I know of implementations that provide both operations in time $O(1)$, so I worry mostly about the size of data structure. I expect billions of entries but want to avoid swapping as much as possible.
I am willing to sacrifice runtime if necessary. Amortised runtime of $O(\log n)$ is what I can admit; expected runtimes or runtimes in $\omega(\log n)$ are not admissable.
One idea I have is that if $S$ can be represented as a union of ranges
[xmin, xmax], then we will be able to save on storage size with the price of some performance decrease. Also, some other data patterns are possible, like [0, 2, 4, 6].Could you please point me to data structures which can do something like that?
Solution
Joe's answer is extremely good, and gives you all the important keywords.
You should be aware that succinct data structure research is still in an early stage, and many of the results are largely theoretical. Many of the proposed data structures are quite complex to implement, but most of the complexity is due to the fact that you need to maintain asymptotic complexity both over the universe size and the number of elements stored. If either one of these is relatively constant, then a lot of the complexity goes away.
If the collection is semi-static (that is, inserts are rare, or at least low-volume), then it's certainly worth considering an easy-to-implement static data structure (Sadakane's sdarray is a fine choice) in conjunction with an update cache. Basically, you record updates in a traditional data structure (e.g. B-tree, trie, hash table), and periodically bulk-update the "main" data structure. This is a very popular technique in information retrieval, since inverted indexes have many advantages for searching but are hard to update in-place. If this is the case, please let me know in a comment and I'll amend this answer to give you some pointers.
If inserts are more frequent, then I suggest succinct hashing. The basic idea is straightforward enough to explain here, so I will do so.
So the basic information theoretic result is that if you're storing $n$ elements from a universe of $u$ items, and there is no other information (e.g. no correlation between the elements) then you need $\log {u \choose n} + O(1)$ bits to store it. (All logarithms are base-2 unless otherwise specified.) You need this many bits. There is no way around it.
Now some terminology:
The difference between succinct and compact is the difference between little-oh and big-oh. Ignoring the absolute-value thing for a moment...
Informally, big-oh and little-oh are both "within a constant factor", but with big-oh the constant is chosen for you (by the algorithm designer, the CPU manufacturer, the laws of physics or whatever), but with little-oh you pick the constant yourself and it can be as small as you like. To put it another way, with succinct data structures, the relative overhead gets arbitrarily small as the size of the problem increases.
Of course, the size of the problem may have to get huge to realise the relative overhead that you want, but you can't have everything.
OK, with that under our belts, let's put some numbers on the problem. Let's supposed that keys are $n$-bit integers (so the universe size is $2^n$), and we want to store $2^m$ of these integers. Let's suppose that we can magically arrange an idealised hash table with full occupancy and no wastage, so that we need exactly $2^m$ hash slots.
A lookup operation would hash the $n$-bit key, mask off $m$ bits to find the hash slots, and then check to see if the value in the table matched the key. So far, so good.
Such a hash table uses $n 2^m$ bits. Can we do better than this?
Suppose that the hash function $h$ is invertible. Then we don't have to store the whole key in each hash slot. The location of the hash slot gives you $m$ bits of the hash value, so if you only stored the $n-m$ remaining bits, you can reconstruct the key from those two pieces of information (the hash slot location and the value stored there). So you would only need $(n - m) 2^m$ bits of storage.
If $2^m$ is small compared with $2^n$, Stirling's approximation and a little arithmetic (proof is an exercise!) reveals that:
$$(n - m) 2^m = \log {2^n \choose 2^m} + o\left(\log {2^n \choose 2^m}\right)$$
So this data structure is succinct.
However, there are two catches.
The first catch is constructing "good" invertible hash functions. Fortunately, this is much easier than it looks; cryptographers make invertible functions all the time, only they call them "cyphers". You could, for example, base a hash function on a Feistel network, which is a straightforward way to construct an inve
You should be aware that succinct data structure research is still in an early stage, and many of the results are largely theoretical. Many of the proposed data structures are quite complex to implement, but most of the complexity is due to the fact that you need to maintain asymptotic complexity both over the universe size and the number of elements stored. If either one of these is relatively constant, then a lot of the complexity goes away.
If the collection is semi-static (that is, inserts are rare, or at least low-volume), then it's certainly worth considering an easy-to-implement static data structure (Sadakane's sdarray is a fine choice) in conjunction with an update cache. Basically, you record updates in a traditional data structure (e.g. B-tree, trie, hash table), and periodically bulk-update the "main" data structure. This is a very popular technique in information retrieval, since inverted indexes have many advantages for searching but are hard to update in-place. If this is the case, please let me know in a comment and I'll amend this answer to give you some pointers.
If inserts are more frequent, then I suggest succinct hashing. The basic idea is straightforward enough to explain here, so I will do so.
So the basic information theoretic result is that if you're storing $n$ elements from a universe of $u$ items, and there is no other information (e.g. no correlation between the elements) then you need $\log {u \choose n} + O(1)$ bits to store it. (All logarithms are base-2 unless otherwise specified.) You need this many bits. There is no way around it.
Now some terminology:
- If you have a data structure which can store the data and support your operations in $\log {u \choose n} + O(1)$ bits of space, we call this an implicit data structure.
- If you have a data structure which can store the data and support your operations in $\log {u \choose n} + O(\log {u \choose n}) = (1 + O(1)) \log {u \choose n}$ bits of space, we call this a compact data structure. Note that in practice this means that the relative overhead (relative to the theoretical minimum) is within a constant. It could be 5% overhead, or 10% overhead, or 10 times overhead.
- If you have a data structure which can store the data and support your operations in $\log {u \choose n} + o(\log {u \choose n}) = (1 + o(1)) \log {u \choose n}$ bits of space, we call this a succinct data structure.
The difference between succinct and compact is the difference between little-oh and big-oh. Ignoring the absolute-value thing for a moment...
- $g(n) = O(f(n))$ means that there exists a constant $c$ and a number $n_0$ such that for all $n > n_0$, $g(n)
- $g(n) = o(f(n))$ means that for all constants $c$ there exists a number $n_0$ such that for all $n > n_0$, $g(n)
Informally, big-oh and little-oh are both "within a constant factor", but with big-oh the constant is chosen for you (by the algorithm designer, the CPU manufacturer, the laws of physics or whatever), but with little-oh you pick the constant yourself and it can be as small as you like. To put it another way, with succinct data structures, the relative overhead gets arbitrarily small as the size of the problem increases.
Of course, the size of the problem may have to get huge to realise the relative overhead that you want, but you can't have everything.
OK, with that under our belts, let's put some numbers on the problem. Let's supposed that keys are $n$-bit integers (so the universe size is $2^n$), and we want to store $2^m$ of these integers. Let's suppose that we can magically arrange an idealised hash table with full occupancy and no wastage, so that we need exactly $2^m$ hash slots.
A lookup operation would hash the $n$-bit key, mask off $m$ bits to find the hash slots, and then check to see if the value in the table matched the key. So far, so good.
Such a hash table uses $n 2^m$ bits. Can we do better than this?
Suppose that the hash function $h$ is invertible. Then we don't have to store the whole key in each hash slot. The location of the hash slot gives you $m$ bits of the hash value, so if you only stored the $n-m$ remaining bits, you can reconstruct the key from those two pieces of information (the hash slot location and the value stored there). So you would only need $(n - m) 2^m$ bits of storage.
If $2^m$ is small compared with $2^n$, Stirling's approximation and a little arithmetic (proof is an exercise!) reveals that:
$$(n - m) 2^m = \log {2^n \choose 2^m} + o\left(\log {2^n \choose 2^m}\right)$$
So this data structure is succinct.
However, there are two catches.
The first catch is constructing "good" invertible hash functions. Fortunately, this is much easier than it looks; cryptographers make invertible functions all the time, only they call them "cyphers". You could, for example, base a hash function on a Feistel network, which is a straightforward way to construct an inve
Context
StackExchange Computer Science Q#20070, answer score: 8
Revisions (0)
No revisions yet.