patterncppMinor
Perfect Hashing Implementation
Viewed 0 times
hashingimplementationperfect
Problem
I'm building a hash table for this problem:
Implement a class for storing a set of integers:
After the call of the
The
Using this class, solve the following problem. The input has a set of different numbers followed by a set of requests, each of which is represented by an integer number. For each request, you need to determine whether or not the number is in the set.
The sizw of set is not bigger than 100000 elements; the absolute values of elements |val| 9.
I've implemented perfect hashing from Cornmen:
```
#include
#include
#include
#include
#include
#include
using std::cout;
using std::cin;
using std::endl;
using std::vector;
using std::list;
typedef long long int long_int;
const int max_int = 1000000001; // value, that could't be in the table. Analog of NULL.
// function for calculation of hash
inline int hash(long_int a_prime, long_int b_prime, int p_prime, int table_size, int key)
{
return (((a_prime * key + b_prime) % p_prime) % table_size);
}
// class for mini-hash table in cells of main hash-table
class Bucket
{
vector _cells;
int size; // the size of mini-table should be greater then 4
long_int hash_a;
long_int hash_b;
int prime;
public:
Bucket() {}
void Initialize()
{
prime = 17;
hash_a = std::rand() % prime;
hash_b = 1 + std::rand() % (prime - 1);
Implement a class for storing a set of integers:
class FixedSet {
public:
FixedSet();
void Initialize(const vector& numbers);
bool Contains(int number) const;
};After the call of the
Initialize() function, FixedSet contains integers from the input vector. The set of numbers stored by FixedSet does not change (until the Initialize() function is called again).The
Contains() function returns true if number is in the set, or false otherwise. The Initialize() function should run in expected O(n), where n is the number of elements in numbers. The memory consumption should not exceed O(n). The Contains() function should run in guaranteed O(1).Using this class, solve the following problem. The input has a set of different numbers followed by a set of requests, each of which is represented by an integer number. For each request, you need to determine whether or not the number is in the set.
The sizw of set is not bigger than 100000 elements; the absolute values of elements |val| 9.
I've implemented perfect hashing from Cornmen:
```
#include
#include
#include
#include
#include
#include
using std::cout;
using std::cin;
using std::endl;
using std::vector;
using std::list;
typedef long long int long_int;
const int max_int = 1000000001; // value, that could't be in the table. Analog of NULL.
// function for calculation of hash
inline int hash(long_int a_prime, long_int b_prime, int p_prime, int table_size, int key)
{
return (((a_prime * key + b_prime) % p_prime) % table_size);
}
// class for mini-hash table in cells of main hash-table
class Bucket
{
vector _cells;
int size; // the size of mini-table should be greater then 4
long_int hash_a;
long_int hash_b;
int prime;
public:
Bucket() {}
void Initialize()
{
prime = 17;
hash_a = std::rand() % prime;
hash_b = 1 + std::rand() % (prime - 1);
Solution
I tried running your implementation on my machine but with the following input (input is 25 random integers within the interval
Note: Never in this context means I killed it after 10 or 15 seconds.
So it's not surprising that a list with 50,000 random elements can easily result in a higher setup time than 100,000 depending on the distribution of the values (the algorithm attempts to re-create the table every time there is a collision). As per given example it's easy to generate inputs which will make it run for a long time for very few values.
Given the problem restrictions I'd say you can get away with the fastest perfect hash function there is: Identity.
You could simply use a bitset with the values as indices and flip the bits on if the value is present. This requires a bitset with 10^9 bits - approx. 125MB of memory which is not all that much these days (at least on desktops).
Resulting implementation:
Searching for 10,000 random elements takes about 3ms
Some notes:
Update
I think the main problem with your implementation is that the bucket size is fairly small (from testing it looks like you rarely get more than 6 or 7 collisions). Especially in buckets with more than 4 collisions your hash function actually throws away all hashes larger than 17 (which is
While reading up on hashing, a popular version seems to be Cuckoo Hashing which has a constant look up time (if you use two hash functions it will perform at most two lookups) and also a constant worst case insert time for an element (even if you have to rehash).
I threw together a quick hack implementation which does not do re-hashing and just relies on the sets being big enough:
Notes:
[0; 10^9]) Initialize never completes (it's stuck in the bucket while (flag) loop):25
882868245
264589055
955665379
570725902
186426836
425509062
780811177
528755197
921593609
210302061
162860187
237314629
771563954
716724339
500613765
749586096
118952462
708453275
530816792
697958285
841037949
796725013
123270367
470484394
578476359
1
252678354Note: Never in this context means I killed it after 10 or 15 seconds.
So it's not surprising that a list with 50,000 random elements can easily result in a higher setup time than 100,000 depending on the distribution of the values (the algorithm attempts to re-create the table every time there is a collision). As per given example it's easy to generate inputs which will make it run for a long time for very few values.
Given the problem restrictions I'd say you can get away with the fastest perfect hash function there is: Identity.
You could simply use a bitset with the values as indices and flip the bits on if the value is present. This requires a bitset with 10^9 bits - approx. 125MB of memory which is not all that much these days (at least on desktops).
Resulting implementation:
class FixedSet
{
vector _set;
public:
FixedSet() : _set(1000000000) { }
void Initialize(const vector& numbers)
{
for (int i = 0; i < numbers.size(); ++i)
{
_set[numbers[i]] = true;
}
}
bool Contains(int number)
{
return _set[number];
}
};Initialize takes approx 45ms on my machine - it doesn't matter if it's 1 or 100,000 input elements. Apparently allocating the memory for the bitset takes the bulk of the time. Flipping the bits on seems to hardly take any time at all.Searching for 10,000 random elements takes about 3ms
Some notes:
- In case you are wondering:
std::vectoris a specialization of a vector packing bools as bits. This container is considered somewhat as a failure in the C++ standard but happens to do exactly what we need here.
- This implementation of
FixedSethas space complexityO(1)(albeit a fairly big constant)
- Initializing it has time complexity
O(n)although the difference between 1 and 100,000 input elements is not measurable with the provided timing method
- Looking up an element has time complexity of
O(1)(indexed container access)
- Disadvantages:
- Fairly big initial upfront cost for setup (memory allocation)
- Might not have been in the spirit of the assignment
Update
I think the main problem with your implementation is that the bucket size is fairly small (from testing it looks like you rarely get more than 6 or 7 collisions). Especially in buckets with more than 4 collisions your hash function actually throws away all hashes larger than 17 (which is
p_prime you pass in from bucket) so you reduce your available bucket space. Also my suspicion is that in your hash function a_prime and b_prime should actually be, well, prime but rand() % prime does not yield a prime number.While reading up on hashing, a popular version seems to be Cuckoo Hashing which has a constant look up time (if you use two hash functions it will perform at most two lookups) and also a constant worst case insert time for an element (even if you have to rehash).
I threw together a quick hack implementation which does not do re-hashing and just relies on the sets being big enough:
class FixedSet
{
// min 36500, sweetspot 73000
static const int _size = 73000;
int _noentry;
std::vector _set1;
std::vector _set2;
std::vector _set3;
public:
FixedSet() : _noentry(std::numeric_limits::min()), _set1(_size, _noentry), _set2(_size, _noentry), _set3(_size, _noentry) { }
void Initialize(const vector& numbers)
{
for (int i = 0; i & _set = (round % 3 == 0) ? _set1 : ((round % 3 == 1) ? _set2 : _set3);
int h = hash(round + 1, number);
if (number == _set[h])
return true;
}
return false;
}
private:
int hash(int rounds, int number)
{
int withOffset = number;
srand(withOffset);
int h = bigRand() % _size;
while (--rounds)
{
h = bigRand() % _size;
}
return h;
}
inline int bigRand()
{
return (rand() & _set = (i % 3 == 0) ? _set1 : ((i % 3 == 1) ? _set2 : _set3);
int current = _set[h];
if (current == _noentry)
{
_set[h] = toInsert;
return true;
}
_set[h] = toInsert;
toInsert = current;
}
return false;
}
};Notes:
- I'm using 3 hash functions instead of the classic 2. Can't get the load factor high enough with just 2 functions for some reason.
- The above is tweaked for no more than 100,000 input elements
- `
Code Snippets
25
882868245
264589055
955665379
570725902
186426836
425509062
780811177
528755197
921593609
210302061
162860187
237314629
771563954
716724339
500613765
749586096
118952462
708453275
530816792
697958285
841037949
796725013
123270367
470484394
578476359
1
252678354class FixedSet
{
vector<bool> _set;
public:
FixedSet() : _set(1000000000) { }
void Initialize(const vector<int>& numbers)
{
for (int i = 0; i < numbers.size(); ++i)
{
_set[numbers[i]] = true;
}
}
bool Contains(int number)
{
return _set[number];
}
};class FixedSet
{
// min 36500, sweetspot 73000
static const int _size = 73000;
int _noentry;
std::vector<int> _set1;
std::vector<int> _set2;
std::vector<int> _set3;
public:
FixedSet() : _noentry(std::numeric_limits<int>::min()), _set1(_size, _noentry), _set2(_size, _noentry), _set3(_size, _noentry) { }
void Initialize(const vector<int>& numbers)
{
for (int i = 0; i < numbers.size(); ++i)
{
if (!add(numbers[i]))
{
std::ostringstream o;
o << "Failed to insert after " << i << " elements, rehashing not implemented";
throw new std::exception(o.str().c_str());
}
}
}
bool Contains(int number)
{
for (int round = 0; round < 3; ++round)
{
std::vector<int>& _set = (round % 3 == 0) ? _set1 : ((round % 3 == 1) ? _set2 : _set3);
int h = hash(round + 1, number);
if (number == _set[h])
return true;
}
return false;
}
private:
int hash(int rounds, int number)
{
int withOffset = number;
srand(withOffset);
int h = bigRand() % _size;
while (--rounds)
{
h = bigRand() % _size;
}
return h;
}
inline int bigRand()
{
return (rand() << 15) | rand(); // RAND_MAX is 0x7FFF in VS2010
}
bool add(int number)
{
int toInsert = number;
for (int i = 0; i < _size; ++i)
{
int h = hash(i % 3 + 1, toInsert);
std::vector<int>& _set = (i % 3 == 0) ? _set1 : ((i % 3 == 1) ? _set2 : _set3);
int current = _set[h];
if (current == _noentry)
{
_set[h] = toInsert;
return true;
}
_set[h] = toInsert;
toInsert = current;
}
return false;
}
};Context
StackExchange Code Review Q#35220, answer score: 6
Revisions (0)
No revisions yet.