patterncMinor
Counting occurrences of values in C Array (Shannon Entropy)
Viewed 0 times
countingoccurrencesshannonarrayentropyvalues
Problem
I have written the following C code for calculating the Shannon Entropy of a distribution of 8-bit ints. But obviously this is very inefficient for small arrays and won't work with say 32-bit integers, because that would require literally gigabytes of memory. I am not very experienced in C and don't know what would be the best approach here. If it would simplify things, I could use C++ or Objective-C...
Also, please tell me about any other issues with the code you may find :-)
btw, this is the formula I implemented: $$ H(x) = -\sum_{x} p(x) \log(p(x))$$
Also, please tell me about any other issues with the code you may find :-)
double entropyOfDistribution(uint8_t *x, size_t length)
{
double entropy = 0.0;
//Counting number of occurrences of a number (using "buckets")
double *probabilityOfX = calloc(sizeof(double), 256);
for (int i = 0; i 0.0)
sum += probabilityOfX[i] * log2(probabilityOfX[i]);
entropy = -1.0 * sum;
free(probabilityOfX);
return entropy;
}btw, this is the formula I implemented: $$ H(x) = -\sum_{x} p(x) \log(p(x))$$
Solution
Firstly, you know the size of the number of buckets you want to use here, so there is no reason to use dynamic allocation (specifically,
You don't need to free this memory at the end then, either, reducing the possibility for memory leaks.
Of course, this is fine for small values (like what can fit in a
double *probabilityOfX = calloc(sizeof(double), 256);). This could simply be:double probabilityOfX[256];
memset(&probabilityOfX, 0.0, 256);You don't need to free this memory at the end then, either, reducing the possibility for memory leaks.
Of course, this is fine for small values (like what can fit in a
uint8_t), however, using a uint32_t (or larger), this will pre-allocate a large array which could potentially be very sparse. In this case, what you actually want is a dictionary data structure (like a hashmap). Since C doesn't have anything like this inbuilt, I'm going to switch over to C++ so we can use std::unordered_map and some other nice things like std::vector (instead of raw uintx_t pointers):double entropyOfDistribution(const std::vector& vec)
{
std::unordered_map counts;
// Store the number of counts
for(uint32_t value : vec) {
++counts[value];
}
double sum = 0.0;
// Note the cast as otherwise we'll be doing integer
// division and hence rounding to an int -
// thanks @syb0rg for pointing that out.
const double num_samples = static_cast(vec.size());
for(auto it = counts.begin(); it != counts.end(); ++it) {
double probability = it->second / num_samples;
sum += probability * log2(probability);
}
return -1.0 * sum;
}Code Snippets
double probabilityOfX[256];
memset(&probabilityOfX, 0.0, 256);double entropyOfDistribution(const std::vector<uint32_t>& vec)
{
std::unordered_map<uint32_t, unsigned> counts;
// Store the number of counts
for(uint32_t value : vec) {
++counts[value];
}
double sum = 0.0;
// Note the cast as otherwise we'll be doing integer
// division and hence rounding to an int -
// thanks @syb0rg for pointing that out.
const double num_samples = static_cast<double>(vec.size());
for(auto it = counts.begin(); it != counts.end(); ++it) {
double probability = it->second / num_samples;
sum += probability * log2(probability);
}
return -1.0 * sum;
}Context
StackExchange Code Review Q#45444, answer score: 7
Revisions (0)
No revisions yet.