patterncsharpCritical
Calculating entropy of a string
Viewed 0 times
calculatingentropystring
Problem
We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality.
I whipped up this simple method which counts unique characters in a string, but it is quite literally the first thing that popped into my head. It's the "dumbest thing that works".
Is there a better / more elegant / more accurate way to calculate the entropy of a string?
Efficiency is also good, though we never call this on large strings so it is not a huge concern.
I whipped up this simple method which counts unique characters in a string, but it is quite literally the first thing that popped into my head. It's the "dumbest thing that works".
///
/// returns the # of unique characters in a string as a rough
/// measurement of entropy
///
public static int Entropy(this string s)
{
var d = new Dictionary();
foreach (char c in s)
if (!d.ContainsKey(c)) d.Add(c, true);
return d.Count();
}Is there a better / more elegant / more accurate way to calculate the entropy of a string?
Efficiency is also good, though we never call this on large strings so it is not a huge concern.
Solution
I also came up with this, based on Shannon entropy.
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits.
It is a more "formal" calculation of entropy than simply counting letters:
Results are:
"abcdefghijklmnop" = 4.00
"Hello, World!" = 3.18
"hello world" = 2.85
"123123123123" = 1.58
"aaaa" = 0
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits.
It is a more "formal" calculation of entropy than simply counting letters:
///
/// returns bits of entropy represented in a given string, per
/// http://en.wikipedia.org/wiki/Entropy_(information_theory)
///
public static double ShannonEntropy(string s)
{
var map = new Dictionary();
foreach (char c in s)
{
if (!map.ContainsKey(c))
map.Add(c, 1);
else
map[c] += 1;
}
double result = 0.0;
int len = s.Length;
foreach (var item in map)
{
var frequency = (double)item.Value / len;
result -= frequency * (Math.Log(frequency) / Math.Log(2));
}
return result;
}Results are:
"abcdefghijklmnop" = 4.00
"Hello, World!" = 3.18
"hello world" = 2.85
"123123123123" = 1.58
"aaaa" = 0
Code Snippets
/// <summary>
/// returns bits of entropy represented in a given string, per
/// http://en.wikipedia.org/wiki/Entropy_(information_theory)
/// </summary>
public static double ShannonEntropy(string s)
{
var map = new Dictionary<char, int>();
foreach (char c in s)
{
if (!map.ContainsKey(c))
map.Add(c, 1);
else
map[c] += 1;
}
double result = 0.0;
int len = s.Length;
foreach (var item in map)
{
var frequency = (double)item.Value / len;
result -= frequency * (Math.Log(frequency) / Math.Log(2));
}
return result;
}Context
StackExchange Code Review Q#868, answer score: 97
Revisions (0)
No revisions yet.