patternMinor
Estimate entropy, based upon observed frequency counts
Viewed 0 times
estimateuponfrequencyentropyobservedbasedcounts
Problem
Suppose I have $n$ independent observations $x_1,\dots,x_n$ from some unknown distribution over a known alphabet $\Sigma$, and I want to estimate the entropy of the distribution. I can count the frequency $f_s$ of each symbol $s \in \Sigma$ among the observations; how should I use them to estimate the Shannon entropy of the source?
The obvious approach is to estimate the probability of each symbol $s$ as $\Pr[X=s]=f_s/n$, and then calculate the entropy using the standard formula for Shannon entropy. This leads to the following estimate of the entropy $H(X)$:
$$\text{estimate}(H(X)) = - \sum_{s \in \Sigma} {f_s \over n} \lg (f_s/n).$$
However, this feels like it might not produce the best estimate. Consider, by analogy, the problem of estimating the probability of symbol $s$ based upon its frequency $f_s$. The naive estimate $f_s/n$ is likely an underestimate of its probability. For instance, if I make 100 observations of birds in my back yard and none of them were a hummingbird, should my best estimate of the probability of seeing a hummingbird on my next observation be exactly 0? No, instead, it's probably more realistic to estimate the probability is something small but not zero. (A zero estimate means that a hummingbird is absolutely impossible, which seems unlikely.)
For the problem of estimating the probability of symbol $s$, there are a number of standard techniques for addressing this problem. Additive smoothing (aka Laplace smoothing) is one standard technique, where we estimate the probability of symbol $s$ as $\Pr[X=s] = (f_s + 1)/(n+|\Sigma|)$. Others have proposed Bayesian smoothing or other methods. These methods are widely used in natural language processing and document analysis, where just because a word never appears in your document set doesn't mean that the word has probability zero. In natural language processing, this also goes by the name smoothing.
So, taking these considerations into account, how should I estimate the entropy,
The obvious approach is to estimate the probability of each symbol $s$ as $\Pr[X=s]=f_s/n$, and then calculate the entropy using the standard formula for Shannon entropy. This leads to the following estimate of the entropy $H(X)$:
$$\text{estimate}(H(X)) = - \sum_{s \in \Sigma} {f_s \over n} \lg (f_s/n).$$
However, this feels like it might not produce the best estimate. Consider, by analogy, the problem of estimating the probability of symbol $s$ based upon its frequency $f_s$. The naive estimate $f_s/n$ is likely an underestimate of its probability. For instance, if I make 100 observations of birds in my back yard and none of them were a hummingbird, should my best estimate of the probability of seeing a hummingbird on my next observation be exactly 0? No, instead, it's probably more realistic to estimate the probability is something small but not zero. (A zero estimate means that a hummingbird is absolutely impossible, which seems unlikely.)
For the problem of estimating the probability of symbol $s$, there are a number of standard techniques for addressing this problem. Additive smoothing (aka Laplace smoothing) is one standard technique, where we estimate the probability of symbol $s$ as $\Pr[X=s] = (f_s + 1)/(n+|\Sigma|)$. Others have proposed Bayesian smoothing or other methods. These methods are widely used in natural language processing and document analysis, where just because a word never appears in your document set doesn't mean that the word has probability zero. In natural language processing, this also goes by the name smoothing.
So, taking these considerations into account, how should I estimate the entropy,
Solution
Like most things of this nature the best method is best found by empirical evaluation. One thing worth noting is that most smoothing schemes can be thought of as the incorporation of a prior into your likelihood estimate. For example if you are trying to estimate the parameter $\theta$ of a binary random variable $X$ and you have data $\mathcal{D} = \{x_1, \ldots, x_n\}$ consisting of i.i.d. realizations of $X$ then your posterior takes the form
$$
\begin{align*}
P(\theta|\mathcal{D})
&\varpropto P(\theta)P(\mathcal{D}|\theta)\\
\end{align*}
$$
Assuming $P(\theta) = B(\alpha,\beta)^{-1}\theta^{\alpha-1}(1-\theta)^{\beta-1}$, i.e., a Beta distribution and letting $n_1 = |\{x_i \in \mathcal{D} \colon x_i = 1\}|$ then we have
$$
\begin{align*}
P(\theta|\mathcal{D})
&\varpropto \theta^{\alpha-1}(1-\theta)^{\beta-1} \theta^{n_1}(1-\theta)^{n - n_1}\\
&= \theta^{n_1 + \alpha - 1}(1- \theta)^{n - n_1 + \beta - 1}.
\end{align*}
$$
Notice the effect of the hyperparameters ($\alpha$ and $\beta$) is essentially to just add some constant to your counts of each possible outcome. So additive smoothing (at least all applications I've encountered) is really just a special case of Bayesian estimation.
$$
\begin{align*}
P(\theta|\mathcal{D})
&\varpropto P(\theta)P(\mathcal{D}|\theta)\\
\end{align*}
$$
Assuming $P(\theta) = B(\alpha,\beta)^{-1}\theta^{\alpha-1}(1-\theta)^{\beta-1}$, i.e., a Beta distribution and letting $n_1 = |\{x_i \in \mathcal{D} \colon x_i = 1\}|$ then we have
$$
\begin{align*}
P(\theta|\mathcal{D})
&\varpropto \theta^{\alpha-1}(1-\theta)^{\beta-1} \theta^{n_1}(1-\theta)^{n - n_1}\\
&= \theta^{n_1 + \alpha - 1}(1- \theta)^{n - n_1 + \beta - 1}.
\end{align*}
$$
Notice the effect of the hyperparameters ($\alpha$ and $\beta$) is essentially to just add some constant to your counts of each possible outcome. So additive smoothing (at least all applications I've encountered) is really just a special case of Bayesian estimation.
Context
StackExchange Computer Science Q#15010, answer score: 3
Revisions (0)
No revisions yet.