HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Suffix trees - "smaller half trick"

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
suffixtricktreessmallerhalf

Problem

I have been reading a paper Finding Maximal Pairs with Bounded Gap:

An in there, there is a sentence (page 6 second paragraph):


The “smaller-half trick” is used in several methods for finding tandem repeats,
e.g. [2, 5, 26]. It says that the sum over all nodes v in an arbitrary binary tree of size n of terms that are $O(n_{1})$, where $n_{1}\leq n_{2}$ are the numbers of leaves in the subtrees rooted at the two children of v, is $O(n\log n)$

Although it sounds very straightforward, I just cannot see why this is true. Could someone with some experience regarding this trick please explain why this is true?
I read several other papers but just don't see it.

Solution

We prove by induction on $n$ that your sum (with $O(n_1)$ replaced with $n_1$, only taken over non-leaves) is at most $n\log_2 n$. We use the convention $0\log_2 0 = 0$. The base case $n$ = 0 trivially holds, since the sum is empty. Now suppose we have a node with subtrees of sizes $n_1 \leq n_2$; this implies that $n = n_1 + n_2 + 1$. The sum is at most
$$ S = n_1\log n_1 + n_2\log n_2 + n_1. $$
Let's concentrate on the first two terms:
$$
\begin{align*}
n_1\log n_1 + n_2\log n_2 &= (n_1 + n_2) \left[ \frac{n_1}{n_1+n_2} \log n_1 + \frac{n_2}{n_1+n2} \log n_2 \right] \\ &=
(n_1 + n_2) \left[ \frac{n_1}{n_1+n_2} \log \frac{n_1}{n_1+n_2} + \frac{n_2}{n_1+n_2} \log \frac{n_2}{n_1+n_2} + \log (n_1+n_2) \right] \\ &= (n_1+n_2) [\log(n_1+n_2) - h(\tfrac{n_1}{n_1+n_2})],
\end{align*}
$$
where $h(p)$ is the binary entropy function. Let $p = n_1/(n_1+n_2) \leq 1/2$. Then we can bound $S$ by
$$ n \log n + (n_1+n_2)(p-h(p)). $$
One checks that $p \leq h(p)$ for $0 \leq p \leq 1/2$, completing the proof. (In fact, $2p \leq h(p)$, so we can get the better bound $\tfrac{1}{2}n\log_2 n$, which is nearly tight for complete binary trees.)

Context

StackExchange Computer Science Q#16253, answer score: 2

Revisions (0)

No revisions yet.