HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

Generating random words by grammar

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
randomgrammargeneratingwords

Problem

A bit of context

I was writing a parser for a grammar, and for testing purposes I come up with idea to generate some random inputs. The grammar I was dealing with was much more complicated, in this question I presented "minimal working example" for simplicity. And of course, I am able to avoid the issue I faced using a bunch of trivial heuristics, but the question really makes me wonder.

The problem

Suppose we have a typical context-free grammar for arithmetic expressions of $+,*$, brackets, and integer literals:

$$E \longrightarrow F(“+”F)^*$$
$$F \longrightarrow T(“”T)^$$
$$T \longrightarrow int|“(”E“)”$$

It is easy to implement a staighforward algorithm for generating random words by this grammar: we implement a separate procedure for each nonterminal. If a nonterminal has multiple production rules (as $T$ does), we choose a production rule by tossing a coin. If rule contains kleene star (e.g. $(“+”F)^*$), we also toss a coin and generate zero or one repetition (sure we could pick any random integer $k\geq0$ and generate $k$ repetitions, but for simplicity we will focus on the simplest version of this procedure). Here is what we get:

generate_E():
    if coin_toss():
        return generate_F() + "+" + generate_F()
    else:
        return generate_F()

generate_F():
    if coin_toss():
        return generate_T() + "*" + generate_T()
    else:
        return generate_F()

def generate_T():
    if coin_toss():
        return "(" + generate_E() + ")"
    else:
        return random_integer()


An invocation of generate_E() yields a random expression.

What could go wrong? It turns out that execution of this procedure on the real machine terminates with stack overflow quite often. Of course, technically here we have a possibility of endless recursion, but my intuition was telling me that probability of reaching recursion depth $k$ decays exponentially with increasing $k$, therefore getting deep levels (let's say, 1000) is almost impossible. Apparentl

Solution

Your process is a textbook example of a branching process. Starting with one $E$, we have an expected $3/2$ many $F$s, $9/4$ many $T$s, and so $9/8$ many remaining $E$s in expectation. Since $9/8 > 1$, it is not surprising that your process often failed to terminate.

To gain more information, we need to know the exact distribution of the number of $E$-offsprings, which is given by the following generating function (see the Wikipedia article linked to above):
$$
h(z) = \frac{33}{128} + \frac{7}{16}z+ \frac{15}{64}z^2 + \frac{1}{16}z^3 + \frac{1}{128}z^4.
$$
This implies that the extinction probability is $d \approx 0.717778143742483$ (by solving $h(z) = z$). This is an upper bound on the probability that your procedure terminates.

We can easily recover your numbers by considering iterates of $h$. The probability that the process terminates in $3k$ steps is $h^{(k)}(0)$. So computing $h(0),h(h(0))-h(0),h(h(h(0)))-h(h(0))$ and so on, we recover the figures in your table.

When $k$ is large, $h^{(k)}(0) \approx d$. We can compute $h'(d) \approx 0.882115879977412$. We have
$$
\frac{d - h^{(k)}(0)}{d - h^{(k-1)}(0)} =
\frac{h(d) - h(h^{(k-1)}(0))}{d - h^{(k-1)}(0)} \approx h'(d).
$$
It follows that
$$
d - h^{(k)}(0) \propto h'(d)^k.
$$
Therefore the probability that the process terminates in exactly $3k$ steps is
$$
h^{(k)}(0) - h^{(k-1)}(0) =
[d - h^{(k-1)}(0)] - [d - h^{(k)}(0)] \propto h'(d)^{k-1} - h'(d)^k \propto h'(d)^k.
$$
Empirically, we can check that the constant of proportionality is roughly $0.0248011196615094$.

Context

StackExchange Computer Science Q#121884, answer score: 11

Revisions (0)

No revisions yet.