HiveBrain v1.2.0
Get Started
← Back to all entries
snippetModerate

How do I reconstruct the forest of syntax trees from the Earley vector?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
forestthesyntaxtreesreconstructhowearleyfromvector

Problem

Using the Earley vector as a recognizer is quite straightforward: when the end of the string is reached, you just have to check for a completed axiomatic production started at position 0. If you have at least one, then the string is accepted.

Using the Earley vector to reconstruct the parsing tree(s) is less obvious. Actually, I cannot figure out how an algorithmic procedure would work, moreover the only references I found were either vague or super-technical. Could anybody shed some light on it?

Solution

I am using terminology and notations from Earley's paper. It is possible that the description you read is different.

It seems frequent that general CF parsing algorithms are first
presented in the form of a recognizer, and then the information
management needed to actually build parse trees and parse forests is
sort of added as an afterthought. One reason may be that keeping the
information needed to construct the shared forest require cubic space
$O(n^3)$ where $n$ is the length of the input string being parsed, but
the space requirement is only square $O(n^2)$ for recognition, when this information is
not preserved. The reason for this space complexity increase is quite
simple: the parse forest size can be cubic.

The worst case time complexity is $O(n^3)$, as is well known.

The best reference for Earley's algorithm is of course Earley's paper,
but it is not very explicit about building the parse forest. This can
actually be a messy business, much more so than the fast talk of
section 7 page 101 will let appear. To be true, Earley does not talk
of parse forest, or of forest, but of "a factored representation of
all possible parse trees". And there is a good reason for that: if he
tried to produce a forest according to his grammar, his space (hence
time) complexity bound would climb to $O(n^{s+1})$ where $s$ is the
size of the longest rule right-hand-side. This is why other algorithms
use grammars in binary form (not necessarily Chomsky Normal Form (CNF)).

Actually, Earley uses binary form implicitly, because that is
necessary for the cubic time complexity. This is one of the major
roles of the rule dot in states. But this implicit binary form
produces parses and forests according to the binarized grammar, not
to the original one, which, I fear, is a major source of obscurity. This is detailed further below.

One good way to understand how the forest is obtained is probably to
look at it in a simpler case, the CYK algorithm. It is also often
described as a recognizer, and the parser aspect is added at the end.
You can look at the description in wikipedia. The information needed
to build the forest is what they store in the table of "backpointers".
Backpointers are essentially pointers to substrings (an associated
symbol) that form the constituents of a string according to some
rule. They give all possible ways of parsing a substring. Recall that
CYK uses binary form, usually CNF, so that things are simpler. The CYK
parser has fundamentally the same dynamic programming structure as
Earley, but is much simpler. So understanding it well can be a
significant help.

Going back to Earley's algorithm, I do not believe you need Earley
vector to decide acceptance or to build parse trees and forests. What
Earley calls vector in his paper appears only in page 97, in third
paragraph of implementation. It is only a device to speed up the
search of states pointing back at some given string position k, in
order to get a better complexity. But all the information is in the
state sets, implemented as lists of states. However, this information
is not sufficient to build the forest of parse trees, because the
algorithm does not keep track of the way(s) a state may be
obtain. Indeed, the vector is even used to efficiently discard a state
already found, independently of how it was found.

In section 7 of Earley's article, he explains that in order to "make
the recognizer into a parser", i.e. to be able to recover parse trees,
it is necessary to keep track of the way completions are done.


Each time we perform the completer operation adding a state
$E\rightarrow \alpha D.\beta \; g$ (ignoring lookahead) we construct a pointer from the instance of $D$ in
that state to the state $D\rightarrow \gamma. \; f$ which caused us to do the
operation. This indicates that $D$ was parsed as $\gamma$. In case
D is ambiguous there will be a set of pointers from it,
one for each completer operation which caused $E\rightarrow \alpha
D.\beta \; g$ to be added to the particular state set. Each symbol in
$\gamma$ will also have pointers from it (unless it is terminal),
and so on, thus representing the derivation tree for $D$.

Note that in this text, $f$ and $g$ are indices in the parsed string,
pointing where the recognition of the rule left-hand side started (as
the right-hand side symbol had been predicted. So $f$ is the string
index where the recognition of $D\rightarrow \gamma$ started, and it
ended at index $g$. These "completion pointers" are the Earley
equivalent of the backpointers described (not too well in
wikipedia) for the parser version of CYK.

From such a pointer (as described in the quote) we know that the $D$
in the rule instance $E\rightarrow \alpha D.\beta \; g$ can itself be
developped into a tree (or forest) that parses the input string $w$ from
index $f+1$ to index $g$, which we note $w_{f+1:g}$. The nodes immediately below $D$ are given by
the rule $D\rightarrow \gamma$. By looking for the completion

Context

StackExchange Computer Science Q#27937, answer score: 10

Revisions (0)

No revisions yet.