HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

What information do we get from a compiler's parse tree?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
whatparsegetcompilerfromtreeinformation

Problem

In the compiler course by Alex Aiken on Coursera, more specifically lecture 05-02 Context Free Grammars, the professor says that CFGs give answers of the type yes/no, i.e. whether the given string of tokens is valid or not. He adds that it is also desirable to know how a particular string of tokens is in the language; for this purpose he introduces parse trees.

Why is the "how" part important?

Solution

Prelude It might be useful to be pedantic and start with a surprising fact: compilers do not use context free grammars, contrary to what you've been told. Instead they use something closely related but subtly different, which might be termed context-free transducers (please let me know if there's an official name for it). They relate to context free grammars as Mealy-machines do to finite state machines.

The reason why CFGs are not used is that CFGs only give a yes/no answer. That's not enough for the subsequent stages of a compiler. Instead, subsequent stages need a good representation of the input program to work with. This representation is the abstract syntax tree (AST). Something similar happens in lexical analysis: contrary to what most compiler courses tell you, the lexer does not use a finite state machine or regular expression to nail down the lexical structure of a language, because regular expressions and finite state machines also only give a yes/no answer. We want the output of the lexical stage to be a token-list for consumption by the parser. Mealy machines deliver token lists.

If this is the case, why do compiler courses usually use regular expressions and finite state machines for the lexical analysis and CFGs for parsing? Do compiler teachers lie to us? No, they want to help us understand:

-
The concepts are really similar, e.g. a Mealy machine is just a finite state machine where every transition carries not just an input action (as they do with finite state machines) but also a corresponding output action. Likewise context free transducers are just CFGs where every production is also associated with an output.

-
Teachers don't want to be overly formal, and expect the students to bridge the gap between
finite state machines and lexer / CFG and parser themselves. In practise students almost always do.

Main point. With this pedantic caveat we can come to the original question: what is the parse tree used for?

The parse tree contains the
"how a string is in the CFG". The parse tree is used to construct the abstract syntax tree (AST) which is a concise representation of the program that is used by later phases in the compiler, in particular the type checker (for statically typed languages) and the code generator. The advantage of ASTs over other program representations such as strings is that ASTs make access to the immediate sub-programs of a program easy, e.g. the program

if C then P else Q


has three immediate sub-programs namely C, P and Q. The AST for the program has three pointers to the sub-programs:

Both, code-generation and type-checking, involve the recursive invocation of code-generation/type-checker on the immediate sub-programs (plus some glue-code). Hence ASTs provide exactly the information needed for efficient type-checking and code generation.

The relationship between parse trees and ASTs is simple.
The parse tree contains the exact information about how a string 'fits' into the CFG. Every production applied when consuming the input string is noted down in the order used. This
information has a natural tree shape, whence the name "parse tree". An AST is a parse-tree with redundant information removed. An example of redundant information are properly balanced brackets. Another example is the node labelling that
gives the aforementioned information about what productions were used to derive the string. Such information
is important for checking if the input string (or more precisely token-list)
is syntactically valid, but once that has been acertained, this information is no longer
needed for compilation and hence discarded.

Let us look at an example. Consider the arithmetic expression $4 *(3+17)$ in the obvious grammar of arithmetic expressions:
$$
E \ \ \rightarrow\ \ E + E \ |\ E * E \ |\ ( E ) \ |\ 0 \ |\ 1 \ |\ 2 \ |\ ...
$$
Let's ignore the ambiguity and left-recursion in that grammar. Here is a plausible parse tree for $4 *(3+17)$

(Note that the parse tree is usually not constructed explicitly, but is implicity in the structure of the recursion throughout the parsing process.)
The corresponding AST is simpler, but still contains all relevant information, including
precedence of the addition over the multiplication, in its pointer structure:

Summary: compilers construct ASTs (good representation of the program for future compiler stages) from parse trees (which tell "how" the program is in the CFG).

Code Snippets

if C then P else Q

Context

StackExchange Computer Science Q#16647, answer score: 7

Revisions (0)

No revisions yet.