HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What techniques can I use to hand-write a parser for an ambiguous grammar?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
handcanwhatparserwriteambiguousgrammarforusetechniques

Problem

I'm writing a compiler, and I've built a recursive-descent parser to handle the syntax analysis. I'd like to enhance the type system to support functions as a valid variable type, but I'm building a statically typed language, and my desired syntax for a function type renders the grammar temporarily* ambiguous until resolved. I'd rather not use a parser generator, though I know that Elkhound would be an option. I know I can alter the grammar to make it parse within a fixed number of steps, but I'm more interested in how to implement this by hand.

I've made a number of attempts at figuring out the high-level control flow, but every time I do this I end up forgetting a dimension of complexity, and my parser becomes impossible to use and maintain.

There are two layers of ambiguity: a statement can be an expression, a variable definition, or a function declaration, and the function declaration can have a complex return type.

Grammar subset, demonstrating the ambiguity:

```
basetype
: TYPE
| IDENTIFIER
;

type
: basetype
| basetype parameter_list
;

arguments
: arraytype IDENTIFIER
| arraytype IDENTIFIER COMMA arguments
;

argument_list
: OPEN_PAREN CLOSE_PAREN
| OPEN_PAREN arguments CLOSE_PAREN
;

parameters
: arraytype
| arraytype COMMA parameters
;

parameter_list
: OPEN_PAREN CLOSE_PAREN
| OPEN_PAREN parameters CLOSE_PAREN
;

expressions
: expression
| expression COMMA expressions
;

expression_list
: OPEN_PAREN CLOSE_PAREN
| OPEN_PAREN expressions CLOSE_PAREN
;

// just a type that can be an array (this language does not support
// multidimensional arrays)
arraytype:
: type
| type OPEN_BRACKET CLOSE_BRACKET
;

block
: OPEN_BRACE CLOSE_BRACE
| OPEN_BRACE statements CLOSE_BRACE
;

function_expression
: arraytype argument_list block
| arraytype IDENTIFIER argument_list block
;

call_expression
: expression expressions
;

index_expression
: expression OPEN_BRACKET expression CLOSE_BRACKET
;

ex

Solution

I did not read in detail your grammar which is far too large for my
available time, and short of an example that you should provide (since you
assert it is ambiguous) I will not not attempt to check whether it is,
which is in general undecidable.

This said, as already remarked by user rici there are general
context-free (GCF) parsers that will parse any CF grammar, whether
ambiguous ot not. (I am ignoring very old depth first, recursive
decent parser). GCF parsing takes in general a time $O(n^3)$ where
$n$ is the size of the input sentence. However, it will be only
$O(n^2)$ for unambiguous, but possibly non-deterministic grammars.
And they are usually linear for many grammars, or for a significant
subset of the language defined by the grammars (see below).

I think it would be abusive to qualify these parsers as using unbounded
lookahead. Actually they try simultaneously all parsing possibilities
in parallel, and abandon some when further information from the input
becomes incompatible with the attempted parse. This is not the same as
not attempting a parse using some known facts about the rest of the
string, which is what lookahead is about. Indeed, several of these
algorithm may try not to attempt some parses, based on bounded
lookahead information: this is true of Earley's, GLL and GLR.
Furthermore, the concept of lookahead is meaningful for on-line algorithms, which is not really the case for GCF parsers.

The best way to survey this technology is probably a fast historical overview.

The (almost) oldest of GCF parser is the well known CYK algorithm. It
is pure dynamic programming constructed directly on the grammar. It is
extremely simple, but not best in performance.

The next one is Earley's algorithm (1968). Contrary to what is often said, it
is not simple, at least in its general form, and there is no standard
reference that I know regarding the format of the produced parse-trees
(parse-forest). It is not clear that its performances are the best one
might expect. Earley's algorithm is actually derived in a somewhat ad hoc
fashion from Knuth's LR(k) parser construction. This of course diminishes in no way the importance of its contribution to introducing new parsing concepts.

This was generalized by Lang (1974), who proposed a general dynamic
programming interpretation of any PDA, that could be combined with any
PDA construction technique, such as LR, LL or precedence, giving
techniques known as GLR or GLL parsing. When the technique an yield a
derterministic automaton, the GCF parser works in linear time.

Tomita realized the first implementation of this technique, applied to
LR(k) PDA construction in 1984, thus implementing the first GLR parser. Then it was also used for LL(k) PDA
construction, in this millenium.

Regarding performance, people have often tried to improve it by using
sophisticated PDA construction techniques that have been designed to
increase the number of grammars that can be parsed
deterministically. This folk wisdom is apparently ill-advised, as
analyzed by Billot and Lang (1989). That suggests that efficient GCF parsers can have a fairly simple structure.

All of these parser can work fast enough on modern computers. They
produce all possible parse-trees, in a condensed form called
parse-forest. One issue may be the representation of the
parse-forest which may be more or less convenient depending on the
implementation. Of course, if the CF grammar is unambiguous, there is
only a single parse-tree, which may do away with this parse-forest problem.

A major difference with traditional deterministic parsers is that the
production of the parses is online for deterministic parsers,
producing the left part of the parse-tree in synchronization with
the reading of the corresponding left part of the sentene being
parsed. With the GCF parsers, it is more an offline behavior, as you may have to wait the very end of the
parsing process before you know what parsing structure is relevant.

Another issue is that I do not know how well parsing errors can be
handled by current implementation. I do know that some very
interesting work has been done on that by the Natural Language
community.

Regarding your own grammar, I think that trying to use these
techniques in a hand written parser is just looking for trouble. You
woud have to have an excellent mastery of the technology. There is
just nothing to be gained. But, given the way you state your question, if you want a simple implementation
that you can master, for testing purposes, you may try the CYK
algorithm. It main advantage is that, like Earley's algorithm, it works directly on the grammar, which you may find more intuitive.

Many parser generation system now propose some form of GCF parsing,
but I have no personnal experience with any of them, thus no
recommendations.

Context

StackExchange Computer Science Q#44102, answer score: 4

Revisions (0)

No revisions yet.