HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

LALR(1) grammar for simple math parser

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
simpleparserlalrmathgrammarfor

Problem

I am trying to write a simple parser for a small calculator project, that should be able to parse e.g. the following inputs:

5 + 3
5 + f(4)
5 + f(x)
x = 5
f(x) = 3*x


so basically, I want to be able to parse expressions (that may contain variables and function calls), variable assignments, and function definitions using the = operator.

The problem is that in function definitions, only identifiers must be allowed in the definition, e.g.

f(5+1) = 3*x


should not be legal in the grammar. Thus, I need to define two distinct cases for functions, where the first (bFunc, containing only ids) could occur on both sides of =, and the second (func, can also contain expressions) must not occur on the left side.

This is the grammar so far:

stmt -> expr.
stmt -> fdef.
stmt -> vdef.

fdef -> bFunc = expr.
vdef -> id = expr.

bFunc -> id ( idList ).

func -> id ( list ).
func -> id ( ).

list -> expr.
list -> list , expr.

idList -> id.
idList -> idList , id.

expr -> term.
expr -> expr + term.
expr -> expr - term.

term -> atom.
term -> term / atom.
term -> term * atom.

atom -> id.
atom -> num.
atom -> func.
atom -> ( expr ).


However, I was not able to figure out how to specify a valid LALR(1) grammar because of this problem.
The tool http://mdaines.github.io/grammophone/#/ reports reduce-reduce conflicts.
The problem is that the parser would not know whether id should be reduced to idList or atom.

Surely it would be possible to simply use func in both cases and catch invalid left sides later in the program.
But my question is now, is it even possible to write a LALR(1) grammar for this problem? How does one decide whether it is possible or not?
And may my problem be the reason why programming languages use keywords like def or function for function definitions?

Solution

As modified by your edit, your grammar is unambiguous. Unfortunately, it is not deterministic; no limited lookahead is sufficient to decide whether when the parser reaches a comma: $$\bf{ID}\;\bf{(}\;\bf{ID}\;\bullet\;\bf{,}\;\cdots$$ it should predict $\it{bFunc}$ by reducing $\bf{ID}$ to $\it{idList}$ or predict $\it{func}$ by reducing it to $\it{atom}$ (and thus eventually to $\it{expr}$ and $\it{list}$).

One reasonably straightforward solution is to differentiate between $\it{expr}$ and $\bf{ID}$ by creating the non-terminal $\it{expr_{!ID}}$ which matches $\tt{expr}\setminus\{\bf{ID}\}$ and using it in $\it{list}$. (Note that $\it{list}$ is not a simple list of $\it{expr_{!ID}}$; rather, it is a list in which at least one element is a $\it{expr_{!ID}}$. See below for sample grammar).

That makes $\it{list}$ and $\it{idList}$ disjoint and the grammar is deterministic. However, the distinction is not strictly related to the semantics of the language. It creates a curious hybrid parse tree node in which some of the nodes in the argument list's parse subtree are $\it{idList}$ and the rest are $\it{list}$. (Look at the parse tree for f(a,b,c,d+3), for example). Indeed, a complete function call might be parsed as a $\it{bFunc}$, which will need to be converted to $\it{bFunc}$ when it turns out that it is not followed by an $\bf{=}$.

A practical parser will need to repair the parse tree for $\it{func}$ by converting the $\it{idList}$ nodes to $\it{list}$. That could be done in a reduction action, or it could be done in a post-parse tree walk, but it will almost certainly need to be done, since the semantic distinction is real: the use of an $\bf{ID}$ in a parameter list is a binding, while the use of the same $\bf{ID}$ in a function call is a use.

So at the high level, what we end up with is:
$$\begin{align}
\tt{bFunc}&\to\;ID\;\tt{(}\;\tt{idList}\;\tt{)}\\
\tt{func}&\to\;ID\;\tt{(}\;\tt{list}\;\tt{)}\\
\tt{func}&\to\;ID\;\tt{(}\;\tt{)}\\
\tt{func}&\to\;ID\;\tt{(}\;\tt{idList}\;\tt{)}\\
\\
\tt{list}&\to\;\tt{expr_{!ID}}\\
\tt{list}&\to\;\tt{list}\;,\;\tt{expr}\\
\tt{list}&\to\;\tt{idList}\;,\;\tt{expr_{!ID}}\\
\\
\tt{idList}&\to\;\tt{ID}\\
\tt{idList}&\to\;\tt{idList}\;,\;\tt{ID}\\
\end{align}$$

We also need to define the non-terminal $\it{expr_{!ID}}$ which doesn't match $\bf{ID}$ (and its pars at the other chained precedence levels). The straightforward solution is to remove the production $\text{atom}\;\to\;ID$, thus removing the token $\bf{ID}$ from the direct unit-production chain. Then we add it back in the non-terminals in which it is permitted (i.e. the ones which are not restricted to $\it{!ID}$). So the expression grammar now looks like this:
$$\begin{align}
\tt{expr}&\to\;\tt{expr_{!ID}}\\
\tt{expr}&\to\;\tt{ID}\\
\tt{term}&\to\;\tt{term_{!ID}}\\
\tt{term}&\to\;\tt{ID}\\
\tt{atom}&\to\;\tt{atom_{!ID}}\\
\tt{atom}&\to\;\tt{ID}\\
\\
\tt{expr_{!ID}}&\to\;\tt{expr}\;\bf{+}\;\tt{term}\\
\tt{expr_{!ID}}&\to\;\tt{expr}\;\bf{-}\;\tt{term}\\
\tt{expr_{!ID}}&\to\;\tt{term_{!ID}}\\
\\
\tt{term_{!ID}}&\to\;\tt{term}\;\bf{/}\;\tt{atom}\\
\tt{term_{!ID}}&\to\;\tt{term}\;\bf{*}\;\tt{atom}\\
\tt{term_{!ID}}&\to\;\tt{atom_{!ID}}\\
\\
\tt{atom_{!ID}}&\to\;\tt{NUM}\\
\tt{atom_{!ID}}&\to\;\tt{func}\\
\tt{atom_{!ID}}&\to\;\bf{(}\;\tt{expr}\;\bf{)}\\
\end{align}$$

which is only a little longer than the equivalent lines in your original.

It's interesting to note that the syntax formalism used by ECMAScript (and some other semi-related technologies) uses a macro-enhanced form of BNF in which non-terminals can be given boolean parameters, like the $\tt{!ID}$ subscript I used above. With that feature, the grammar could be written even more compactly (although the macro expansion would be a bit bigger).

As another interesting note, your grammar will work perfectly without the definition of $\it{expr_{!ID}}$ if processed with Bison or almost any other Yacc-derivative parser generator. (You do need to add the extra productions in the list definitions to convert $\it{idList}$ to $\it{list}$, as indicated above.) The parser generator will signal four reduce-reduce conflicts, since there is an ambiguity. But as long as the list productions come before the expression productions, Yacc/Bison's automatic conflict resolution mechanism will produce the correct resolution. That's probably not the optimal solution, but it is certainly the shortest.

Context

StackExchange Computer Science Q#149212, answer score: 3

Revisions (0)

No revisions yet.