snippetMinor
Is there a way to efficiently parse this grammar?
Viewed 0 times
thisefficientlywayparsegrammarthere
Problem
Consider the following context-free grammar:
$T \to () \mid (T) \mid (T) \mid (TT)$
This effectively forms a notation for binary trees: A binary tree is a $*$, with optional children on either side, all surrounded by $()$. This is clearly unambiguous, because you can distinguish between the presence and absence of a tree, and you can determine where a tree begins and ends.
Now consider the same grammar, except that it does not distinguish between opening and closing parenthesis:
$T \to aba \mid aTba \mid abTa \mid aTbTa$
Here, we have replaced both $($ and $)$ with $a$ (and $*$ with $b$).
After trying several examples, I believe that this grammar is unambiguous. For example, the string $aaabababaabaababaa$ parses as $(a(a(aba)b(aba)a)b(a(aba)ba)a)$.
However, I don't see an efficient way to parse it. The general idea I had was to go as follows:
Here it seems that we would have to create a choice point and use a backtracking method to find it. However, this method would have exponential runtime for many pathological inputs, such as $aababababababababababaa$.
Is this grammar unambiguous? If so, is there a way to parse this grammar that avoids pathological exponential runtime for most inputs?
$T \to () \mid (T) \mid (T) \mid (TT)$
This effectively forms a notation for binary trees: A binary tree is a $*$, with optional children on either side, all surrounded by $()$. This is clearly unambiguous, because you can distinguish between the presence and absence of a tree, and you can determine where a tree begins and ends.
Now consider the same grammar, except that it does not distinguish between opening and closing parenthesis:
$T \to aba \mid aTba \mid abTa \mid aTbTa$
Here, we have replaced both $($ and $)$ with $a$ (and $*$ with $b$).
After trying several examples, I believe that this grammar is unambiguous. For example, the string $aaabababaabaababaa$ parses as $(a(a(aba)b(aba)a)b(a(aba)ba)a)$.
However, I don't see an efficient way to parse it. The general idea I had was to go as follows:
- The string must start and end with $a$.
- If the second character from the start is $b$, recursively parse the part after it, all the way to the second character from the end.
- If the second character from the end is $b$, recursively parse the part before it, all the way to the second character from the start.
- If the second character from both the start and end is $b$, we have a malformed expression.
- If both characters from the start and end are $a$, then... what?
Here it seems that we would have to create a choice point and use a backtracking method to find it. However, this method would have exponential runtime for many pathological inputs, such as $aababababababababababaa$.
Is this grammar unambiguous? If so, is there a way to parse this grammar that avoids pathological exponential runtime for most inputs?
Solution
The grammar is ambiguous: (I used
However, you can use an Earley grammar to find all possible parses in worst-case cubic time. (As long as you are satisfied with a graph representation of "all possible parses" since the maximum number of parses of a string of a given size is exponential.)
I found this example by using a bison GLR grammar and iterating over inputs until I found one which produced an ambiguity. Out of laziness, I produced the list of inputs by generating all strings consisting of $n$ | and $n$ *| where the first element is |. Most of these inputs are syntax errors, but for small $n$ it doesn't matter.
Here's a simpler approach, using Python to generate valid inputs:
If this function generates the same string twice, that string is ambiguous since it has two left-most derivations.
| for a and * for b because I find it easier to read.)||*|*||*|*||
((*(*))*(*))
((*)*((*)*))However, you can use an Earley grammar to find all possible parses in worst-case cubic time. (As long as you are satisfied with a graph representation of "all possible parses" since the maximum number of parses of a string of a given size is exponential.)
I found this example by using a bison GLR grammar and iterating over inputs until I found one which produced an ambiguity. Out of laziness, I produced the list of inputs by generating all strings consisting of $n$ | and $n$ *| where the first element is |. Most of these inputs are syntax errors, but for small $n$ it doesn't matter.
Here's a simpler approach, using Python to generate valid inputs:
def gen(s, n):
"""Generate left-most derivations of 's' with maximum length 'n'"""
if n > 0:
i = s.find('T')
if i == -1:
yield s
else:
yield from gen(s[:i]+'|*|'+s[i+1:], n)
yield from gen(s[:i]+'|T*|'+s[i+1:], n-1)
yield from gen(s[:i]+'|*T|'+s[i+1:], n-1)
yield from gen(s[:i]+'|T*T|'+s[i+1:], n-2)If this function generates the same string twice, that string is ambiguous since it has two left-most derivations.
Code Snippets
||*|*||*|*||
((*(*))*(*))
((*)*((*)*))def gen(s, n):
"""Generate left-most derivations of 's' with maximum length 'n'"""
if n > 0:
i = s.find('T')
if i == -1:
yield s
else:
yield from gen(s[:i]+'|*|'+s[i+1:], n)
yield from gen(s[:i]+'|T*|'+s[i+1:], n-1)
yield from gen(s[:i]+'|*T|'+s[i+1:], n-1)
yield from gen(s[:i]+'|T*T|'+s[i+1:], n-2)Context
StackExchange Computer Science Q#86124, answer score: 4
Revisions (0)
No revisions yet.