snippetMinor

Is there a way to efficiently parse this grammar?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

thisefficientlywayparsegrammarthere

Problem

Consider the following context-free grammar:

$T \to () \mid (T) \mid (T) \mid (TT)$

This effectively forms a notation for binary trees: A binary tree is a $*$, with optional children on either side, all surrounded by $()$. This is clearly unambiguous, because you can distinguish between the presence and absence of a tree, and you can determine where a tree begins and ends.

Now consider the same grammar, except that it does not distinguish between opening and closing parenthesis:

$T \to aba \mid aTba \mid abTa \mid aTbTa$

Here, we have replaced both $($ and $)$ with $a$ (and $*$ with $b$).

After trying several examples, I believe that this grammar is unambiguous. For example, the string $aaabababaabaababaa$ parses as $(a(a(aba)b(aba)a)b(a(aba)ba)a)$.

However, I don't see an efficient way to parse it. The general idea I had was to go as follows:

The string must start and end with $a$.

If the second character from the start is $b$, recursively parse the part after it, all the way to the second character from the end.

If the second character from the end is $b$, recursively parse the part before it, all the way to the second character from the start.

If the second character from both the start and end is $b$, we have a malformed expression.

If both characters from the start and end are $a$, then... what?

Here it seems that we would have to create a choice point and use a backtracking method to find it. However, this method would have exponential runtime for many pathological inputs, such as $aababababababababababaa$.

Is this grammar unambiguous? If so, is there a way to parse this grammar that avoids pathological exponential runtime for most inputs?

Solution

The grammar is ambiguous: (I used | for a and * for b because I find it easier to read.)

||*|*||*|*||
((*(*))*(*))
((*)*((*)*))

However, you can use an Earley grammar to find all possible parses in worst-case cubic time. (As long as you are satisfied with a graph representation of "all possible parses" since the maximum number of parses of a string of a given size is exponential.)

I found this example by using a bison GLR grammar and iterating over inputs until I found one which produced an ambiguity. Out of laziness, I produced the list of inputs by generating all strings consisting of $n$ | and $n$ *| where the first element is |. Most of these inputs are syntax errors, but for small $n$ it doesn't matter.

Here's a simpler approach, using Python to generate valid inputs:

def gen(s, n):
  """Generate left-most derivations of 's' with maximum length 'n'"""
  if n > 0:
    i = s.find('T')
    if i == -1:
      yield s
    else:
      yield from gen(s[:i]+'|*|'+s[i+1:], n)
      yield from gen(s[:i]+'|T*|'+s[i+1:], n-1)
      yield from gen(s[:i]+'|*T|'+s[i+1:], n-1)
      yield from gen(s[:i]+'|T*T|'+s[i+1:], n-2)

If this function generates the same string twice, that string is ambiguous since it has two left-most derivations.

Code Snippets

||*|*||*|*||
((*(*))*(*))
((*)*((*)*))

def gen(s, n):
  """Generate left-most derivations of 's' with maximum length 'n'"""
  if n > 0:
    i = s.find('T')
    if i == -1:
      yield s
    else:
      yield from gen(s[:i]+'|*|'+s[i+1:], n)
      yield from gen(s[:i]+'|T*|'+s[i+1:], n-1)
      yield from gen(s[:i]+'|*T|'+s[i+1:], n-1)
      yield from gen(s[:i]+'|T*T|'+s[i+1:], n-2)

Context

StackExchange Computer Science Q#86124, answer score: 4

Revisions (0)

No revisions yet.