patternMinor
Generating a set of minimal-length strings that, together, invoke every production of a context free language
Viewed 0 times
productionfreelengthtogethereverylanguagegeneratingthatcontextminimal
Problem
Problem (tl;dr)
Given a context free grammar, $G$, find a set of strings that take $G$ through every production it has at least once.
How and how fast can it be done?
Background
I'm working on a compiler whose parser is implemented with a tool similar to Yacc+Antlr. I've written up most of the parser code, and I'd like to generate some code of the object language that invokes every production of the grammar at least once so that I can feed it to the parser and make sure that nothing is wrong.
In the interest of good testing, what I'd really like is one, short test file that has a particular production "under test" -- so, for each production rule, I want to generate a minimal string that takes the parser from the start state, through the production being tested, to a set of terminals.
Possible solutions
I imagine there is an elegant solution using graph theory, but I'm not quite sure what it is. I would like to just use Dijkstra's algorithm to find shortest paths through some appropriate structure, but I think that a string is parsed by a context free grammar in a tree structure rather than a path, so I don't know how to make that work.
I think there might be a clever way to pose it as a network flow problem. Something like this: take a graph that has a vertex for every symbol (terminal and nonterminal) and a vertex for every production. If a nonterminal has a production, add a directed edge from the nonterminal to the production. If a production produces a symbol, add a directed edge from the production to the symbol. Add a source with some capacity $c$ and attach it to the vertex corresponding to the start symbol. Add a sink with infinite capacity and attach it to each terminal.
If a nonterminal has an in-arc with a capacity $k$, add an arc from the nonterminal to each of its productions with capacity $k$. If a production has an in-arc with capacity $k$ and it has an out-arc to $n$ nonterminals, add an arc with capacity $\frac k n$ from the production
Given a context free grammar, $G$, find a set of strings that take $G$ through every production it has at least once.
How and how fast can it be done?
Background
I'm working on a compiler whose parser is implemented with a tool similar to Yacc+Antlr. I've written up most of the parser code, and I'd like to generate some code of the object language that invokes every production of the grammar at least once so that I can feed it to the parser and make sure that nothing is wrong.
In the interest of good testing, what I'd really like is one, short test file that has a particular production "under test" -- so, for each production rule, I want to generate a minimal string that takes the parser from the start state, through the production being tested, to a set of terminals.
Possible solutions
I imagine there is an elegant solution using graph theory, but I'm not quite sure what it is. I would like to just use Dijkstra's algorithm to find shortest paths through some appropriate structure, but I think that a string is parsed by a context free grammar in a tree structure rather than a path, so I don't know how to make that work.
I think there might be a clever way to pose it as a network flow problem. Something like this: take a graph that has a vertex for every symbol (terminal and nonterminal) and a vertex for every production. If a nonterminal has a production, add a directed edge from the nonterminal to the production. If a production produces a symbol, add a directed edge from the production to the symbol. Add a source with some capacity $c$ and attach it to the vertex corresponding to the start symbol. Add a sink with infinite capacity and attach it to each terminal.
If a nonterminal has an in-arc with a capacity $k$, add an arc from the nonterminal to each of its productions with capacity $k$. If a production has an in-arc with capacity $k$ and it has an out-arc to $n$ nonterminals, add an arc with capacity $\frac k n$ from the production
Solution
In a nutshell
Not knowing enough the literature, I worked out a solution which is
presented in the next section, together with a proof for the hardest
part. Once I knew what was needed, I could search the literature for
the right ideas. Here is a quick presentation
of the algorithm, based on the literature, which is essentially the
same I developed.
The first thing to do is to find a size-minimal terminal string
$\sigma(U)$ for every non-terminal $U$ of the grammar. This can be
done by using Knuth's extension to conc-or graphs (also known as CF
grammars) and and-or graphs of Dijkstra's shortest path algorithm.
Example B in Knuth's paper does what is needed, almost.
Actually, Knuth computes only the length of these terminal strings,
but it is quite easy to modify his algorithm to actually compute one
such terminal string $\sigma(U)$ for each non-terminal $U$ (as I do it
in my own version below). We also define $\sigma(a)=a$ for every
terminal $a$, and we extend $\sigma$ as usual into a string
homomorphism.
Then we consider a directed graph where non-terminals are the nodes,
and there is an arc $(U,V)$ iff there is a rule $U\rightarrow \alpha
V\beta$. If several such rules can produce the same arc $(U,V)$, we
keep one such that the length $|\sigma(\alpha\beta)|$ is minimal.
The arc is labeled with that rule, and that minimal length
$|\sigma(\alpha\beta)|$ becomes the weight of the arc.
Finally, using Dijkstra's shortest path algorithm, we compute the
shortest path from the initial non-terminal $S$ to each non-terminal
of the grammar. Given the shortest path for a non-terminal $U$, the
rule labels on the arcs may be used to get a derivation
$S\overset{*}{\Longrightarrow}\alpha U \beta$. Then, to every rule of
the form $U\rightarrow\gamma$ in the grammar, we associate the
size-minimal terminal string $\sigma(\alpha\gamma\beta)$ which can be
derived using that rule.
To achieve low complexity, both Dijkstra's algorithm and Knuth's
extension are implemented with heaps, AKA priority queues. This gives
for Dijkstra's algorithm a complexity of $O(n\log n +t)$, and for
Knuth's algorithm a complexity $O(m \log n +t)$, where there are $m$
grammar rules and $n$ non-terminals, and $t$ is the total length of
all rules. The whole is dominated by the complexity of Knuth's algorithm since $m\geq n$.
What follows is my own work, before I produced the short answer above.
Deriving the solution from the useless symbols elimination algorithm.
There are several aspects to this algorithm. For better intuition I
chose to present it in three successive versions that introduce
progressively more features. The first version does not answer the
question, but is a standard algorithm for useless symbols elimination that suggests a solution. The second version answers the question without the minimality constraint, The third version gives an answer to the question, satisfying thye minimality constraint.
This third solution is then improved by using an adaptation to
and-or graphs of Dijkstra's shortest path algorithm.
The end result is a very simple algorithm, that avoids reconsidering computations already done. But it is less intuitive and does require a proof.
This answer only tries to answer the question as made precise by the OP's
comment: "for each production rule, I want to generate a minimal
string that takes the parser from the start state, through the
production being tested, to a set of terminals." Hence I only try to
get a set of strings such that for each rule, there is a string in the
set that is one of the size-minimal strings of the language having a
derivation using the rule.
It must be however noted that the fact that a string "invokes" a rule,
that is has a derivation using that rule, does not necessarily means
that the rule will be considered by a parser that work with ambiguous
grammars and resolves ambiguities arbitrarily. Handling such a
situation would probably require more precise knowledge of the parser, and
might well be a more complex question.
The basic algorithm
To solve this question, one can start with the classical algorithm for useless
symbols removal in context-free grammars. It is in section 4.4, pp
88-89, of Hopcroft & Ullman, 1979 edition. But the presentation here may be a bit different.
The algorithm aims precisely at proving the existence of such a covering as
requested by the OP, and consists in two parts:
-
lemma 4.1 of H&U, page 88: removal of all unproductive non-terminals. This
is done by trying find for each terminal a terminal string it can
derive on. A simple way to explain it is as follow: You create a
set $Prod$ od productive symbols, which you initialize with all
terminals. Then for each rule, not yet processed, that has all its
right-hand-side (RHS) symbols in $Prod$, you add the left-hand-side
(LHS) non-terminal to the set $Prod$, and you remove all rules with the
same LHS non-terminal from the set of rules to be processed. You iterate the process until there is no rule l
Not knowing enough the literature, I worked out a solution which is
presented in the next section, together with a proof for the hardest
part. Once I knew what was needed, I could search the literature for
the right ideas. Here is a quick presentation
of the algorithm, based on the literature, which is essentially the
same I developed.
The first thing to do is to find a size-minimal terminal string
$\sigma(U)$ for every non-terminal $U$ of the grammar. This can be
done by using Knuth's extension to conc-or graphs (also known as CF
grammars) and and-or graphs of Dijkstra's shortest path algorithm.
Example B in Knuth's paper does what is needed, almost.
Actually, Knuth computes only the length of these terminal strings,
but it is quite easy to modify his algorithm to actually compute one
such terminal string $\sigma(U)$ for each non-terminal $U$ (as I do it
in my own version below). We also define $\sigma(a)=a$ for every
terminal $a$, and we extend $\sigma$ as usual into a string
homomorphism.
Then we consider a directed graph where non-terminals are the nodes,
and there is an arc $(U,V)$ iff there is a rule $U\rightarrow \alpha
V\beta$. If several such rules can produce the same arc $(U,V)$, we
keep one such that the length $|\sigma(\alpha\beta)|$ is minimal.
The arc is labeled with that rule, and that minimal length
$|\sigma(\alpha\beta)|$ becomes the weight of the arc.
Finally, using Dijkstra's shortest path algorithm, we compute the
shortest path from the initial non-terminal $S$ to each non-terminal
of the grammar. Given the shortest path for a non-terminal $U$, the
rule labels on the arcs may be used to get a derivation
$S\overset{*}{\Longrightarrow}\alpha U \beta$. Then, to every rule of
the form $U\rightarrow\gamma$ in the grammar, we associate the
size-minimal terminal string $\sigma(\alpha\gamma\beta)$ which can be
derived using that rule.
To achieve low complexity, both Dijkstra's algorithm and Knuth's
extension are implemented with heaps, AKA priority queues. This gives
for Dijkstra's algorithm a complexity of $O(n\log n +t)$, and for
Knuth's algorithm a complexity $O(m \log n +t)$, where there are $m$
grammar rules and $n$ non-terminals, and $t$ is the total length of
all rules. The whole is dominated by the complexity of Knuth's algorithm since $m\geq n$.
What follows is my own work, before I produced the short answer above.
Deriving the solution from the useless symbols elimination algorithm.
There are several aspects to this algorithm. For better intuition I
chose to present it in three successive versions that introduce
progressively more features. The first version does not answer the
question, but is a standard algorithm for useless symbols elimination that suggests a solution. The second version answers the question without the minimality constraint, The third version gives an answer to the question, satisfying thye minimality constraint.
This third solution is then improved by using an adaptation to
and-or graphs of Dijkstra's shortest path algorithm.
The end result is a very simple algorithm, that avoids reconsidering computations already done. But it is less intuitive and does require a proof.
This answer only tries to answer the question as made precise by the OP's
comment: "for each production rule, I want to generate a minimal
string that takes the parser from the start state, through the
production being tested, to a set of terminals." Hence I only try to
get a set of strings such that for each rule, there is a string in the
set that is one of the size-minimal strings of the language having a
derivation using the rule.
It must be however noted that the fact that a string "invokes" a rule,
that is has a derivation using that rule, does not necessarily means
that the rule will be considered by a parser that work with ambiguous
grammars and resolves ambiguities arbitrarily. Handling such a
situation would probably require more precise knowledge of the parser, and
might well be a more complex question.
The basic algorithm
To solve this question, one can start with the classical algorithm for useless
symbols removal in context-free grammars. It is in section 4.4, pp
88-89, of Hopcroft & Ullman, 1979 edition. But the presentation here may be a bit different.
The algorithm aims precisely at proving the existence of such a covering as
requested by the OP, and consists in two parts:
-
lemma 4.1 of H&U, page 88: removal of all unproductive non-terminals. This
is done by trying find for each terminal a terminal string it can
derive on. A simple way to explain it is as follow: You create a
set $Prod$ od productive symbols, which you initialize with all
terminals. Then for each rule, not yet processed, that has all its
right-hand-side (RHS) symbols in $Prod$, you add the left-hand-side
(LHS) non-terminal to the set $Prod$, and you remove all rules with the
same LHS non-terminal from the set of rules to be processed. You iterate the process until there is no rule l
Context
StackExchange Computer Science Q#28264, answer score: 3
Revisions (0)
No revisions yet.