HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Find member of CFL that is Levenshtein-closest to non-member string

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
cflnonclosestlevenshteinmemberthatfindstring

Problem

Is there an (efficient?) algorithm which given a context-free language $L$ (given as a grammar) and a string $x$ with $x \not \in L$ computes a $y$ with $y \in L$ and $\forall y': y' \in L \implies d(x, y) \le d(x, y')$, where $d$ is the Levenshtein distance? (Secondarily, can one enumerate all such $y$?)

The motivation is to give more useful parse error messages: if you give me an almost-valid C program, I would like to tell you "if you delete these parentheses and insert a semicolon here, it'll parse".

A family of related questions:

  • What happens when you vary the distance function? (Can you think of other candidates? Do non-metric distance functions make sense? Do they change the answer?)



  • Does the answer change if you restrict $y$ to e.g.



  • $x$ is a prefix of $y$ (i.e. you can do append-only error-correction)



  • $y$ is a prefix of $x$ (i.e. you can do pop-last-only error-correction)



  • all the characters of $x$ occur in $y$ and in the same relative order, but maybe not adjacently (i.e. you can error-correct by inserting anywhere)



  • [...] $y$ occur in $x$ [...] by deleting anywhere.



  • Does the answer change if you change the optimization criterion? E.g. "find a $y$ in $L$ which maximizes the length of common_prefix(x, y)", or something else?



  • Does the answer change for other language classes, e.g. regular?



  • Does this change if the language is given as $LALR(k)$, $LL(k)$, deterministic or other sub-CF grammar?

Solution

Yes. This can be done, using Levenshtein automata.

Let $S_k = \{y \in \{0,1\}^* : d(x,y) \le k)\}$. Then the set $S_k$ is regular, and one can construct a finite-state automaton for it, called a Levenshtein automaton. Now the intersection of a CFL and a regular language is another CFL. Also, given a CFL, you can efficiently determine whether it is non-empty or not, and if not, you can efficiently find an example of an element of it.

So, we obtain the following simple algorithm:

  • For $k:= 1,2,3,\dots$, do:



  • If $L \cap S_k$ is non-empty, find $y \in L \cap S_k$ and output it.



This can be sped up by using binary search to find the smallest $k$ such that $L \cap S_k$ is non-empty.

The same approach can handle your proposed restrictions as well, as those simply amount to intersecting with another regular language. It also works if $L$ is regular, as any regular language is necessarily context-free.

Context

StackExchange Computer Science Q#60262, answer score: 4

Revisions (0)

No revisions yet.