HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Can an Earley Parser be made into a fuzzy parser similar to the Levenshtein Automata Algo for DFA?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
canthemadedfaparserintoautomataalgolevenshteinfor

Problem

There's a way to perform fuzzy parsing (accepts strings even with typos to a certain edit distance), with a DFA and a run-time constructed Levenshtein Automata of the input word. Can something similar be done with an Earley parser? I'm finding it hard to understand the algorithm, let alone answer this question.

Solution

The answer is yes. However I would not do that with an Earley parser
because there are simpler ones with the same capabilities.

Basically, Earley parser belongs to a family of general context-free
parsers, that produces all possible parses for a given string, when
the grammar is ambiguous.

There are two ways (at least) of understanding these parsers:

-
as dynamic programming interpretation of a pushdown automaton
corresponding to the grammar on the input string;

-
as the the construction of the intersection of the grammar
with a finite state automaton.

When parsing a single string, the finite state automaton to be
considered is a linear automaton that recognizes only the string $w$
to be parsed, one symbol at a time (number of state is $|w|+1$). If
you apply the cross-product construction of a FA $A$ and a CF garmmar
$G$ (Bar Hillel, Perlis, Shamir 1961), you get a new CF grammar that
is a new grammar $F$ which generates $\mathcal L(A)\cap\mathcal L(G)$.
The interesting point usually overlooked is that $F$ preserves the
parse-trees used by $G$, up to non-terminals renaming (due to
cross-product).

Thus if the FA $A$ generates only your input string, the grammar $F$
will generate only that string (if it is in $\mathcal L(G)$, otherwise
it generates the empty language $\emptyset$). Furthermore, it
generates it with all the parse-trees that $G$ could use to generate
it.

This grammar $F$ is what is usually called a shared parse forest, and
all general CF parsing algorithms are a more or less optimized version
of the cross-product construction, whether CYK, Earley, generalized LR
or LL, or others. So all I am saying applies to them too.

But, as you see, this generalizes to parsing a whole regular set, if
anyone is interested in doing that.

That is precisely your question. You have a string $w$. You want to
parse it up to some variations defined by a finite state transducer,
which in your case is a transducer that produces all strings within
some given Levenshtein edit distance of $w$ (but the origin of the
transducer is immaterial). The set of those strings is a regular set
that can be defined by a FA, with weighted transition that can compute
the edit distance of each string.

If you do the cross-product with your grammar $G$, you get a shared
parse forest grammar $F$ that generates all the strings in the
intersection. Furthermore, you get the weights on some of the rules so
that you can compute the edit distance of each of the accepted
strings.

If desirable, this can be used to keep only the strings with minimal
distance.

However, this can be improved a bit since composition with finite
state machines is associative.

If you always use the same, finite state transducer, as is the case in
your question, the right way to go about is to compose the Grammar $G$
and the transducer (here the Levenshtein automaton), independently of
the input string. This gives you a weighted grammar that you can use
to parse the input string $w$. The problem is that parsing with the
brutal intersection construction with give you strings at any
Levenshtein distance, i.e. $\Sigma^*$.

It would be easy to prune that construction to get the same result as
before, but the best way is a more controlled intersection
construction, such as the dynamic programming organization used by
most parsers in the literature, including Earley's, and use it to
avoid generating useless rule by computing distances and aborting any
computational path when it exceeds the desired threshold. Dynamic
programming can also be used to compute directly the parse-forest (or
parse-tree) for the string that have the shortest distance to the
input.

Context

StackExchange Computer Science Q#44889, answer score: 8

Revisions (0)

No revisions yet.