HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Implementing regular expression matching using Brzozowski derivatives

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
expressionbrzozowskiregularderivativesusingimplementingmatching

Problem

I have been taking a language theory class, and we learned about Brzozowski derivatives recently. At class it occurred to me that they could be used to implement a simple regular expression matcher. If you take the derivative of a regular expression with respect to a given string and the resulting expression matches the empty string ($\varepsilon$), then the original expression matches the string used in the derivative:

$$
w \text{ matches } e \Leftrightarrow \varepsilon \text{ matches } D_w(e)
$$

I did a web search and, as expected, other people had the same idea before. There is an implementation here and another one here. They are short and simple.

Is this used in practice anywhere? I'm not sure, but I believe that most regular expression matchers are implemented using either finite automata or backtracking algorithms (like Perl regular expressions). Why is this the case? Is the technique I mentioned too slow? Is it missing functionalities? Does anybody know what the complexity of regular expression matching using derivatives is?

Solution

Yes. This approach has been used before. I've seen it used in a research project, where the goal was to build a regexp matcher that could be formally verified to be correct (with as little effort as possible). This algorithm was implemented in Coq and then proven correct. The nice thing is that it is relatively easy to prove correct. In contrast, a table-driven lexer would be harder to prove correct, because one would have to prove that the table was constructed correctly.

I think that table-driven lexers are often faster in practice. Your scheme essentially computes the transition table on the fly. Table-driven lexers involve pre-computing the transition table, and now you only need to do a few instructions per character read from the input (read one character, do a table lookup based on the current state and just-read character, update the current state). Therefore, their constant factor can be lower.

To be clear, by table-driven lexer, I mean something that converts the regexp to a DFA and then precomputes a table that represents the transition table. This is useful in cases where the regexp is fixed and known at compile time. For many regexps seen in practice, the size of the DFA is not too large and so this leads to a fairly efficient matcher algorithm.

Context

StackExchange Computer Science Q#52097, answer score: 4

Revisions (0)

No revisions yet.