HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Are there real lexers that use NFAs directly instead of first transforming them to DFAs?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
realarelexerstransforminginsteaddfasfirstthatdirectlythem

Problem

I am taking the Coursera class on compilers and in the lesson about lexers it is hinted that there is a time-space tradeoff between using non-deterministic finite automaton (NFA) and deterministic finite automaton (DFA) to parse regular expressions. If I understand correctly, the tradeoff is that a NFA is smaller, but is more time consuming to traverse because all possible states have to be regarded at the same time and therefore it is most of the time transformed into a DFA. Are there any lexers that use NFAs instead of DFAs in "real"-life i.e. some compiler that is used in production and not a just a proof of concept?

Solution

Compiled lexical analyzers compile the NFA to a DFA.

Good interpreted regular expression matchers, on the other hand, use Thompson's algorithm, simulating the NFA with memoization. This is equivalent to compiling the NFA to a DFA, but you only produce DFA states on demand, if they are needed. At each step your deterministic state is a set of NFA states, then given the next input character you transition to a new set of NFA states. You cache previously seen states and their output transitions in a hash table. The hash table is flushed if it fills up, it does not grow without bound.

The reason you do it this way is that converting the NFA to DFA can take time exponential in the size of the regular expression. This is certainly not something you want to do if you are only evaluating the regular expression once.

RE2 is an example of a regex engine that (essentially) uses Thompson's algorithm. I can highly recommend the brilliant blog posts by RE2's author Russ Cox if you want to learn more (including lots of historical information and experimental comparisons of lots of different approaches to regex searching.

I can also highly recommend the "why GNU grep is fast" email chain. Lesson 1 is: the common case for regex search is simple string search, so special case your algorithm.

Context

StackExchange Computer Science Q#9708, answer score: 5

Revisions (0)

No revisions yet.