patternMinor
NFA models with characters on nodes, not edges
Viewed 0 times
nodesedgesnfawithcharactersmodelsnot
Problem
I am attempting to understand the inner workings of the open source string matching library Hyperscan.
It takes a multiple-engine approach to the problem of generating string matches, and I'm still in the early stages of following through the debugging output to see how it does what it does.
However, one design decision already visible at this point puzzles me: when constructing an NFA, it doesn't use the traditional model of a directed graph with edges labeled with characters, and nodes labeled with whether or not they accept. Instead, it seems to use a model (*) of a directed graph with unlabeled edges and nodes labeled with a set of characters that can reach this node, along with designated special nodes for accepting strings. (specifically, it has a special
I think this model is equivalent to the standard NFA model, but I can't find any references to this NFA model anywhere; I wonder if it goes by a standard name that I just don't know, and therefore can't search for. (My attempts just get buried in page after page of explanations of the standard NFA model)
So, has anyone seen an NFA model like this before? Are there any references I can look at to find out why one might prefer a model like this?
(On the off chance that anyone has already written up something on how the compiler and the various string matching engines embedded inside the Hyperscan library work, I'd also love any pointers to that. I primarily want to learn about state-of-the-art string-matching techniques, not about how to reverse-engineer C++.)
(*) I'm simplifying slightly here and ignoring bits of node and edge labels that seem unused at the initial stages.
It takes a multiple-engine approach to the problem of generating string matches, and I'm still in the early stages of following through the debugging output to see how it does what it does.
However, one design decision already visible at this point puzzles me: when constructing an NFA, it doesn't use the traditional model of a directed graph with edges labeled with characters, and nodes labeled with whether or not they accept. Instead, it seems to use a model (*) of a directed graph with unlabeled edges and nodes labeled with a set of characters that can reach this node, along with designated special nodes for accepting strings. (specifically, it has a special
NODE_ACCEPT node and a special NODE_ACCEPT_EOD node; nodes that mean "accept and ignore remaining input" have edges to NODE_ACCEPT and nodes that mean "accept if at end of input" have edges to NODE_ACCEPT_EOD)I think this model is equivalent to the standard NFA model, but I can't find any references to this NFA model anywhere; I wonder if it goes by a standard name that I just don't know, and therefore can't search for. (My attempts just get buried in page after page of explanations of the standard NFA model)
So, has anyone seen an NFA model like this before? Are there any references I can look at to find out why one might prefer a model like this?
(On the off chance that anyone has already written up something on how the compiler and the various string matching engines embedded inside the Hyperscan library work, I'd also love any pointers to that. I primarily want to learn about state-of-the-art string-matching techniques, not about how to reverse-engineer C++.)
(*) I'm simplifying slightly here and ignoring bits of node and edge labels that seem unused at the initial stages.
Solution
Hyperscan chief architect here.
The buzzword you are looking for is a Glushkov NFA, as opposed to a Thompson NFA.
The 'traditional model' you refer to is a Thompson NFA but is by no means the only choice. Both have their strengths and weaknesses, but we found the Glushkov NFA quite economical and it maps very nicely to a bitwise implementation (the "LimEx" NFA is a concrete NFA implementation, as opposed to the NFAGraph, which is an abstract representation). The joy of the Glushkov NFA is that the calculation of "reach" (which states are reached on which characters) and "succ" (which states can succeed other states) are completely independent and can be combined with a simple AND operation.
Gonzalo Navarro has a number of very nice papers and an excellent book on this subject. In terms of underlying approaches (if no the specific algorithms) we are entirely indebted to him.
Feel free to contact the Hyperscan list or alias if you want more information. We have been very lax in terms of writing up our approach.
The buzzword you are looking for is a Glushkov NFA, as opposed to a Thompson NFA.
The 'traditional model' you refer to is a Thompson NFA but is by no means the only choice. Both have their strengths and weaknesses, but we found the Glushkov NFA quite economical and it maps very nicely to a bitwise implementation (the "LimEx" NFA is a concrete NFA implementation, as opposed to the NFAGraph, which is an abstract representation). The joy of the Glushkov NFA is that the calculation of "reach" (which states are reached on which characters) and "succ" (which states can succeed other states) are completely independent and can be combined with a simple AND operation.
Gonzalo Navarro has a number of very nice papers and an excellent book on this subject. In terms of underlying approaches (if no the specific algorithms) we are entirely indebted to him.
Feel free to contact the Hyperscan list or alias if you want more information. We have been very lax in terms of writing up our approach.
Context
StackExchange Computer Science Q#71236, answer score: 5
Revisions (0)
No revisions yet.