HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMajor

How to simulate backreferences, lookaheads, and lookbehinds in finite state automata?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
lookbehindsfinitebackreferenceslookaheadsautomatastatehowandsimulate

Problem

I created a simple regular expression lexer and parser to take a regular expression and generate its parse tree. Creating a non-deterministic finite state automaton from this parse tree is relatively simple for basic regular expressions. However I can't seem to wrap my head around how to simulate backreferences, lookaheads, and lookbehinds.

From what I read in the purple dragon book I understood that to simulate a lookahead $r/s$ where the regular expression $r$ is matched if and only if the match is followed by a match of the regular expression $s$, you create a non-deterministic finite state automaton in which $/$ is replaced by $\varepsilon$. Is it possible to create a deterministic finite state automaton that does the same?

What about simulating negative lookaheads and lookbehinds? I would really appreciate it if you would link me to a resource which describes how to do this in detail.

Solution

First of all, backreferences can not be simulated by finite automata as they allow you to describe non-regular languages. For example, ([ab]^)\1 matches $\{ww \mid w \in \{a,b\}^\}$, which is not even context-free.

Look-ahead and look-behind are nothing special in the world of finite automata as we only match whole inputs here. Therefore, the special semantic of "just check but don't consume" is meaningless; you just concatenate and/or intersect checking and consuming expressions and use the resulting automata. The idea is to check the look-ahead or look-behind expressions while you "consume" the input and store the result in a state.

When implementing regexps, you want to run the input through an automaton and get back start and end indices of matches. That is a very different task, so there is not really a construction for finite automata. You build your automaton as if the look-ahead or look-behind expression were consuming, and change your index storing resp. reporting accordingly.

Take, for instance, look-behinds. We can mimic the regexp semantics by executing the checking regexp concurrently to the implicitly consuming "match-all" regexp. only from states where the look-behind expression's automaton is in a final state can the automaton of the guarded expression be entered. For example, the regexp /(?=c)[ab]+/ (assuming $\{a,b,c\}$ is the full alphabet) -- note that it translates to the regular expression $\{a,b,c\}^c\{a,b\}^+\{a,b,c\}^$ -- could be matched by

[source]

and you would have to

  • store the current index as $i$ whenever you enter $q_2$ (initially or from $q_2$) and



  • report a (maximum) match from $i$ to the current index ($-1$) whenever you hit (leave) $q_2$.



Note how the left part of the automaton is the parallel automaton of the canonical automata for [abc]* and c (iterated), respectively.

Look-aheads can be dealt with similarly; you have to remember the index $i$ when you enter the "main" automaton, the index $j$ when you leave the main automaton and enter the look-ahead automaton and report a match from $i$ to $j$ only when you hit the look-ahead automaton's final state.

Note that non-determinism is inherent to this: main and look-ahead/-behind automaton might overlap, so you have to store all transitions between them in order to report the matching ones later, or backtrack.

Context

StackExchange Computer Science Q#2557, answer score: 23

Revisions (0)

No revisions yet.