patternMinor
Worst-Case Complexity of Quantifiers in Thompson's Construction
Viewed 0 times
caseconstructionquantifierscomplexityworstthompson
Problem
My understanding is that an NFA compiled using Thompson's Construction should have a running time that is linear in the length of the input string, with a space complexity that is linear with the length of the regex pattern.
Consider this pathological example: $a^a^ab$, which should match at least one $a$ followed by exactly one $b$. I'm going to refer to the compiled NFA as having three parts, with a "front part" representing the first $a^$, a "middle part" representing the second $a^$, and a "back part" representing $ab$.
The front and middle parts would look like this, with a simple $a$ symbol transition for $N(s)$:
The back part would look like two of these symbol transitions, one for $a$ then one for $b$:
Let's try to match $aaab$:
At this point, you can probably see what I'm getting at; the first Kleene star causes execution to split, in the worst case, $O(n)$ times, where $n$ is the length of the input, then the second
Consider this pathological example: $a^a^ab$, which should match at least one $a$ followed by exactly one $b$. I'm going to refer to the compiled NFA as having three parts, with a "front part" representing the first $a^$, a "middle part" representing the second $a^$, and a "back part" representing $ab$.
The front and middle parts would look like this, with a simple $a$ symbol transition for $N(s)$:
The back part would look like two of these symbol transitions, one for $a$ then one for $b$:
Let's try to match $aaab$:
- The compiled NFA starts in the start state.
- Having two options due to the first Kleene star, it splits execution. It doesn't matter which state goes where, so let's just say state #1 takes the epsilon transition to move on to the middle part, and state #2 stays in the front part.
- State #2 would match the first $a$, then split again. Let's say state #2 proceeds along the loop-back epsilon transition and stays in the front part, while a new state #3 moves on to the middle part. Now there are three states.
- Meanwhile, state #1 – now in the middle part – also splits due to the second Kleene star, and does basically the same thing as state #2 did in the front part. Let's say state #1 moves on again to the back part, while a new state #4 goes into the Kleene star construction in the middle part, matches the first $a$, and in turn splits again into a state #5 that loops back within the second Kleene star construction, so state #4 can move on to the back part like state #1 did, except that state #4 has already matched the first $a$.
At this point, you can probably see what I'm getting at; the first Kleene star causes execution to split, in the worst case, $O(n)$ times, where $n$ is the length of the input, then the second
Solution
The point of a finite automaton is that the number of possible states that it can be in is finite. Specifically, if a NFA has $n$ states, then at any point in the input string, you can be in at most $n$ states at the same time.
The first observation is that you can precompute $\epsilon$ transitions. Define $\mathbf{closure} : Q \rightarrow 2^Q$ to be a function that takes a state and returns the state which are reachable using zero or more $\epsilon$ transitions. Then from this, we can compute an "extended" transition function $\delta^{*} : Q \times \Sigma \rightarrow 2^Q$, which maps a state and an input symbol to the set of states that are reachable from it, then you can use this algorithm to recognise a string:
This algorithm looks like it takes $O(k^2n)$ space in the worst case, namely, the space to represent $\delta^{}$. And the inner loop operation, $\cup_{s \in W}\,\delta^{}(w, a)$, looks like it should take $O(k^2)$ time in the worst case.
This isn't quite true in the word RAM model, however. Consider that the space needed to store a set of states can be measured in bits; since $k$ is known in advance, the sets like $W$ and the codomain of $\delta^{*}$ can be represented as a bit vector, and set union can be a logical or. So in the word RAM model, the inner loop will actually take $O\left(\frac{k^2}{\log k}\right)$ time in practice.
Having said that, as you correctly guessed, Thompson's construction is not ideal for this use case. It has many advantages (not the least of which is that the space is linear in the size of the input regular expression), but the efficiency of its "obvious" recogniser algorithm is not one of them. As you rightly suspected, it produces a lot of unnecessary states that we can get rid of.
One obvious point is that it's trivial to optimise common "leaf" cases. For example, Thompson's algorithm will turn $a^$ into a NFA with 4 states, but it should be obvious that you "need" fewer than that, and the complexity of the construction for $E^$ is only needed because $E$ may be more complex.
Following this reasoning, it is possible, without too much difficulty, to construct an NFA with the following properties:
Property 1 gives an accurate bound on the number of states which is computable at parse time.
Property 3 means that you can label states with alphabet symbols instead of transitions. Together with property 2, the transition function $\delta$ doesn't need to consider input symbols.
We define a normal NFA (NNFA) to be a 6-tuple $(\Sigma,Q,\delta,I,F,A)$ where:
Then the inner loop of the recogniser algorithm is to set:
$$W := \left\{ s\,|\,w \in W, s \in \delta(w), A(s) = \alpha\right\}$$
This should give you a lot of ideas as to how to implement this in data structures. For example, consider representing $\delta$ as a bit matrix and $A$ as a map from $\Sigma$ to a bit vector, where $A(\alpha)$ is the set of states $q$ which have the label $\alpha$. Then the inner loop becomes a bunch of bit vector unions (i.e. logical or operations) with a final bit vector intersection (i.e. a logical and). In the word RAM model, this is extremely efficient.
As to how to construct a NNFA, it's worth trying to come up with the idea behind it yourself, now that you know that it's possible. As a hint, start with Thompson's construction, and create a new automaton where the "new" states are the $\mathbf{closure}$ of every state which has a terminal transition coming into it. You can then think about how you could compute that automaton without explicitly constructing Thompson's automaton to begin with.
Once you've tried it, the best explanation that I've seen is Chang's PhD thesis. But here's the amazing part: Chang then goes on to give a compressed representation for a NNFA which takes $O(k)$ space, with a construction algorithm that uses $O(k)$ auxiliary space and $O(r)$ time where $r$ is (in some sense) the "size" of the input regular expression.
That effectively answers your question. There is indeed an algorithm which recognises regular expressions, which stores the automaton in $O(k)$ sp
The first observation is that you can precompute $\epsilon$ transitions. Define $\mathbf{closure} : Q \rightarrow 2^Q$ to be a function that takes a state and returns the state which are reachable using zero or more $\epsilon$ transitions. Then from this, we can compute an "extended" transition function $\delta^{*} : Q \times \Sigma \rightarrow 2^Q$, which maps a state and an input symbol to the set of states that are reachable from it, then you can use this algorithm to recognise a string:
- Given the start state $s$, set $W := \mathbf{closure}(s)$. (Again, this can be precomputed.)
- For each input symbol $a$, set $W := \cup_{s \in W}\,\delta^{*}(w, a)$.
- Accept if there is a final state in $W$.
This algorithm looks like it takes $O(k^2n)$ space in the worst case, namely, the space to represent $\delta^{}$. And the inner loop operation, $\cup_{s \in W}\,\delta^{}(w, a)$, looks like it should take $O(k^2)$ time in the worst case.
This isn't quite true in the word RAM model, however. Consider that the space needed to store a set of states can be measured in bits; since $k$ is known in advance, the sets like $W$ and the codomain of $\delta^{*}$ can be represented as a bit vector, and set union can be a logical or. So in the word RAM model, the inner loop will actually take $O\left(\frac{k^2}{\log k}\right)$ time in practice.
Having said that, as you correctly guessed, Thompson's construction is not ideal for this use case. It has many advantages (not the least of which is that the space is linear in the size of the input regular expression), but the efficiency of its "obvious" recogniser algorithm is not one of them. As you rightly suspected, it produces a lot of unnecessary states that we can get rid of.
One obvious point is that it's trivial to optimise common "leaf" cases. For example, Thompson's algorithm will turn $a^$ into a NFA with 4 states, but it should be obvious that you "need" fewer than that, and the complexity of the construction for $E^$ is only needed because $E$ may be more complex.
Following this reasoning, it is possible, without too much difficulty, to construct an NFA with the following properties:
- The number of states is equal to the number of alphabet symbols in the original regular expression (in some formulations, plus one for the start state, but here we will use a set of start states). That is, $a^a^ab$ should require exactly 4 states.
- There are no $\epsilon$ transitions.
- Every transition into a state $s$ has the same label.
Property 1 gives an accurate bound on the number of states which is computable at parse time.
Property 3 means that you can label states with alphabet symbols instead of transitions. Together with property 2, the transition function $\delta$ doesn't need to consider input symbols.
We define a normal NFA (NNFA) to be a 6-tuple $(\Sigma,Q,\delta,I,F,A)$ where:
- $\Sigma$ is the alphabet.
- $Q$ is a set of states.
- $\delta : Q \rightarrow 2^Q$ (or, alternatively, $\delta \subseteq Q \times Q$) is the transition function.
- $I \subseteq Q$ is the set of initial states.
- $F \subseteq Q$ is the set of final states.
- $A : Q \rightarrow \Sigma$ is a function which assigns labels to states.
Then the inner loop of the recogniser algorithm is to set:
$$W := \left\{ s\,|\,w \in W, s \in \delta(w), A(s) = \alpha\right\}$$
This should give you a lot of ideas as to how to implement this in data structures. For example, consider representing $\delta$ as a bit matrix and $A$ as a map from $\Sigma$ to a bit vector, where $A(\alpha)$ is the set of states $q$ which have the label $\alpha$. Then the inner loop becomes a bunch of bit vector unions (i.e. logical or operations) with a final bit vector intersection (i.e. a logical and). In the word RAM model, this is extremely efficient.
As to how to construct a NNFA, it's worth trying to come up with the idea behind it yourself, now that you know that it's possible. As a hint, start with Thompson's construction, and create a new automaton where the "new" states are the $\mathbf{closure}$ of every state which has a terminal transition coming into it. You can then think about how you could compute that automaton without explicitly constructing Thompson's automaton to begin with.
Once you've tried it, the best explanation that I've seen is Chang's PhD thesis. But here's the amazing part: Chang then goes on to give a compressed representation for a NNFA which takes $O(k)$ space, with a construction algorithm that uses $O(k)$ auxiliary space and $O(r)$ time where $r$ is (in some sense) the "size" of the input regular expression.
That effectively answers your question. There is indeed an algorithm which recognises regular expressions, which stores the automaton in $O(k)$ sp
Context
StackExchange Computer Science Q#138275, answer score: 3
Revisions (0)
No revisions yet.