patternMinor
Possessive Kleene star operator
Viewed 0 times
possessivekleenestaroperator
Problem
Has anyone studied the consequences of the Kleene star in regular expressions to always be "possessive"?
In other words, if
Let's say that
There are cases where it doesn't matter:
However there are cases where being possessive is relevant:
One may be tempted to think that for any non-possesive expression there is an equivalent possessive expression and viceversa but I wasn't able to find any proof of this.
I'm limiting myself to
As D.W. suggested in the comments below, I tried to start from the DFA. I didn't go much far, as this seems to have more to do with the way the non-determinism is handled, rather than with the structure of the automaton.
Could anyone point me in the right direction?
EDIT 3
I've posted an answer that I believe is correct.
EDIT 2
(REMOVED)
EDIT 1
(It has pointed out to me that the right term to use would be "possessive" (http://www.regular-expressions.info/possessive.html) rather than "greedy". Thanks.)
In other words, if
* would always match as much as possible and no backtracking is allowed, would I still be able to express any regular language?Let's say that
@ is the possessive * operator. There are cases where it doesn't matter:
a*b and a@b both will match any string with 0 or more $a$ followed by a $b$.However there are cases where being possessive is relevant:
a.*b will match any string that starts with $a$ and ends with $b$ but a.@b will never match any string as the greedy @ will eat any character, including the ending $b$. The equivalent expression would be a[^b]@b.One may be tempted to think that for any non-possesive expression there is an equivalent possessive expression and viceversa but I wasn't able to find any proof of this.
I'm limiting myself to
considering a+ equivalent to aa.As D.W. suggested in the comments below, I tried to start from the DFA. I didn't go much far, as this seems to have more to do with the way the non-determinism is handled, rather than with the structure of the automaton.
Could anyone point me in the right direction?
EDIT 3
I've posted an answer that I believe is correct.
EDIT 2
(REMOVED)
EDIT 1
(It has pointed out to me that the right term to use would be "possessive" (http://www.regular-expressions.info/possessive.html) rather than "greedy". Thanks.)
Solution
I will share below my attempt to prove that you can express every regular language as a regular expression using just the $+$ and $@$ operators. Unfortunately, my attempt at a proof has a gap. Perhaps you can repair the gap. My attempt based upon Brzozoswki's method for generating regular expressions, so I'll assume you are already familiar with how that works.
First, some definitions and other preliminaries.
Definition. Define $A^+$ to be the language of one or more words from $A$, i.e., $A^+ = \{a_1 a_2 \cdots a_k \mid k \ge 1, a_1,\dots,a_k \in A\}$. In other words, $A^+ = A^* \setminus \{\epsilon\}$.
Definition. $\newcommand{\rp}{\backslash\backslash}$ For sets $A,B \subseteq \Sigma^$, define $A\rp B$ by $A\rp B = (A^\backslash B) - (A^+ \Sigma^)$, where $A\backslash B$ is the left quotient, and $-$ represents the difference of two sets. In other words, $A\rp B$ is the smallest set $C \subseteq \Sigma^$ such that $A^ B = A^ C$ and such that no word in $A^+$ is a prefix of any word in $C$.
Example: $\{01\}\rp \{0101010,101\} = \{0,101\}$.
Note that if $A,B$ are regular languages, then $A\rp B$ is also a regular language and can be computed effectively from $A,B$. (This follows from the standard closure properties of regular languages.)
Definition. The semantics of regular expressions using the $+$ and $@$ operators are defined inductively:
(As a consequence, every regular expression using $+,@$ can be equivalently written as a sum of terms of the form $A_1 A_2 \cdots A_k$ where each $A_i$ is either of the form $w_i$ or $C_i^@$ for some $w_i \in \Sigma^*$ and some regular expression $C_i$ using $+,@$. Once it is rewritten in that sum-of-products form, the semantics of the resulting expression is given by the definition above.)
For instance, we have the following property: if there is no word in $L(A^+)$ that is a prefix of any word in $L(B)$, then $L(A^@ B) = L(A^* B)$.
In what follows, I won't try to distinguish between a regular expression $E$ and its corresponding language $L(E)$.
Now we can prove a generalization of Arden's lemma.
Lemma 1. Let $A,B \subseteq \Sigma^*$ be given, and assume $\epsilon \notin A$ and no word in $A^+$ is a prefix of any word in $B$. Suppose we have the equation $X=AX + B$ (i.e., $X=AX \cup B$). Then the least solution to this equation is $X=A^@ B$.
Proof. Arden's lemma says the least solution is $X=A^ B$. Now, since no word in $A^+$ is a prefix of any word in $B$, in fact $A^ B = A^@ B$, as there can be no ambiguity about how much the $*$ operator gobbles up.
Lemma 2. Let $A,B \subseteq \Sigma^*$ be given, and assume $\epsilon \notin A$. Suppose we have the equation $X=AX + B$ (i.e., $X=AX \cup B$). Then the least solution to this equation is $X=A^@ C$ where $C=(A\rp B)$.
Proof. Arden's lemma says the least solution is $X=A^ B$. As noted above, $A^ B = A^ C$. Moreover, no word in $A^+$ is a prefix of any word in $C$, so $A^ C = A^@ C$.
Now given any regular language $L$, we can apply Brzozoswki's method to it, but using Lemma 2 above instead of Arden's lemma. It's tempting to hope that the result will be a regular expression for $L$ that uses only the operators $+$ and $@$, but unfortunately there's a gap: I don't know whether one can prove that $A\rp B$ can be represented as a regular expression using only the $+$ and $@$ operators. So, this proof method has a big gaping hole in it.
But perhaps someone will see a way to build on this and prove the desired result.
First, some definitions and other preliminaries.
Definition. Define $A^+$ to be the language of one or more words from $A$, i.e., $A^+ = \{a_1 a_2 \cdots a_k \mid k \ge 1, a_1,\dots,a_k \in A\}$. In other words, $A^+ = A^* \setminus \{\epsilon\}$.
Definition. $\newcommand{\rp}{\backslash\backslash}$ For sets $A,B \subseteq \Sigma^$, define $A\rp B$ by $A\rp B = (A^\backslash B) - (A^+ \Sigma^)$, where $A\backslash B$ is the left quotient, and $-$ represents the difference of two sets. In other words, $A\rp B$ is the smallest set $C \subseteq \Sigma^$ such that $A^ B = A^ C$ and such that no word in $A^+$ is a prefix of any word in $C$.
Example: $\{01\}\rp \{0101010,101\} = \{0,101\}$.
Note that if $A,B$ are regular languages, then $A\rp B$ is also a regular language and can be computed effectively from $A,B$. (This follows from the standard closure properties of regular languages.)
Definition. The semantics of regular expressions using the $+$ and $@$ operators are defined inductively:
- $L(w) = \{w\}$ if $w \in \Sigma^*$
- $L(A_1+A_2+\dots+A_k) = L(A_1) \cup L(A_2) \cup \dots \cup L(A_k)$
- $L(A^@ B) = \{ab : a \in L(A^), b \in L(B), b \notin L(A^+ \Sigma^)\}$
- $L(w B) = \{wb : b \in L(B)\}$ if $w \in \Sigma^*$
- $L(A^@) = L(A^*)$
- $L((A_1+A_2)B) = L(A_1B + A_2 B)$, $L(A(B_1+B_2)) = L(AB_1 + AB_2)$
- $L(A_1 A_2 \cdots A_k) = L(A_1 (A_2 \cdots A_k))$
(As a consequence, every regular expression using $+,@$ can be equivalently written as a sum of terms of the form $A_1 A_2 \cdots A_k$ where each $A_i$ is either of the form $w_i$ or $C_i^@$ for some $w_i \in \Sigma^*$ and some regular expression $C_i$ using $+,@$. Once it is rewritten in that sum-of-products form, the semantics of the resulting expression is given by the definition above.)
For instance, we have the following property: if there is no word in $L(A^+)$ that is a prefix of any word in $L(B)$, then $L(A^@ B) = L(A^* B)$.
In what follows, I won't try to distinguish between a regular expression $E$ and its corresponding language $L(E)$.
Now we can prove a generalization of Arden's lemma.
Lemma 1. Let $A,B \subseteq \Sigma^*$ be given, and assume $\epsilon \notin A$ and no word in $A^+$ is a prefix of any word in $B$. Suppose we have the equation $X=AX + B$ (i.e., $X=AX \cup B$). Then the least solution to this equation is $X=A^@ B$.
Proof. Arden's lemma says the least solution is $X=A^ B$. Now, since no word in $A^+$ is a prefix of any word in $B$, in fact $A^ B = A^@ B$, as there can be no ambiguity about how much the $*$ operator gobbles up.
Lemma 2. Let $A,B \subseteq \Sigma^*$ be given, and assume $\epsilon \notin A$. Suppose we have the equation $X=AX + B$ (i.e., $X=AX \cup B$). Then the least solution to this equation is $X=A^@ C$ where $C=(A\rp B)$.
Proof. Arden's lemma says the least solution is $X=A^ B$. As noted above, $A^ B = A^ C$. Moreover, no word in $A^+$ is a prefix of any word in $C$, so $A^ C = A^@ C$.
Now given any regular language $L$, we can apply Brzozoswki's method to it, but using Lemma 2 above instead of Arden's lemma. It's tempting to hope that the result will be a regular expression for $L$ that uses only the operators $+$ and $@$, but unfortunately there's a gap: I don't know whether one can prove that $A\rp B$ can be represented as a regular expression using only the $+$ and $@$ operators. So, this proof method has a big gaping hole in it.
But perhaps someone will see a way to build on this and prove the desired result.
Context
StackExchange Computer Science Q#40712, answer score: 5
Revisions (0)
No revisions yet.