patternModerate
Which languages do Perl-compatible regular expressions recognize?
Viewed 0 times
perllanguagesregularrecognizecompatibleexpressionswhich
Problem
As the title says, I spent a couple of hours last weekend trying to wrap up my mind about the class of languages matched by Perl-compatible regular expressions, excluding any matching operator that allows to execute arbitrary code inside the pattern.
If you don't know what PCREs are, please read this and this.
The problem is, the resources available on internet pretty much stop at context-free languages, and PCREs can match more than those (see below); but I really don't know where to find more theorems or papers about this kind of stuff.
In particular: PCREs are obviously a superset of regular languages (as the PCRE syntax has all the regular language operators).
Any CFG can be put in Greibach normal form, which removes left recursion. I think this can be used by means of
This should deal with any CFG. Therefore, PCREs should be able to match any CFL.
There's more: let's take the simple language
$$L = \{ ww | w \in \Lambda^* \} $$
i.e. the language of the strings repeated twice. This language is not a CFL -- the pumping lemma for CFLs fails. (Pay particular attention that
$$ |vxw| \leq p$$
must hold, thus you can't just pump the beginnings or the ends of the two repeated strings.)
However, this language is easily matched by a
If you don't know what PCREs are, please read this and this.
The problem is, the resources available on internet pretty much stop at context-free languages, and PCREs can match more than those (see below); but I really don't know where to find more theorems or papers about this kind of stuff.
In particular: PCREs are obviously a superset of regular languages (as the PCRE syntax has all the regular language operators).
Any CFG can be put in Greibach normal form, which removes left recursion. I think this can be used by means of
(?(DEFINE)...) groups to "translate" the grammar into matching subroutines, avoiding to choke on left recursion, by translating:- the non-terminal at the head of each production becomes a subroutine
(?...)
- the body of each production is put in the subroutine; terminals are left as-is, non-terminals become procedure invocations (i.e.
(?&NONTERMINAL));
- all the productions with the same nonterminal as head are ORed together by means of the
|operator (plus additional grouping with(?:...), if necessary)
- the pattern then becomes a
(?(DEFINE)...)group containing all the "translated" productions, and an invocation for the procedure of the starting symbol, to match the entire string, i.e.^(?(DEFINE)...)(?&START)$
This should deal with any CFG. Therefore, PCREs should be able to match any CFL.
There's more: let's take the simple language
$$L = \{ ww | w \in \Lambda^* \} $$
i.e. the language of the strings repeated twice. This language is not a CFL -- the pumping lemma for CFLs fails. (Pay particular attention that
$$ |vxw| \leq p$$
must hold, thus you can't just pump the beginnings or the ends of the two repeated strings.)
However, this language is easily matched by a
Solution
I've also found this blog post extremely interesting http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html. It provides the very same proof I gave before about the fact that regexps recognizes the CFLs (by rewriting the grammar through
DEFINE blocks), and even some CSLs (like the language of repeated strings); it builds on that and goes on, giving a proof that regexps with backreferences are NP-hard (by reducing 3-SAT to a regexp).Context
StackExchange Computer Science Q#4839, answer score: 11
Revisions (0)
No revisions yet.