HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

When is a regexp not a Regular Expression?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
expressionregularregexpwhennot

Problem

Since I'm studying for my formal languages college course, I stumbled upon these fascinating posts (One Two) which describe how to find a prime number using a regexp. As I said, a regexp, not a regular expression.
Since a regular expression can match strings computed by a Finite State Automata and finding a prime number can't be done by a FSA, the regexp shown in the blog post is not entirely a regular expression since it does backtracking to match the string.

Since I've never really used any regular expression, now, my question:

How can I immediately recognize a regexp from a "true" regular expression just by looking at it?

Definitions: By regular expression, I refer to the notion as defined in formal languages. By regexp, I mean the notion supported by modern programming languages; the regexp syntax often contains additional features, such as backreferences. Regexps as seen in programming languages are strictly more powerful than formal languages style regular expressions.

Solution

tl;dr backrefs.

As soon as there is a \1 (or any number that isn't used to escape unicode) in the regexp it is not a regular expression.

Backrefs allows you to match (a+)b\1 which matches n times a followed by b followed by n times a for any n>1. This is not a regular language (it's the poster child of a non regular language).

It is necessary and nearly sufficient that the backref references a group that contains a regexp that matches an arbitrarily long string or that it contains a * or +. The only exception (that I found) of a regexp of the form (A)B\1 where A is a finite language (could be replaced by a enumeration of all words that accepts them). You can convert it to word1+Bword1|word2+Bword2 etc. because A is finite.

Look-around groups don't remove the regularness of the regexp. A(?=B)C is the cross-section of regexes AB. and AC and the cross-section of 2 regular languages is regular. Negative lookahead is similar except using the complement of B. (complements of regular languages being regular). Lookbehind is exactly the same as well A(?<=B)C is the cross-section of AC and .*BC.

Context

StackExchange Computer Science Q#38451, answer score: 17

Revisions (0)

No revisions yet.