HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

Theory behind regex implementations

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
theoryregeximplementationsbehind

Problem

In a 2007 article, Russ Cox (at presents, he leads the development of the Go programming language at Google) argues that regex engines in languages like Java, Perl, PHP, Python, Ruby are built on a different theoretical framework from implementations like awk or sed. Cox claims a difference in performance between 60s and 20microseconds (the difference is an order of magnitude of millions!). Link to the article.

Is this issue still up-to-date? That is, is the theory behind these groups of implementations basically different?

Is there any real life scenario when a programmer would need to pick some Jurassic implementation like awk (due to its solid theoretical framework) because otherwise the task at hand could not be dealt with?

Solution

It isn't a matter of implementation. It's a matter of different approaches to parsing that unfortunately have similar names.

NFAs and DFAs (finite-state machines) can only recognize regular languages. Regular expressions are a way of specifying regular languages, and they can be compiled into efficient NFAs or DFAs that recognize the language. This is all taught in undergraduate CS courses.

The so-called "regexes" in Perl and its many imitators have a name that sounds like an abbreviation of "regular expression", and they syntactically resemble the regular expressions of sed and awk, but they are not regular expressions. They are really specifications of backtracking parsers. Backtracking parsers aren't limited to recognizing regular languages. (The name "regex" is Larry Wall's fault. He made a deliberate decision to not call them "regular expressions" because he knew that they weren't, but he should have invented an entirely different name.)

You don't need a Jurassic implementation of NFA/DFA based regular expression matching; modern maintained libraries are available for popular languages. But they lack many features of Perl-style regexes, and they don't offer much in exchange. They certainly have much better worst-case performance, but the pathological worst cases of backtracking parsers can be fixed by modifying the regexes, without losing the extra flexibility of backtracking.

The article by Cox is disingenuous. He evidently wants you to believe that the authors of regex libraries are ignorant of the theory of regular languages, and if they had just studied the standard undergraduate CS curriculum then they would have implemented their libraries in a different way and made them faster. That's false. It's impossible to implement Perl-style regexes that way, and he knows it. The popularity of Perl-style regexes is probably due in large part to their flexible feature set that goes far beyond what's available in any true regular expression library.

Context

StackExchange Computer Science Q#135340, answer score: 13

Revisions (0)

No revisions yet.