patternModerate
For every 'evil' regex, does there exist a non-evil alternative, or is the devil in the grammar?
Viewed 0 times
thealternativenoneveryexistevildevilgrammarfordoes
Problem
Apparently, ReDos attacks exploit characteristics of some (otherwise useful) regular expressions ... essentially causing an explosion of possible paths through the graph defined by the NFA.
Is it possible to avoid such problems by writing an equivalent 'non-evil' regex? If not (thus, the grammar can't be handled in practical space/time by an NFA), what parsing approaches would be better? Why?
Is it possible to avoid such problems by writing an equivalent 'non-evil' regex? If not (thus, the grammar can't be handled in practical space/time by an NFA), what parsing approaches would be better? Why?
Solution
It depends upon whether you've got a regular expression or a regexp: regexps are evil, but regular expressions are a thing of beauty and will never turn evil on you.
By regexp, I mean a modern regular expression: i.e., a regular expression with additional modern features such as backreferences -- e.g., a Perl-compatible regular expression. This is more powerful than a classical regular expression from a formal languages / automata theory textbook, as classical regular expressions don't allow backreferences, lookahead, lookbehind, and so on.
For a classical regular expression, if you have a good implementation for the matcher, then no regular expression is too evil. In particular, a standard algorithm for matching is to convert the regular expression to a NFA and then executing the NFA on an input string. For this algorithm, the worst-case running time to test a $n$-character string is $O(n)$, when the regular expression is fixed. This means that the running time can't explode too rapidly. There is no string that will cause an exponential increase in running time. Thus, if you're using a matcher that uses this algorithm, no classical regular expression will be evil.
This does depend on the implementation of the regular expression matcher. If you have a naive or poor implementation of the matcher, then matching could take exponential time; there are certainly algorithms with that property. But the best answer to that is probably not to change the regular expression; it's probably better to pick a better matcher, if you are concerned about denial-of-service attacks.
In comparison, some modern regexps are unavoidably evil. If you have a modern regexp, then matching can require exponential time. In particular, regexps with backreferences can recognize NP-hard languages. Consequently, under plausible assumptions, there exists a class of evil regexps where testing for a match takes exponential time. Thus, some modern regexps are unavoidably evil: there is no feasible way to find an equivalent regexp that won't cause exponential blowup in running time to match.
(Such an equivalent might exist and might even be findable in theory, but under plausible assumptions, finding the equivalent regexp will take exponential time, which isn't feasible in practice. If you had a systematic procedure to find the equivalent regexp in polynomial time, then you could solve the NP-hard problem in polynomial time, proving that P = NP. It doesn't do much good for there to exist an equivalent regexp if there's no way actually find it within your lifetime.)
Background and sources:
-
Which languages do Perl-compatible regular expressions recognize? and Expressiveness of modern regular expressions provide references to justify that modern regexps can recognize NP-hard languages.
-
How to simulate backreferences, lookaheads, and lookbehinds in finite state automata? and When a regexp is not a Regular Expression? can be helpful to understand the difference between regular expressions and regexps.
-
This article from Russ Cox has a nice explanation of two different ways of building a regular expression matcher, and explains why the running time if you use the proper algorithm is linear in the length of the input string (when the regular expression is held fixed and its length is treated as a constant). In particular, the NFA-based algorithm -- also known as the Thompson algorithm -- has linear worst-case running time. It also shows how some popular languages have regexp implementations that can go exponential on some regexps, and it discusses which aspects of modern regexps can introduce exponential running times.
-
In this post I assume P != NP. More to the point, when I refer to "plausible assumptions", I am referring to the exponential time hypothesis.
By regexp, I mean a modern regular expression: i.e., a regular expression with additional modern features such as backreferences -- e.g., a Perl-compatible regular expression. This is more powerful than a classical regular expression from a formal languages / automata theory textbook, as classical regular expressions don't allow backreferences, lookahead, lookbehind, and so on.
For a classical regular expression, if you have a good implementation for the matcher, then no regular expression is too evil. In particular, a standard algorithm for matching is to convert the regular expression to a NFA and then executing the NFA on an input string. For this algorithm, the worst-case running time to test a $n$-character string is $O(n)$, when the regular expression is fixed. This means that the running time can't explode too rapidly. There is no string that will cause an exponential increase in running time. Thus, if you're using a matcher that uses this algorithm, no classical regular expression will be evil.
This does depend on the implementation of the regular expression matcher. If you have a naive or poor implementation of the matcher, then matching could take exponential time; there are certainly algorithms with that property. But the best answer to that is probably not to change the regular expression; it's probably better to pick a better matcher, if you are concerned about denial-of-service attacks.
In comparison, some modern regexps are unavoidably evil. If you have a modern regexp, then matching can require exponential time. In particular, regexps with backreferences can recognize NP-hard languages. Consequently, under plausible assumptions, there exists a class of evil regexps where testing for a match takes exponential time. Thus, some modern regexps are unavoidably evil: there is no feasible way to find an equivalent regexp that won't cause exponential blowup in running time to match.
(Such an equivalent might exist and might even be findable in theory, but under plausible assumptions, finding the equivalent regexp will take exponential time, which isn't feasible in practice. If you had a systematic procedure to find the equivalent regexp in polynomial time, then you could solve the NP-hard problem in polynomial time, proving that P = NP. It doesn't do much good for there to exist an equivalent regexp if there's no way actually find it within your lifetime.)
Background and sources:
-
Which languages do Perl-compatible regular expressions recognize? and Expressiveness of modern regular expressions provide references to justify that modern regexps can recognize NP-hard languages.
-
How to simulate backreferences, lookaheads, and lookbehinds in finite state automata? and When a regexp is not a Regular Expression? can be helpful to understand the difference between regular expressions and regexps.
-
This article from Russ Cox has a nice explanation of two different ways of building a regular expression matcher, and explains why the running time if you use the proper algorithm is linear in the length of the input string (when the regular expression is held fixed and its length is treated as a constant). In particular, the NFA-based algorithm -- also known as the Thompson algorithm -- has linear worst-case running time. It also shows how some popular languages have regexp implementations that can go exponential on some regexps, and it discusses which aspects of modern regexps can introduce exponential running times.
-
In this post I assume P != NP. More to the point, when I refer to "plausible assumptions", I am referring to the exponential time hypothesis.
Context
StackExchange Computer Science Q#46836, answer score: 14
Revisions (0)
No revisions yet.