HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Regex parser implementing dot, star and question mark

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
starparserdotquestionandregeximplementingmark

Problem

Regex parser which follows the following rule



  • . (dot) means a any single character match



  • * (star) means 0 or more character match



  • ? (question) means previous char is an option i.e colou?r matches color and colour.




A large feedback has been taken from here.

I'm looking for code review, optimizations, best practices. I also still get confused with complexity. Let's say regex string has length of m and string has length of n. Then it appears the complexity would be O(n! (n-1)! ... (n-m)!)

```
public final class Regex {

private Regex() {};

private static enum Wildcards {
STAR('*'), DOT('.'), QUESTION('?');

private final char wildCard;

Wildcards (char wildCard) {
this.wildCard = wildCard;
}

public char getWildCard() {
return this.wildCard;
}
}

/**
* The index of first non-star character following first occurence of star
* eg: if regex is a**b, then index is startIndex is 1 and return value is 3 (index of b)
*
* @param regex : the regex string.
@param firstStarIndex : index of the first occurene of after a non *, eg: a**b, then starIndex would be 1
* @return : the index of first non-star character following first occurence of star
*/
private static final int getNonWildCardIndex(String regex, int firstQuestionIndex, Wildcards foo) {
for (int k = firstQuestionIndex; k = 0 && regexPos = 0 && strPos strpos is an impossible condition.

while ((regexPos < regex.length() && strPos < str.length())) {

if (regex.charAt(regexPos) == str.charAt(strPos) || regex.charAt(regexPos) == Wildcards.DOT.getWildCard()) {
regexPos++;
strPos++;
}
/*
* nested if else is better than non-nested.
* http://www.macs.hw.ac.uk/~pjbk/pathways/cpp1/node99.html
*/

Solution

Your code is rather nice, but the algorithm is fundamentally flawed. Consider these test cases, which you currently fail:

// The empty string is a substring of any string
Assert.assertTrue(Regex.matches("", "abc"));

// Because you check for direct character equivalence before metacharacters,
// a string containing regex metacharacters will falsely pass
Assert.assertFalse(Regex.matches("a*b", "a*b"));

// We should reject malformed regexes
try {
    Regex.matches("***", "");
    Assert.fail();
} catch (PatternSyntaxException e) {
}


Furthermore, your regex dialect is unable to match all regular languages: It is possible that a group of characters (and not just a single character) is repeated:

S := A
A := "ab" A
A := ""


This grammar can produce the strings "", "ab", "abab" etc. It is matched by the regular expression (ab)*. However, you do not offer such groupings. Other features that are lacking and are generally expected of regexes:

  • A + quantifier



  • alternatives | (this is strictly required)



  • character classes [a-z] (only needed in real-life applications, not in theoretical settings)



  • Escape sequences like \ to match a literal , instead of * being a metacharacter.



Don't interpret the regex character by character. This is buggy, and – as shown above – prone to all kinds of failures(1). Instead, write a proper parser and compile the regex to some intermediate representation. For example, the regex (ab|c)* might be equivalent to this data structure:

new Regex.Star(new Regex.Alternative(new Regex.Constant("ab"),
                                     new Regex.Constant("c")))


Given such a data structure, writing a backtracking regex engine is very easy. Of course, regular expressions can also be translated to a state machine, which is a bit more complicated – but the first step is to make it work correctly, before you try any clever optimizations.

  • Especially, regular expressions themselves do not follow a regular grammar.

Code Snippets

// The empty string is a substring of any string
Assert.assertTrue(Regex.matches("", "abc"));

// Because you check for direct character equivalence before metacharacters,
// a string containing regex metacharacters will falsely pass
Assert.assertFalse(Regex.matches("a*b", "a*b"));

// We should reject malformed regexes
try {
    Regex.matches("***", "");
    Assert.fail();
} catch (PatternSyntaxException e) {
}
S := A
A := "ab" A
A := ""
new Regex.Star(new Regex.Alternative(new Regex.Constant("ab"),
                                     new Regex.Constant("c")))

Context

StackExchange Code Review Q#43579, answer score: 3

Revisions (0)

No revisions yet.