patternjavaMinor
Regex parser implementing dot, star and question mark
Viewed 0 times
starparserdotquestionandregeximplementingmark
Problem
Regex parser which follows the following rule
A large feedback has been taken from here.
I'm looking for code review, optimizations, best practices. I also still get confused with complexity. Let's say regex string has length of
```
public final class Regex {
private Regex() {};
private static enum Wildcards {
STAR('*'), DOT('.'), QUESTION('?');
private final char wildCard;
Wildcards (char wildCard) {
this.wildCard = wildCard;
}
public char getWildCard() {
return this.wildCard;
}
}
/**
* The index of first non-star character following first occurence of star
* eg: if regex is a**b, then index is startIndex is 1 and return value is 3 (index of b)
*
* @param regex : the regex string.
@param firstStarIndex : index of the first occurene of after a non *, eg: a**b, then starIndex would be 1
* @return : the index of first non-star character following first occurence of star
*/
private static final int getNonWildCardIndex(String regex, int firstQuestionIndex, Wildcards foo) {
for (int k = firstQuestionIndex; k = 0 && regexPos = 0 && strPos strpos is an impossible condition.
while ((regexPos < regex.length() && strPos < str.length())) {
if (regex.charAt(regexPos) == str.charAt(strPos) || regex.charAt(regexPos) == Wildcards.DOT.getWildCard()) {
regexPos++;
strPos++;
}
/*
* nested if else is better than non-nested.
* http://www.macs.hw.ac.uk/~pjbk/pathways/cpp1/node99.html
*/
.(dot) means a any single character match
*(star) means 0 or more character match
?(question) means previous char is an option i.ecolou?rmatchescolorandcolour.
A large feedback has been taken from here.
I'm looking for code review, optimizations, best practices. I also still get confused with complexity. Let's say regex string has length of
m and string has length of n. Then it appears the complexity would be O(n! (n-1)! ... (n-m)!)```
public final class Regex {
private Regex() {};
private static enum Wildcards {
STAR('*'), DOT('.'), QUESTION('?');
private final char wildCard;
Wildcards (char wildCard) {
this.wildCard = wildCard;
}
public char getWildCard() {
return this.wildCard;
}
}
/**
* The index of first non-star character following first occurence of star
* eg: if regex is a**b, then index is startIndex is 1 and return value is 3 (index of b)
*
* @param regex : the regex string.
@param firstStarIndex : index of the first occurene of after a non *, eg: a**b, then starIndex would be 1
* @return : the index of first non-star character following first occurence of star
*/
private static final int getNonWildCardIndex(String regex, int firstQuestionIndex, Wildcards foo) {
for (int k = firstQuestionIndex; k = 0 && regexPos = 0 && strPos strpos is an impossible condition.
while ((regexPos < regex.length() && strPos < str.length())) {
if (regex.charAt(regexPos) == str.charAt(strPos) || regex.charAt(regexPos) == Wildcards.DOT.getWildCard()) {
regexPos++;
strPos++;
}
/*
* nested if else is better than non-nested.
* http://www.macs.hw.ac.uk/~pjbk/pathways/cpp1/node99.html
*/
Solution
Your code is rather nice, but the algorithm is fundamentally flawed. Consider these test cases, which you currently fail:
Furthermore, your regex dialect is unable to match all regular languages: It is possible that a group of characters (and not just a single character) is repeated:
This grammar can produce the strings
Don't interpret the regex character by character. This is buggy, and – as shown above – prone to all kinds of failures(1). Instead, write a proper parser and compile the regex to some intermediate representation. For example, the regex
Given such a data structure, writing a backtracking regex engine is very easy. Of course, regular expressions can also be translated to a state machine, which is a bit more complicated – but the first step is to make it work correctly, before you try any clever optimizations.
// The empty string is a substring of any string
Assert.assertTrue(Regex.matches("", "abc"));
// Because you check for direct character equivalence before metacharacters,
// a string containing regex metacharacters will falsely pass
Assert.assertFalse(Regex.matches("a*b", "a*b"));
// We should reject malformed regexes
try {
Regex.matches("***", "");
Assert.fail();
} catch (PatternSyntaxException e) {
}Furthermore, your regex dialect is unable to match all regular languages: It is possible that a group of characters (and not just a single character) is repeated:
S := A
A := "ab" A
A := ""This grammar can produce the strings
"", "ab", "abab" etc. It is matched by the regular expression (ab)*. However, you do not offer such groupings. Other features that are lacking and are generally expected of regexes:- A
+quantifier
- alternatives
|(this is strictly required)
- character classes
[a-z](only needed in real-life applications, not in theoretical settings)
- Escape sequences like
\to match a literal, instead of*being a metacharacter.
Don't interpret the regex character by character. This is buggy, and – as shown above – prone to all kinds of failures(1). Instead, write a proper parser and compile the regex to some intermediate representation. For example, the regex
(ab|c)* might be equivalent to this data structure:new Regex.Star(new Regex.Alternative(new Regex.Constant("ab"),
new Regex.Constant("c")))Given such a data structure, writing a backtracking regex engine is very easy. Of course, regular expressions can also be translated to a state machine, which is a bit more complicated – but the first step is to make it work correctly, before you try any clever optimizations.
- Especially, regular expressions themselves do not follow a regular grammar.
Code Snippets
// The empty string is a substring of any string
Assert.assertTrue(Regex.matches("", "abc"));
// Because you check for direct character equivalence before metacharacters,
// a string containing regex metacharacters will falsely pass
Assert.assertFalse(Regex.matches("a*b", "a*b"));
// We should reject malformed regexes
try {
Regex.matches("***", "");
Assert.fail();
} catch (PatternSyntaxException e) {
}S := A
A := "ab" A
A := ""new Regex.Star(new Regex.Alternative(new Regex.Constant("ab"),
new Regex.Constant("c")))Context
StackExchange Code Review Q#43579, answer score: 3
Revisions (0)
No revisions yet.