patterncsharpMinor
Fighting f̶i̶r̶e̶ regex with f̶i̶r̶e̶ regex
Viewed 0 times
withregexfighting
Problem
So someone requested a Regex Builder / Assistant for Rubberduck and since my CS class had been covering the topic only recently I was thinking to myself: "This sounds interesting, you should do this".
Long story short, a day of designing and a weekend of implementation (and some more polishing) as well as a few reviews later Rubberduck has a fully working Regex Assistant (which even supports i18n).
It can be given a VBA Regular Expression pattern and tells you what it does in a manner similar to other well-known assistants (regex101, expresso, ...)
This means it can parse a given syntactically correct Regex and recognize the separate Atoms and Quantifiers. It analyzes said Atoms and builds a structure that allows displaying useful information about the pattern.
This is accomplished by parsing the expression into a Tree-Structure.
Any Regular Expression is built by either the smallest possible unit you have (an Atom) or other regular expressions. This definition allows us to "abuse" the syntax into building our Design.
We basically need to handle two (well three) things to represent any arbitrary Regular Expression.
to be fully correct, Quantifiers actually belong in cahoots with Atoms, but they're.. different, so we'll treat them differently.
This sets the stage for our "data holders":
Atom.cs
```
using Rubberduck.RegexAssistant.i18n;
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace Rubberduck.RegexAssistant
{
public interface IAtom : IDescribable
{
string Specifier { get; }
}
internal class CharacterClass : IAtom
{
public static readonly string Pattern = @"(?.*?)(? _characterSpecifiers;
public IList CharacterSpecifiers { get { return _characterSpecifiers; } }
private readonly string _specifier;
public CharacterClass(string specifier)
{
Match m = Matcher.Match(specifier);
Long story short, a day of designing and a weekend of implementation (and some more polishing) as well as a few reviews later Rubberduck has a fully working Regex Assistant (which even supports i18n).
It can be given a VBA Regular Expression pattern and tells you what it does in a manner similar to other well-known assistants (regex101, expresso, ...)
This means it can parse a given syntactically correct Regex and recognize the separate Atoms and Quantifiers. It analyzes said Atoms and builds a structure that allows displaying useful information about the pattern.
This is accomplished by parsing the expression into a Tree-Structure.
Any Regular Expression is built by either the smallest possible unit you have (an Atom) or other regular expressions. This definition allows us to "abuse" the syntax into building our Design.
We basically need to handle two (well three) things to represent any arbitrary Regular Expression.
- Atoms
- Regular Expressions
- Quantifiers
to be fully correct, Quantifiers actually belong in cahoots with Atoms, but they're.. different, so we'll treat them differently.
This sets the stage for our "data holders":
Atom.cs
```
using Rubberduck.RegexAssistant.i18n;
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace Rubberduck.RegexAssistant
{
public interface IAtom : IDescribable
{
string Specifier { get; }
}
internal class CharacterClass : IAtom
{
public static readonly string Pattern = @"(?.*?)(? _characterSpecifiers;
public IList CharacterSpecifiers { get { return _characterSpecifiers; } }
private readonly string _specifier;
public CharacterClass(string specifier)
{
Match m = Matcher.Match(specifier);
Solution
Validation
A public method should validate its method parameters. It doesn't matter if it is only used inside one project like
here
The same is true for the constructors of
A bigger issue will come in
Regex
I don't know how often the
CharacterClass
Instead of adding a
If you are using C# 6 then you could get rid of the
Maybe I would let the
or at least would return a new
This prevents that items of the
While we are speaking about properties, you should stick to one style. Right now you have single lined properties like above and also multi lined properties like
Equals()
I avoid the
Quantifier constructor
if
```
public Quantifier(string expression)
{
if (expression.Length == 0)
{
Kind = QuantifierKind.None;
MaximumMatches = 1;
MinimumMatches = 1;
return;
}
if (expression.Length == 1)
{
switch (expression[0])
{
case '*':
MinimumMatches = 0;
MaximumMatches = int.MaxValue;
Kind = QuantifierKind.Wildcard;
break;
case '+':
MinimumMatches = 1;
MaximumMatches = int.MaxValue;
Kind = QuantifierKind.Wildcard;
break;
case '?':
MinimumMatches = 0;
MaximumMatches = 1;
Kind = QuantifierKind.Wildcard;
break;
default:
throw new ArgumentException("Passed Quantifier String was not an allowed Quantifier");
}
return;
}
Kind = QuantifierKind.Expression;
Match m = Matcher.Match(expression);
if (!m.Success)
{
throw new ArgumentException(string.Format("Cannot extract a Quantifier from the expression {1}", expression));
}
int minimum;
// shouldn't ever happen
if (!int.TryParse(m.Groups["min"].Value, out minimum))
{
throw new ArgumentException("Cannot Parse Quantifier Expression into Range");
}
MinimumMatches = minimum;
string maximumString = m.Groups["max"].Value; // drop the comma
if (maximumString.Length > 1)
{
int maximum;
// shouldn't ever happen
if (!int.TryParse(maximumString.Substring(1), out maximum))
{
throw new ArgumentException("Cannot Parse Quantifier Expression into Range");
}
MaximumMatches = maximum;
}
else
A public method should validate its method parameters. It doesn't matter if it is only used inside one project like
RubberDuck or if it is used by other projects or developers.public CharacterClass(string specifier)
{
Match m = Matcher.Match(specifier);here
Matcher.Match() would throw an ArgumentNullException which I would throw as well if specifier would be null but if you throw it at your own validation you wouldn't expose that you use a Regex.The same is true for the constructors of
Group and Literal.A bigger issue will come in
RegularExpression.Parse(string) because here you are exposing internal details of your implementation. If the passed in specifier == null the stacktrace would contain that you have called a method named TryParseAsAtom() and that the exception had been thrown by calling the Regex.Match() method.Regex
I don't know how often the
Regexes are used but you should consider to use the overloaded constructor Regex(string, RegexOptions) so you could use the Compiled enum for the RegexOptions to have the regex compiled which will be faster if called often.CharacterClass
public static readonly string Pattern = @"(?.*?)(? _characterSpecifiers;
public IList CharacterSpecifiers { get { return _characterSpecifiers; } }
private readonly string _specifier;Instead of adding a
private backing field for your properties you could simply have a private set; which would read nicer like sopublic static readonly string Pattern = @"(?.*?)(? CharacterSpecifiers { get; private set; }
private readonly string _specifier;If you are using C# 6 then you could get rid of the
private set; all together which would make it nicer.Maybe I would let the
public IList CharacterSpecifiers { get { return _characterSpecifiers; } } stay with a backing field but would return either as a ReadOnlyCollection like sopublic ReadOnlyCollection CharacterSpecifiers { get { return _characterSpecifiers.AsReadOnly(); } }or at least would return a new
List like sopublic IList CharacterSpecifiers { get { return new List(_characterSpecifiers); } }This prevents that items of the
IList could be changed.While we are speaking about properties, you should stick to one style. Right now you have single lined properties like above and also multi lined properties like
public string Specifier
{
get
{
return _specifier;
}
}Equals()
I avoid the
is operator if I later on cast the object. The is operator just tries to cast the object to the desired type and returns true if it can be casted. So a softcast using as and a null check is doing the same but involves less casting like sopublic override bool Equals(object obj)
{
var item = obj as CharacterClass
if (item != null)
{
return item._specifier.Equals(_specifier);
}
return false;
}Quantifier constructor
if
expression.Length == 0 you should return early following with the check expression.Length == 1 and returning early as well. This saves you one horizontal identation level which makes your code more readable like so```
public Quantifier(string expression)
{
if (expression.Length == 0)
{
Kind = QuantifierKind.None;
MaximumMatches = 1;
MinimumMatches = 1;
return;
}
if (expression.Length == 1)
{
switch (expression[0])
{
case '*':
MinimumMatches = 0;
MaximumMatches = int.MaxValue;
Kind = QuantifierKind.Wildcard;
break;
case '+':
MinimumMatches = 1;
MaximumMatches = int.MaxValue;
Kind = QuantifierKind.Wildcard;
break;
case '?':
MinimumMatches = 0;
MaximumMatches = 1;
Kind = QuantifierKind.Wildcard;
break;
default:
throw new ArgumentException("Passed Quantifier String was not an allowed Quantifier");
}
return;
}
Kind = QuantifierKind.Expression;
Match m = Matcher.Match(expression);
if (!m.Success)
{
throw new ArgumentException(string.Format("Cannot extract a Quantifier from the expression {1}", expression));
}
int minimum;
// shouldn't ever happen
if (!int.TryParse(m.Groups["min"].Value, out minimum))
{
throw new ArgumentException("Cannot Parse Quantifier Expression into Range");
}
MinimumMatches = minimum;
string maximumString = m.Groups["max"].Value; // drop the comma
if (maximumString.Length > 1)
{
int maximum;
// shouldn't ever happen
if (!int.TryParse(maximumString.Substring(1), out maximum))
{
throw new ArgumentException("Cannot Parse Quantifier Expression into Range");
}
MaximumMatches = maximum;
}
else
Code Snippets
public CharacterClass(string specifier)
{
Match m = Matcher.Match(specifier);public static readonly string Pattern = @"(?<!\\)\[(?<expression>.*?)(?<!\\)\]";
private static readonly Regex Matcher = new Regex("^" + Pattern + "$");
private readonly bool _inverseMatching;
public bool InverseMatching { get { return _inverseMatching; } }
private readonly IList<string> _characterSpecifiers;
public IList<string> CharacterSpecifiers { get { return _characterSpecifiers; } }
private readonly string _specifier;public static readonly string Pattern = @"(?<!\\)\[(?<expression>.*?)(?<!\\)\]";
private static readonly Regex Matcher = new Regex("^" + Pattern + "$");
public bool InverseMatching { get; private set; }
public IList<string> CharacterSpecifiers { get; private set; }
private readonly string _specifier;public ReadOnlyCollection<string> CharacterSpecifiers { get { return _characterSpecifiers.AsReadOnly(); } }public IList<string> CharacterSpecifiers { get { return new List<string>(_characterSpecifiers); } }Context
StackExchange Code Review Q#136411, answer score: 5
Revisions (0)
No revisions yet.