patterncsharpMinor
Simple Tokenizer + Parser
Viewed 0 times
parsersimpletokenizer
Problem
Requirements
A function is needed, that is able to parse a list of symbols with the following rules:
Examples:
Available Symbols: ABC, CDE, A, B, C - A, B and C may be associated with numbers.
Approach
Yes, a parser is a little oversized for that kind of string analysis. However, it is very likely that the complexity of the syntax will grow in the course of time, so I decide to write a parse that scales better that regex or string operations.
The parsing process works in 2 steps.
Tokenizer
I created a general purpose regex based
Usage of the tokenizer:
```
Tokenizer tokenizer = Tokenizer.Empty
.WithToken(SYMBOL, "(ABC|CDE)")
.WithToken(SYMBOL_AND_NUMBERS, "(A|B|C|D|E|F)")
.WithToken(RANGE, "-")
.WithToken(SEPARATOR, ",")
.WithToken(NUMBER, "[0-9]+");
Token[] ro
A function is needed, that is able to parse a list of symbols with the following rules:
- Symbols may be associated with numbers or not.
- Numbers are defined by a comma separed list behind the symbol.
- Numbers may be defined as range (e.g. 3-7 or also 7-3).
- Whitespaces and invalid tokens should be ignored after token recognition finished.
- Whitespaces are not needed as seperators.
- The size / number of items to parse is not time critical.
Examples:
Available Symbols: ABC, CDE, A, B, C - A, B and C may be associated with numbers.
- "ABC CDE A 1, 2 B" => [ABC], [CDE], [A{1,2}], [B].
- "A1-4,6 CDE B C9-7,4,3" => [A{1,2,3,4,6}], [CDE], [B], [C{3,4,7,8,9}]
- "ABCBCA5,4" => [ABC], [B], [C], [A{4,5}]
Approach
Yes, a parser is a little oversized for that kind of string analysis. However, it is very likely that the complexity of the syntax will grow in the course of time, so I decide to write a parse that scales better that regex or string operations.
The parsing process works in 2 steps.
- Split string in tokens (Symbols, Numbers, Separator ',', Range operator '-')
- Parse tokens to list of objects
Tokenizer
I created a general purpose regex based
Tokenizer that produces a list of Token objects from an input string. Each token has a type (can be defined when creating the Tokenizer), an index (position of the token within the input string) and the actual value.public class Token
{
public Token(string type, string token, int index)
{
Value = token;
Type = type;
Index = index;
}
public string Value { get; private set; }
public string Type { get; private set; }
public int Index { get; private set; }
}Usage of the tokenizer:
```
Tokenizer tokenizer = Tokenizer.Empty
.WithToken(SYMBOL, "(ABC|CDE)")
.WithToken(SYMBOL_AND_NUMBERS, "(A|B|C|D|E|F)")
.WithToken(RANGE, "-")
.WithToken(SEPARATOR, ",")
.WithToken(NUMBER, "[0-9]+");
Token[] ro
Solution
- I don't like
Emptyproperty, as it makes things ambiguous. Is it safe to callTokenizer.Empty.WithToken? Or will I modify some existing static field? Well I will have to go and check the implementation to know for sure. I would rather just callnew Tokenizer().WithToken(...)instead. OrTokenizer.Create().WithToken(...)if you have something against using default constructors.
-
I think you should merge
Tokenize and CollectTokens methods. It doesn't look like Tokenize does any meaningful work and you can easily declare bool array inside CollectTokens. Also since you already check for string being null, you might as well return early by calling:if(String.IsNullOrWhiteSpace(input)) return new Token[0];-
According to C# naming convention you should use PascalCase for constants (unless you are mirroring some external API which uses different naming convention). So It should be
SymbolWithNumbers, etc. Also you might want to use enum instead of strings:enum TokenType
{
SymbolWithNumbers,
Symbol,
...
}-
my... prefix looks amateurish and adds this "hypothetical" flavor to your code. You should drop it.-
I think you can simplify things if you modify your range regex to look for "Number-Number" pattern (I'm sorry, I don't speak regex).
-
You can use OOP to eliminate
case operators.For example:
interface ITokenParser
{
bool TryParse(IList tokens, out ParserResult result);
}
class SymbolParser : ITokenParser
{
public bool TryParse(IList tokens, out ParserResult result)
{
result = null;
if (tokens.Count == 1 && tokens[0].Type == Parser.SYMBOL)
{
result = new ParserResult(tokens[0].Value);
return true;
}
return false;
}
}The naive implementation of
ParseInternal method might look like this:private IEnumerable ParseInternal(Token[] tokens)
{
var stack = new List();
var parsers = new ITokenParser[] {new SymbolParser(), ...};//implements other parsers
foreach (var current in tokens)
{
stack.Add(current);
foreach (var parser in parsers)
{
ParserResult result;
if (parser.TryParse(stack, out result))
{
stack.Clear();
yield return result;
break;
}
}
}
}It just occurred to me though, that it won't quite work for numbers: for "A1,2" it will return "A[1]". So you'll need a more complex solution. You could add tokens to stack until you can no longer parse it, for example. And only when that happens you would yield last successful
ParserResult and clear the stack. But there might be smarter way to do it, I'll think about it.Code Snippets
if(String.IsNullOrWhiteSpace(input)) return new Token[0];enum TokenType
{
SymbolWithNumbers,
Symbol,
...
}interface ITokenParser
{
bool TryParse(IList<Token> tokens, out ParserResult result);
}
class SymbolParser : ITokenParser
{
public bool TryParse(IList<Token> tokens, out ParserResult result)
{
result = null;
if (tokens.Count == 1 && tokens[0].Type == Parser.SYMBOL)
{
result = new ParserResult(tokens[0].Value);
return true;
}
return false;
}
}private IEnumerable<ParserResult> ParseInternal(Token[] tokens)
{
var stack = new List<Token>();
var parsers = new ITokenParser[] {new SymbolParser(), ...};//implements other parsers
foreach (var current in tokens)
{
stack.Add(current);
foreach (var parser in parsers)
{
ParserResult result;
if (parser.TryParse(stack, out result))
{
stack.Clear();
yield return result;
break;
}
}
}
}Context
StackExchange Code Review Q#129663, answer score: 5
Revisions (0)
No revisions yet.