HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Simple Tokenizer + Parser

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
parsersimpletokenizer

Problem

Requirements

A function is needed, that is able to parse a list of symbols with the following rules:

  • Symbols may be associated with numbers or not.



  • Numbers are defined by a comma separed list behind the symbol.



  • Numbers may be defined as range (e.g. 3-7 or also 7-3).



  • Whitespaces and invalid tokens should be ignored after token recognition finished.



  • Whitespaces are not needed as seperators.



  • The size / number of items to parse is not time critical.



Examples:

Available Symbols: ABC, CDE, A, B, C - A, B and C may be associated with numbers.

  • "ABC CDE A 1, 2 B" => [ABC], [CDE], [A{1,2}], [B].



  • "A1-4,6 CDE B C9-7,4,3" => [A{1,2,3,4,6}], [CDE], [B], [C{3,4,7,8,9}]



  • "ABCBCA5,4" => [ABC], [B], [C], [A{4,5}]



Approach

Yes, a parser is a little oversized for that kind of string analysis. However, it is very likely that the complexity of the syntax will grow in the course of time, so I decide to write a parse that scales better that regex or string operations.

The parsing process works in 2 steps.

  • Split string in tokens (Symbols, Numbers, Separator ',', Range operator '-')



  • Parse tokens to list of objects



Tokenizer

I created a general purpose regex based Tokenizer that produces a list of Token objects from an input string. Each token has a type (can be defined when creating the Tokenizer), an index (position of the token within the input string) and the actual value.

public class Token
{
    public Token(string type, string token, int index)
    {
        Value = token;
        Type = type;
        Index = index;
    }
    public string Value { get; private set; }
    public string Type { get; private set; }
    public int Index { get; private set; }
}


Usage of the tokenizer:

```
Tokenizer tokenizer = Tokenizer.Empty
.WithToken(SYMBOL, "(ABC|CDE)")
.WithToken(SYMBOL_AND_NUMBERS, "(A|B|C|D|E|F)")
.WithToken(RANGE, "-")
.WithToken(SEPARATOR, ",")
.WithToken(NUMBER, "[0-9]+");

Token[] ro

Solution


  • I don't like Empty property, as it makes things ambiguous. Is it safe to call Tokenizer.Empty.WithToken? Or will I modify some existing static field? Well I will have to go and check the implementation to know for sure. I would rather just call new Tokenizer().WithToken(...) instead. Or Tokenizer.Create().WithToken(...) if you have something against using default constructors.



-
I think you should merge Tokenize and CollectTokens methods. It doesn't look like Tokenize does any meaningful work and you can easily declare bool array inside CollectTokens. Also since you already check for string being null, you might as well return early by calling:

if(String.IsNullOrWhiteSpace(input)) return new Token[0];


-
According to C# naming convention you should use PascalCase for constants (unless you are mirroring some external API which uses different naming convention). So It should be SymbolWithNumbers, etc. Also you might want to use enum instead of strings:

enum TokenType
{
    SymbolWithNumbers,
    Symbol,
    ...
}


-
my... prefix looks amateurish and adds this "hypothetical" flavor to your code. You should drop it.

-
I think you can simplify things if you modify your range regex to look for "Number-Number" pattern (I'm sorry, I don't speak regex).

-
You can use OOP to eliminate case operators.

For example:

interface ITokenParser
{
    bool TryParse(IList tokens, out ParserResult result);
}

class SymbolParser : ITokenParser
{
    public bool TryParse(IList tokens, out ParserResult result)
    {
        result = null;
        if (tokens.Count == 1 && tokens[0].Type == Parser.SYMBOL)
        {
            result = new ParserResult(tokens[0].Value);
            return true;
        }
        return false;
    }
}


The naive implementation of ParseInternal method might look like this:

private IEnumerable ParseInternal(Token[] tokens)
{
    var stack = new List();
    var parsers = new ITokenParser[] {new SymbolParser(), ...};//implements other parsers

    foreach (var current in tokens)
    {
        stack.Add(current);
        foreach (var parser in parsers)
        {
            ParserResult result;
            if (parser.TryParse(stack, out result))
            {
                stack.Clear();
                yield return result;
                break;
            }
        }
    }
}


It just occurred to me though, that it won't quite work for numbers: for "A1,2" it will return "A[1]". So you'll need a more complex solution. You could add tokens to stack until you can no longer parse it, for example. And only when that happens you would yield last successful ParserResult and clear the stack. But there might be smarter way to do it, I'll think about it.

Code Snippets

if(String.IsNullOrWhiteSpace(input)) return new Token[0];
enum TokenType
{
    SymbolWithNumbers,
    Symbol,
    ...
}
interface ITokenParser
{
    bool TryParse(IList<Token> tokens, out ParserResult result);
}

class SymbolParser : ITokenParser
{
    public bool TryParse(IList<Token> tokens, out ParserResult result)
    {
        result = null;
        if (tokens.Count == 1 && tokens[0].Type == Parser.SYMBOL)
        {
            result = new ParserResult(tokens[0].Value);
            return true;
        }
        return false;
    }
}
private IEnumerable<ParserResult> ParseInternal(Token[] tokens)
{
    var stack = new List<Token>();
    var parsers = new ITokenParser[] {new SymbolParser(), ...};//implements other parsers

    foreach (var current in tokens)
    {
        stack.Add(current);
        foreach (var parser in parsers)
        {
            ParserResult result;
            if (parser.TryParse(stack, out result))
            {
                stack.Clear();
                yield return result;
                break;
            }
        }
    }
}

Context

StackExchange Code Review Q#129663, answer score: 5

Revisions (0)

No revisions yet.