HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Command Tokenizer

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
commandtokenizerstackoverflow

Problem

I've written some code to tokenize a command string into its tokens.

A token is either:

  • A block of any non-whitespace characters



  • A block of characters, which may include whitespace, wrapped in quotes



So, for the input:


This is some text "with information" quoted.

I'd expect the tokens:



  • This



  • is



  • some



  • text



  • with information



  • quoted.




The tokenizer

using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace MudCore
{
    public static class CommandTokenizer
    {
        static Regex _pattern;

        static CommandTokenizer()
        {
            _pattern = new Regex(@"((\s*""(?[^""]*)(""|$)\s*)|(\s*(?[^\s""]+)\s*))*", RegexOptions.Compiled | RegexOptions.ExplicitCapture);
        }

        public static string[] Tokenise(string input)
        {
            List matches = new List();
            var match = _pattern.Match(input);

            if(match.Success)
            {
                    foreach(Capture capture in match.Groups["token"].Captures)
                    {
                        matches.Add(capture.Value);
                    }
            }
            return matches.ToArray();
        }
    }
}


The Tests

```
using MudCore;
using NUnit.Framework;

namespace MudCoreTests
{
[TestFixture]
public class CommandTokenizerTests
{
[Test]
public void SingleWordBecomesSingleToken()
{
var tokens = CommandTokenizer.Tokenise("single");
Assert.AreEqual(1, tokens.Length);
Assert.AreEqual("single", tokens[0]);
}

[Test]
public void MultipleWordsReturnMultipleTokens()
{
var tokens = CommandTokenizer.Tokenise("there are multiple tokens");
Assert.AreEqual(4, tokens.Length);
Assert.AreEqual("there", tokens[0]);
Assert.AreEqual("are", tokens[1]);
Assert.AreEqual("multiple", tokens[2]);
Assert.AreEqual("tokens", toke

Solution

You can shorten your Tokenise method using LINQ

public static string[] Tokenise(string input)
{
    List matches = new List();
    var match = _pattern.Match(input);

    if (match.Success)
    {
         foreach (Capture capture in match.Groups["token"].Captures)

         {
             matches.Add(capture.Value);
         }
     }
     return matches.ToArray();
}


Can become

public static string[] Tokenise(string input)
{
    var match = _pattern.Match(input);
    if (match.Success)
    {
        return (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray();
    }
    return default(string[]);
}


Or even shorter with the ternary operator

public static string[] Tokenise(string input)
{
    var match = _pattern.Match(input);
    return match.Success
        ? (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray()
        : default(string[]);
}


But if performance concerns you, you're better off with your own implementation instead of using regex, unless you are working with really long strings, in which case regex will probably win in performance.

I've made an alternative solution which works ~4 times faster than your regex version running 1,000,000 iterations with this string

"There are in the text \"some quoted tokens, that have punctionation. And other stuff\""

public static string[] Tokenise(string input)
{
    input = input.Trim();
    List matches = new List();
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i  0)
        {
            matches.Add(builder.ToString());
            builder.Clear();
        }
    }
    if (builder.Length > 0)
    {
        matches.Add(builder.ToString());
    }
    return matches.ToArray();       
}


I will leave that here too


Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.


Jamie Zawinski

Code Snippets

public static string[] Tokenise(string input)
{
    List<string> matches = new List<string>();
    var match = _pattern.Match(input);

    if (match.Success)
    {
         foreach (Capture capture in match.Groups["token"].Captures)

         {
             matches.Add(capture.Value);
         }
     }
     return matches.ToArray();
}
public static string[] Tokenise(string input)
{
    var match = _pattern.Match(input);
    if (match.Success)
    {
        return (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray();
    }
    return default(string[]);
}
public static string[] Tokenise(string input)
{
    var match = _pattern.Match(input);
    return match.Success
        ? (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray()
        : default(string[]);
}
public static string[] Tokenise(string input)
{
    input = input.Trim();
    List<string> matches = new List<string>();
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < input.Length; i++)
    {
        if (input[i] == '"')
        {
            int nextQuoteIndex = input.IndexOf('"', i + 1);
            if (nextQuoteIndex != -1)
            {
                matches.Add(input.Substring(i + 1, nextQuoteIndex - i - 1));
                i = nextQuoteIndex;
            }
            else
            {
                matches.Add(input.Substring(i + 1, input.Length - i - 1));
                return matches.ToArray();
            }
        }
        else if (input[i] != ' ')
        {
            builder.Append(input[i]);
        }
        else if (builder.Length > 0)
        {
            matches.Add(builder.ToString());
            builder.Clear();
        }
    }
    if (builder.Length > 0)
    {
        matches.Add(builder.ToString());
    }
    return matches.ToArray();       
}

Context

StackExchange Code Review Q#149212, answer score: 7

Revisions (0)

No revisions yet.