patterncsharpMinor
Command Tokenizer
Viewed 0 times
commandtokenizerstackoverflow
Problem
I've written some code to tokenize a command string into its tokens.
A token is either:
So, for the input:
This is some text "with information" quoted.
I'd expect the tokens:
The tokenizer
The Tests
```
using MudCore;
using NUnit.Framework;
namespace MudCoreTests
{
[TestFixture]
public class CommandTokenizerTests
{
[Test]
public void SingleWordBecomesSingleToken()
{
var tokens = CommandTokenizer.Tokenise("single");
Assert.AreEqual(1, tokens.Length);
Assert.AreEqual("single", tokens[0]);
}
[Test]
public void MultipleWordsReturnMultipleTokens()
{
var tokens = CommandTokenizer.Tokenise("there are multiple tokens");
Assert.AreEqual(4, tokens.Length);
Assert.AreEqual("there", tokens[0]);
Assert.AreEqual("are", tokens[1]);
Assert.AreEqual("multiple", tokens[2]);
Assert.AreEqual("tokens", toke
A token is either:
- A block of any non-whitespace characters
- A block of characters, which may include whitespace, wrapped in quotes
So, for the input:
This is some text "with information" quoted.
I'd expect the tokens:
- This
- is
- some
- text
- with information
- quoted.
The tokenizer
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace MudCore
{
public static class CommandTokenizer
{
static Regex _pattern;
static CommandTokenizer()
{
_pattern = new Regex(@"((\s*""(?[^""]*)(""|$)\s*)|(\s*(?[^\s""]+)\s*))*", RegexOptions.Compiled | RegexOptions.ExplicitCapture);
}
public static string[] Tokenise(string input)
{
List matches = new List();
var match = _pattern.Match(input);
if(match.Success)
{
foreach(Capture capture in match.Groups["token"].Captures)
{
matches.Add(capture.Value);
}
}
return matches.ToArray();
}
}
}The Tests
```
using MudCore;
using NUnit.Framework;
namespace MudCoreTests
{
[TestFixture]
public class CommandTokenizerTests
{
[Test]
public void SingleWordBecomesSingleToken()
{
var tokens = CommandTokenizer.Tokenise("single");
Assert.AreEqual(1, tokens.Length);
Assert.AreEqual("single", tokens[0]);
}
[Test]
public void MultipleWordsReturnMultipleTokens()
{
var tokens = CommandTokenizer.Tokenise("there are multiple tokens");
Assert.AreEqual(4, tokens.Length);
Assert.AreEqual("there", tokens[0]);
Assert.AreEqual("are", tokens[1]);
Assert.AreEqual("multiple", tokens[2]);
Assert.AreEqual("tokens", toke
Solution
You can shorten your
Can become
Or even shorter with the ternary operator
But if performance concerns you, you're better off with your own implementation instead of using regex, unless you are working with really long strings, in which case regex will probably win in performance.
I've made an alternative solution which works ~4 times faster than your regex version running 1,000,000 iterations with this string
"There are in the text \"some quoted tokens, that have punctionation. And other stuff\""
I will leave that here too
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
Jamie Zawinski
Tokenise method using LINQpublic static string[] Tokenise(string input)
{
List matches = new List();
var match = _pattern.Match(input);
if (match.Success)
{
foreach (Capture capture in match.Groups["token"].Captures)
{
matches.Add(capture.Value);
}
}
return matches.ToArray();
}Can become
public static string[] Tokenise(string input)
{
var match = _pattern.Match(input);
if (match.Success)
{
return (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray();
}
return default(string[]);
}Or even shorter with the ternary operator
public static string[] Tokenise(string input)
{
var match = _pattern.Match(input);
return match.Success
? (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray()
: default(string[]);
}But if performance concerns you, you're better off with your own implementation instead of using regex, unless you are working with really long strings, in which case regex will probably win in performance.
I've made an alternative solution which works ~4 times faster than your regex version running 1,000,000 iterations with this string
"There are in the text \"some quoted tokens, that have punctionation. And other stuff\""
public static string[] Tokenise(string input)
{
input = input.Trim();
List matches = new List();
StringBuilder builder = new StringBuilder();
for (int i = 0; i 0)
{
matches.Add(builder.ToString());
builder.Clear();
}
}
if (builder.Length > 0)
{
matches.Add(builder.ToString());
}
return matches.ToArray();
}I will leave that here too
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
Jamie Zawinski
Code Snippets
public static string[] Tokenise(string input)
{
List<string> matches = new List<string>();
var match = _pattern.Match(input);
if (match.Success)
{
foreach (Capture capture in match.Groups["token"].Captures)
{
matches.Add(capture.Value);
}
}
return matches.ToArray();
}public static string[] Tokenise(string input)
{
var match = _pattern.Match(input);
if (match.Success)
{
return (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray();
}
return default(string[]);
}public static string[] Tokenise(string input)
{
var match = _pattern.Match(input);
return match.Success
? (from Capture capture in match.Groups["token"].Captures select capture.Value).ToArray()
: default(string[]);
}public static string[] Tokenise(string input)
{
input = input.Trim();
List<string> matches = new List<string>();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '"')
{
int nextQuoteIndex = input.IndexOf('"', i + 1);
if (nextQuoteIndex != -1)
{
matches.Add(input.Substring(i + 1, nextQuoteIndex - i - 1));
i = nextQuoteIndex;
}
else
{
matches.Add(input.Substring(i + 1, input.Length - i - 1));
return matches.ToArray();
}
}
else if (input[i] != ' ')
{
builder.Append(input[i]);
}
else if (builder.Length > 0)
{
matches.Add(builder.ToString());
builder.Clear();
}
}
if (builder.Length > 0)
{
matches.Add(builder.ToString());
}
return matches.ToArray();
}Context
StackExchange Code Review Q#149212, answer score: 7
Revisions (0)
No revisions yet.